# <del>Parser</del> ''done''
# Data Source Merger
# Database Insert(modify <code>models/</code> files with some mapping to database fields)# Data Cleanup(reference [[Patent_Assignment_Data_Restructure|Marcela and Sonia's work]])
== Directory Layout ==
There are three interesting directories:
* <code>zipfilesdata/downloads/</code> is USPTO bulkdata, unmodified and validated to have straight from the correct file sizescraper* <code>data/extracts/</code> is a directory of a strict subset of the information stored in <code>zipfilesdata/downloads/</code>. It is the result of running a bulk 7-zip job on that directory to get everything unzipped in a flat data structure. Note that these files have the USPTO modified-by time since that metadata is stored in the zipfiles.To extract files in this nice format, select all of the zipfiles and setup an extraction job like in this [[media:7zip-params.png|screenshot]]
* <code>src/</code> is the main code repository for the java project
In addition, there are three interesting files in the base directory:=== Input Files ===
* <code>extracts.7z</code> is an archived version All of the <code>extracts/</code> directory text-only Red Book files for backup and transfer reasons'''granted patents''' from 1976 to 2016, inclusive.To find a specific year's XML file, find it in
<nowikicode>NameE: \McNair\Projects\SimplerPatentData\data\extracts.7zSize: 55847284301 bytes (53260 MB)SHA256: C653E5B736530711DB2212191853EAABBF36CF48820915F8B57DB54E1990BDC0\granted\</nowikicode>
* <code>hashes.tsv</code> is a tab-separated value file with SHA-256 hashes of the files as downloaded from the USPTO.* <code>index.tsv</code> is a tab-separated value file with the URLs'''To find assignment data''', modified-by datetime, and supposed filesize look in bytes.
=== Input Files ===<code>E:\McNair\Projects\SimplerPatentData\data\extracts\granted\</code>
All of the text-only Red Book files for granted patents from 1976 to 2016, inclusive. '''To find a specific yearmaintenance fee data'''s XML file, find it look in
<code>E:\McNair\Projects\SimplerPatentData\extractsdata\downloads\maintenance</code>
== Schema Reconciliation ==
|January 1976 to December 2001
|APS
|NoYes (syntactic parsing but little semantic knowledge)
|-
|<del>January 2001 to December 2001</del>
|January 2005 to December 2005
|XML Version 4.0 ICE
|NoMaybe
|-
|January 2006 to December 2006
|XML Version 4.1 ICE
|NoMaybe
|-
|January 2007 to December 2012
|XML Version 4.2 ICE
|NoMaybe
|-
|January 2013 to September 24, 2013
|style="background: green; color: white;" | Yes
|}
=== Processing ===
TODO
=== Attributes ===