|Does subsume=Redesigning Patent Database, Patent Assignment Data Restructure,
}}
A continuation of [[Redesigning Patent Database]] that aims to write faster, more centralized code to deal with the USPTO data. By having an end-to-end pipeline we can easily reproduce or update data without worrying about unintentional side effects or missing data.
== Directory Layout ==
All of the information for this project is located at <code>E:\McNair\Projects\SimplerPatentData</code>
There are three interesting directories:
* <code>zipfiles/</code> is USPTO bulkdata, unmodified and validated to have the correct file size
* <code>extracts/</code> is a directory of a strict subset of the information stored in <code>zipfiles/</code>. It is the result of running a bulk 7-zip job on that directory to get everything unzipped in a flat data structure. Note that these files have the USPTO modified-by time since that metadata is stored in the zipfiles.
* <code>src/</code> is the main code repository for the java project
In addition, there are three interesting files in the base directory:
* <code>extracts.7z</code> is an archived version of the <code>extracts/</code> directory for backup and transfer reasons.
<nowiki>Name: extracts.7z
Size: 55847284301 bytes (53260 MB)
SHA256: C653E5B736530711DB2212191853EAABBF36CF48820915F8B57DB54E1990BDC0</nowiki>
* <code>hashes.tsv</code> is a tab-separated value file with SHA-256 hashes of the files as downloaded from the USPTO.
* <code>index.tsv</code> is a tab-separated value file with the URLs, modified-by datetime, and supposed filesize in bytes.
=== Input Files ===
All of the text-only Red Book files for granted patents from 1976 to 2016, inclusive. To find a specific year's XML file, find it in
<code>E:\McNair\Projects\SimplerPatentData\extracts</code>
== Schema Reconciliation ==
TODO
=== Processing ===
TODO
=== Attributes ===
== Related Projects ==