Reproducible Patent Data
Jump to navigation
Jump to search
A continuation of Redesigning Patent Database that aims to write faster, more centralized code to deal with the USPTO data. By having an end-to-end pipeline we can easily reproduce or update data without worrying about unintentional side effects or missing data.
Reproducible Patent Data | |
---|---|
Project Information | |
Project Title | Reproducible Patent Data |
Owner | Oliver Chang |
Start Date | May 17 |
Deadline | |
Primary Billing | |
Notes | |
Has project status | Active |
Subsumes: | Redesigning Patent Database, Patent Assignment Data Restructure |
Copyright © 2016 edegan.com. All Rights Reserved. |
Directory Layout
All of the information for this project is located at E:\McNair\Projects\SimplerPatentData
There are three interesting directories:
zipfiles/
is USPTO bulkdata, unmodified and validated to have the correct file sizeextracts/
is a directory of a strict subset of the information stored inzipfiles/
. It is the result of running a bulk 7-zip job on that directory to get everything unzipped in a flat data structure. Note that these files have the USPTO modified-by time since that metadata is stored in the zipfiles.src/
is the main code repository for the java project
In addition, there are three interesting files in the base directory:
extracts.7z
is an archived version of theextracts/
directory for backup and transfer reasons.
Name: extracts.7z Size: 55847284301 bytes (53260 MB) SHA256: C653E5B736530711DB2212191853EAABBF36CF48820915F8B57DB54E1990BDC0
hashes.tsv
is a tab-separated value file with SHA-256 hashes of the files as downloaded from the USPTO.index.tsv
is a tab-separated value file with the URLs, modified-by datetime, and supposed filesize in bytes.
Input Files
All of the text-only Red Book files for granted patents from 1976 to 2016, inclusive. To find a specific year's XML file, find it in
E:\McNair\Projects\SimplerPatentData\extracts
Schema Reconciliation
January 1976 to December 2001 | APS |
January 2002 to December 2004 | XML Version 2.5 |
January 2005 to December 2005 | XML Version 4.0 ICE |
January 2006 to December 2006 | XML Version 4.1 ICE |
January 2007 to December 2012 | XML Version 4.2 ICE |
January 2014 to June 2014 (to-check) | XML Version 4.3 ICE |
July 2014 to December 2014 (to-check) | XML Version 4.4 ICE |
January 2015 to December 2016 | XML Version 4.5 ICE |
Processing
TODO
Attributes
Value | XML 2.5 | XML 4.0 | XML 4.1 | XML 4.2 | XML 4.3 | XML 4.4 | XML 4.5 |
---|---|---|---|---|---|---|---|
DTD Version | //us-patent-grant/@dtd-version
|
||||||
Publication Reference ID |
TODO (in order)
- Flesh out equivalencies for all XML schemas
- Ditto APS
- Ditto Maintenance Fee Data
- Create a new DB schema using a less centralized conception of a patent
- Check correctness versus existing data
- Store abstracts, processing metadata
- Investigate USPTO products for addressees
- Handle addressee data