Reproducible Patent Data
A continuation of Redesigning Patent Database that aims to write faster, more centralized code to deal with the USPTO data. By having an end-to-end pipeline we can easily reproduce or update data without worrying about unintentional side effects or missing data. Currently, it succeeds in bulk downloading from the USPTO; streaming file splitting, that is, splitting large concatenated files into their component parts in-memory; and parsing of XML to Java objects, APS to Java Maps, and maintenance fee data to Java objects.
Reproducible Patent Data | |
---|---|
Project Information | |
Project Title | Reproducible Patent Data |
Owner | Oliver Chang |
Start Date | May 17 |
Deadline | |
Primary Billing | |
Notes | |
Has project status | Active |
Subsumes: | Redesigning Patent Database, Patent Assignment Data Restructure |
Copyright © 2016 edegan.com. All Rights Reserved. |
Progress
DownloaderdoneSplitterdoneParserdone- Data Source Merger (only USPTO not Harvard Dataverse or Lex Machina)
- Database Insert (modify
models/
files with some mapping to database fields) - Data Cleanup (reference Marcela and Sonia's work)
Directory Layout
All of the information for this project is located at E:\McNair\Projects\SimplerPatentData
There are three interesting directories:
data/downloads/
is USPTO bulkdata, unmodified straight from the scraperdata/extracts/
is a directory of a strict subset of the information stored indata/downloads/
. It is the result of running a bulk 7-zip job on that directory to get everything unzipped in a flat data structure. Note that these files have the USPTO modified-by time since that metadata is stored in the zipfiles. To extract files in this nice format, select all of the zipfiles and setup an extraction job like in this screenshotdata/backups/
is a 7zip'd backup of the corresponding directory in extractssrc/
is the main code repository for the java project
Input Files
All of the text-only Red Book files for granted patents from 1976 to 2016, inclusive. To find a specific year's XML file, find it in
E:\McNair\Projects\SimplerPatentData\data\extracts\granted\
To find assignment data, look in
E:\McNair\Projects\SimplerPatentData\data\extracts\granted\
To find maintenance fee data, look in
E:\McNair\Projects\SimplerPatentData\data\downloads\maintenance
Schema Reconciliation
Dates Used | Format | Supported by Parser? |
---|---|---|
January 1976 to December 2001 | APS | Yes (syntactic parsing but little semantic knowledge) |
January 2002 to December 2004 | XML Version 2.5 | No |
January 2005 to December 2005 | XML Version 4.0 ICE | Maybe |
January 2006 to December 2006 | XML Version 4.1 ICE | Maybe |
January 2007 to December 2012 | XML Version 4.2 ICE | Maybe |
January 2013 to September 24, 2013 | XML Version 4.3 ICE | Yes |
October 8, 2013 to December 2014 | XML Version 4.4 ICE | Yes |
January 2015 to December 2016 | XML Version 4.5 ICE | Yes |
Attributes
Note: these values are likely to change without warning. For the latest version of these see the actual files at E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\models
.
Assignee
text fields: NAME, ADDR1, ADDR2, CITY, STATE, COUNTRY_NAME, POSTCODE
Assignment
text fields: REEL_NUMBER, FRAME_NUMBER, LAST_UPDATE_DATE, RECORDED_DATE, CONVEYANCE_TEXT
lists: correspondents, assignors, assignees
Assignment Summary
text fields: LAST_NAME, FIRST_NAME, ORG_NAME, CITY, COUNTRY, STATE, ADDRESS, POSTCODE
Assignor
text fields: NAME, EXECUTION_DATE, DATE_ACKNOWLEDGED
Citation
text fields: CITED_PATENT_NUMBER, CITED_PATENT_COUNTRY, CITED_PATENT_KIND, CITED_PATENT_CATEGORY
Correspondent
text fields: NAME, ADDR1, ADDR2, ADDR3, ADDR4
GrantedPatent
text fields: PATENT_TYPE, TITLE, PCT_DOCUMENT_NUMBER, PATENT_COUNTRY, PATENT_NUMBER,
PATENT_KIND, PATENT_GRANT_DATE,
APPLICATION_NUMBER, APPLICATION_FILING_DATE,
PRIORITY_CLAIMS_DATE, PRIORITY_CLAIMS_COUNTRY, PRIORITY_CLAIMS_PATENT_NUMBER,
CLASSIFICATION_NATIONAL_COUNTRY, CLASSIFICATION_NATIONAL_CLASS,
PRIMARY_EXAMINER_FIRST_NAME, PRIMARY_EXAMINER_LAST_NAME, PRIMARY_EXAMINER_DEPARTMENT
number fields: NUMBER_OF_CLAIMS
list fields: citations, scirefs, inventors, assignmentsummaries, lawyers
Inventor
text fields: SEQUENCE, LAST_NAME, FIRST_NAME, ORG_NAME, CITY, COUNTRY, STATE, ADDRESS, POSTCODE
Lawyer
text fields: SEQUENCE, LAST_NAME, FIRST_NAME, ORG_NAME, CITY, COUNTRY, STATE, ADDRESS, POSTCODE
MaintenanceFeeEvent
text fields: US_PATENT_NUMBER, US_APPLICATION_NUMBER, IS_SMALL_ENTITY,
US_APPLICATION_FILING_DATE, US_GRANT_ISSUE_DATE, EVENT_ENTRY_DATE,
EVENT_CODE
Sciref
text fields: CITATION_DESCRIPTION