Reproducible Patent Data
Reproducible Patent Data | |
---|---|
Project Information | |
Project Title | Reproducible Patent Data |
Owner | Oliver Chang |
Start Date | May 17 |
Deadline | |
Primary Billing | |
Notes | |
Has project status | Active |
Subsumes: | Redesigning Patent Database, Patent Assignment Data Restructure |
Copyright © 2016 edegan.com. All Rights Reserved. |
A continuation of Redesigning Patent Database that aims to write faster, more centralized code to deal with the USPTO data. By having an end-to-end pipeline we can easily reproduce or update data without worrying about unintentional side effects or missing data. Currently, it succeeds in bulk downloading from the USPTO; streaming file splitting, that is, splitting large concatenated files into their component parts in-memory; and parsing of XML to Java objects, APS to Java Maps, and maintenance fee data to Java objects.
Contents
Progress
DownloaderdoneSplitterdoneParserdone- Data Source Merger (only USPTO not Harvard Dataverse or Lex Machina)
- Database Insert (modify
models/
files with some mapping to database fields) - Data Cleanup (reference Marcela and Sonia's work)
Directory Layout
All of the information for this project is located at E:\McNair\Projects\SimplerPatentData
There are three interesting directories:
data/downloads/
is USPTO bulkdata, unmodified straight from the scraperdata/extracts/
is a directory of a strict subset of the information stored indata/downloads/
. It is the result of running a bulk 7-zip job on that directory to get everything unzipped in a flat data structure. Note that these files have the USPTO modified-by time since that metadata is stored in the zipfiles. To extract files in this nice format, select all of the zipfiles and setup an extraction job like in this screenshotdata/backups/
is a 7zip'd backup of the corresponding directory in extractssrc/
is the main code repository for the java project
Input Files
All of the text-only Red Book files for granted patents from 1976 to 2016, inclusive. To find a specific year's XML file, find it in
E:\McNair\Projects\SimplerPatentData\data\extracts\granted\
To find assignment data, look in
E:\McNair\Projects\SimplerPatentData\data\extracts\granted\
To find maintenance fee data, look in
E:\McNair\Projects\SimplerPatentData\data\downloads\maintenance
Schema Reconciliation
Dates Used | Format | Supported by Parser? |
---|---|---|
January 1976 to December 2001 | APS | Yes (syntactic parsing but little semantic knowledge) |
January 2002 to December 2004 | XML Version 2.5 | No |
January 2005 to December 2005 | XML Version 4.0 ICE | Maybe |
January 2006 to December 2006 | XML Version 4.1 ICE | Maybe |
January 2007 to December 2012 | XML Version 4.2 ICE | Maybe |
January 2013 to September 24, 2013 | XML Version 4.3 ICE | Yes |
October 8, 2013 to December 2014 | XML Version 4.4 ICE | Yes |
January 2015 to December 2016 | XML Version 4.5 ICE | Yes |
Attributes
Note: these values are likely to change without warning. For the latest version of these see the actual files at E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\models
.
Assignee
text fields: NAME, ADDR1, ADDR2, CITY, STATE, COUNTRY_NAME, POSTCODE
Assignment
text fields: REEL_NUMBER, FRAME_NUMBER, LAST_UPDATE_DATE, RECORDED_DATE, CONVEYANCE_TEXT
lists: correspondents, assignors, assignees
Assignment Summary
text fields: LAST_NAME, FIRST_NAME, ORG_NAME, CITY, COUNTRY, STATE, ADDRESS, POSTCODE
Assignor
Citation
Correspondent
GrantedPatent
Inventor
Lawyer
MaintenanceFeeEvent
Sciref
New Schema
Rough sketch: https://app.quickdatabasediagrams.com/#/schema/Huo3bW9jK065GlXoTitReQ