Difference between revisions of "Reproducible Patent Data"
m |
|||
Line 7: | Line 7: | ||
}} | }} | ||
− | A continuation of [[Redesigning Patent Database]] that aims to write faster, more centralized code to deal with the USPTO | + | A continuation of [[Redesigning Patent Database]] that aims to write faster, more centralized code to deal with data from the United States Patent and Trademark Office (USPTO). By having an end-to-end pipeline we can easily reproduce or update data without worrying about unintentional side effects or missing data. Currently, it succeeds in bulk downloading from the USPTO; streaming file splitting, that is, splitting large concatenated files into their component parts in-memory; and parsing of XML to Java objects, APS to Java Maps, and maintenance fee data to Java objects. |
== Progress == | == Progress == | ||
Line 14: | Line 14: | ||
# <del>Splitter</del> ''done'' | # <del>Splitter</del> ''done'' | ||
# <del>Parser</del> ''done'' | # <del>Parser</del> ''done'' | ||
− | # | + | # Create tooling for minions |
+ | # Setup PostgreSQL JDBC | ||
+ | # Create naive schema based on previous approaches | ||
+ | # Create new data structures | ||
# Database Insert (modify <code>models/</code> files with some mapping to database fields) | # Database Insert (modify <code>models/</code> files with some mapping to database fields) | ||
# Data Cleanup (reference [[Patent_Assignment_Data_Restructure|Marcela and Sonia's work]]) | # Data Cleanup (reference [[Patent_Assignment_Data_Restructure|Marcela and Sonia's work]]) | ||
# Setup pipeline script to complete all of these steps in series | # Setup pipeline script to complete all of these steps in series | ||
+ | # Data Source Merger (''only USPTO granted, maintfee, assignment'' not USPTO applications or Harvard Dataverse or Lex Machina currently) | ||
== Directory Layout == | == Directory Layout == | ||
Line 23: | Line 27: | ||
All of the information for this project is located at <code>E:\McNair\Projects\SimplerPatentData</code> | All of the information for this project is located at <code>E:\McNair\Projects\SimplerPatentData</code> | ||
− | There are | + | There are four interesting directories: |
* <code>data/downloads/</code> is USPTO bulkdata, unmodified straight from the scraper | * <code>data/downloads/</code> is USPTO bulkdata, unmodified straight from the scraper | ||
Line 53: | Line 57: | ||
|January 1976 to December 2001 | |January 1976 to December 2001 | ||
|APS | |APS | ||
− | | | + | |Only syntax |
|- | |- | ||
|<del>January 2001 to December 2001</del> | |<del>January 2001 to December 2001</del> | ||
Line 61: | Line 65: | ||
|January 2002 to December 2004 | |January 2002 to December 2004 | ||
|XML Version 2.5 | |XML Version 2.5 | ||
− | | | + | |Only syntax |
|- | |- | ||
|January 2005 to December 2005 | |January 2005 to December 2005 |
Revision as of 23:30, 31 May 2017
Reproducible Patent Data | |
---|---|
Project Information | |
Project Title | Reproducible Patent Data |
Owner | Oliver Chang |
Start Date | May 17 |
Deadline | |
Primary Billing | |
Notes | |
Has project status | Active |
Subsumes: | Redesigning Patent Database, Patent Assignment Data Restructure |
Copyright © 2016 edegan.com. All Rights Reserved. |
A continuation of Redesigning Patent Database that aims to write faster, more centralized code to deal with data from the United States Patent and Trademark Office (USPTO). By having an end-to-end pipeline we can easily reproduce or update data without worrying about unintentional side effects or missing data. Currently, it succeeds in bulk downloading from the USPTO; streaming file splitting, that is, splitting large concatenated files into their component parts in-memory; and parsing of XML to Java objects, APS to Java Maps, and maintenance fee data to Java objects.
Contents
Progress
DownloaderdoneSplitterdoneParserdone- Create tooling for minions
- Setup PostgreSQL JDBC
- Create naive schema based on previous approaches
- Create new data structures
- Database Insert (modify
models/
files with some mapping to database fields) - Data Cleanup (reference Marcela and Sonia's work)
- Setup pipeline script to complete all of these steps in series
- Data Source Merger (only USPTO granted, maintfee, assignment not USPTO applications or Harvard Dataverse or Lex Machina currently)
Directory Layout
All of the information for this project is located at E:\McNair\Projects\SimplerPatentData
There are four interesting directories:
data/downloads/
is USPTO bulkdata, unmodified straight from the scraperdata/extracts/
is a directory of a strict subset of the information stored indata/downloads/
. It is the result of running a bulk 7-zip job on that directory to get everything unzipped in a flat data structure. Note that these files have the USPTO modified-by time since that metadata is stored in the zipfiles. To extract files in this nice format, select all of the zipfiles and setup an extraction job like in this screenshotdata/backups/
is a 7zip'd backup of the corresponding directory in extractssrc/
is the main code repository for the java project
Input Files
All of the text-only Red Book files for granted patents from 1976 to 2016, inclusive. To find a specific year's XML file, find it in
E:\McNair\Projects\SimplerPatentData\data\extracts\granted\
To find assignment data, look in
E:\McNair\Projects\SimplerPatentData\data\extracts\granted\
To find maintenance fee data, look in
E:\McNair\Projects\SimplerPatentData\data\downloads\maintenance
Schema Reconciliation
Dates Used | Format | Supported by Parser? |
---|---|---|
January 1976 to December 2001 | APS | Only syntax |
January 2002 to December 2004 | XML Version 2.5 | Only syntax |
January 2005 to December 2005 | XML Version 4.0 ICE | Maybe |
January 2006 to December 2006 | XML Version 4.1 ICE | Maybe |
January 2007 to December 2012 | XML Version 4.2 ICE | Maybe |
January 2013 to September 24, 2013 | XML Version 4.3 ICE | Yes |
October 8, 2013 to December 2014 | XML Version 4.4 ICE | Yes |
January 2015 to December 2016 | XML Version 4.5 ICE | Yes |
Attributes
Note: these values are likely to change without warning. For the latest version of these see the actual files at E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\models
.
Assignee
text fields: NAME, ADDR1, ADDR2, CITY, STATE, COUNTRY_NAME, POSTCODE
Assignment
text fields: REEL_NUMBER, FRAME_NUMBER, LAST_UPDATE_DATE, RECORDED_DATE, CONVEYANCE_TEXT
lists: correspondents, assignors, assignees
Assignment Summary
text fields: LAST_NAME, FIRST_NAME, ORG_NAME, CITY, COUNTRY, STATE, ADDRESS, POSTCODE
Assignor
text fields: NAME, EXECUTION_DATE, DATE_ACKNOWLEDGED
Citation
text fields: CITED_PATENT_NUMBER, CITED_PATENT_COUNTRY, CITED_PATENT_KIND, CITED_PATENT_CATEGORY
Correspondent
text fields: NAME, ADDR1, ADDR2, ADDR3, ADDR4
GrantedPatent
text fields: PATENT_TYPE, TITLE, PCT_DOCUMENT_NUMBER, PATENT_COUNTRY, PATENT_NUMBER,PATENT_KIND, PATENT_GRANT_DATE, APPLICATION_NUMBER, APPLICATION_FILING_DATE, PRIORITY_CLAIMS_DATE, PRIORITY_CLAIMS_COUNTRY, PRIORITY_CLAIMS_PATENT_NUMBER, CLASSIFICATION_NATIONAL_COUNTRY, CLASSIFICATION_NATIONAL_CLASS, PRIMARY_EXAMINER_FIRST_NAME, PRIMARY_EXAMINER_LAST_NAME, PRIMARY_EXAMINER_DEPARTMENT
number fields: NUMBER_OF_CLAIMS
list fields: citations, scirefs, inventors, assignmentsummaries, lawyers
Inventor
text fields: SEQUENCE, LAST_NAME, FIRST_NAME, ORG_NAME, CITY, COUNTRY, STATE, ADDRESS, POSTCODE
Lawyer
text fields: SEQUENCE, LAST_NAME, FIRST_NAME, ORG_NAME, CITY, COUNTRY, STATE, ADDRESS, POSTCODE
MaintenanceFeeEvent
text fields: US_PATENT_NUMBER, US_APPLICATION_NUMBER, IS_SMALL_ENTITY, US_APPLICATION_FILING_DATE, US_GRANT_ISSUE_DATE, EVENT_ENTRY_DATE, EVENT_CODE
Sciref
text fields: CITATION_DESCRIPTION
New Schema
Rough sketch: https://app.quickdatabasediagrams.com/#/schema/Huo3bW9jK065GlXoTitReQ