Reproducible Patent Data
A continuation of Redesigning Patent Database that aims to write faster, more centralized code to deal with data from the United States Patent and Trademark Office (USPTO). By having an end-to-end pipeline we can easily reproduce or update data without worrying about unintentional side effects or missing data. Currently, it succeeds in bulk downloading from the USPTO; streaming file splitting, that is, splitting large concatenated files into their component parts in-memory; and parsing of XML to Java objects, APS to Java Maps, and maintenance fee data to Java objects.
Reproducible Patent Data | |
---|---|
Project Information | |
Project Title | Reproducible Patent Data |
Owner | Oliver Chang |
Start Date | May 17 |
Deadline | |
Primary Billing | |
Notes | |
Has project status | Active |
Subsumes: | Redesigning Patent Database, Patent Assignment Data Restructure |
Copyright © 2016 edegan.com. All Rights Reserved. |
Progress
DownloaderdoneSplitterdoneParserdoneSetup PostgreSQL JDBCdoneCreate naive schema based on previous approachesdoneCreate new data structuresdoneDatabase Insert (modifydonemodels/
files with some mapping to database fields)- Create tooling for minions
- Create XPath queries for reissue, design patents (only utility right now)
- Data Cleanup (reference Marcela and Sonia's work)
- Investigate parallel speedup (e.g. multithread, mmap)
- Data Source Merger (only USPTO granted, maintfee, assignment not USPTO applications or Harvard Dataverse or Lex Machina currently)
- Setup pipeline script to complete all of these steps in series
Directory Layout
Where is the Data?
Directories
All of the information for this project is located at E:\McNair\Projects\SimplerPatentData
There are several interesting directories:
data/downloads/
is USPTO bulkdata, unmodified straight from the scraperdata/extracts/
is a directory of a strict subset of the information stored indata/downloads/
. It is the result of running a bulk 7-zip job on that directory to get everything unzipped in a flat data structure. Note that these files have the USPTO modified-by time since that metadata is stored in the zipfiles. To extract files in this nice format, select all of the zipfiles and setup an extraction job like in this screenshotdata/backups/
is a 7zip'd backup of the corresponding directory in extractssrc/
is the main code repository for the java project
Input Files
All of the text-only Red Book files for granted patents from 1976 to 2016, inclusive. To find a specific year's XML file, find it in
E:\McNair\Projects\SimplerPatentData\data\extracts\granted\
To find assignment data, look in
E:\McNair\Projects\SimplerPatentData\data\extracts\granted\
To find maintenance fee data, look in
E:\McNair\Projects\SimplerPatentData\data\downloads\maintenance
Where is the Code?
The code has the same parent directory as the data, so it is at E:\McNair\Projects\SimplerPatentData\src
. You might notice a lot of single-entry directories; this is an idiomatic Java pattern that is used for package separation. If using IntelliJ or some other IDE, these directories are a bit less annoying.
The development environment is Java 8 JDK, IntelliJ Ultimate IDE, Maven build tools, and git VCS.
The git repository can be found at https://rdp.mcnaircenter.org/codebase/Repository/ReproduciblePatent
Prior Art
This tool is not so concerned with adding new functionality; rather, it aims to take a bunch of spread out Perl scripts and create a faster system that is easier to work with. As such, its functionality is largely stolen from those scripts:
- Downloader:
E:\McNair\Software\Scripts\Patent\USPTO_Parser.pl
- XML Splitter:
E:\McNair\PatentData\splitter.pl
- XML Parsing:
E:\McNair\PatentData\Processed\xmlparser_4.5_4.4_4.3.pl
andE:\McNair\PatentData\Processed\*.pm
In addition, I used several non-standard Java libraries listed below:
- Unirest for easy HTTP requests (MIT License)
- Google Guava for immutable collections and Stream utilities (Apache v2.0 License)
- jsoup for HTML parsing (MIT License)
- Apache Commons Codec (Apache v2.0 License)
- Apache Commons Lang v3 (Apache v2.0 License)
- Jetbrains Annotations for enhanced null checks (Apache v2.0 License)
- PostgreSQL JDBC (BSD 3-clause per https://github.com/pgjdbc/pgjdbc-jre7/blob/master/LICENSE)
If using maven, these dependencies are listed and should automatically be setup.
Design
E:\McNair\Projects\Market for Ideas E:\McNair\Projects\Little Guy Academic Paper
TODO
Using Code
TODO
Altering Code
TODO
Schema Reconciliation
Dates Used | Format | Supported by Parser? | Utility | Reissue | Design |
---|---|---|---|---|---|
January 1976 to December 2001 | APS | Only syntax | |||
Ignored; use concurrently recorded APS data | ✗ | ✗ | ✗ | ||
January 2002 to December 2004 | XML Version 2.5 | Only syntax | |||
January 2005 to December 2005 | XML Version 4.0 ICE | Maybe | |||
January 2006 to December 2006 | XML Version 4.1 ICE | Maybe | |||
January 2007 to December 2012 | XML Version 4.2 ICE | Maybe | |||
January 2013 to September 24, 2013 | XML Version 4.3 ICE | Yes | ✓ | ✗ | ✗ |
October 8, 2013 to December 2014 | XML Version 4.4 ICE | Yes | ✓ | ✗ | ✗ |
January 2015 to December 2016 | XML Version 4.5 ICE | Yes | ✓ | ✗ | ✗ |
Database
Because there isn't a compelling reason not to, I used the existing PostgreSQL infrastructure on the RDP. The "Java Way" of interacting with databases is the Java Database Connectivity API (JDBC), an implementation-agnostic API for interacting with databases. This project uses the stock Postgres JDBC, version 42.1.1
- Create an empty database:
$ createdb --username=postgres patents_june_2017 # password is tabspaceenter
- Create tables via script at
E:\McNair\Projects\SimplerPatentData\src\db\NaiveSchema.sql
- Prior Example
E:\McNair\Software\Scripts\Patent\createTables.sql
- Aim to create a completely naive schema with as few constraints as possible--iteratively add more constraints in the future
- Prior Example
Related Pages
- Assignment Data Restructure, Spring 2017 by Marcela and Sonia
- Redesigning Patent Database, Spring 2017 by Shelby
- Patent Data Cleanup, June 2016 by Marcela
- Patent Data, Spring 2016 by Marcela
- Lex Machina
- USPTO Patent Litigation Research Dataset by Ed
- Patent Litigation and Review by Marcela
- Bag of Words Analysis
- Existing Database Schema
- My Work Log