A continuation of [[Redesigning Patent Database]] that aims to write faster, more centralized code to deal with data from the United States Patent and Trademark Office (USPTO). By having an end-to-end pipeline we can easily reproduce or update data without worrying about unintentional side effects or missing data.
== Progress ==
# <del>Downloader</del> ''done''
# <del>Splitter</del> ''done''
# <del>Parser</del> ''done''
# <del>Setup PostgreSQL JDBC</del> ''done''
# <del>Create naive schema based on previous approaches</del> ''done''
# <del>Create new data structures</del> ''done''
# <del>Database Insert (modify <code>models/</code> files with some mapping to database fields)</del> ''done''
# <del>Create tooling for minions</del> ''skipped''
# <del>Investigate parallel speedup (e.g. multithread, mmap)</del> ''done''
# <del>Remove duplicate code through the addition of more abstract classes</del> ''done''
# <del>first 5 zipcode; centroid?</del> ''hackily done''
# <del>patent id</del> ''doneish''
# <del>Create XPath queries for reissue, design patents (only utility right now)</del> ''split off'' (see [[Equivalent_XPath_and_APS_Queries]])
# <del>Create semantic parser for APS files</del> ''see above''
# Data Cleanup (reference [[Patent_Assignment_Data_Restructure|Marcela and Sonia's work]])
# Data Source Merger (''only USPTO granted, maintfee, assignment'' not USPTO applications or Harvard Dataverse or Lex Machina currently)
# Setup pipeline script to complete all of these steps in series
# Add constraints to database tables, e.g. correct types, foreign keys, not null, lookup tables
# Add deduplication
== Directory Layout ==