Oliver Chang (Work Log)
Oliver Chang Work Logs (logpage)
To-do List:
- Expand XPath use in the patent data
- Edit to include Application data
- Finish ID joining
Projects:
- PostGIS Installation
- Reproducible Patent Data
- Predictive Patent Validity Machine Learning Ideas
- Equivalent XPath and APS Queries
- US Address Verification
- GPU Computer Build
- Parallel Enclosing Circle Algorithm
Uploads:
- File:PADX-File-Description-v2 Hague.pdf
- Describes patent kind codes (notably, what the hell X0 represents)
- File:PatentFullTextAPSDoc GreenBook pgs13-22.pdf
- Describes the fields in APS, their supposed character lengths, and if they are required/optional
- File:Aps-wku-modulus11.pdf
- Describes the layout of the check digit on magnetic tape
- File:Mod-11-algorithm.pdf
- Describes the algorithm used to calculate the check digit
Day-by-Day (in reverse chronological order)
October 2017
- Oct 25: start ingestion of application xml files and deal with all the bugs which accompany that
- Oct 21: create xml explorer script to mass-inspect xpaths (can be found at
E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\xml_schema_explorer
- Oct 20: re-run assignment import, review xpaths
- Oct 19: repopulate patents database and create spreadsheet of assignment equivalencies
- Oct 11: add issues to work on to the PECA wiki page; cleanup PECA git repo; start patent application code from granted patent code and start customizing to new domain
- Oct 3: troubleshoot vc_circles.py and make command line interface a little nicer
- Oct 2: discuss mapping strategies & investigate missing eca data
September 2017
- Sept 23: make Project/OliverLovesCircles usable and add initial splitting ability
- Sept 22: goal setting & server debugging & meet with Yang
August 2017
- Aug 4: setup parallel instance python framework for job reporting; begin test run
- Aug 2: finish up some documentation of the code and for the wiki
- Aug 1: discuss with Abhi & Ed about alternatives to Java port because of algorithmic constants that would be hard to port; run test batches on python with addition of equality operators and convergence early stopping
July 2017
- July 31: sketch out parallel enclosing circle algorithm
- July 28: field questions and data cleanup questions from Kerda & Joe & Adrian
- travelling
- July 19: try to remove duplicated records (esp. those with empty titles) which are preventing the addition of a unique constraint
- July 18: run correspondent join on properties and correspondents table to match previous project; sync with Adrian and Abhi
- July 17: redo db operations after cleaning up granted patent number bugs
- July 13: powwow about parallelizing Enclosing Circle Algorithm; sketch out what to do for the rest of the summer; work more on joins
- July 12: generate some example data illustrating the difficulty of joining different tables
- July 11: track down some bugs that happen very rarely and were missed in the initial qa phase
- July 7: catch up on documentation
- July 6: try (unsuccessfully) to understand docid mapping...create exploration scripts
- July 5: add invention title to proper grouping of assignment properties; optimize XML parsing
June 2017
- June 30: powwow with James, Abhi, Ed about optimization issues; discuss document ids, X0 etc with Ed; pinpoint issues with APS doc numbers (see Repro Pat Dat#Gotchas) for more info
- June 29: add logging of copy commands, more chattiness to scripts, debug assignment data failure
- June 28: create examples for expansion to plant, reissue, design patent collection; start optimizing xml
- June 27: write SQL to replicate assignees, extract postcodes for ongoing projects
- June 26: speedup code, abstract in-memory file splitters to avoid repetition and some weird edge cases
- June 25: create mappings for APS, assignment properties, XML 2.5 for data import; run data imports for granted data
- June 23: cleanup hacky models with a better set of abstractions; cleanup IDE warnings; redefine patent-address mapping
- June 22: create postcode<->patent table
- June 21: document granted patent queries and equivalencies
- June 20: sketch out APS driver; discuss patent id problem; further document with evidence the zipcode data validity
- June 19: skim address regular expressions; cursory investigation of patent table
- June 16: create method of getting all data into the database, whether it likes it or not; copy over assignments, granted data using new scheme
- June 15: add more robust error reporting, fix race conditions; build out assignment driver; build out fee event driver; add error logging
- June 14: migrate bulk inserts to copy command; refresh on address data and start in on that; convert processor to multi-threaded application
- June 13: spot check SQL tables; fix broken final case endlessly looping; investigate smarter insert methods
- June 12: add XML printer, use it to inspect applications; extend BaseScraper to fetch patent application data; add applications documentation to my project page; add CREATE of other tables
- June 8: add foreign key inserts; create pretty printer for XML analysis
- June 7: finalize DB abstraction layer; migrate code to bulk inserts; upgrade webserver software and do optimization on RDP postgres with Ed
- June 6: add jdbc; create basic schema; add db interaction; schedule meeting for later in the week
- June 5: look into postgresql; refersh on postgis; add some notes to the Enclosing Circle Algorithm page
- June 1: add RDP git remote; add more documentation to wiki page; refactor downloader scripts; start creation of tooling for interacting with data
May 2017
- May 31: finish copy-pasting attributes into the wiki page; retroactively fill out work log; meet with Ed to discuss next steps
- May 30: update documentation on wiki, restructure large binary files to have more hierarchy instead of a flat listing at the root
- May 29: expand to APS; expand to raw assignment data
- May 27: expand to maintenance fee data
- May 26: create models, translate
xmlparser*.pl
file into Java; start using builder pattern - May 25: sketch out OO design of project; download bulk data
- May 24: move wiki pages around; start git repository for project
- May 21: discuss technical details of previous work with Ed
- May 8: cleanup dead links on wiki and start reading about previous work; discuss current project status with Ed
- May 4: setup wiki account, rdp account, database training