[http://mcnair.bakerinstitute.org/wiki/Patent_Data_Processing_-_SQL_Steps Patent Data Processing - SQL Steps] - explains SQL needed to merge two existing databases, one that contained the Harvard Dataverse data and one that contained the USPTO data
Here at the instructions I'm developing for downloading, parsing, and hopefully adding new data to the database since the documentation is very sparse (can also be found under McNair/Projects/Redesigning Patent Database/Instructions on how to download patent data form USPTO bulk data.
How to run Perl Scripts to extract Patent Data
1) map the network drive to your computer (instructions can be found under Help for New Staff
scroll down to the "Working with the infrastructure and click on the link to "How to Map the network
drive)
2) Consult this handy link on how to install and run perl programs: https://www.thoughtco.com/how-to-install-and-run-perl-2641103
3) Now it's time to download the data so it can be parsed.
For USPTO Assignment Data, there appears
to be a script, under McNair/usptoAssignment, called USPTO_Assignee_Download, which appears
to let a user pass it a url and then ir downloads all th zip files available at that URL. It then places the
downloaded zip files in "E:/McNair/usptoAssigneeData/name", where "name" is the name of the file. If you want to check
which files have already been processed, check "McNair/usptoAssigneeData/Finished" to see the finished zip files.
The equivalent for patent data is called "USPTO_Parser" and can be found under McNair/Software/Scripts/Patent.
Instead of taking a url, as the USPTO_Assignee Download does, ir takes two arguments, year 1 and year2,
which are supposed to represent the range of data that you wish to download (for example, 2015 to 2016).
The perl script places the downloaded zip files into "E:/PatentData/name" where "name" is the name of the
zip file. This location is not quite accurate - the files are actually stored in what appears to be McNair/PatentData.
The folder "Processed" under McNair/PatentData appears to hold all the unzipped zip files that have been downloaded and processed
already. So if you are curious if some files have already been processed, you could look there. They are organized by year.
Now to actually run the scripts:
*Insert how one would do this once I figure it out, I searched online to try to troubleshoot
why I could not run the scripts, but I couldn't figure it out 4/11/2017*
4) The next step would be to parse the actual files.
For the patent data files, based on the existing documentation, it looks like PatentParser, found in McNair/Software/Scripts/Patent,
has to be run on each xml file that was downloaded and unzipped during the previous step. It then stores the parsed xml files all in a text file called
"Results.txt" (which I assume will have to be deleted afterward). Thi script utilizes the Claim.pm, Inventor.pm, PatentApplication.pm, and Loader.pm modules. It no
longer uses the AddressBook.pm module.
*If we have a perl module for getting the inventor, why do we not have an inventors table in the database?*
For the USPTO Assignment Data, the parsing file is called USPTO_Assignee_XML_parser. It takes the path to the files that need to be parsed (an example I think is ":E/PatentData/Processed/year" where
"year" is the name of the folder you've placed the xml files to be parsed. It iterates through all the files in the "year" directory that you passed. This file directly loads the information into the database
while it parses the file.
5) For patent data, I assume the next step would be to create a table from the text file - possibly using CreateTables (a PostgreSQL file).
== Specifications of USPTO Data To Extract ==