Note the zip files should appear briefly (sequentially) in E:/McNair/Software/Scripts/Patent before disappearing and reappearing unzipped in E:/McNair/PatentData
3b) Now we need to split the files into individual, valid xml files. To do this:
Move the files to be split into E:/McNair/PatentData/Queue
Run the command:
perl splitter.pl
Each file will then be blown out into a directory of xml files in E:/McNair/PatentData/Processed
4) The next step would be to parse the actual files.
For the patent data files, based on the existing documentation, it looks like PatentParser, found in McNair/Software/Scripts/Patent, has to be run on each xml file that was downloaded and unzipped during the previous step. *It then stores the parsed xml files all in a text file called "Results.txt" (which I assume will have to be deleted afterward). This script utilizes the Claim.pm, Inventor.pm, PatentApplication.pm, and Loader.pm modules. *It nolonger uses the AddressBook.pm module. *If we have a perl module for getting the inventor, why do we not have an inventors table in the database?*THIS IS A GOOD QUESTION!
For the USPTO Assignment Data, the parsing file is called USPTO_Assignee_XML_parser. It takes the path to the files that need to be parsed (an example I think is ":E/PatentData/Processed/year" where
while it parses the file.
5) For patent This parser will open an ODBC (or similar) connection to a database on the RDP's installation of postgres. It will then put the data, I assume directly into this database. Once complete. we manually move the next step would be tables to create a table from the text file - possibly using CreateTables dbase server's database (a PostgreSQL filei.e. patent).
== Specifications of USPTO Data To Extract ==