Difference between revisions of "Redesigning Patent Database"
(150 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
− | + | {{Project | |
− | {{McNair | + | |Has project output=Data |
+ | |Has sponsor=McNair Center | ||
|Has title=Redesigning Patent Database | |Has title=Redesigning Patent Database | ||
|Has owner=Shelby Bice | |Has owner=Shelby Bice | ||
− | |Has start date= | + | |Has start date=201701 |
− | | | + | |Has deadline=201705 |
− | | | + | |Has keywords=Database, Patent |
− | | | + | |Has project status=Complete |
− | |||
}} | }} | ||
+ | Documentation on the process of updating the patent database with new data and eventual designs for the new patent database. Not to be confused with "Patent Data Restructure", which deals with condensing and cleaning USPTO assignees data. | ||
+ | |||
+ | NOTE FOR ED: Places I looked for the parser for the USPTO assignment data and did not find it are located in a txt file under McNair/Projects/Redesigning Patent Database called "Notes on where the correct parser for the USPTO Assignment Data is and is not". Hopefully that will help narrow your search or help you determine if you should keep debugging the one we have. | ||
+ | |||
+ | For anyone looking to pick up this project: | ||
+ | Ended with notes on how to add data to database (see Current Design and Scripts Documentation). | ||
+ | Below are some checklist items if we overhauled the database and rebuilt all the tables from scratch, which could be a next step. | ||
+ | There is a lot of notes on this page about designing the database and what tables are necessary and which have been created for projects. Feel free to consult these if you are overhauling the database. I would suggest going through the tables in an existing patent database and instead removing extraneous tables that no longer have any use instead of rebuilding the entire database, since I determined that the existing schema isn't that convoluted and the process of downloading, parsing, and cleaning data from USPTO, let alone the Harvard Dataverse, is a huge hassle, so there isn't any reason to reinvent | ||
+ | the wheel in my opinion. | ||
+ | |||
+ | If you're looking to just pull new data from USPTO and want to know the process, see "Current Design and Scripts Documentation." | ||
+ | |||
+ | If you're rewriting any of the Perl scripts that download data from USPTO, specifically USPTO Assignment data (sometimes referred to on the wiki as USPTO Assignee Data), please take a look at "Specifications of USPTO Assignment Data To Extract in Future Perl Scripts." I looked through the DTDs and documentation on the USPTO bulk website and found a couple things that we could be extracting from the data we download that could be useful, like abstracts and more inventor information. | ||
+ | |||
+ | |||
== '''Redesigning Patent Database''' == | == '''Redesigning Patent Database''' == | ||
+ | * Design a better representation for database | ||
+ | * Fix scripts if necessary | ||
+ | * Start moving data into new database by querying existing databases (using SQL) | ||
+ | * Use scripts to query new data | ||
+ | * Test database | ||
+ | * Remove extraneous information from database (copies, patents that we're not interested in, etc.) | ||
+ | |||
+ | '''Documentation I need to include:''' | ||
+ | |||
+ | * Schema of new database (with justification of design), would like to include a visual representation | ||
+ | * SQL commands that were used to fill database with explanation of what they do | ||
+ | * Clear instructions on where to find scripts in bulk drive and an explanation of what each script does | ||
+ | * Visual representation of example table entries that isn't just copied and pasted from a CSV file | ||
+ | |||
+ | ==Related Projects== | ||
+ | *[[Patent Assignment Data Restructure]] | ||
+ | *[[Small Inventors Project]] - uses Fee Status and Citations | ||
+ | *[[Medical Centers and Grants]] - uses patent assignees, specifically their zipcodes and organizations | ||
+ | |||
+ | ==Documentation Relevant to Current Patent Database== | ||
+ | |||
+ | '''Previous documentation on the patent database:''' | ||
+ | |||
+ | '''As of 3/21/2017 the most up-to-date database containing patent data is "patent" not "allpatent" or "allpatent_clone" and "patent" is the database that the the other patent data redesign project, Restructuring Patent Data (link above) is working with. The datbase "allpatent" has since been removed, but it can be restored if it is needed.''' | ||
+ | |||
+ | [[Patent Data]] - overview of what the data is and where it came from, probably starting point for changing documentation | ||
+ | |||
+ | [[Patent|Patent Database]] - overview of schema of database (specifically, the database patent, which includes data from Harvard dataverse (originally stored in patentdata) and USPTO (patent_2015) | ||
+ | |||
+ | [[USPTOAssigneesData|USPTO Assignees Database]] - enhances assignee info in patent database, also being redesigned | ||
+ | |||
+ | [[Patent_Data_Issues|Problems with Patent Database]] - lists issues with current schema | ||
+ | |||
+ | [[Data_Model|Previous ER Diagram]] - does not match up with schema described in [[Patent|Patent Database]] and contains outdated list of what we want to pull from XML files | ||
+ | |||
+ | [[Patent_Data_Processing_-_SQL_Steps|Processing Patent Data]] - states that allpatent is the newest database and an amalgamation of patentdata or patent_2015 | ||
== Description == | == Description == | ||
Line 18: | Line 69: | ||
== Development == | == Development == | ||
− | Design will be built upon a relational database model. I will be referencing this article on database design as I develop the design (http://en.tekstenuitleg.net/articles/software/database-design-tutorial/one-to-many.html), and I will be creating an ER diagram. | + | Design will be built upon a relational database model. I will be referencing this article on database design as I develop the design (http://en.tekstenuitleg.net/articles/software/database-design-tutorial/one-to-many.html), and I will be creating an ER diagram using [https://erdplus.com/#/standalone ERDPlus] or [https://creately.com/app/?tempID=hqdgwjki1&login_type=demo# Creately]. |
− | == Current Design and Scripts | + | For notes on the existing database called "patent" and its tables, suggestions on redesigning tables, etc, please see e:/McNair/Projects/Redesigning Patent Database. |
+ | |||
+ | == Current Design and Scripts Documentation == | ||
The scripts for querying data for the patent database exist in McNair/software/scripts/patent. If the design for the schema of the database tables are altered too much, may have to write new scripts. | The scripts for querying data for the patent database exist in McNair/software/scripts/patent. If the design for the schema of the database tables are altered too much, may have to write new scripts. | ||
+ | |||
+ | The following pages are relevant to how previous databases are built/how to build tables in the database: | ||
+ | |||
+ | [[Harvard_Dataverse_Data]] - explains how to make tables from Harvard Dataverse data, where to find scripts, etc. | ||
+ | |||
+ | [[USPTO_Bulk_Data_Processing|USPTO Data]] - explains how to make tables from USPTO data, where to find scripts, etc, specifically for assignment data. | ||
+ | |||
+ | [[Patent_Data_Extraction_Scripts_(Tool)|Patent Data Extraction]] - explains locations of XML files and lists (at the bottom) where the Perl scripts can be found | ||
+ | |||
+ | [[Patent_Data_Cleanup_(June_2016)|Patent Data Cleanup]] - explains changes that were made to clean up problems in the database allpatent as a result of merging the Harvard Dataverse data and the USPTO data | ||
+ | |||
+ | [[Patent_Data_Processing_-_SQL_Steps|Patent Data Processing - SQL Steps]] - explains SQL needed to merge two existing databases, one that contained the Harvard Dataverse data and one that contained the USPTO data | ||
+ | |||
+ | Here at the instructions I'm developing for downloading, parsing, and hopefully adding new data to the database since the documentation is very sparse (can also be found under McNair/Projects/Redesigning Patent Database/Instructions on how to download patent data form USPTO bulk data. | ||
+ | |||
+ | ===How to run Perl Scripts to extract Patent Data=== | ||
+ | |||
+ | #Log on to the RDP | ||
+ | #Open powershell | ||
+ | #Change directory to wherever the script is located by doing: cd e:/mcnair/whatever | ||
+ | #Run the script by doing: | ||
+ | perl scriptname arg1 arg2 ... | ||
+ | or | ||
+ | perl scriptname argument1=arg1 ... | ||
+ | where arg1 and arg2 are arguments passed to the script either directly or by name (e.g., argument1=) | ||
+ | |||
+ | ===For the Main Patent Data=== | ||
+ | |||
+ | You'll need to run a series of scripts: | ||
+ | #USPTO_Parser.pl to get the zip files from the USPTO and unzip them | ||
+ | #splitter.pl to break the files into individual xml files | ||
+ | |||
+ | ====USPTO_Parser.pl==== | ||
+ | |||
+ | USPTO_Parser.pl can be found under | ||
+ | E:/McNair/Software/Scripts/Patent | ||
+ | |||
+ | Notes: | ||
+ | *You may need to fix the base url in the script. The USPTO has changed it before. It is currently: https://bulkdata.uspto.gov/data2/patent/grant/redbook | ||
+ | *Instead of taking a text file containing a url, as the USPTO_Assignee Download does, it takes two arguments, year 1 and year2, which are supposed to represent the range of data that you wish to download (for example, 2015 to 2016). | ||
+ | *The perl script places the downloaded zip files into "E:/McNair/PatentData/name" where "name" is the name of the zip file. | ||
+ | *The folder "Processed" under McNair/PatentData appears to hold all the unzipped zip files that have been downloaded and processed already. So if you are curious if some files have already been processed, you could look there. They are organized by year. | ||
+ | |||
+ | Now to actually run the scripts: | ||
+ | Open a command line or powershell | ||
+ | e: | ||
+ | cd .\McNair\Software\Scripts\Patent | ||
+ | perl USPTO_Parser.pl 2016 2017 | ||
+ | |||
+ | Note the zip files should appear briefly (sequentially) in E:/McNair/Software/Scripts/Patent before disappearing and reappearing unzipped in E:/McNair/PatentData | ||
+ | |||
+ | ====splitter.pl==== | ||
+ | |||
+ | Now we need to split the files into individual, valid xml files. To do this: | ||
+ | Move the files to be split into E:/McNair/PatentData/Queue | ||
+ | Go to: E:/McNair/PatentData (this is the version of the splitter to use) | ||
+ | Run the command: | ||
+ | perl splitter.pl | ||
+ | Each file will then be blown out into a directory of xml files in E:/McNair/PatentData/Processed | ||
+ | |||
+ | Notes: | ||
+ | *Change line 26 of script to reflect the year (name of the directory that you want to put the split files into the appropriate directory) | ||
+ | this will go ahead and put the files in the directory so you don't have to copy them. | ||
+ | |||
+ | ====xmlparser_4.5_4.4_4.3.pl==== | ||
+ | |||
+ | The next step is to parse the actual files. '''Do not use the perl script PatentParser.pl'''. This script is out of date. | ||
+ | |||
+ | Instead go to: | ||
+ | E:/McNair/PatentData/Processed | ||
+ | |||
+ | and use the perl script called xmlparser_4.5_4.4_4.3.pl by running (for example): | ||
+ | perl xmlparser_4.5_4.4_4.3.pl E:\McNair\PatentData\Processed\2016 | ||
+ | |||
+ | This will load the data into the database. Running splitter.pl should have loaded the files into a directory. Pass the path to the directory to the parser and it will parse the data and load it. | ||
+ | Notes: | ||
+ | *The parser will open a connection to a database on the RDP's installation of postgres. It will then put the data directly into this database. Once complete. we manually move the tables to the dbase server's database (i.e. patent). | ||
+ | *The password hint is ''tsn''. You can run pgAdmin III to connect to it. | ||
+ | *The default database is called PatentDB | ||
+ | *If you want to make a new dbase, you'll have to run a sql script to make the tables. It is E:\McNair\Software\Scripts\Patent\createTables.sql | ||
+ | *The script populates an inventors table. We may have failed to move this table over to the production dbase server. | ||
+ | *If you are updating the data, make sure that you don't add duplicate records to the dbase. The easiest way to fix this is to make sure that you don't parse duplicate xml files. | ||
+ | |||
+ | During the last update: | ||
+ | *ipg160322_6745.xml was the last loaded XML file. It covered <doc-number>09295186</doc-number><kind>B2</kind><date>20160322</date>. I confirmed that this was the last file in the E:\McNair\PatentData\Processed\ipg160322. | ||
+ | *I also confirmed that this was the highest patent number with the highest grant data in the dbase PatentDB | ||
+ | *I therefore put every folder from ipg160329 to ipg161227 into the E:\McNair\PatentData\Processed\2016 folder | ||
+ | *The script (xmlparser_4.5_4.4_4.3.pl) required substantial modification. In particular, the data structure implied by E:\McNair\Software\Scripts\Patent\createTables.sql was inadequate. As a consequence, I created a new database called PatentUpdate and modified its structure. All of the additional data is now loaded into there. | ||
+ | |||
+ | ====Next steps==== | ||
+ | We now need to: | ||
+ | #Retrieve the data out of PatentUpdate and reprocess it so that it will fit into ''patent'' on the main db server | ||
+ | |||
+ | ====Cleaning up patent data and inserting it into the existing patent data table==== | ||
+ | |||
+ | =====For the table called "patent"===== | ||
+ | |||
+ | INSERT statement: | ||
+ | INSERT INTO patents_merged ( SELECT patents.patentnumber, patents.grantdate, patents.prioritydate, patents.prioritycountry, patents.prioritypatentnumber, patents.cpcsubgroup, patents.pctpatentnumber, patents.numberofclaims, | ||
+ | patents.applicationnumber, NULL, patents.filingdate, NULL, NULL, NULL, NULL FROM patents ); | ||
+ | Equivalent COPY statements: | ||
+ | |||
+ | First copy data from RDP to a CSV file: | ||
+ | |||
+ | \COPY (SELECT patents.patentnumber, patents.grantdate, patents.prioritydate, patents.prioritycountry, patents.prioritypatentnumber, patents.cpcsubgroup, patents.pctpatentnumber, patents.numberofclaims, | ||
+ | patents.applicationnumber, NULL, patents.filingdate, NULL, NULL, NULL, NULL FROM patents) TO '/tmp/merged_patents_export.txt' DELIMITER AS E'\t' HEADER NULL AS CSV; | ||
+ | --COPY 1646225 | ||
+ | Then copy CSV file into table: | ||
+ | \COPY patents_merged FROM '/tmp/merged_patents_export.txt' DELIMITER AS E'\t' HEADER NULL AS CSV; | ||
+ | -- RESULT : COPY 1646225 | ||
+ | where "patents" is the name of the table containing the data from the RDP update and "patents_merged" is the name of the table we want to insert the new data into. | ||
+ | |||
+ | Table that we want to insert in patent is called "patent". So you would replace "patents_merge" with "patent". | ||
+ | |||
+ | |||
+ | The data in columns nber, uspc, and uspc_sub is taken from historicalpatentdata. Here is the SQL code to do that after the new data from the RDP has been inserted into the table "patent". | ||
+ | |||
+ | UPDATE patent p | ||
+ | SET nber = hp.nber, | ||
+ | uspc = hp.uspc, | ||
+ | uspc_sub = hp.uspc | ||
+ | FROM historicalpatentdata hp | ||
+ | WHERE hp.patentnumber = CAST(p.patent AS varchar); | ||
+ | |||
+ | Missing information (represented by NULL): | ||
+ | gyear (int) | ||
+ | appear (int) | ||
+ | |||
+ | Schema of patent data when loaded into RDP (based on existin documentation - have not checked schema on RDP postgres server): | ||
+ | patentnumber | int | not null | ||
+ | kind | varchar | | ||
+ | grantdate | date | | ||
+ | type | varchar | | ||
+ | applicationnumber | varchar | | ||
+ | filingdate | date | | ||
+ | prioritydate | date | | ||
+ | prioritycountry | varchar | | ||
+ | prioritypatentnumber | varchar | | ||
+ | ussubclass | varchar | | ||
+ | maingroup | varchar | | ||
+ | subgroup | varchar | | ||
+ | cpcsubclass | varchar | | ||
+ | cpcmaingroup | varchar | | ||
+ | cpcsubgroup | varchar | | ||
+ | classificationnationalcountry | varchar | | ||
+ | classificationnationalclass | varchar | | ||
+ | title | varchar | | ||
+ | numberofclaims | int | | ||
+ | primaryexaminerfirstname | varchar | | ||
+ | primaryexaminerlastname | varchar | | ||
+ | primaryexaminerdepartment | varchar | | ||
+ | pctpatentnumber | varchar | | ||
+ | filename | varchar | | ||
+ | |||
+ | Schema of patent table in patent (called "patent"): | ||
+ | patent | integer | not null | plain | | | ||
+ | gdate | date | | plain | | | ||
+ | prioritydate | date | | plain | | | ||
+ | prioritycountry | character varying | | extended | | | ||
+ | prioritypatentnumber | character varying | | extended | | | ||
+ | cpcsubgroup | character varying | | extended | | | ||
+ | pctpatentnumber | character varying | | extended | | | ||
+ | claims | integer | | plain | | | ||
+ | appnum | integer | | plain | | | ||
+ | gyear | integer | | plain | | | ||
+ | appdate | date | | plain | | | ||
+ | appyear | integer | | plain | | | ||
+ | nber | integer | | plain | | | ||
+ | uspc | character varying | | extended | | | ||
+ | uspc_sub | character varying | | extended | | | ||
+ | Indexes: | ||
+ | "patents_pkey" PRIMARY KEY, btree (patent) | ||
+ | "patent_idx" UNIQUE, btree (patent) | ||
+ | |||
+ | =====For the table called "citation"===== | ||
+ | Schema of citations table in patent | ||
+ | |||
+ | Table "public.citation" | ||
+ | Column | Type | Modifiers | Storage | Stats target | Description | ||
+ | --------+--------+-----------+---------+--------------+------------- | ||
+ | patent | bigint | | plain | | | ||
+ | cited | bigint | | plain | | | ||
+ | |||
+ | We just need citingpatentnumber and citedpatentnumber where citedpatentnumber is an seven digit int from the RDP database. | ||
+ | |||
+ | Use this script to replace textual patent numbers as ints (from existing documentation) | ||
+ | |||
+ | CREATE OR REPLACE FUNCTION cleanpatno (text) RETURNS bigint AS $$ if ($_[0]) { my $var=$_[0]; if ($var=~/^\d*$/) {return $var;} return undef; } return undef; $$ LANGUAGE plperl; | ||
+ | INSERT statement | ||
+ | INSERT INTO citation ( SELECT new_citations.citingpatentnumber, new_citations.citedpatent number FROM patents ); | ||
+ | Equivalent COPY statement: | ||
+ | \COPY citations (SELECT new_citations.citingpatentnumber, new_citations.citedpatent number FROM patents) TO '/tmp/citations_merge_export.txt' DELIMITER AS E'\t' HEADER NULL AS CSV; | ||
+ | --COPY 1646225 | ||
+ | Then copy CSV file into table: | ||
+ | \COPY citations_merged FROM '/tmp/citations_merge_export.txt' DELIMITER AS E'\t' HEADER NULL AS CSV; --1607724 | ||
+ | |||
+ | where "citations" is the name of the table containing the data from the RDP update and "citations_merge" is the name of the table we want to insert the new data into. | ||
+ | |||
+ | =====For the table called assignee===== | ||
+ | |||
+ | INSERT statement (copies data from table we loaded from RDP into existing table in patent database | ||
+ | INSERT INTO assignees_merge ( SELECT assignees.lastname, assignees.firstname, assignees.address, assignees.postcode, assignees.orgname, assignees.city, assignees.country, assignees.patentnumber, assignees.state, | ||
+ | assignees.patentcountry, NULL, NULL, NULL, NULL FROM assignees ); | ||
+ | Equivalent COPY statemnets: | ||
+ | |||
+ | \COPY assignees (SELECT assignees.lastname, assignees.firstname, assignees.address, assignees.postcode, assignees.orgname, assignees.city, assignees.country, assignees.patentnumber, assignees.state, assignees.patentcountry, | ||
+ | NULL, NULL, NULL, NULL FROM assignees) TO '/tmp/merged_assignees_export.txt' DELIMITER AS E'\t' HEADER NULL AS CSV; | ||
+ | --COPY 1646225 | ||
+ | Then copy CSV file into table: | ||
+ | \COPY assignees_merge FROM '/tmp/assignees_merge_export.txt' DELIMITER AS E'\t' HEADER NULL AS CSV; --1607724 | ||
+ | where "assignees" is the name of the table containing the data from the RDP update and "assignees_merge" is the name of the table we want to insert the new data into. | ||
+ | |||
+ | Table that we want to insert in patent is called assigneeinfo. So you would replace "assignees_merge" with "assigneeinfo". | ||
+ | |||
+ | |||
+ | Missing information (represented by NULL): | ||
+ | |||
+ | nationality (char sequence) | ||
+ | |||
+ | residence (char sequence) | ||
+ | |||
+ | asgseq (int) | ||
+ | |||
+ | asgtype (int) | ||
+ | |||
+ | Schema of assignees data when loaded into RDP (based on exisitng docmentation - have not checked schema on RDP postgres server): | ||
+ | lastname | text | | ||
+ | firstname | text | | ||
+ | orgname | text | | ||
+ | city | text | | ||
+ | country | text | | ||
+ | patentcountry | text | | ||
+ | patentnumber | integer | | ||
+ | state | text | | ||
+ | address | text | | ||
+ | postcode | text | | ||
+ | |||
+ | Schema of assignees table in patent (named assigneeinfo) | ||
+ | lastname | character varying(200) | | extended | | | ||
+ | firstname | character varying(200) | | extended | | | ||
+ | address | character varying(200) | | extended | | | ||
+ | postcode | character varying(200) | | extended | | | ||
+ | orgname | character varying(500) | | extended | | | ||
+ | city | character varying(200) | | extended | | | ||
+ | country | character varying(200) | | extended | | | ||
+ | patent | character varying(200) | | extended | | | ||
+ | state | character varying(200) | | extended | | | ||
+ | patentcountry | character varying(200) | | extended | | | ||
+ | nationality | character varying(200) | | extended | | | ||
+ | residence | character varying(200) | | extended | | | ||
+ | asgseq | integer | | plain | | | ||
+ | asgtype | integer | | plain | | | ||
+ | |||
+ | ===For the USPTO Assignment Data=== | ||
+ | |||
+ | You'll need to run a series of scripts: | ||
+ | |||
+ | USPTO_Assignee_Download.pl to get the zip files from the USPTO and unzip them from USPTO bulk website. | ||
+ | |||
+ | USPTO_Assignee_Splitter.pl to break the files into individual xml files | ||
+ | |||
+ | ====USPTO_Assignee_Download.pl==== | ||
+ | |||
+ | This is correct: | ||
+ | |||
+ | Run USPTO_Assignee_Download.pl, which is located in McNair/usptoAssigneeData, to download the data from uspto bulk website. | ||
+ | |||
+ | Notes: | ||
+ | *Instead of taking two arguments, as the USPTO Patent Parser does to download the patent data, it takes text file in the same directory as USPTO_Assignee_Download.pl. The text file should contain one line, the url leading to the page on the USPTO bulk website where all the zip files are located for the USPTO Assignee data. The current url is https://bulkdata.uspto.gov/data2/patent/assignment/. An example of a text file that can be passed to USPTO_Assignee_Download.pl is Base_URLs, which is located under McNair/usptoAssigneeData. | ||
+ | *The perl script downloads all the zip files present on the page. It is not possible currently to specify which year to start the download at. Therefore, before the data has been split, you will have to check what the last reel number and frame number that were loaded into the database are. | ||
+ | |||
+ | Now to actually run the scripts: | ||
+ | Open a command line or powershell | ||
+ | e: | ||
+ | cd .\McNair\usptoAssignmentData | ||
+ | perl USPTO_Assignee_Download.pl Base_URLs | ||
+ | |||
+ | After the files have been downloaded, move them to McNair/usptoAssigneeData/processing. | ||
+ | |||
+ | ====USPTO_Assignee_Splitter.pl==== | ||
+ | |||
+ | Run USPTO_Assignee_Splitter.pl, which is located in McNair/usptoAssigneeData, to split the data.This will break up the large xml files. | ||
+ | |||
+ | To do this: | ||
+ | Go to: E:/McNair/usptoAssigneeData (this is the version of the splitter to use) | ||
+ | Run the command: | ||
+ | perl USPTO_Assignee_Splitter.pl | ||
+ | Each file will then be blown out into a directory of xml files in E:/McNair/usptoAssigneeData/name-of-directory-you-chose | ||
+ | |||
+ | Notes: | ||
+ | *Change line 26 of script to reflect the year (name-of-directory-you-chose, the directory that you want to put the split files into) | ||
+ | this will go ahead and put the files in the directory so you don't have to copy them. | ||
+ | |||
+ | ====xml_parser2.plx==== | ||
+ | |||
+ | Now to parse the data and load it into the RDP database. This parser, xml_parser2.plx, is located under McNair/usptoAssigneeData. Do not use the other parsers located in this directory - they are believed to be out-of-date. | ||
+ | |||
+ | Before parsing, look in the "patent" database and look for the most recent frameno and reelno so we don't copy over data that we already have in the database. Remove all files that have data we already loaded into the database before running the parser. | ||
+ | |||
+ | current max reelno: 39068 (4/20/2017) | ||
+ | current max frameno 9680 (4/20/2017) | ||
+ | |||
+ | This will load the data into the database. Running USPTO_Assignee_Splitter.pl should have loaded the files into a directory of your choosing. Pass the path to the directory to the parser and it will parse the data and load it. | ||
+ | |||
+ | Now to actually run the scripts: | ||
+ | Open a command line or powershell | ||
+ | e: | ||
+ | cd .\McNair\usptoAssigneeData | ||
+ | perl xml_parser2.plx name-of-directory-you-chose | ||
+ | |||
+ | Notes: | ||
+ | *The parser will open a connection to a database on the RDP's installation of postgres. It will then put the data directly into this database. Once complete. we manually move the tables to the dbase server's database (i.e. patent). | ||
+ | *The password hint is ''tsn''. You can run pgAdmin III to connect to it. | ||
+ | |||
+ | ====Cleaning up the USPTO Assignee Data==== | ||
+ | |||
+ | Existing documentation that seems relevant to cleaning/moving the UPSTO Assignee data over from the RDP database: | ||
+ | [[USPTO_Bulk_Data_Processing]] | ||
+ | [[PTO_Tables]] | ||
+ | [[USPTOAssigneesData]] | ||
+ | |||
+ | The xml_parser2.plx creates four tables: Assignment, Assignees, Assignors, and Properties. These appear to correspond to ptoassignment, ptoassignee, ptoassignor, and ptoproperty, respectively in the "patent" database. There is another pto table called "ptopatentfile" that has the following schema, but I cannot find out how it is populated; the xml_parser2.plx does not create this table. | ||
+ | |||
+ | Table "public.ptopatentfile" | ||
+ | Column | Type | Modifiers | Storage | Stats target | Description | ||
+ | reel_no | integer | | plain | | | ||
+ | frame_no | integer | | plain | | | ||
+ | action_key_code | character varying(10) | | extended | | | ||
+ | uspto_transaction_date | date | | plain | | | ||
+ | uspto_date_produced | date | | plain | | | ||
+ | version | numeric | | main | | | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | Disregard what is below, this is now believed to be incorrect: | ||
+ | |||
+ | For USPTO Assignment Data, there is a script, under McNair/usptoAssignment, called USPTO_Assignee_Download, which lets a user pass it a text file (file ending in.txt) which contains the url(s) of the assignment data that needs to be downloaded. The script then downloads all the zip files available at that URL. An example called BaseUrls.txt (containing the url that you will probably be using to download the assignment data, unless you're downloading the data from this currrent year, which is in a different link) can be found in McNair/usptoAssignment It then places the downloaded zip files in "E:/McNair/usptoAssigneeData/name", where "name" is the name of the file. If you want to check which files have already been processed, check "McNair/usptoAssigneeData/Finished" to see the finished zip files. (In the future, this should be updated, if possible to specify which years to download, since all assignment data that is not from this current year is under one url, and we've already downloaded most of it.) | ||
+ | |||
+ | Then, to parse the actual files, do the following: | ||
+ | |||
+ | For the USPTO Assignment Data, the parsing file is called USPTO_Assignee_XML_parser. It takes the path to the files that need to be parsed (an example would be ":E/McNair/Files_to_process" where | ||
+ | "Files_to_process" is the name of the folder you've placed the xml files to be parsed. It iterates through all the files in the "year" directory that you passed. This file directly loads the information into the database | ||
+ | while it parses the file. | ||
+ | |||
+ | ===For the USPTO Maintenance Fee Data=== | ||
+ | |||
+ | Download the file manually from https://bulkdata.uspto.gov/data2/patent/maintenancefee/and place file in McNair/Patent Data/Maintenance Fee Data. Then, go into the file and add headers for each column separated by spaces (make sure that each header lines up with the start of the column. To see an example of how to do this, look at the example file MaintFeeEvents_20170410-wHeader.txt under McNair/Patent Data/Maintenance Fee Data. | ||
+ | |||
+ | Then, run the normalizer on the text file. To do this: | ||
+ | |||
+ | Open a command line or powershell | ||
+ | e: | ||
+ | cd .\McNair\PatentData/Maintenance Fee Data | ||
+ | perl NormalizerFixedWidth.pl -file=MaintFeeEvents_20170410-wHeaders.txt | ||
+ | |||
+ | Where "MaintFeeEvents_20170410-wHeaders" is the name of the file with the added headers at the top. This script will put the normalized (cleaned) file in MaintFeeEvents_20170410-wHeader-normal.txt (basically appends "-normal" to whatever file name you pass it). | ||
+ | |||
+ | To then make a table out of the normalized text file, use the SQL detailed on [[Patent_Expiration_Rules]] | ||
+ | |||
+ | This will create entirely new tables from the maintenance fee data. To avoid repeating data, we will most likely just replace the existing tables in the database with the new tables. | ||
+ | |||
+ | ===For the USPTO Historical Patent Data=== | ||
+ | |||
+ | Has not been updated since 2015, but if you need to download Historical Patent data this is the link: https://bulkdata.uspto.gov/data2/patent/historical/2014/ | ||
+ | |||
+ | == Specifications of USPTO Assignment Data To Extract in Future Perl Scripts== | ||
+ | Go to https://bulkdata.uspto.gov/ to bulk data from USPTO. | ||
+ | |||
+ | To see a description of what each file the USPTO bulk data contains, go to the bulk drive and navigate to McNair/Projects/Redesigning Patent Database/2017BulkDataProductDescriptions. This gives an overview, but does not explain how the XML files are structured. Those are the DTDs. | ||
+ | |||
+ | For assignment data, we pull from https://bulkdata.uspto.gov/data2/patent/assignment/. A description of all the elements in an USPTO assignment XML file is called a DTD, and the one for the assignment data can be found in the bulk drive under McNair/Project/Redesigning Patent Database/USPTO Assignment DTD | ||
+ | |||
+ | I'm currently looking at the DTDs for USPTO patent data (2005 and up DTDs have to be opened by Microsoft Visual Studio) to ascertain if there are any fields we're not currently pulling from the bulk data that we should. I am using the following link to figure out how to read a DTD: http://www.ldodds.com/delta/dtd_guide.html. There are only DTDs for 2005 and up, but there is a very long pdf that appears to detail the format of patent files pre - 2005. It has been saved under McNair/Projects/Redesigning Patent Database/ | ||
+ | |||
+ | '''Pre - 2005''' | ||
+ | |||
+ | According to the documentation, the inventor's name and city are required to be listed, and the street, state, and country of the inventor may also be listed. | ||
+ | |||
+ | [[File:USPTO pre - 2005 inventors data description.png]] | ||
+ | |||
+ | Paragraphs of abstract may also be listed under logical group "Abstract" in a field called "abstract": | ||
+ | |||
+ | [[File:USPTO pre-2005 abstract data format.png]] | ||
+ | |||
+ | |||
+ | '''2005 - present''' | ||
+ | |||
+ | The USPTO patent data for 2005 seems to include multiple paragraphs for the abstract under an element called "abstract". I've included the line from the DTD below: | ||
+ | |||
+ | <!--A concise summary of the disclosure.--> | ||
+ | <!ELEMENT abstract (doc-page+ | (abst-problem , abst-solution) | p+)> | ||
+ | |||
+ | <!ATTLIST abstract id ID #IMPLIED lang CDATA #IMPLIED status CDATA #IMPLIED > | ||
+ | |||
+ | An abstract is required to be included for all patents that are not design patents, and according to the DTD must have at least one paragraph element. | ||
+ | |||
+ | As far as inventors go, it looks like for 2005 - up patents, the inventor is simply the applicant. this is copied from the DTD: | ||
+ | |||
+ | <!--(US: applicant is always the inventor. Use "applicant" with applicant type attribute of "applicant-inventor.")--> | ||
+ | <!ELEMENT applicant (addressbook+ , nationality , residence , us-rights* , designated-states? , designated-states-as-inventor?)> | ||
+ | |||
+ | <!ATTLIST applicant sequence CDATA #REQUIRED app-type (applicant | applicant-inventor ) #REQUIRED designation (all | all-except-us | us-only | as-indicated ) #REQUIRED > | ||
+ | |||
+ | If for some reason the inventor could not be the applicant, this element would contain the inventor: | ||
+ | |||
+ | <!--The individual or entity responsible for creating the matter that protection is sought for. (US: populated only when the inventor was unable to be the applicant.)--> | ||
+ | <!ELEMENT inventor (addressbook+ , designated-states?)> | ||
+ | |||
+ | <!ATTLIST inventor sequence CDATA #REQUIRED designation (all | all-except-us | us-only | as-indicated ) #IMPLIED > | ||
+ | |||
+ | <!--Inventors information.--> | ||
+ | <!ELEMENT inventors (inventor | deceased-inventor)+> | ||
+ | |||
+ | |||
+ | It seems that we have already pulled patent abstracts before with the existing scripts. I found a huge word document with abstracts under McNair/Software/Scripts/Patent/Abstracts_2016. | ||
== Test Plan == | == Test Plan == | ||
==Log:== | ==Log:== | ||
+ | '''2/16/2017''' - Talked over project with Ed, began reading existing wiki pages related to patent data and databases | ||
+ | |||
+ | '''2/21/2017''' - Brushed up on SQL, Entity - Relationship model of designing databases | ||
+ | * In the documentation, I want to briefly explain what the entity-relationship model is before including | ||
+ | the diagram so that readers have a little bit of background | ||
+ | * Found a tool for creating a visual representation called ERDPlus.com - create a standalone instead of an account, can download | ||
+ | Learning commands from Patent Data - SQL Steps | ||
+ | * copy command is PostgreSQL that copies a SQL table to a text file | ||
+ | ** DELIMITER set what will separate columns in text file | ||
+ | ** HEADER specifies that there will be a header in the text file with the names of the columns | ||
+ | * Definitely need to include more detail about what these do in the documentation | ||
+ | * insert into command inserts a new entry into the table | ||
+ | |||
+ | '''2/23/2017''' - Read great database design article, dug through some more wiki articles, started reviewing Perl | ||
+ | * What client do we use to interact with the current patent database? | ||
+ | * Will need to determine all the fields that need to be included in the database before finishing the design and ER diagram, will need Ed's input | ||
+ | |||
+ | '''3/2/2017''' - Started compiling a list of what fields to include and how they would be related. | ||
+ | * Created an Excel spreadsheet that records the each table, their current attributes in the existing patentdata table, what I think the attributes should be in the new table, their relationship to a patent (i.e. one-to-many, many-to-many, etc.), their primary key, questions I have relating to the table, future steps for cleaning up the data in the table (i.e., once all the data has been move to the new database, removing patents that are not US-based), and current problems that have been recorded with the existing table for that information (if an existing table exists) | ||
+ | * Once Excel spreadsheet is completed (and questions in the Questions column are answered or removed from the spreadsheet entirely) I will look into trying to embed it on my "Redesigning Patent Database" wiki page so that future users can sort of follow my thought process. I will also create separate wiki pages to explain each table once the new database is created | ||
+ | ** instructions for adding a table to a wiki page https://www.mediawiki.org/wiki/Help:Tables | ||
+ | * Making spreadsheet led me to realize there is some data that is repeated (filedate in fee table when it is also located in patent table, and the fee table includes the patent | ||
+ | |||
+ | '''3/7/2017''' - Continued working on spreadsheet, update project page with relevant links, developed following plan for documentation | ||
+ | * Most important thing for new documentation in my opinion is a clear flow, i.e., one central page on the database, which contains links to the following: | ||
+ | ** Where data is (patent data) where it came from (Harvard Dataverse, etc), the scripts/SQL used to obtain the data, and where original xml files can be located | ||
+ | ** Next, there should be an ER diagram that shows the general design, followed by short descriptions of each table is, what it's related to, and what it's use is | ||
+ | ** Each table should contain a link to a separate wiki page on the table which will explain all the columns, where the data came from, provide a visual example of data in the table, list issues, and also contain a log of what has been done to the table over time | ||
+ | ** there should also be instructions on the page for how, if someone is going to change the database (add a table, alter a column, etc.) which documentation they should update (the wikipage for the table, for example) | ||
+ | ** Links to USPTO Assignee Database with an explanation of how it is joined with the Patent Database | ||
+ | ** Lastly, everything should be dated. I know that each wiki-page lists when it was last edited, but the main page, next to table descriptions, should have the date the table was originally added. The ER diagram should be updated as needed and also dated so that readers know if it's the most current representation of the database. | ||
+ | |||
+ | '''3/9/2017''' - Finished up first draft of spreadsheet, found ER diagram tools | ||
+ | * Old ER diagram (see Documentation Relevant to Current Databases) is difficult to read - it appears to show tables that don't exist and data that doesn't exist in the current schema | ||
+ | * psql allpatent does not work. psql allpatent_clone does for some reason | ||
+ | |||
+ | '''3/21/2017''' - Working on determining "core" tablse that will always be in database and understanding work on assignees info | ||
+ | * http://postgresonline.com/special_feature.php?sf_name=postgresql83_psql_cheatsheet&outputformat=html | ||
+ | ** helpful commands for interacting with psql | ||
+ | * Notes on determining "core" tables for database are saved under Projects/Redesigning Patent Database/Determining which tables should be core tables | ||
+ | '''3/23/2017''' - Made spreadsheet with core tables and share it with other people working on patent data to get their approval | ||
+ | * This spreadsheet can be found under Projects/Redesigning Patent Database/Core Tables for New Patent Database | ||
+ | * Some tables that will later be deleted were included on the spreadsheet because their are currently being tables built to replace them | ||
+ | * May try to just move all the (twenty-something) "pto-" tables that have been created due to the "Restructuring Patent Data" project from "patent" to the new database | ||
+ | * Will work on understanding SQL for filling new database from this link next week [[Patent_Data_Processing_-_SQL_Steps]] and [[Patent_Data_Cleanup_(June_2016)]] | ||
− | + | '''4/4/2017''' - Found all the pages on extracting data and making tables and databases | |
+ | * Do not need to pull Harvard Dataverse data again, it's saved in CSV files on the bulk drive | ||
+ | * Started looking through DTDs for USPTO patent and assignment data to determine if there is extra information that we should extract from USPTO data. |
Latest revision as of 12:41, 21 September 2020
Redesigning Patent Database | |
---|---|
Project Information | |
Has title | Redesigning Patent Database |
Has owner | Shelby Bice |
Has start date | 201701 |
Has deadline date | |
Has keywords | Database, Patent |
Has project status | Complete |
Subsumed by: | Reproducible Patent Data |
Has sponsor | McNair Center |
Has project output | Data |
Copyright © 2019 edegan.com. All Rights Reserved. |
Documentation on the process of updating the patent database with new data and eventual designs for the new patent database. Not to be confused with "Patent Data Restructure", which deals with condensing and cleaning USPTO assignees data.
NOTE FOR ED: Places I looked for the parser for the USPTO assignment data and did not find it are located in a txt file under McNair/Projects/Redesigning Patent Database called "Notes on where the correct parser for the USPTO Assignment Data is and is not". Hopefully that will help narrow your search or help you determine if you should keep debugging the one we have.
For anyone looking to pick up this project: Ended with notes on how to add data to database (see Current Design and Scripts Documentation). Below are some checklist items if we overhauled the database and rebuilt all the tables from scratch, which could be a next step. There is a lot of notes on this page about designing the database and what tables are necessary and which have been created for projects. Feel free to consult these if you are overhauling the database. I would suggest going through the tables in an existing patent database and instead removing extraneous tables that no longer have any use instead of rebuilding the entire database, since I determined that the existing schema isn't that convoluted and the process of downloading, parsing, and cleaning data from USPTO, let alone the Harvard Dataverse, is a huge hassle, so there isn't any reason to reinvent the wheel in my opinion.
If you're looking to just pull new data from USPTO and want to know the process, see "Current Design and Scripts Documentation."
If you're rewriting any of the Perl scripts that download data from USPTO, specifically USPTO Assignment data (sometimes referred to on the wiki as USPTO Assignee Data), please take a look at "Specifications of USPTO Assignment Data To Extract in Future Perl Scripts." I looked through the DTDs and documentation on the USPTO bulk website and found a couple things that we could be extracting from the data we download that could be useful, like abstracts and more inventor information.
Contents
- 1 Redesigning Patent Database
- 2 Related Projects
- 3 Documentation Relevant to Current Patent Database
- 4 Description
- 5 Development
- 6 Current Design and Scripts Documentation
- 7 Specifications of USPTO Assignment Data To Extract in Future Perl Scripts
- 8 Test Plan
- 9 Log:
Redesigning Patent Database
- Design a better representation for database
- Fix scripts if necessary
- Start moving data into new database by querying existing databases (using SQL)
- Use scripts to query new data
- Test database
- Remove extraneous information from database (copies, patents that we're not interested in, etc.)
Documentation I need to include:
- Schema of new database (with justification of design), would like to include a visual representation
- SQL commands that were used to fill database with explanation of what they do
- Clear instructions on where to find scripts in bulk drive and an explanation of what each script does
- Visual representation of example table entries that isn't just copied and pasted from a CSV file
Related Projects
- Patent Assignment Data Restructure
- Small Inventors Project - uses Fee Status and Citations
- Medical Centers and Grants - uses patent assignees, specifically their zipcodes and organizations
Documentation Relevant to Current Patent Database
Previous documentation on the patent database:
As of 3/21/2017 the most up-to-date database containing patent data is "patent" not "allpatent" or "allpatent_clone" and "patent" is the database that the the other patent data redesign project, Restructuring Patent Data (link above) is working with. The datbase "allpatent" has since been removed, but it can be restored if it is needed.
Patent Data - overview of what the data is and where it came from, probably starting point for changing documentation
Patent Database - overview of schema of database (specifically, the database patent, which includes data from Harvard dataverse (originally stored in patentdata) and USPTO (patent_2015)
USPTO Assignees Database - enhances assignee info in patent database, also being redesigned
Problems with Patent Database - lists issues with current schema
Previous ER Diagram - does not match up with schema described in Patent Database and contains outdated list of what we want to pull from XML files
Processing Patent Data - states that allpatent is the newest database and an amalgamation of patentdata or patent_2015
Description
The purpose of this project is to create a new, redesigned database to hold all of the patent information that the McNair Center has accumulated and document the process so that the design can be easily understood and replicated or edited as needed.
This database will include design patents, utility patents, and reissues.
Development
Design will be built upon a relational database model. I will be referencing this article on database design as I develop the design (http://en.tekstenuitleg.net/articles/software/database-design-tutorial/one-to-many.html), and I will be creating an ER diagram using ERDPlus or Creately.
For notes on the existing database called "patent" and its tables, suggestions on redesigning tables, etc, please see e:/McNair/Projects/Redesigning Patent Database.
Current Design and Scripts Documentation
The scripts for querying data for the patent database exist in McNair/software/scripts/patent. If the design for the schema of the database tables are altered too much, may have to write new scripts.
The following pages are relevant to how previous databases are built/how to build tables in the database:
Harvard_Dataverse_Data - explains how to make tables from Harvard Dataverse data, where to find scripts, etc.
USPTO Data - explains how to make tables from USPTO data, where to find scripts, etc, specifically for assignment data.
Patent Data Extraction - explains locations of XML files and lists (at the bottom) where the Perl scripts can be found
Patent Data Cleanup - explains changes that were made to clean up problems in the database allpatent as a result of merging the Harvard Dataverse data and the USPTO data
Patent Data Processing - SQL Steps - explains SQL needed to merge two existing databases, one that contained the Harvard Dataverse data and one that contained the USPTO data
Here at the instructions I'm developing for downloading, parsing, and hopefully adding new data to the database since the documentation is very sparse (can also be found under McNair/Projects/Redesigning Patent Database/Instructions on how to download patent data form USPTO bulk data.
How to run Perl Scripts to extract Patent Data
- Log on to the RDP
- Open powershell
- Change directory to wherever the script is located by doing: cd e:/mcnair/whatever
- Run the script by doing:
perl scriptname arg1 arg2 ... or perl scriptname argument1=arg1 ... where arg1 and arg2 are arguments passed to the script either directly or by name (e.g., argument1=)
For the Main Patent Data
You'll need to run a series of scripts:
- USPTO_Parser.pl to get the zip files from the USPTO and unzip them
- splitter.pl to break the files into individual xml files
USPTO_Parser.pl
USPTO_Parser.pl can be found under
E:/McNair/Software/Scripts/Patent
Notes:
- You may need to fix the base url in the script. The USPTO has changed it before. It is currently: https://bulkdata.uspto.gov/data2/patent/grant/redbook
- Instead of taking a text file containing a url, as the USPTO_Assignee Download does, it takes two arguments, year 1 and year2, which are supposed to represent the range of data that you wish to download (for example, 2015 to 2016).
- The perl script places the downloaded zip files into "E:/McNair/PatentData/name" where "name" is the name of the zip file.
- The folder "Processed" under McNair/PatentData appears to hold all the unzipped zip files that have been downloaded and processed already. So if you are curious if some files have already been processed, you could look there. They are organized by year.
Now to actually run the scripts:
Open a command line or powershell e: cd .\McNair\Software\Scripts\Patent perl USPTO_Parser.pl 2016 2017
Note the zip files should appear briefly (sequentially) in E:/McNair/Software/Scripts/Patent before disappearing and reappearing unzipped in E:/McNair/PatentData
splitter.pl
Now we need to split the files into individual, valid xml files. To do this:
Move the files to be split into E:/McNair/PatentData/Queue Go to: E:/McNair/PatentData (this is the version of the splitter to use) Run the command: perl splitter.pl Each file will then be blown out into a directory of xml files in E:/McNair/PatentData/Processed
Notes:
- Change line 26 of script to reflect the year (name of the directory that you want to put the split files into the appropriate directory)
this will go ahead and put the files in the directory so you don't have to copy them.
xmlparser_4.5_4.4_4.3.pl
The next step is to parse the actual files. Do not use the perl script PatentParser.pl. This script is out of date.
Instead go to:
E:/McNair/PatentData/Processed
and use the perl script called xmlparser_4.5_4.4_4.3.pl by running (for example):
perl xmlparser_4.5_4.4_4.3.pl E:\McNair\PatentData\Processed\2016
This will load the data into the database. Running splitter.pl should have loaded the files into a directory. Pass the path to the directory to the parser and it will parse the data and load it. Notes:
- The parser will open a connection to a database on the RDP's installation of postgres. It will then put the data directly into this database. Once complete. we manually move the tables to the dbase server's database (i.e. patent).
- The password hint is tsn. You can run pgAdmin III to connect to it.
- The default database is called PatentDB
- If you want to make a new dbase, you'll have to run a sql script to make the tables. It is E:\McNair\Software\Scripts\Patent\createTables.sql
- The script populates an inventors table. We may have failed to move this table over to the production dbase server.
- If you are updating the data, make sure that you don't add duplicate records to the dbase. The easiest way to fix this is to make sure that you don't parse duplicate xml files.
During the last update:
- ipg160322_6745.xml was the last loaded XML file. It covered <doc-number>09295186</doc-number><kind>B2</kind><date>20160322</date>. I confirmed that this was the last file in the E:\McNair\PatentData\Processed\ipg160322.
- I also confirmed that this was the highest patent number with the highest grant data in the dbase PatentDB
- I therefore put every folder from ipg160329 to ipg161227 into the E:\McNair\PatentData\Processed\2016 folder
- The script (xmlparser_4.5_4.4_4.3.pl) required substantial modification. In particular, the data structure implied by E:\McNair\Software\Scripts\Patent\createTables.sql was inadequate. As a consequence, I created a new database called PatentUpdate and modified its structure. All of the additional data is now loaded into there.
Next steps
We now need to:
- Retrieve the data out of PatentUpdate and reprocess it so that it will fit into patent on the main db server
Cleaning up patent data and inserting it into the existing patent data table
For the table called "patent"
INSERT statement:
INSERT INTO patents_merged ( SELECT patents.patentnumber, patents.grantdate, patents.prioritydate, patents.prioritycountry, patents.prioritypatentnumber, patents.cpcsubgroup, patents.pctpatentnumber, patents.numberofclaims, patents.applicationnumber, NULL, patents.filingdate, NULL, NULL, NULL, NULL FROM patents );
Equivalent COPY statements:
First copy data from RDP to a CSV file:
\COPY (SELECT patents.patentnumber, patents.grantdate, patents.prioritydate, patents.prioritycountry, patents.prioritypatentnumber, patents.cpcsubgroup, patents.pctpatentnumber, patents.numberofclaims, patents.applicationnumber, NULL, patents.filingdate, NULL, NULL, NULL, NULL FROM patents) TO '/tmp/merged_patents_export.txt' DELIMITER AS E'\t' HEADER NULL AS CSV; --COPY 1646225
Then copy CSV file into table:
\COPY patents_merged FROM '/tmp/merged_patents_export.txt' DELIMITER AS E'\t' HEADER NULL AS CSV; -- RESULT : COPY 1646225
where "patents" is the name of the table containing the data from the RDP update and "patents_merged" is the name of the table we want to insert the new data into.
Table that we want to insert in patent is called "patent". So you would replace "patents_merge" with "patent".
The data in columns nber, uspc, and uspc_sub is taken from historicalpatentdata. Here is the SQL code to do that after the new data from the RDP has been inserted into the table "patent".
UPDATE patent p SET nber = hp.nber, uspc = hp.uspc, uspc_sub = hp.uspc FROM historicalpatentdata hp WHERE hp.patentnumber = CAST(p.patent AS varchar);
Missing information (represented by NULL): gyear (int) appear (int)
Schema of patent data when loaded into RDP (based on existin documentation - have not checked schema on RDP postgres server):
patentnumber | int | not null kind | varchar | grantdate | date | type | varchar | applicationnumber | varchar | filingdate | date | prioritydate | date | prioritycountry | varchar | prioritypatentnumber | varchar | ussubclass | varchar | maingroup | varchar | subgroup | varchar | cpcsubclass | varchar | cpcmaingroup | varchar | cpcsubgroup | varchar | classificationnationalcountry | varchar | classificationnationalclass | varchar | title | varchar | numberofclaims | int | primaryexaminerfirstname | varchar | primaryexaminerlastname | varchar | primaryexaminerdepartment | varchar | pctpatentnumber | varchar | filename | varchar |
Schema of patent table in patent (called "patent"):
patent | integer | not null | plain | | gdate | date | | plain | | prioritydate | date | | plain | | prioritycountry | character varying | | extended | | prioritypatentnumber | character varying | | extended | | cpcsubgroup | character varying | | extended | | pctpatentnumber | character varying | | extended | | claims | integer | | plain | | appnum | integer | | plain | | gyear | integer | | plain | | appdate | date | | plain | | appyear | integer | | plain | | nber | integer | | plain | | uspc | character varying | | extended | | uspc_sub | character varying | | extended | |
Indexes:
"patents_pkey" PRIMARY KEY, btree (patent) "patent_idx" UNIQUE, btree (patent)
For the table called "citation"
Schema of citations table in patent
Table "public.citation" Column | Type | Modifiers | Storage | Stats target | Description --------+--------+-----------+---------+--------------+------------- patent | bigint | | plain | | cited | bigint | | plain | |
We just need citingpatentnumber and citedpatentnumber where citedpatentnumber is an seven digit int from the RDP database.
Use this script to replace textual patent numbers as ints (from existing documentation)
CREATE OR REPLACE FUNCTION cleanpatno (text) RETURNS bigint AS $$ if ($_[0]) { my $var=$_[0]; if ($var=~/^\d*$/) {return $var;} return undef; } return undef; $$ LANGUAGE plperl;
INSERT statement
INSERT INTO citation ( SELECT new_citations.citingpatentnumber, new_citations.citedpatent number FROM patents );
Equivalent COPY statement:
\COPY citations (SELECT new_citations.citingpatentnumber, new_citations.citedpatent number FROM patents) TO '/tmp/citations_merge_export.txt' DELIMITER AS E'\t' HEADER NULL AS CSV; --COPY 1646225
Then copy CSV file into table:
\COPY citations_merged FROM '/tmp/citations_merge_export.txt' DELIMITER AS E'\t' HEADER NULL AS CSV; --1607724
where "citations" is the name of the table containing the data from the RDP update and "citations_merge" is the name of the table we want to insert the new data into.
For the table called assignee
INSERT statement (copies data from table we loaded from RDP into existing table in patent database
INSERT INTO assignees_merge ( SELECT assignees.lastname, assignees.firstname, assignees.address, assignees.postcode, assignees.orgname, assignees.city, assignees.country, assignees.patentnumber, assignees.state, assignees.patentcountry, NULL, NULL, NULL, NULL FROM assignees );
Equivalent COPY statemnets:
\COPY assignees (SELECT assignees.lastname, assignees.firstname, assignees.address, assignees.postcode, assignees.orgname, assignees.city, assignees.country, assignees.patentnumber, assignees.state, assignees.patentcountry, NULL, NULL, NULL, NULL FROM assignees) TO '/tmp/merged_assignees_export.txt' DELIMITER AS E'\t' HEADER NULL AS CSV; --COPY 1646225 Then copy CSV file into table: \COPY assignees_merge FROM '/tmp/assignees_merge_export.txt' DELIMITER AS E'\t' HEADER NULL AS CSV; --1607724
where "assignees" is the name of the table containing the data from the RDP update and "assignees_merge" is the name of the table we want to insert the new data into.
Table that we want to insert in patent is called assigneeinfo. So you would replace "assignees_merge" with "assigneeinfo".
Missing information (represented by NULL):
nationality (char sequence)
residence (char sequence)
asgseq (int)
asgtype (int)
Schema of assignees data when loaded into RDP (based on exisitng docmentation - have not checked schema on RDP postgres server): lastname | text | firstname | text | orgname | text | city | text | country | text | patentcountry | text | patentnumber | integer | state | text | address | text | postcode | text |
Schema of assignees table in patent (named assigneeinfo)
lastname | character varying(200) | | extended | | firstname | character varying(200) | | extended | | address | character varying(200) | | extended | | postcode | character varying(200) | | extended | | orgname | character varying(500) | | extended | | city | character varying(200) | | extended | | country | character varying(200) | | extended | | patent | character varying(200) | | extended | | state | character varying(200) | | extended | | patentcountry | character varying(200) | | extended | | nationality | character varying(200) | | extended | | residence | character varying(200) | | extended | | asgseq | integer | | plain | | asgtype | integer | | plain | |
For the USPTO Assignment Data
You'll need to run a series of scripts:
USPTO_Assignee_Download.pl to get the zip files from the USPTO and unzip them from USPTO bulk website.
USPTO_Assignee_Splitter.pl to break the files into individual xml files
USPTO_Assignee_Download.pl
This is correct:
Run USPTO_Assignee_Download.pl, which is located in McNair/usptoAssigneeData, to download the data from uspto bulk website.
Notes:
- Instead of taking two arguments, as the USPTO Patent Parser does to download the patent data, it takes text file in the same directory as USPTO_Assignee_Download.pl. The text file should contain one line, the url leading to the page on the USPTO bulk website where all the zip files are located for the USPTO Assignee data. The current url is https://bulkdata.uspto.gov/data2/patent/assignment/. An example of a text file that can be passed to USPTO_Assignee_Download.pl is Base_URLs, which is located under McNair/usptoAssigneeData.
- The perl script downloads all the zip files present on the page. It is not possible currently to specify which year to start the download at. Therefore, before the data has been split, you will have to check what the last reel number and frame number that were loaded into the database are.
Now to actually run the scripts:
Open a command line or powershell e: cd .\McNair\usptoAssignmentData perl USPTO_Assignee_Download.pl Base_URLs
After the files have been downloaded, move them to McNair/usptoAssigneeData/processing.
USPTO_Assignee_Splitter.pl
Run USPTO_Assignee_Splitter.pl, which is located in McNair/usptoAssigneeData, to split the data.This will break up the large xml files.
To do this:
Go to: E:/McNair/usptoAssigneeData (this is the version of the splitter to use) Run the command: perl USPTO_Assignee_Splitter.pl Each file will then be blown out into a directory of xml files in E:/McNair/usptoAssigneeData/name-of-directory-you-chose
Notes:
- Change line 26 of script to reflect the year (name-of-directory-you-chose, the directory that you want to put the split files into)
this will go ahead and put the files in the directory so you don't have to copy them.
xml_parser2.plx
Now to parse the data and load it into the RDP database. This parser, xml_parser2.plx, is located under McNair/usptoAssigneeData. Do not use the other parsers located in this directory - they are believed to be out-of-date.
Before parsing, look in the "patent" database and look for the most recent frameno and reelno so we don't copy over data that we already have in the database. Remove all files that have data we already loaded into the database before running the parser.
current max reelno: 39068 (4/20/2017) current max frameno 9680 (4/20/2017)
This will load the data into the database. Running USPTO_Assignee_Splitter.pl should have loaded the files into a directory of your choosing. Pass the path to the directory to the parser and it will parse the data and load it.
Now to actually run the scripts:
Open a command line or powershell e: cd .\McNair\usptoAssigneeData perl xml_parser2.plx name-of-directory-you-chose
Notes:
- The parser will open a connection to a database on the RDP's installation of postgres. It will then put the data directly into this database. Once complete. we manually move the tables to the dbase server's database (i.e. patent).
- The password hint is tsn. You can run pgAdmin III to connect to it.
Cleaning up the USPTO Assignee Data
Existing documentation that seems relevant to cleaning/moving the UPSTO Assignee data over from the RDP database: USPTO_Bulk_Data_Processing PTO_Tables USPTOAssigneesData
The xml_parser2.plx creates four tables: Assignment, Assignees, Assignors, and Properties. These appear to correspond to ptoassignment, ptoassignee, ptoassignor, and ptoproperty, respectively in the "patent" database. There is another pto table called "ptopatentfile" that has the following schema, but I cannot find out how it is populated; the xml_parser2.plx does not create this table.
Table "public.ptopatentfile" Column | Type | Modifiers | Storage | Stats target | Description reel_no | integer | | plain | | frame_no | integer | | plain | | action_key_code | character varying(10) | | extended | | uspto_transaction_date | date | | plain | | uspto_date_produced | date | | plain | | version | numeric | | main | |
Disregard what is below, this is now believed to be incorrect:
For USPTO Assignment Data, there is a script, under McNair/usptoAssignment, called USPTO_Assignee_Download, which lets a user pass it a text file (file ending in.txt) which contains the url(s) of the assignment data that needs to be downloaded. The script then downloads all the zip files available at that URL. An example called BaseUrls.txt (containing the url that you will probably be using to download the assignment data, unless you're downloading the data from this currrent year, which is in a different link) can be found in McNair/usptoAssignment It then places the downloaded zip files in "E:/McNair/usptoAssigneeData/name", where "name" is the name of the file. If you want to check which files have already been processed, check "McNair/usptoAssigneeData/Finished" to see the finished zip files. (In the future, this should be updated, if possible to specify which years to download, since all assignment data that is not from this current year is under one url, and we've already downloaded most of it.)
Then, to parse the actual files, do the following:
For the USPTO Assignment Data, the parsing file is called USPTO_Assignee_XML_parser. It takes the path to the files that need to be parsed (an example would be ":E/McNair/Files_to_process" where "Files_to_process" is the name of the folder you've placed the xml files to be parsed. It iterates through all the files in the "year" directory that you passed. This file directly loads the information into the database while it parses the file.
For the USPTO Maintenance Fee Data
Download the file manually from https://bulkdata.uspto.gov/data2/patent/maintenancefee/and place file in McNair/Patent Data/Maintenance Fee Data. Then, go into the file and add headers for each column separated by spaces (make sure that each header lines up with the start of the column. To see an example of how to do this, look at the example file MaintFeeEvents_20170410-wHeader.txt under McNair/Patent Data/Maintenance Fee Data.
Then, run the normalizer on the text file. To do this:
Open a command line or powershell e: cd .\McNair\PatentData/Maintenance Fee Data perl NormalizerFixedWidth.pl -file=MaintFeeEvents_20170410-wHeaders.txt
Where "MaintFeeEvents_20170410-wHeaders" is the name of the file with the added headers at the top. This script will put the normalized (cleaned) file in MaintFeeEvents_20170410-wHeader-normal.txt (basically appends "-normal" to whatever file name you pass it).
To then make a table out of the normalized text file, use the SQL detailed on Patent_Expiration_Rules
This will create entirely new tables from the maintenance fee data. To avoid repeating data, we will most likely just replace the existing tables in the database with the new tables.
For the USPTO Historical Patent Data
Has not been updated since 2015, but if you need to download Historical Patent data this is the link: https://bulkdata.uspto.gov/data2/patent/historical/2014/
Specifications of USPTO Assignment Data To Extract in Future Perl Scripts
Go to https://bulkdata.uspto.gov/ to bulk data from USPTO.
To see a description of what each file the USPTO bulk data contains, go to the bulk drive and navigate to McNair/Projects/Redesigning Patent Database/2017BulkDataProductDescriptions. This gives an overview, but does not explain how the XML files are structured. Those are the DTDs.
For assignment data, we pull from https://bulkdata.uspto.gov/data2/patent/assignment/. A description of all the elements in an USPTO assignment XML file is called a DTD, and the one for the assignment data can be found in the bulk drive under McNair/Project/Redesigning Patent Database/USPTO Assignment DTD
I'm currently looking at the DTDs for USPTO patent data (2005 and up DTDs have to be opened by Microsoft Visual Studio) to ascertain if there are any fields we're not currently pulling from the bulk data that we should. I am using the following link to figure out how to read a DTD: http://www.ldodds.com/delta/dtd_guide.html. There are only DTDs for 2005 and up, but there is a very long pdf that appears to detail the format of patent files pre - 2005. It has been saved under McNair/Projects/Redesigning Patent Database/
Pre - 2005
According to the documentation, the inventor's name and city are required to be listed, and the street, state, and country of the inventor may also be listed.
Paragraphs of abstract may also be listed under logical group "Abstract" in a field called "abstract":
2005 - present
The USPTO patent data for 2005 seems to include multiple paragraphs for the abstract under an element called "abstract". I've included the line from the DTD below:
<!ELEMENT abstract (doc-page+ | (abst-problem , abst-solution) | p+)>
<!ATTLIST abstract id ID #IMPLIED lang CDATA #IMPLIED status CDATA #IMPLIED >
An abstract is required to be included for all patents that are not design patents, and according to the DTD must have at least one paragraph element.
As far as inventors go, it looks like for 2005 - up patents, the inventor is simply the applicant. this is copied from the DTD:
<!ELEMENT applicant (addressbook+ , nationality , residence , us-rights* , designated-states? , designated-states-as-inventor?)>
<!ATTLIST applicant sequence CDATA #REQUIRED app-type (applicant | applicant-inventor ) #REQUIRED designation (all | all-except-us | us-only | as-indicated ) #REQUIRED >
If for some reason the inventor could not be the applicant, this element would contain the inventor:
<!ELEMENT inventor (addressbook+ , designated-states?)>
<!ATTLIST inventor sequence CDATA #REQUIRED designation (all | all-except-us | us-only | as-indicated ) #IMPLIED >
<!ELEMENT inventors (inventor | deceased-inventor)+>
It seems that we have already pulled patent abstracts before with the existing scripts. I found a huge word document with abstracts under McNair/Software/Scripts/Patent/Abstracts_2016.
Test Plan
Log:
2/16/2017 - Talked over project with Ed, began reading existing wiki pages related to patent data and databases
2/21/2017 - Brushed up on SQL, Entity - Relationship model of designing databases
- In the documentation, I want to briefly explain what the entity-relationship model is before including
the diagram so that readers have a little bit of background
- Found a tool for creating a visual representation called ERDPlus.com - create a standalone instead of an account, can download
Learning commands from Patent Data - SQL Steps
- copy command is PostgreSQL that copies a SQL table to a text file
- DELIMITER set what will separate columns in text file
- HEADER specifies that there will be a header in the text file with the names of the columns
- Definitely need to include more detail about what these do in the documentation
- insert into command inserts a new entry into the table
2/23/2017 - Read great database design article, dug through some more wiki articles, started reviewing Perl
- What client do we use to interact with the current patent database?
- Will need to determine all the fields that need to be included in the database before finishing the design and ER diagram, will need Ed's input
3/2/2017 - Started compiling a list of what fields to include and how they would be related.
- Created an Excel spreadsheet that records the each table, their current attributes in the existing patentdata table, what I think the attributes should be in the new table, their relationship to a patent (i.e. one-to-many, many-to-many, etc.), their primary key, questions I have relating to the table, future steps for cleaning up the data in the table (i.e., once all the data has been move to the new database, removing patents that are not US-based), and current problems that have been recorded with the existing table for that information (if an existing table exists)
- Once Excel spreadsheet is completed (and questions in the Questions column are answered or removed from the spreadsheet entirely) I will look into trying to embed it on my "Redesigning Patent Database" wiki page so that future users can sort of follow my thought process. I will also create separate wiki pages to explain each table once the new database is created
- instructions for adding a table to a wiki page https://www.mediawiki.org/wiki/Help:Tables
- Making spreadsheet led me to realize there is some data that is repeated (filedate in fee table when it is also located in patent table, and the fee table includes the patent
3/7/2017 - Continued working on spreadsheet, update project page with relevant links, developed following plan for documentation
- Most important thing for new documentation in my opinion is a clear flow, i.e., one central page on the database, which contains links to the following:
- Where data is (patent data) where it came from (Harvard Dataverse, etc), the scripts/SQL used to obtain the data, and where original xml files can be located
- Next, there should be an ER diagram that shows the general design, followed by short descriptions of each table is, what it's related to, and what it's use is
- Each table should contain a link to a separate wiki page on the table which will explain all the columns, where the data came from, provide a visual example of data in the table, list issues, and also contain a log of what has been done to the table over time
- there should also be instructions on the page for how, if someone is going to change the database (add a table, alter a column, etc.) which documentation they should update (the wikipage for the table, for example)
- Links to USPTO Assignee Database with an explanation of how it is joined with the Patent Database
- Lastly, everything should be dated. I know that each wiki-page lists when it was last edited, but the main page, next to table descriptions, should have the date the table was originally added. The ER diagram should be updated as needed and also dated so that readers know if it's the most current representation of the database.
3/9/2017 - Finished up first draft of spreadsheet, found ER diagram tools
- Old ER diagram (see Documentation Relevant to Current Databases) is difficult to read - it appears to show tables that don't exist and data that doesn't exist in the current schema
- psql allpatent does not work. psql allpatent_clone does for some reason
3/21/2017 - Working on determining "core" tablse that will always be in database and understanding work on assignees info
- http://postgresonline.com/special_feature.php?sf_name=postgresql83_psql_cheatsheet&outputformat=html
- helpful commands for interacting with psql
- Notes on determining "core" tables for database are saved under Projects/Redesigning Patent Database/Determining which tables should be core tables
3/23/2017 - Made spreadsheet with core tables and share it with other people working on patent data to get their approval
- This spreadsheet can be found under Projects/Redesigning Patent Database/Core Tables for New Patent Database
- Some tables that will later be deleted were included on the spreadsheet because their are currently being tables built to replace them
- May try to just move all the (twenty-something) "pto-" tables that have been created due to the "Restructuring Patent Data" project from "patent" to the new database
- Will work on understanding SQL for filling new database from this link next week Patent_Data_Processing_-_SQL_Steps and Patent_Data_Cleanup_(June_2016)
4/4/2017 - Found all the pages on extracting data and making tables and databases
- Do not need to pull Harvard Dataverse data again, it's saved in CSV files on the bulk drive
- Started looking through DTDs for USPTO patent and assignment data to determine if there is extra information that we should extract from USPTO data.