We would like Return to download and absorb data from this location on the USPTO website into our tables. The objective is to determine whether this dataset is better than the current version of our patent data (a combination of the data in the patent_2015 and patentdata databases[[Patent Data]].
<section begin=bulk />The USPTO provides bulk data recording patent transactions, applications, properties, reassignments, and history through XML files to the general public. These files have been downloaded and the data has been compiled in tables using PostgreSQL. The objective of processing the bulk data is to enhance the McNair Center's historical datasets ([[Patent Data Processing - SQL Steps|patent_2015 and patentdata]]) and track the entirety of US patent activity, specifically concerning utility patents. <section end=bulk /> == Steps Followed to Extract the USPTO Assignees Data ==
===Extracting Data from XML Files ===
Here is the DTD specified by the USPTO, which specifies optional fields and :
<?xml version="1.0" encoding="utf-8"?> <br> <!DOCTYPE us-patent-assignments [<!ELEMENT us-patent-assignments (action-key-code, transaction-date, patent-assignments)> <br> <!ATTLIST us-patent-assignments dtd-version CDATA #IMPLIED <br> date-produced CDATA #IMPLIED> <br> <!ELEMENT action-key-code (#PCDATA)> <br> <!ELEMENT transaction-date (date)><br> <!ELEMENT patent-assignments (data-available-code | patent-assignment+)> <br> <!ELEMENT date (#PCDATA)> <br> <!ELEMENT data-available-code (#PCDATA)> <br> <!ELEMENT patent-assignment (assignment-record, patent-assignors, patent-assignees, patent-properties)> <br> <!ELEMENT assignment-record (reel-no, frame-no, last-update-date, purge-indicator, recorded-date, page-count?, correspondent, conveyance-text)> <br> <!ELEMENT patent-assignors (patent-assignor+)> <br> <!ELEMENT patent-assignees (patent-assignee+)> <br> <!ELEMENT patent-properties (patent-property+)> <br> <!ELEMENT reel-no (#PCDATA)> <br> <!ELEMENT frame-no (#PCDATA)> <br> <!ELEMENT last-update-date (date)> <br> <!ELEMENT purge-indicator (#PCDATA)> <br> <!ELEMENT recorded-date (date)> <br> <!ELEMENT page-count (#PCDATA)> <br> <!ELEMENT correspondent (name, address-1?, address-2?, address-3?, address-4?)> <br> <!ELEMENT conveyance-text (#PCDATA)> <br> <!ELEMENT patent-assignor (name, execution-date?, date-acknowledged?)> <br> <!ELEMENT patent-assignee (name, address-1?, address-2?, city?, state?, country-name?, postcode?)> <br> <!ELEMENT patent-property (document-id*, invention-title?)> <br> <!ELEMENT name (#PCDATA)> <br> <!ATTLIST name name-type (natural | legal) #IMPLIED> <br> <!ELEMENT address-1 (#PCDATA)> <br> <!ELEMENT address-2 (#PCDATA)> <br> <!ELEMENT address-3 (#PCDATA)> <br> <!ELEMENT address-4 (#PCDATA)> <br> <!ELEMENT execution-date (date)> <br> <!ELEMENT date-acknowledged (date)> <br> <!ELEMENT city (#PCDATA)> <br> <!ELEMENT state (#PCDATA)> <br> <!ELEMENT country-name (#PCDATA)> <br> <!ELEMENT postcode (#PCDATA)> <br> <!ELEMENT document-id (country, doc-number, kind?, name?, date?)> <br> <!ELEMENT invention-title (#PCDATA | b | i | u | sup | sub)*> <br> <!ATTLIST invention-title id ID #IMPLIED <br> lang CDATA #REQUIRED> <br> <!ELEMENT country (#PCDATA)><br> <!ELEMENT doc-number (#PCDATA)><br> <!ELEMENT kind (#PCDATA)><br> <!--bold formatting for text--><br> <!ELEMENT b (#PCDATA | i | u | smallcaps)*><br> <!--italic formatting for text--><br> <!ELEMENT i (#PCDATA | b | u | smallcaps)*><br> <!--underscore: style - single is default--><br> <!ELEMENT u (#PCDATA | b | i | smallcaps)*><br> <!ATTLIST u style (single | double | dash | dots ) 'single' ><br> <!--superscripted text--><br> <!ELEMENT sup (#PCDATA | b | u | i)*><br> <!--subscripted text--><br> <!ELEMENT sub (#PCDATA | b | u | i)*><br> <!--small capitals--><br> <!ELEMENT smallcaps (#PCDATA | b | u | i)*><br> ]><br>
===Inserting Extracted Data into Tables ===
===Clean Up ===
== Scripts for processing data ==
The programs/scripts (see details below) are located on our [[Software Repository|Bonobo Git Server]].
repository: Patent_Data_Parser
branch: next
directory: /uspto_assignees_xml_parser
=== Downloading raw bulk data from USPTO ===
repository: Patent_Data_Parser
branch: next
directory: /uspto_assignees_xml_parser
file: USPTO_Assignee_Download.pl
The down-loader script used to download XML files is essentially same, with minor changes, as the one used for downloading USPTO patent-data.
That is, the current version of down-loader script downloads all files from the base URL: https://bulkdata.uspto.gov/data2/patent/assignment/
=== Parsing the XML files ===
repository: Patent_Data_Parser
branch: next
directory: /uspto_assignees_xml_parser
file: uspto_assignees_XML_parser.plx
==== NAME ====
uspto_assignees_XML_parser.plx - Parses XML files and populates a database.
Specifically, parses every file in a directory according to a schema (see above).