Patent Data Extraction Scripts (Tool)
Utility patent grants fields
Patent
- patent number
- kind: http://www.uspto.gov/patents-application-process/patent-search/authority-files/uspto-kind-codes
- grantdate
For version 4.5:
<publication-reference> <document-id> <country>US</country> <doc-number>08925112</doc-number> <kind>B2</kind> <date>20150106</date> </document-id> </publication-reference>
- type
- applicationnumber
- filingdate
<application-reference appl-type="utility"> <document-id> <country>US</country> <doc-number>13824291</doc-number> <date>20110929</date> </document-id> </application-reference>
For priority, if there is more than 1, we want sequence 01
- prioritydate
- prioritycountry (should use ISO country codes - may need a lookup table)
- prioritypatentnumber
- find 4.3 file with priority claim
<priority-claims> <priority-claim sequence="01" kind="national"> <country>GB</country> <doc-number>1016384.8</doc-number> <date>20100930</date> </priority-claim> </priority-claims>
Classification IPC - we only need the first one: http://www.wipo.int/export/sites/www/classifications/ipc/en/guide/guide_ipc.pdf
- Section, Class, SubClass - Together these concord to US subclass: http://www.uspto.gov/web/patents/classification/international/ipc/ipc8/ipc_concordance/ipcsel.htm#a
- MainGroup, SubGroup
<classifications-ipcr> <classification-ipcr> <ipc-version-indicator> <date>20060101</date> </ipc-version-indicator> <classification-level>A</classification-level>B <class>64</class> <subclass>G</subclass> <main-group>6</main-group> <subgroup>00</subgroup> <symbol-position>F</symbol-position> <classification-value>I</classification-value> ... </classification-ipcr> ... </classifications-ipcr>
Classification CPC - we only need the main one
CPC is a classification scheme set up by the USPTO and the European Patent Office (EPO). The first classification codes rolled out on November 9, 2012.[1] Full implementation of the CPC classification system occurred on January 2015, at the same time of version 4.5 of the USPTO patent bulk data.[2]
- Section, Class, Subclass
- Main Group, Subgroup
- v 4.2, 4.3, 4.4 does not have this
<classifications-cpc> <main-cpc> <classification-cpc> <cpc-version-indicator> <date>20130101</date> </cpc-version-indicator>B <class>64</class> <subclass>D</subclass> <main-group>10</main-group> <subgroup>00</subgroup> <symbol-position>F</symbol-position> <classification-value>I</classification-value> ... </classification-cpc> </main-cpc> </classifications-cpc>
Classification National: Note that the one below comes out to 2/2.11 (http://www.google.com/patents/US8925112#classifications)
- Country
- Class
THIS IS NOT UNIQUE. What classifications are we searching for?
<classification-national> <country>US</country> <main-classification>2 211</main-classification> </classification-national>
Title of the patent:
<invention-title id="d2e61">Aircrew ensembles</invention-title>
Number of Claims:
<number-of-claims>12</number-of-claims>
Primary examiner:
- FirstName, LastName, Department
<examiners> <primary-examiner> <last-name>Patel</last-name> <first-name>Tejash</first-name> <department>3765</department> </primary-examiner> ... </examiners>
PCT/Regional Patent Number:
- PCTNumber (just the doc number - if it starts with PCT set a flag)
- not in all v 4.5
- not in v 4.2, 4.3, 4.4
- maybe not all patents are filed under PCT, need to use code to search all files for key word
<pct-or-regional-filing-data> <document-id> <country>WO</country> <doc-number>PCT/EP2011/067014</doc-number> <kind>00</kind> <date>20110929</date> </document-id> ... </pct-or-regional-filing-data>
Citations
Patent Citations (we need all of them):
- CitingPatentNumber (from the patent)
- CitingPatentCountry (from the patent)
<publication-reference> <document-id> <country>US</country> <doc-number>08925112</doc-number> <kind>B2</kind> <date>20150106</date> </document-id> </publication-reference>
- CitedPatentNumber
- CitedPatentCountry
- V 4.2 does not have <us-references-cited>
<us-references-cited> <us-citation> <patcit num="00001"> <document-id> <country>US</country> <doc-number>1105569</doc-number> <kind>A</kind> <name>Lacrotte</name> <date>19140700</date> </document-id> </patcit> <category>cited by examiner</category> <classification-national> <country>US</country> <main-classification>2 214</main-classification> </classification-national> </us-citation> ... </us-references-cited>
For non-patent references, we are just going to count them:
- NoNonPatRefs
<us-references-cited> ... <us-citation> <nplcit num="00020"> <othercit> European Search Report dated Jan. 20, 2011 as received in European Patent Application No. GB1016384.8. </othercit> </nplcit> <category>cited by applicant</category> </us-citation> </us-references-cited>
Inventors
- For v 4.3, 4.4, 4.5
- PatentNumber (and country) to build a key
- We need a standard name and address object for each inventor
<us-parties> <us-applicants> ... </us-applicants> <inventors> <inventor sequence="001" designation="us-only"> <addressbook> <last-name>Oliver</last-name> <first-name>Paul</first-name> <address> <city>Rhyl</city> <country>GB</country> </address> </addressbook> </inventor> ... </inventors> ... <us-parties>
- For v 4.2
<parties> <applicants> <applicant sequence="001" app-type="applicant-inventor" designation="us-only"> <addressbook> <last-name>Kamath</last-name> <first-name>Sandeep</first-name> <address> <city>Bangalore</city> <country>IN</country> </address> </addressbook> <nationality> <country>omitted</country> </nationality> <residence> <country>IN</country> </residence> </applicant> ... </applicants> ... </parties>
Assignees
- PatentNumber (and country) to build a key
- We need a "standard" name and address object for each assignee
<assignees> <assignee> <addressbook> <orgname>Survitec Group Limited</orgname> <role>03</role> <address> <city>Merseyside</city> <country>GB</country> </address> </addressbook> </assignee> </assignees>
Other things we might want
- Abstract
- Claims (other than their count)
Things we don't need
General:
Classification related:
- Level - This appears to be either core or advanced. Not sure it matters.
- SymbolPosition, ClassificationValue - we likely don't need them
- Classification status and data source - no idea what these do
About the scripts
The scripts to process the Patent Data are all located under /bulk/Software/Scripts/PatentData/ ("E:\Software\Scripts\PatentData\")
There are currently 5 .pm files: PatentApplication.pm, Inventor.pm, Claim.pm, and Addressbook.pm, and Loader.pm available.
Each of the first 4 represents an Object type. The last one is a helper object that is able to extract the wanted fields as a perl object given a schema file. Future work should be done in this file to support more schema files.
Example Usage:
perl PatentParser.pl -file=ipa150319.xml
This will parse the xml file with name ipa150319.xml, extract all the Patents (in this case PatentApplications) each as a temporary xml file, and then, using a Loader object with a specified schema file, in this case "us-patent-application-v44-2014-04-03.dtd" to be able to extract each of the 4 object types from the Patents. If any error happened during the parsing of any file, that file will be moved to a directory called "failed_files". Most likely if a file failed the parsing it is likely not a Utility patent.
About the Harvard Dataverse
The patents from 1975-2010 loaded as .sqlite3 and csv files can be found at
I have also downloaded all of them on to the database server and can be found by
cd /bulk/patent