Patent Data
The Patent Data page is for instructions on how to get the USPTO patent data, how to use the database, and for the documentation of our database.
Contents
ER diagram
Downloading the files
The files (in xml format) for granted patent data can be obtained at granted patent
The files for patent application data can be obtained at patent applications
The files for maintenance fees data can be obtained at maintenance
Scripts are available to perform a bulk download of all the above files
These scripts can also be found under /bulk/Software/download\ scripts ("E:\Software\download scripts")on McNair RDP:
Script to download patent application data from 2001-2004
Script to download patent application data from 2005-2015
Script to download patent grant data from 1976-2000
Script to download patent grant data from 2001-2004
Script to download patent grant data from 2005-2015
To use the scripts, save the scripts as shell scripts, then either
$ sh Applications_download_2001-2004.sh
or first change the script to an executable and execute it
$ chmod a+x Applications_download_2001-2004.sh $ ./Applications_download_2001-2004.sh
Notice there will be several hundreds of .zip files of size ~100mb getting downloaded so the process might take a while. When all the files are downloaded, unzip all of them using
$ unzip *.zip
XML Schema Notes
Tags we are using:
- CPC Classification: https://en.wikipedia.org/wiki/Cooperative_Patent_Classification
Tags we aren't using:
- Kind codes: http://www.uspto.gov/learning-and-resources/support-centers/electronic-business-center/kind-codes-included-uspto-patent
- Series codes: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/filingyr.htm
Parsing and Processing the XML files
The ParserSpliter.pl script will first split a large Patent Data XML file into smaller XML files, one for each patent data. And it will then parse and process each Patent Data XML file.
Some of the files are somehow mal-formatted, and will be moved to a ./failed_files directory If you add a character anywhere in these files, they somehow become fine to be processed by the script.
In order to use this script, you will need to have XML::Simple and Try::Tiny installed.
Open up CPAN shell:
$ perl -e shell -MCPAN
Install:
cpan[0]> install XML::Simple cpan[1]> install Try::Tiny cpan[2]> install Switch
Once the packages have been installed, use the script like the following example:
perl PatentParser.pl -file=ipa150319_small.xml
Other Resources
Documentations for the xml files
New Notes
The source files have transitioned from here:
- https://www.google.com/googlebooks/uspto-patents-grants-text.html (No longer maintained)
To:
- https://bulkdata.uspto.gov/ (includes 2016 data)
The historic data is the same both sides.
Each file contains, in order, sorted by document ID:
- Design patents (we will discard)
- Plant patents (we will discard)
- Reissues (we probably want them)
- Utility patents (we want them)
The classifications in the XML file are:
- IPC - these are good and we just need the main classification
- CPC - as above
- USPC - just a numeric but not split. Is 22431 224/31 or 22/431, etc.
Scripts
All the scripts related to the patent Data are at:
\\father\bulk\Software\Scripts\Patent
USPTO_Parser.pl will parse the USPTO website and downloads the concatenated xmls to:
\\father\bulk\PatentData
It should be run as follows
USPTO_Parser.pl year1 year2
Gets the data from year1 to year2
Splitter.pl will split those concatenated xmls into individual xmls into:
\\father\bulk\PatentData\Processed
Fields of Interest
We only care about Utility patents (and maybe Reissue patents too)
Utility patent grants fields
Patent
- patent number
- kind: http://www.uspto.gov/patents-application-process/patent-search/authority-files/uspto-kind-codes
- grantdate
For version 4.5:
<publication-reference> <document-id> <country>US</country> <doc-number>08925112</doc-number> <kind>B2</kind> <date>20150106</date> </document-id> </publication-reference>
- type
- applicationnumber
- filingdate
<application-reference appl-type="utility"> <document-id> <country>US</country> <doc-number>13824291</doc-number> <date>20110929</date> </document-id> </application-reference>
For priority, if there is more than 1, we want sequence 01
- prioritydate
- prioritycountry (should use ISO country codes - may need a lookup table)
- prioritypatentnumber
- find 4.3 file with priority claim
<priority-claims> <priority-claim sequence="01" kind="national"> <country>GB</country> <doc-number>1016384.8</doc-number> <date>20100930</date> </priority-claim> </priority-claims>
Classification IPC - we only need the first one: http://www.wipo.int/export/sites/www/classifications/ipc/en/guide/guide_ipc.pdf
- Section, Class, SubClass - Together these concord to US subclass: http://www.uspto.gov/web/patents/classification/international/ipc/ipc8/ipc_concordance/ipcsel.htm#a
- MainGroup, SubGroup
<classifications-ipcr> <classification-ipcr> <ipc-version-indicator> <date>20060101</date> </ipc-version-indicator> <classification-level>A</classification-level>B <class>64</class> <subclass>G</subclass> <main-group>6</main-group> <subgroup>00</subgroup> <symbol-position>F</symbol-position> <classification-value>I</classification-value> ... </classification-ipcr> ... </classifications-ipcr>
Classification CPC - we only need the main one
- Section, Class, Subclass
- Main Group, Subgroup
- v 4.2, 4.3, 4.4 does not have this
<classifications-cpc> <main-cpc> <classification-cpc> <cpc-version-indicator> <date>20130101</date> </cpc-version-indicator>B <class>64</class> <subclass>D</subclass> <main-group>10</main-group> <subgroup>00</subgroup> <symbol-position>F</symbol-position> <classification-value>I</classification-value> ... </classification-cpc> </main-cpc> </classifications-cpc>
Classification National: Note that the one below comes out to 2/2.11 (http://www.google.com/patents/US8925112#classifications)
- Country
- Class
THIS IS NOT UNIQUE. What classifications are we searching for?
<classification-national> <country>US</country> <main-classification>2 211</main-classification> </classification-national>
Title of the patent:
<invention-title id="d2e61">Aircrew ensembles</invention-title>
Number of Claims:
<number-of-claims>12</number-of-claims>
Primary examiner:
- FirstName, LastName, Department
<examiners> <primary-examiner> <last-name>Patel</last-name> <first-name>Tejash</first-name> <department>3765</department> </primary-examiner> ... </examiners>
PCT/Regional Patent Number:
- PCTNumber (just the doc number - if it starts with PCT set a flag)
- not in all v 4.5
- not in v 4.2, 4.3, 4.4
<pct-or-regional-filing-data> <document-id> <country>WO</country> <doc-number>PCT/EP2011/067014</doc-number> <kind>00</kind> <date>20110929</date> </document-id> ... </pct-or-regional-filing-data>
Citations
Patent Citations (we need all of them):
- CitingPatentNumber (from the patent)
- CitingPatentCountry (from the patent)
<publication-reference> <document-id> <country>US</country> <doc-number>08925112</doc-number> <kind>B2</kind> <date>20150106</date> </document-id> </publication-reference>
- CitedPatentNumber
- CitedPatentCountry
- V 4.2 does not have <us-references-cited>
<us-references-cited> <us-citation> <patcit num="00001"> <document-id> <country>US</country> <doc-number>1105569</doc-number> <kind>A</kind> <name>Lacrotte</name> <date>19140700</date> </document-id> </patcit> <category>cited by examiner</category> <classification-national> <country>US</country> <main-classification>2 214</main-classification> </classification-national> </us-citation> ... </us-references-cited>
For non-patent references, we are just going to count them:
- NoNonPatRefs
<us-references-cited> ... <us-citation> <nplcit num="00020"> <othercit> European Search Report dated Jan. 20, 2011 as received in European Patent Application No. GB1016384.8. </othercit> </nplcit> <category>cited by applicant</category> </us-citation> </us-references-cited>
Inventors
- Not in v 4.2
- PatentNumber (and country) to build a key
- We need a "standard" name and address object for each inventor
<us-parties> <us-applicants> ... </us-applicants> <inventors> <inventor sequence="001" designation="us-only"> <addressbook> <last-name>Oliver</last-name> <first-name>Paul</first-name> <address> <city>Rhyl</city> <country>GB</country> </address> </addressbook> </inventor> ... </inventors> ... <us-parties>
Assignees
- PatentNumber (and country) to build a key
- We need a "standard" name and address object for each assignee
<assignees> <assignee> <addressbook> <orgname>Survitec Group Limited</orgname> <role>03</role> <address> <city>Merseyside</city> <country>GB</country> </address> </addressbook> </assignee> </assignees>
Other things we might want
- Abstract
- Claims (other than their count)
Things we don't need
General:
Classification related:
- Level - This appears to be either core or advanced. Not sure it matters.
- SymbolPosition, ClassificationValue - we likely don't need them
- Classification status and data source - no idea what these do
About the scripts
The scripts to process the Patent Data are all located under /bulk/Software/Scripts/PatentData/ ("E:\Software\Scripts\PatentData\")
There are currently 5 .pm files: PatentApplication.pm, Inventor.pm, Claim.pm, and Addressbook.pm, and Loader.pm available.
Each of the first 4 represents an Object type. The last one is a helper object that is able to extract the wanted fields as a perl object given a schema file. Future work should be done in this file to support more schema files.
Example Usage:
perl PatentParser.pl -file=ipa150319.xml
This will parse the xml file with name ipa150319.xml, extract all the Patents (in this case PatentApplications) each as a temporary xml file, and then, using a Loader object with a specified schema file, in this case "us-patent-application-v44-2014-04-03.dtd" to be able to extract each of the 4 object types from the Patents. If any error happened during the parsing of any file, that file will be moved to a directory called "failed_files". Most likely if a file failed the parsing it is likely not a Utility patent.
About the Harvard Dataverse
The patents from 1975-2010 loaded as .sqlite3 and csv files can be found at
I have also downloaded all of them on to the database server and can be found by
cd /bulk/patent