Difference between revisions of "Patent Data"
imported>Harsh |
imported>Harsh |
||
Line 55: | Line 55: | ||
Open up CPAN shell: | Open up CPAN shell: | ||
− | + | $ perl -e shell -MCPAN | |
Install: | Install: | ||
Line 64: | Line 64: | ||
Once the packages have been installed, use the script like the following example: | Once the packages have been installed, use the script like the following example: | ||
perl PatentParser.pl -file=ipa150319_small.xml | perl PatentParser.pl -file=ipa150319_small.xml | ||
+ | |||
==Other Resources== | ==Other Resources== | ||
[http://www.uspto.gov/learning-and-resources/xml-resources Documentations for the xml files] | [http://www.uspto.gov/learning-and-resources/xml-resources Documentations for the xml files] |
Revision as of 16:28, 17 March 2016
The Patent Data page is for instructions on how to get the USPTO patent data, how to use the database, and for the documentation of our database.
Contents
ER diagram
See ER Diagram
Downloading the files
The files (in xml format) for granted patent data can be obtained at granted patent
The files for patent application data can be obtained at patent applications
The files for maintenance fees data can be obtained at maintenance
Scripts are available to perform a bulk download of all the above files:
Script to download patent application data from 2001-2004
Script to download patent application data from 2005-2015
Script to download patent grant data from 1976-2000
Script to download patent grant data from 2001-2004
Script to download patent grant data from 2005-2015
To use the scripts, save the scripts as shell scripts, then either
$ sh Applications_download_2001-2004.sh
or first change the script to an executable and execute it
$ chmod a+x Applications_download_2001-2004.sh $ ./Applications_download_2001-2004.sh
Notice there will be several hundreds of .zip files of size ~100mb getting downloaded so the process might take a while. When all the files are downloaded, unzip all of them using
$ unzip *.zip
XML Schema Notes
Tags we are using:
- CPC Classification: https://en.wikipedia.org/wiki/Cooperative_Patent_Classification
Tags we aren't using:
- Kind codes: http://www.uspto.gov/learning-and-resources/support-centers/electronic-business-center/kind-codes-included-uspto-patent
- Series codes: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/filingyr.htm
Parsing and Processing the XML files
The ParserSpliter.pl script will first split a large Patent Data XML file into smaller XML files, one for each patent data. And it will then parse and process each Patent Data XML file.
Some of the files are somehow mal-formatted, and will be moved to a ./failed_files directory If you add a character anywhere in these files, they somehow become fine to be processed by the script.
In order to use this script, you will need to have XML::Simple and Try::Tiny installed.
Open up CPAN shell:
$ perl -e shell -MCPAN
Install:
cpan[0]> install XML::Simple cpan[1]> install Try::Tiny cpan[2]> install Switch
Once the packages have been installed, use the script like the following example:
perl PatentParser.pl -file=ipa150319_small.xml