@@ Line 1: / Line 1: @@
-The Patent Data page is for instructions on how to get the USPTO patent data, how to use the database, and for the documentation of our database.
+#REDIRECT [[Patent Data Wiki Page]]
-== ER diagram ==
-[[Image:Patent_Data.png|1024x768px]]
-== Downloading the files ==
-The files (in xml format) for granted patent data can be obtained at [https://www.google.com/googlebooks/uspto-patents-grants-text.html granted patent]
-The files for patent application data can be obtained at [https://www.google.com/googlebooks/uspto-patents-applications-text.html patent applications]
-The files for maintenance fees data can be obtained at [https://www.google.com/googlebooks/uspto-patents-maintenance-fees.html maintenance]
-Scripts are available to perform a bulk download of all the above files
-These scripts can also be found under /bulk/Software/download\ scripts ("E:\Software\download scripts")on McNair RDP:
-[http://www.edegan.com/wiki/index.php/Image:Applications_download_2001-2004.sh Script to download patent application data from 2001-2004]
-[http://www.edegan.com/wiki/index.php/Image:Applications_download_2005-2015.sh Script to download patent application data from 2005-2015]
-[http://www.edegan.com/wiki/index.php/Image:Grant_download_1976-2000.sh Script to download patent grant data from 1976-2000]
-[http://www.edegan.com/wiki/index.php/Image:Grant_download_2001-2004.sh Script to download patent grant data from 2001-2004]
-[http://www.edegan.com/wiki/index.php/Image:Grant_download_2005-2015.sh Script to download patent grant data from 2005-2015]
-To use the scripts, save the scripts as shell scripts, then either
- $ sh Applications_download_2001-2004.sh
-or first change the script to an executable and execute it
- $ chmod a+x Applications_download_2001-2004.sh
- $ ./Applications_download_2001-2004.sh
-Notice there will be several hundreds of .zip files of size ~100mb getting downloaded so the process might take a while.
-When all the files are downloaded, unzip all of them using
- $ unzip *.zip
-==XML Schema Notes==
-Tags we are using:
-*CPC Classification: https://en.wikipedia.org/wiki/Cooperative_Patent_Classification
-Tags we aren't using:
-*Kind codes: http://www.uspto.gov/learning-and-resources/support-centers/electronic-business-center/kind-codes-included-uspto-patent
-*Series codes: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/filingyr.htm
-== Parsing and Processing the XML files ==
-The ParserSpliter.pl script will first split a large Patent Data XML file into smaller XML files, one for each patent data. And it will then parse and process each Patent Data XML file.
-Some of the files are somehow mal-formatted, and will be moved to a ./failed_files directory If you add a character anywhere in these files, they somehow become fine to be processed by the script.
-In order to use this script, you will need to have XML::Simple and Try::Tiny installed.
-Open up CPAN shell:
- $ perl -e shell -MCPAN
-Install:
- cpan[0]> install XML::Simple
- cpan[1]> install Try::Tiny
- cpan[2]> install Switch
-Once the packages have been installed, use the script like the following example:
- perl PatentParser.pl -file=ipa150319_small.xml
-==Other Resources==
-The Harvard Dataverse page: [http://www.edegan.com/wiki/index.php/Harvard_Dataverse]
-[http://www.uspto.gov/learning-and-resources/xml-resources Documentations for the xml files]
-[http://www.uspto.gov/learning-and-resources/xml-resources/xml-resources-retrospective See Also]
-[https://www.w3.org/2000/04/schema_hack/ tool to convert dtd to xsd]
-[https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/15705 Harvard Dataverse]
-==New Notes==
-The source files have transitioned from here:
-*https://www.google.com/googlebooks/uspto-patents-grants-text.html (No longer maintained)
-To:
-*https://bulkdata.uspto.gov/ (includes 2016 data)
-The historic data is the same both sides.
-Each file contains, in order, sorted by document ID:
-#Design patents (we will discard)
-#Plant patents (we will discard)
-#Reissues (we probably want them)
-#Utility patents (we want them)
-The classifications in the XML file are:
-*IPC - these are good and we just need the main classification
-*CPC - as above
-*USPC - just a numeric but not split. Is 22431 224/31 or 22/431, etc.
-==Scripts==
-All the scripts related to the patent Data are at:
- \\father\bulk\Software\Scripts\Patent
-USPTO_Parser.pl will parse the USPTO website and downloads the concatenated xmls to:
- \\father\bulk\PatentData
-It should be run as follows
- USPTO_Parser.pl year1 year2
-Gets the data from year1 to year2
-Splitter.pl will split those concatenated xmls into individual xmls into:
- \\father\bulk\PatentData\Processed
- Note: The ByYear (2010-2016) folders are for convenience (the XMLs inside them are post-processed to deal with genome sequences)
-xmlparser_4.5_4.4_4.3.pl is the script that processes the xmls given the path where the xmls are stored. This script is located at
- \\father\bulk\PatentData\Processed
-It should be run as
- xmlparser_4.5_4.4_4.3.pl '\\father\bulk\PatentData\Processed\2010'
-This will process all the xmls present in the 2010 directory and store them in the database.
-The database connection string is hard coded for now inside the script. The database name is patentDB (located in the postgres installation of the RDP server). We then pg_dump them and pg_restore on the dbase server.
-==Fields of Interest==
-We only care about Utility patents (and maybe Reissue patents too)
-===Utility patent grants fields===
-====Patent====
-*patent number
-*kind: http://www.uspto.gov/patents-application-process/patent-search/authority-files/uspto-kind-codes
-*grantdate
-For version 4.5:
- <publication-reference>
-  <document-id>
-   <country>US</country>
-   <doc-number>08925112</doc-number>
-   <kind>B2</kind>
-   <date>20150106</date>
-  </document-id>
- </publication-reference>
-*type
-*applicationnumber
-*filingdate
- <application-reference appl-type="utility">
-  <document-id>
-   <country>US</country>
-   <doc-number>13824291</doc-number>
-   <date>20110929</date>
-  </document-id>
- </application-reference>
-For priority, if there is more than 1, we want sequence 01
-*prioritydate
-*prioritycountry (should use ISO country codes - may need a lookup table)
-*prioritypatentnumber
-*'''find 4.3 file with priority claim'''
- <priority-claims>
-  <priority-claim sequence="01" kind="national">
-   <country>GB</country>
-   <doc-number>1016384.8</doc-number>
-   <date>20100930</date>
-  </priority-claim>
- </priority-claims>
-Classification IPC - we only need the first one: http://www.wipo.int/export/sites/www/classifications/ipc/en/guide/guide_ipc.pdf
-*Section, Class, SubClass - Together these concord to US subclass: http://www.uspto.gov/web/patents/classification/international/ipc/ipc8/ipc_concordance/ipcsel.htm#a
-*MainGroup, SubGroup
- <classifications-ipcr>
-  <classification-ipcr>
-   <ipc-version-indicator>
-    <date>20060101</date>
-   </ipc-version-indicator>
-   <classification-level>A</classification-level>
-   <section>B</section>
-   <class>64</class>
-   <subclass>G</subclass>
-   <main-group>6</main-group>
-   <subgroup>00</subgroup>
-   <symbol-position>F</symbol-position>
-   <classification-value>I</classification-value>
- ...
-  </classification-ipcr>
- ...
- </classifications-ipcr>
-Classification CPC - we only need the main one
-CPC is a classification scheme set up by the USPTO and the European Patent Office (EPO). The first classification codes rolled out on November 9, 2012.[http://www.cooperativepatentclassification.org/cpcSchemeAndDefinitions.html] Full implementation of the CPC classification system occurred on January 2015, at the same time of version 4.5 of the USPTO patent bulk data.[http://www.uspto.gov/sites/default/files/about/advisory/ppac/120927-09a-international_cpc.pdf]
-*Section, Class, Subclass
-*Main Group, Subgroup
-*'''v 4.2, 4.3, 4.4 does not have this'''
- <classifications-cpc>
-  <main-cpc>
-   <classification-cpc>
-     <cpc-version-indicator>
-       <date>20130101</date>
-     </cpc-version-indicator>
-     <section>B</section>
-     <class>64</class>
-     <subclass>D</subclass>
-     <main-group>10</main-group>
-     <subgroup>00</subgroup>
-     <symbol-position>F</symbol-position>
-     <classification-value>I</classification-value>
-  ...
-    </classification-cpc>
-   </main-cpc>
- </classifications-cpc>
-Classification National: Note that the one below comes out to 2/2.11 (http://www.google.com/patents/US8925112#classifications)
-*Country
-*Class
-'''THIS IS NOT UNIQUE. What classifications are we searching for?'''
- <classification-national>
-  <country>US</country>
-   <main-classification>2 211</main-classification>
- </classification-national>
-Title of the patent:
- <invention-title id="d2e61">Aircrew ensembles</invention-title>
-Number of Claims:
- <number-of-claims>12</number-of-claims>
-Primary examiner:
-*FirstName, LastName, Department
- <examiners>
-  <primary-examiner>
-   <last-name>Patel</last-name>
-   <first-name>Tejash</first-name>
-   <department>3765</department>
-  </primary-examiner>
- ...
- </examiners>
-PCT/Regional Patent Number:
-*PCTNumber (just the doc number - if it starts with PCT set a flag)
-*'''not in all v 4.5'''
-*'''not in v 4.2, 4.3, 4.4'''
-*'''maybe not all patents are filed under PCT, need to use code to search all files for key word'''
- <pct-or-regional-filing-data>
-  <document-id>
-   <country>WO</country>
-   <doc-number>PCT/EP2011/067014</doc-number>
-   <kind>00</kind>
-   <date>20110929</date>
-  </document-id>
- ...
- </pct-or-regional-filing-data>
-====Citations====
-Patent Citations (we need all of them):
-*CitingPatentNumber (from the patent)
-*CitingPatentCountry (from the patent)
- <publication-reference>
-  <document-id>
-   <country>US</country>
-   <doc-number>08925112</doc-number>
-   <kind>B2</kind>
-   <date>20150106</date>
-  </document-id>
- </publication-reference>
-*CitedPatentNumber
-*CitedPatentCountry
-*'''V 4.2 does not have <us-references-cited>
- <us-references-cited>
-  <us-citation>
-   <patcit num="00001">
-    <document-id>
-     <country>US</country>
-     <doc-number>1105569</doc-number>
-     <kind>A</kind>
-     <name>Lacrotte</name>
-     <date>19140700</date>
-    </document-id>
-   </patcit>
-   <category>cited by examiner</category>
-   <classification-national>
-    <country>US</country>
-    <main-classification>2 214</main-classification>
-   </classification-national>
-  </us-citation>
- ...
- </us-references-cited>
-For non-patent references, we are just going to count them:
-*NoNonPatRefs
- <us-references-cited>
- ...
-  <us-citation>
-   <nplcit num="00020">
-    <othercit>
-     European Search Report dated Jan. 20, 2011 as received in European Patent Application No. GB1016384.8.
-    </othercit>
-   </nplcit>
-   <category>cited by applicant</category>
-  </us-citation>
- </us-references-cited>
-====Inventors====
-*'''For v 4.3, 4.4, 4.5'''
-*PatentNumber (and country) to build a key
-*We need a "standard" name and address object for each inventor
- <us-parties>
-  <us-applicants>
- ...
-  </us-applicants>
-  <inventors>
-    <inventor sequence="001" designation="us-only">
-     <addressbook>
-      <last-name>Oliver</last-name>
-      <first-name>Paul</first-name>
-     <address>
-      <city>Rhyl</city>
-      <country>GB</country>
-     </address>
-    </addressbook>
-   </inventor>
- ...
-  </inventors>
- ...
- <us-parties>
-*'''For v 4.2'''
- <parties>
-  <applicants>
-   <applicant sequence="001" app-type="applicant-inventor" designation="us-only">
-    <addressbook>
-     <last-name>Kamath</last-name>
-     <first-name>Sandeep</first-name>
-     <address>
-      <city>Bangalore</city>
-      <country>IN</country>
-     </address>
-    </addressbook>
-    <nationality>
-     <country>omitted</country>
-    </nationality>
-    <residence>
-     <country>IN</country>
-    </residence>
-   </applicant>
-  ...
-  </applicants>
-  ...
- </parties>
-====Assignees====
-*PatentNumber (and country) to build a key
-*We need a "standard" name and address object for each assignee
- <assignees>
-   <assignee>
-    <addressbook>
-     <orgname>Survitec Group Limited</orgname>
-     <role>03</role>
-    <address>
-     <city>Merseyside</city>
-     <country>GB</country>
-    </address>
-   </addressbook>
-  </assignee>
- </assignees>
-====Other things we might want====
-*Abstract
-*Claims (other than their count)
-====Things we don't need====
-General:
-*Series Code: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/filingyr.htm
-Classification related:
-*Level - This appears to be either core or advanced. Not sure it matters.
-*SymbolPosition, ClassificationValue - we likely don't need them
-*Classification status and data source - no idea what these do
-====About the scripts====
-The scripts to process the Patent Data are all located under /bulk/Software/Scripts/PatentData/ ("E:\Software\Scripts\PatentData\")
-There are currently 5 .pm files: PatentApplication.pm, Inventor.pm, Claim.pm, and Addressbook.pm, and Loader.pm available.
-Each of the first 4 represents an Object type. The last one is a helper object that is able to extract the wanted fields as a perl object given a schema file.
-Future work should be done in this file to support more schema files.
-Example Usage:
- perl PatentParser.pl -file=ipa150319.xml
-This will parse the xml file with name ipa150319.xml, extract all the Patents (in this case PatentApplications) each as a temporary xml file, and then, using a Loader object with a specified
-schema file, in this case "us-patent-application-v44-2014-04-03.dtd" to be able to extract each of the 4 object types from the Patents.
-If any error happened during the parsing of any file, that file will be moved to a directory called "failed_files". Most likely if a file failed the parsing it is likely not a Utility patent.
-====About the Harvard Dataverse====
-The patents from 1975-2010 loaded as .sqlite3 and csv files can be found at
-[https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/15705 Harvard Dataverse]
-I have also downloaded all of them on to the database server and can be found by
- cd /bulk/patent

Difference between revisions of "Patent Data"

Revision as of 14:42, 8 July 2016

Navigation menu

Search