Difference between revisions of "Patent Data Extraction Scripts (Tool)"
(Created page with "===Utility patent grants fields=== ====Patent==== *patent number *kind: http://www.uspto.gov/patents-application-process/patent-search/authority-files/uspto-kind-codes *gran...") |
|||
Line 2: | Line 2: | ||
====Patent==== | ====Patent==== | ||
− | + | <onlyinclude> | |
*patent number | *patent number | ||
*kind: http://www.uspto.gov/patents-application-process/patent-search/authority-files/uspto-kind-codes | *kind: http://www.uspto.gov/patents-application-process/patent-search/authority-files/uspto-kind-codes | ||
*grantdate | *grantdate | ||
− | + | </onlyinclude> | |
For version 4.5: | For version 4.5: | ||
<publication-reference> | <publication-reference> | ||
Line 16: | Line 16: | ||
</document-id> | </document-id> | ||
</publication-reference> | </publication-reference> | ||
− | + | <onlyinclude> | |
*type | *type | ||
*applicationnumber | *applicationnumber | ||
*filingdate | *filingdate | ||
+ | </onlyinclude> | ||
<application-reference appl-type="utility"> | <application-reference appl-type="utility"> | ||
<document-id> | <document-id> | ||
Line 28: | Line 29: | ||
</application-reference> | </application-reference> | ||
+ | <onlyinclude> | ||
For priority, if there is more than 1, we want sequence 01 | For priority, if there is more than 1, we want sequence 01 | ||
*prioritydate | *prioritydate | ||
*prioritycountry (should use ISO country codes - may need a lookup table) | *prioritycountry (should use ISO country codes - may need a lookup table) | ||
*prioritypatentnumber | *prioritypatentnumber | ||
+ | </onlyinclude> | ||
*'''find 4.3 file with priority claim''' | *'''find 4.3 file with priority claim''' | ||
Line 41: | Line 44: | ||
</priority-claim> | </priority-claim> | ||
</priority-claims> | </priority-claims> | ||
− | + | <onlyinclude> | |
− | Classification IPC - we only need the first one: http://www.wipo.int/export/sites/www/classifications/ipc/en/guide/guide_ipc.pdf | + | Classification IPC </onlyinclude>- we only need the first one: http://www.wipo.int/export/sites/www/classifications/ipc/en/guide/guide_ipc.pdf |
*Section, Class, SubClass - Together these concord to US subclass: http://www.uspto.gov/web/patents/classification/international/ipc/ipc8/ipc_concordance/ipcsel.htm#a | *Section, Class, SubClass - Together these concord to US subclass: http://www.uspto.gov/web/patents/classification/international/ipc/ipc8/ipc_concordance/ipcsel.htm#a | ||
*MainGroup, SubGroup | *MainGroup, SubGroup | ||
Line 63: | Line 66: | ||
... | ... | ||
</classifications-ipcr> | </classifications-ipcr> | ||
− | + | <onlyinclude> | |
Classification CPC - we only need the main one | Classification CPC - we only need the main one | ||
Line 71: | Line 74: | ||
*Main Group, Subgroup | *Main Group, Subgroup | ||
*'''v 4.2, 4.3, 4.4 does not have this''' | *'''v 4.2, 4.3, 4.4 does not have this''' | ||
− | + | </onlyinclude> | |
<classifications-cpc> | <classifications-cpc> | ||
<main-cpc> | <main-cpc> | ||
Line 89: | Line 92: | ||
</main-cpc> | </main-cpc> | ||
</classifications-cpc> | </classifications-cpc> | ||
− | + | <onlyinclude> | |
Classification National: Note that the one below comes out to 2/2.11 (http://www.google.com/patents/US8925112#classifications) | Classification National: Note that the one below comes out to 2/2.11 (http://www.google.com/patents/US8925112#classifications) | ||
*Country | *Country | ||
*Class | *Class | ||
− | + | </onlyinclude> | |
'''THIS IS NOT UNIQUE. What classifications are we searching for?''' | '''THIS IS NOT UNIQUE. What classifications are we searching for?''' | ||
<classification-national> | <classification-national> | ||
Line 99: | Line 102: | ||
<main-classification>2 211</main-classification> | <main-classification>2 211</main-classification> | ||
</classification-national> | </classification-national> | ||
− | + | <onlyinclude> | |
− | Title of the patent: | + | Title of the patent</onlyinclude>: |
<invention-title id="d2e61">Aircrew ensembles</invention-title> | <invention-title id="d2e61">Aircrew ensembles</invention-title> | ||
− | + | <onlyinclude> | |
− | Number of Claims: | + | Number of Claims</onlyinclude>: |
<number-of-claims>12</number-of-claims> | <number-of-claims>12</number-of-claims> | ||
− | + | <onlyinclude> | |
Primary examiner: | Primary examiner: | ||
*FirstName, LastName, Department | *FirstName, LastName, Department | ||
− | + | </onlyinclude> | |
<examiners> | <examiners> | ||
<primary-examiner> | <primary-examiner> | ||
Line 117: | Line 120: | ||
... | ... | ||
</examiners> | </examiners> | ||
− | + | <onlyinclude> | |
PCT/Regional Patent Number: | PCT/Regional Patent Number: | ||
*PCTNumber (just the doc number - if it starts with PCT set a flag) | *PCTNumber (just the doc number - if it starts with PCT set a flag) | ||
Line 123: | Line 126: | ||
*'''not in v 4.2, 4.3, 4.4''' | *'''not in v 4.2, 4.3, 4.4''' | ||
*'''maybe not all patents are filed under PCT, need to use code to search all files for key word''' | *'''maybe not all patents are filed under PCT, need to use code to search all files for key word''' | ||
− | + | </onlyinclude> | |
<pct-or-regional-filing-data> | <pct-or-regional-filing-data> | ||
<document-id> | <document-id> | ||
Line 135: | Line 138: | ||
====Citations==== | ====Citations==== | ||
− | + | <onlyinclude> | |
Patent Citations (we need all of them): | Patent Citations (we need all of them): | ||
*CitingPatentNumber (from the patent) | *CitingPatentNumber (from the patent) | ||
*CitingPatentCountry (from the patent) | *CitingPatentCountry (from the patent) | ||
− | + | </onlyinclude> | |
<publication-reference> | <publication-reference> | ||
<document-id> | <document-id> | ||
Line 148: | Line 151: | ||
</document-id> | </document-id> | ||
</publication-reference> | </publication-reference> | ||
− | + | <onlyinclude> | |
*CitedPatentNumber | *CitedPatentNumber | ||
*CitedPatentCountry | *CitedPatentCountry | ||
*'''V 4.2 does not have <us-references-cited> | *'''V 4.2 does not have <us-references-cited> | ||
− | + | </onlyinclude> | |
<us-references-cited> | <us-references-cited> | ||
<us-citation> | <us-citation> | ||
Line 172: | Line 175: | ||
... | ... | ||
</us-references-cited> | </us-references-cited> | ||
− | + | <onlyinclude> | |
For non-patent references, we are just going to count them: | For non-patent references, we are just going to count them: | ||
*NoNonPatRefs | *NoNonPatRefs | ||
− | + | </onlyinclude> | |
<us-references-cited> | <us-references-cited> | ||
... | ... | ||
Line 189: | Line 192: | ||
====Inventors==== | ====Inventors==== | ||
− | + | <onlyinclude> | |
*'''For v 4.3, 4.4, 4.5''' | *'''For v 4.3, 4.4, 4.5''' | ||
*PatentNumber (and country) to build a key | *PatentNumber (and country) to build a key | ||
*We need a "standard" name and address object for each inventor | *We need a "standard" name and address object for each inventor | ||
+ | </onlyinclude> | ||
+ | |||
<us-parties> | <us-parties> | ||
<us-applicants> | <us-applicants> | ||
Line 238: | Line 243: | ||
... | ... | ||
</parties> | </parties> | ||
− | + | <onlyinclude> | |
====Assignees==== | ====Assignees==== | ||
*PatentNumber (and country) to build a key | *PatentNumber (and country) to build a key | ||
*We need a "standard" name and address object for each assignee | *We need a "standard" name and address object for each assignee | ||
− | + | </onlyinclude> | |
<assignees> | <assignees> | ||
<assignee> | <assignee> | ||
Line 257: | Line 262: | ||
</assignees> | </assignees> | ||
− | + | <onlyinclude> | |
====Other things we might want==== | ====Other things we might want==== | ||
Line 272: | Line 277: | ||
*SymbolPosition, ClassificationValue - we likely don't need them | *SymbolPosition, ClassificationValue - we likely don't need them | ||
*Classification status and data source - no idea what these do | *Classification status and data source - no idea what these do | ||
− | + | </onlyinclude> | |
====About the scripts==== | ====About the scripts==== | ||
Revision as of 17:10, 7 June 2016
Contents
Utility patent grants fields
Patent
- patent number
- kind: http://www.uspto.gov/patents-application-process/patent-search/authority-files/uspto-kind-codes
- grantdate
For version 4.5:
<publication-reference> <document-id> <country>US</country> <doc-number>08925112</doc-number> <kind>B2</kind> <date>20150106</date> </document-id> </publication-reference>
- type
- applicationnumber
- filingdate
<application-reference appl-type="utility"> <document-id> <country>US</country> <doc-number>13824291</doc-number> <date>20110929</date> </document-id> </application-reference>
For priority, if there is more than 1, we want sequence 01
- prioritydate
- prioritycountry (should use ISO country codes - may need a lookup table)
- prioritypatentnumber
- find 4.3 file with priority claim
<priority-claims> <priority-claim sequence="01" kind="national"> <country>GB</country> <doc-number>1016384.8</doc-number> <date>20100930</date> </priority-claim> </priority-claims>
Classification IPC - we only need the first one: http://www.wipo.int/export/sites/www/classifications/ipc/en/guide/guide_ipc.pdf
- Section, Class, SubClass - Together these concord to US subclass: http://www.uspto.gov/web/patents/classification/international/ipc/ipc8/ipc_concordance/ipcsel.htm#a
- MainGroup, SubGroup
<classifications-ipcr> <classification-ipcr> <ipc-version-indicator> <date>20060101</date> </ipc-version-indicator> <classification-level>A</classification-level>B <class>64</class> <subclass>G</subclass> <main-group>6</main-group> <subgroup>00</subgroup> <symbol-position>F</symbol-position> <classification-value>I</classification-value> ... </classification-ipcr> ... </classifications-ipcr>
Classification CPC - we only need the main one
CPC is a classification scheme set up by the USPTO and the European Patent Office (EPO). The first classification codes rolled out on November 9, 2012.[1] Full implementation of the CPC classification system occurred on January 2015, at the same time of version 4.5 of the USPTO patent bulk data.[2]
- Section, Class, Subclass
- Main Group, Subgroup
- v 4.2, 4.3, 4.4 does not have this
<classifications-cpc> <main-cpc> <classification-cpc> <cpc-version-indicator> <date>20130101</date> </cpc-version-indicator>B <class>64</class> <subclass>D</subclass> <main-group>10</main-group> <subgroup>00</subgroup> <symbol-position>F</symbol-position> <classification-value>I</classification-value> ... </classification-cpc> </main-cpc> </classifications-cpc>
Classification National: Note that the one below comes out to 2/2.11 (http://www.google.com/patents/US8925112#classifications)
- Country
- Class
THIS IS NOT UNIQUE. What classifications are we searching for?
<classification-national> <country>US</country> <main-classification>2 211</main-classification> </classification-national>
Title of the patent:
<invention-title id="d2e61">Aircrew ensembles</invention-title>
Number of Claims:
<number-of-claims>12</number-of-claims>
Primary examiner:
- FirstName, LastName, Department
<examiners> <primary-examiner> <last-name>Patel</last-name> <first-name>Tejash</first-name> <department>3765</department> </primary-examiner> ... </examiners>
PCT/Regional Patent Number:
- PCTNumber (just the doc number - if it starts with PCT set a flag)
- not in all v 4.5
- not in v 4.2, 4.3, 4.4
- maybe not all patents are filed under PCT, need to use code to search all files for key word
<pct-or-regional-filing-data> <document-id> <country>WO</country> <doc-number>PCT/EP2011/067014</doc-number> <kind>00</kind> <date>20110929</date> </document-id> ... </pct-or-regional-filing-data>
Citations
Patent Citations (we need all of them):
- CitingPatentNumber (from the patent)
- CitingPatentCountry (from the patent)
<publication-reference> <document-id> <country>US</country> <doc-number>08925112</doc-number> <kind>B2</kind> <date>20150106</date> </document-id> </publication-reference>
- CitedPatentNumber
- CitedPatentCountry
- V 4.2 does not have <us-references-cited>
<us-references-cited> <us-citation> <patcit num="00001"> <document-id> <country>US</country> <doc-number>1105569</doc-number> <kind>A</kind> <name>Lacrotte</name> <date>19140700</date> </document-id> </patcit> <category>cited by examiner</category> <classification-national> <country>US</country> <main-classification>2 214</main-classification> </classification-national> </us-citation> ... </us-references-cited>
For non-patent references, we are just going to count them:
- NoNonPatRefs
<us-references-cited> ... <us-citation> <nplcit num="00020"> <othercit> European Search Report dated Jan. 20, 2011 as received in European Patent Application No. GB1016384.8. </othercit> </nplcit> <category>cited by applicant</category> </us-citation> </us-references-cited>
Inventors
- For v 4.3, 4.4, 4.5
- PatentNumber (and country) to build a key
- We need a "standard" name and address object for each inventor
<us-parties> <us-applicants> ... </us-applicants> <inventors> <inventor sequence="001" designation="us-only"> <addressbook> <last-name>Oliver</last-name> <first-name>Paul</first-name> <address> <city>Rhyl</city> <country>GB</country> </address> </addressbook> </inventor> ... </inventors> ... <us-parties>
- For v 4.2
<parties> <applicants> <applicant sequence="001" app-type="applicant-inventor" designation="us-only"> <addressbook> <last-name>Kamath</last-name> <first-name>Sandeep</first-name> <address> <city>Bangalore</city> <country>IN</country> </address> </addressbook> <nationality> <country>omitted</country> </nationality> <residence> <country>IN</country> </residence> </applicant> ... </applicants> ... </parties>
Assignees
- PatentNumber (and country) to build a key
- We need a "standard" name and address object for each assignee
<assignees> <assignee> <addressbook> <orgname>Survitec Group Limited</orgname> <role>03</role> <address> <city>Merseyside</city> <country>GB</country> </address> </addressbook> </assignee> </assignees>
Other things we might want
- Abstract
- Claims (other than their count)
Things we don't need
General:
Classification related:
- Level - This appears to be either core or advanced. Not sure it matters.
- SymbolPosition, ClassificationValue - we likely don't need them
- Classification status and data source - no idea what these do
About the scripts
The scripts to process the Patent Data are all located under /bulk/Software/Scripts/PatentData/ ("E:\Software\Scripts\PatentData\")
There are currently 5 .pm files: PatentApplication.pm, Inventor.pm, Claim.pm, and Addressbook.pm, and Loader.pm available.
Each of the first 4 represents an Object type. The last one is a helper object that is able to extract the wanted fields as a perl object given a schema file. Future work should be done in this file to support more schema files.
Example Usage:
perl PatentParser.pl -file=ipa150319.xml
This will parse the xml file with name ipa150319.xml, extract all the Patents (in this case PatentApplications) each as a temporary xml file, and then, using a Loader object with a specified schema file, in this case "us-patent-application-v44-2014-04-03.dtd" to be able to extract each of the 4 object types from the Patents. If any error happened during the parsing of any file, that file will be moved to a directory called "failed_files". Most likely if a file failed the parsing it is likely not a Utility patent.
About the Harvard Dataverse
The patents from 1975-2010 loaded as .sqlite3 and csv files can be found at
I have also downloaded all of them on to the database server and can be found by
cd /bulk/patent