Difference between revisions of "Patent Data Extraction Scripts (Tool)"
(17 intermediate revisions by 6 users not shown) | |||
Line 1: | Line 1: | ||
+ | {{Project | ||
+ | |Has project output=Tool | ||
+ | |Has sponsor=McNair Center | ||
+ | |Has title=Patent Data Extraction Scripts (Tool) | ||
+ | |Has owner=Marcela Interiano, | ||
+ | |Has project status=Subsume | ||
+ | |Has keywords=Tool | ||
+ | }} | ||
+ | |||
+ | ===Patent applications=== | ||
+ | |||
+ | Note that our application data appears to be ONLY utility patents, except for a few plant patents. | ||
+ | |||
+ | At the top level, in spec 4.0 (and presumably others) there are: | ||
+ | <us-patent-application lang="EN" dtd-version="v4.0 2004-12-02" file="US20050000001A1-20050106.XML" | ||
+ | status="PARALLEL-RUN" id="us-patent-application" country="US" date-produced="20041222" date-publ="20050106"> | ||
+ | <us-bibliographic-data-application lang="EN" country="US"> | ||
+ | ... | ||
+ | </us-bibliographic-data-application> | ||
+ | <abstract id="abstract"> | ||
+ | </abstract> | ||
+ | <drawings id="DRAWINGS"> | ||
+ | </drawings> | ||
+ | <description id="description"> | ||
+ | <?summary-of-invention description="Summary of Invention" end="lead"?> | ||
+ | <?summary-of-invention description="Summary of Invention" end="tail"?> | ||
+ | <?brief-description-of-drawings description="Brief Description of Drawings" end="lead"?> | ||
+ | <?brief-description-of-drawings description="Brief Description of Drawings" end="tail"?> | ||
+ | <?detailed-description description="Detailed Description" end="lead"?> | ||
+ | <?detailed-description description="Detailed Description" end="tail"?> | ||
+ | </description> | ||
+ | <claims id="claims"> | ||
+ | </claims> | ||
+ | </us-patent-application> | ||
+ | |||
+ | We are currently processing only: | ||
+ | <us-bibliographic-data-application lang="EN" country="US"> | ||
+ | ... | ||
+ | </us-bibliographic-data-application> | ||
+ | |||
===Utility patent grants fields=== | ===Utility patent grants fields=== | ||
+ | The XML files for patent data are available at | ||
+ | *https://bulkdata.uspto.gov/ | ||
+ | *http://patents.reedtech.com/patent-products.php | ||
+ | |||
+ | Patent data up to year 2015 can also be obtained from https://www.google.com/googlebooks/uspto-patents.html. This repository is no longer updated. | ||
+ | |||
+ | Each XML file contains, in order, sorted by document ID: | ||
+ | #Design patents | ||
+ | #Plant patents | ||
+ | #Reissues | ||
+ | #Utility patents | ||
+ | |||
+ | ====Overview==== | ||
+ | |||
+ | DESIGN Patents: | ||
+ | |||
+ | <?xml version="1.0" encoding="UTF-8"?> | ||
+ | <!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v45-2014-04-03.dtd" [ ]> | ||
+ | <us-patent-grant lang="EN" dtd-version="v4.5 2014-04-03" file="USD0774273-20161220.XML" | ||
+ | status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20161205" date-publ="20161220"> | ||
+ | <us-bibliographic-data-grant> | ||
+ | </us-bibliographic-data-grant> | ||
+ | <drawings id="DRAWINGS"> | ||
+ | </drawings> | ||
+ | <description id="description"> | ||
+ | <?brief-description-of-drawings description="Brief Description of Drawings" end="lead"?> | ||
+ | <description-of-drawings> | ||
+ | </description-of-drawings> | ||
+ | <?brief-description-of-drawings description="Brief Description of Drawings" end="tail"?> | ||
+ | </description> | ||
+ | <us-claim-statement>CLAIM</us-claim-statement> | ||
+ | <claims id="claims"> | ||
+ | </claims> | ||
+ | </us-patent-grant> | ||
+ | |||
====Patent==== | ====Patent==== | ||
Line 73: | Line 148: | ||
*Section, Class, Subclass | *Section, Class, Subclass | ||
*Main Group, Subgroup | *Main Group, Subgroup | ||
+ | </onlyinclude> | ||
*'''v 4.2, 4.3, 4.4 does not have this''' | *'''v 4.2, 4.3, 4.4 does not have this''' | ||
− | |||
<classifications-cpc> | <classifications-cpc> | ||
<main-cpc> | <main-cpc> | ||
Line 122: | Line 197: | ||
<onlyinclude> | <onlyinclude> | ||
PCT/Regional Patent Number: | PCT/Regional Patent Number: | ||
+ | </onlyinclude> | ||
*PCTNumber (just the doc number - if it starts with PCT set a flag) | *PCTNumber (just the doc number - if it starts with PCT set a flag) | ||
*'''not in all v 4.5''' | *'''not in all v 4.5''' | ||
*'''not in v 4.2, 4.3, 4.4''' | *'''not in v 4.2, 4.3, 4.4''' | ||
*'''maybe not all patents are filed under PCT, need to use code to search all files for key word''' | *'''maybe not all patents are filed under PCT, need to use code to search all files for key word''' | ||
− | + | ||
<pct-or-regional-filing-data> | <pct-or-regional-filing-data> | ||
<document-id> | <document-id> | ||
Line 154: | Line 230: | ||
*CitedPatentNumber | *CitedPatentNumber | ||
*CitedPatentCountry | *CitedPatentCountry | ||
+ | </onlyinclude> | ||
*'''V 4.2 does not have <us-references-cited> | *'''V 4.2 does not have <us-references-cited> | ||
− | + | ||
<us-references-cited> | <us-references-cited> | ||
<us-citation> | <us-citation> | ||
Line 192: | Line 269: | ||
====Inventors==== | ====Inventors==== | ||
+ | |||
+ | *'''For v 4.3, 4.4, 4.5''' | ||
<onlyinclude> | <onlyinclude> | ||
− | |||
*PatentNumber (and country) to build a key | *PatentNumber (and country) to build a key | ||
− | *We need a | + | *We need a standard name and address object for each inventor |
</onlyinclude> | </onlyinclude> | ||
Line 262: | Line 340: | ||
</assignees> | </assignees> | ||
− | <onlyinclude> | + | <onlyinclude> |
− | ==== | + | |
+ | For further information on Assignee data from the USPTO, see [[USPTO Assignees Data]]. | ||
+ | |||
+ | ====Fields with Potential==== | ||
*Abstract | *Abstract | ||
*Claims (other than their count) | *Claims (other than their count) | ||
− | + | </onlyinclude> | |
====Things we don't need==== | ====Things we don't need==== | ||
Line 277: | Line 358: | ||
*SymbolPosition, ClassificationValue - we likely don't need them | *SymbolPosition, ClassificationValue - we likely don't need them | ||
*Classification status and data source - no idea what these do | *Classification status and data source - no idea what these do | ||
− | + | ||
====About the scripts==== | ====About the scripts==== | ||
Line 300: | Line 381: | ||
I have also downloaded all of them on to the database server and can be found by | I have also downloaded all of them on to the database server and can be found by | ||
cd /bulk/patent | cd /bulk/patent | ||
+ | |||
+ | [[Category:Patent]] |
Latest revision as of 12:47, 21 September 2020
Patent Data Extraction Scripts (Tool) | |
---|---|
Project Information | |
Has title | Patent Data Extraction Scripts (Tool) |
Has owner | Marcela Interiano |
Has start date | |
Has deadline date | |
Has keywords | Tool |
Has project status | Subsume |
Subsumed by: | Patent Assignment Data Restructure |
Has sponsor | McNair Center |
Has project output | Tool |
Copyright © 2019 edegan.com. All Rights Reserved. |
Contents
Patent applications
Note that our application data appears to be ONLY utility patents, except for a few plant patents.
At the top level, in spec 4.0 (and presumably others) there are:
<us-patent-application lang="EN" dtd-version="v4.0 2004-12-02" file="US20050000001A1-20050106.XML" status="PARALLEL-RUN" id="us-patent-application" country="US" date-produced="20041222" date-publ="20050106"> <us-bibliographic-data-application lang="EN" country="US"> ... </us-bibliographic-data-application> <abstract id="abstract"> </abstract> <drawings id="DRAWINGS"> </drawings> <description id="description"> <?summary-of-invention description="Summary of Invention" end="lead"?> <?summary-of-invention description="Summary of Invention" end="tail"?> <?brief-description-of-drawings description="Brief Description of Drawings" end="lead"?> <?brief-description-of-drawings description="Brief Description of Drawings" end="tail"?> <?detailed-description description="Detailed Description" end="lead"?> <?detailed-description description="Detailed Description" end="tail"?> </description> <claims id="claims"> </claims> </us-patent-application>
We are currently processing only:
<us-bibliographic-data-application lang="EN" country="US"> ... </us-bibliographic-data-application>
Utility patent grants fields
The XML files for patent data are available at
Patent data up to year 2015 can also be obtained from https://www.google.com/googlebooks/uspto-patents.html. This repository is no longer updated.
Each XML file contains, in order, sorted by document ID:
- Design patents
- Plant patents
- Reissues
- Utility patents
Overview
DESIGN Patents:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v45-2014-04-03.dtd" [ ]> <us-patent-grant lang="EN" dtd-version="v4.5 2014-04-03" file="USD0774273-20161220.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20161205" date-publ="20161220"> <us-bibliographic-data-grant> </us-bibliographic-data-grant> <drawings id="DRAWINGS"> </drawings> <description id="description"> <?brief-description-of-drawings description="Brief Description of Drawings" end="lead"?> <description-of-drawings> </description-of-drawings> <?brief-description-of-drawings description="Brief Description of Drawings" end="tail"?> </description> <us-claim-statement>CLAIM</us-claim-statement> <claims id="claims"> </claims> </us-patent-grant>
Patent
- patent number
- kind: http://www.uspto.gov/patents-application-process/patent-search/authority-files/uspto-kind-codes
- grantdate
For version 4.5:
<publication-reference> <document-id> <country>US</country> <doc-number>08925112</doc-number> <kind>B2</kind> <date>20150106</date> </document-id> </publication-reference>
- type
- applicationnumber
- filingdate
<application-reference appl-type="utility"> <document-id> <country>US</country> <doc-number>13824291</doc-number> <date>20110929</date> </document-id> </application-reference>
For priority, if there is more than 1, we want sequence 01
- prioritydate
- prioritycountry (should use ISO country codes - may need a lookup table)
- prioritypatentnumber
- find 4.3 file with priority claim
<priority-claims> <priority-claim sequence="01" kind="national"> <country>GB</country> <doc-number>1016384.8</doc-number> <date>20100930</date> </priority-claim> </priority-claims>
Classification IPC - we only need the first one: http://www.wipo.int/export/sites/www/classifications/ipc/en/guide/guide_ipc.pdf
- Section, Class, SubClass - Together these concord to US subclass: http://www.uspto.gov/web/patents/classification/international/ipc/ipc8/ipc_concordance/ipcsel.htm#a
- MainGroup, SubGroup
<classifications-ipcr> <classification-ipcr> <ipc-version-indicator> <date>20060101</date> </ipc-version-indicator> <classification-level>A</classification-level>B <class>64</class> <subclass>G</subclass> <main-group>6</main-group> <subgroup>00</subgroup> <symbol-position>F</symbol-position> <classification-value>I</classification-value> ... </classification-ipcr> ... </classifications-ipcr>
Classification CPC - we only need the main one
CPC is a classification scheme set up by the USPTO and the European Patent Office (EPO). The first classification codes rolled out on November 9, 2012.[1] Full implementation of the CPC classification system occurred on January 2015, at the same time of version 4.5 of the USPTO patent bulk data.[2]
- Section, Class, Subclass
- Main Group, Subgroup
- v 4.2, 4.3, 4.4 does not have this
<classifications-cpc> <main-cpc> <classification-cpc> <cpc-version-indicator> <date>20130101</date> </cpc-version-indicator>B <class>64</class> <subclass>D</subclass> <main-group>10</main-group> <subgroup>00</subgroup> <symbol-position>F</symbol-position> <classification-value>I</classification-value> ... </classification-cpc> </main-cpc> </classifications-cpc>
Classification National: Note that the one below comes out to 2/2.11 (http://www.google.com/patents/US8925112#classifications)
- Country
- Class
THIS IS NOT UNIQUE. What classifications are we searching for?
<classification-national> <country>US</country> <main-classification>2 211</main-classification> </classification-national>
Title of the patent:
<invention-title id="d2e61">Aircrew ensembles</invention-title>
Number of Claims:
<number-of-claims>12</number-of-claims>
Primary examiner:
- FirstName, LastName, Department
<examiners> <primary-examiner> <last-name>Patel</last-name> <first-name>Tejash</first-name> <department>3765</department> </primary-examiner> ... </examiners>
PCT/Regional Patent Number:
- PCTNumber (just the doc number - if it starts with PCT set a flag)
- not in all v 4.5
- not in v 4.2, 4.3, 4.4
- maybe not all patents are filed under PCT, need to use code to search all files for key word
<pct-or-regional-filing-data> <document-id> <country>WO</country> <doc-number>PCT/EP2011/067014</doc-number> <kind>00</kind> <date>20110929</date> </document-id> ... </pct-or-regional-filing-data>
Citations
Patent Citations (we need all of them):
- CitingPatentNumber (from the patent)
- CitingPatentCountry (from the patent)
<publication-reference> <document-id> <country>US</country> <doc-number>08925112</doc-number> <kind>B2</kind> <date>20150106</date> </document-id> </publication-reference>
- CitedPatentNumber
- CitedPatentCountry
- V 4.2 does not have <us-references-cited>
<us-references-cited> <us-citation> <patcit num="00001"> <document-id> <country>US</country> <doc-number>1105569</doc-number> <kind>A</kind> <name>Lacrotte</name> <date>19140700</date> </document-id> </patcit> <category>cited by examiner</category> <classification-national> <country>US</country> <main-classification>2 214</main-classification> </classification-national> </us-citation> ... </us-references-cited>
For non-patent references, we are just going to count them:
- NoNonPatRefs
<us-references-cited> ... <us-citation> <nplcit num="00020"> <othercit> European Search Report dated Jan. 20, 2011 as received in European Patent Application No. GB1016384.8. </othercit> </nplcit> <category>cited by applicant</category> </us-citation> </us-references-cited>
Inventors
- For v 4.3, 4.4, 4.5
- PatentNumber (and country) to build a key
- We need a standard name and address object for each inventor
<us-parties> <us-applicants> ... </us-applicants> <inventors> <inventor sequence="001" designation="us-only"> <addressbook> <last-name>Oliver</last-name> <first-name>Paul</first-name> <address> <city>Rhyl</city> <country>GB</country> </address> </addressbook> </inventor> ... </inventors> ... <us-parties>
- For v 4.2
<parties> <applicants> <applicant sequence="001" app-type="applicant-inventor" designation="us-only"> <addressbook> <last-name>Kamath</last-name> <first-name>Sandeep</first-name> <address> <city>Bangalore</city> <country>IN</country> </address> </addressbook> <nationality> <country>omitted</country> </nationality> <residence> <country>IN</country> </residence> </applicant> ... </applicants> ... </parties>
Assignees
- PatentNumber (and country) to build a key
- We need a "standard" name and address object for each assignee
<assignees> <assignee> <addressbook> <orgname>Survitec Group Limited</orgname> <role>03</role> <address> <city>Merseyside</city> <country>GB</country> </address> </addressbook> </assignee> </assignees>
For further information on Assignee data from the USPTO, see USPTO Assignees Data.
Fields with Potential
- Abstract
- Claims (other than their count)
Things we don't need
General:
Classification related:
- Level - This appears to be either core or advanced. Not sure it matters.
- SymbolPosition, ClassificationValue - we likely don't need them
- Classification status and data source - no idea what these do
About the scripts
The scripts to process the Patent Data are all located under /bulk/Software/Scripts/PatentData/ ("E:\Software\Scripts\PatentData\")
There are currently 5 .pm files: PatentApplication.pm, Inventor.pm, Claim.pm, and Addressbook.pm, and Loader.pm available.
Each of the first 4 represents an Object type. The last one is a helper object that is able to extract the wanted fields as a perl object given a schema file. Future work should be done in this file to support more schema files.
Example Usage:
perl PatentParser.pl -file=ipa150319.xml
This will parse the xml file with name ipa150319.xml, extract all the Patents (in this case PatentApplications) each as a temporary xml file, and then, using a Loader object with a specified schema file, in this case "us-patent-application-v44-2014-04-03.dtd" to be able to extract each of the 4 object types from the Patents. If any error happened during the parsing of any file, that file will be moved to a directory called "failed_files". Most likely if a file failed the parsing it is likely not a Utility patent.
About the Harvard Dataverse
The patents from 1975-2010 loaded as .sqlite3 and csv files can be found at
I have also downloaded all of them on to the database server and can be found by
cd /bulk/patent