Patent Data Extraction Scripts (Tool)

From edegan.com
Jump to navigation Jump to search

Utility patent grants fields

Patent

For version 4.5:

<publication-reference>
 <document-id>
  <country>US</country>
  <doc-number>08925112</doc-number>
  <kind>B2</kind>
  <date>20150106</date>
 </document-id>
</publication-reference>
  • type
  • applicationnumber
  • filingdate
<application-reference appl-type="utility">
 <document-id>
  <country>US</country>
  <doc-number>13824291</doc-number>
  <date>20110929</date>
 </document-id>
</application-reference>


For priority, if there is more than 1, we want sequence 01

  • prioritydate
  • prioritycountry (should use ISO country codes - may need a lookup table)
  • prioritypatentnumber
  • find 4.3 file with priority claim
<priority-claims>
 <priority-claim sequence="01" kind="national">
  <country>GB</country>
  <doc-number>1016384.8</doc-number>
  <date>20100930</date>
 </priority-claim>
</priority-claims>

Classification IPC - we only need the first one: http://www.wipo.int/export/sites/www/classifications/ipc/en/guide/guide_ipc.pdf

<classifications-ipcr>
 <classification-ipcr>
  <ipc-version-indicator>
   <date>20060101</date>
  </ipc-version-indicator>
  <classification-level>A</classification-level>
  
B
<class>64</class> <subclass>G</subclass> <main-group>6</main-group> <subgroup>00</subgroup> <symbol-position>F</symbol-position> <classification-value>I</classification-value> ... </classification-ipcr> ... </classifications-ipcr>

Classification CPC - we only need the main one

CPC is a classification scheme set up by the USPTO and the European Patent Office (EPO). The first classification codes rolled out on November 9, 2012.[1] Full implementation of the CPC classification system occurred on January 2015, at the same time of version 4.5 of the USPTO patent bulk data.[2]

  • Section, Class, Subclass
  • Main Group, Subgroup
  • v 4.2, 4.3, 4.4 does not have this
<classifications-cpc>
 <main-cpc>
  <classification-cpc>
    <cpc-version-indicator>
      <date>20130101</date>
    </cpc-version-indicator>
    
B
<class>64</class> <subclass>D</subclass> <main-group>10</main-group> <subgroup>00</subgroup> <symbol-position>F</symbol-position> <classification-value>I</classification-value> ... </classification-cpc> </main-cpc> </classifications-cpc>

Classification National: Note that the one below comes out to 2/2.11 (http://www.google.com/patents/US8925112#classifications)

  • Country
  • Class

THIS IS NOT UNIQUE. What classifications are we searching for?

<classification-national>
 <country>US</country>
  <main-classification>2 211</main-classification>
</classification-national>

Title of the patent:

<invention-title id="d2e61">Aircrew ensembles</invention-title>

Number of Claims:

<number-of-claims>12</number-of-claims>

Primary examiner:

  • FirstName, LastName, Department
<examiners>
 <primary-examiner>
  <last-name>Patel</last-name>
  <first-name>Tejash</first-name>
  <department>3765</department>
 </primary-examiner>
...
</examiners>

PCT/Regional Patent Number:

  • PCTNumber (just the doc number - if it starts with PCT set a flag)
  • not in all v 4.5
  • not in v 4.2, 4.3, 4.4
  • maybe not all patents are filed under PCT, need to use code to search all files for key word
<pct-or-regional-filing-data>
 <document-id>
  <country>WO</country>
  <doc-number>PCT/EP2011/067014</doc-number>
  <kind>00</kind>
  <date>20110929</date>
 </document-id>
...
</pct-or-regional-filing-data>

Citations

Patent Citations (we need all of them):

  • CitingPatentNumber (from the patent)
  • CitingPatentCountry (from the patent)
<publication-reference>
 <document-id>
  <country>US</country>
  <doc-number>08925112</doc-number>
  <kind>B2</kind>
  <date>20150106</date>
 </document-id>
</publication-reference>
  • CitedPatentNumber
  • CitedPatentCountry
  • V 4.2 does not have <us-references-cited>
<us-references-cited>
 <us-citation>
  <patcit num="00001">
   <document-id>
    <country>US</country>
    <doc-number>1105569</doc-number>
    <kind>A</kind>
    <name>Lacrotte</name>
    <date>19140700</date>
   </document-id>
  </patcit>
  <category>cited by examiner</category>
  <classification-national>
   <country>US</country>
   <main-classification>2 214</main-classification>
  </classification-national>
 </us-citation>
...
</us-references-cited>

For non-patent references, we are just going to count them:

  • NoNonPatRefs
<us-references-cited>
...
 <us-citation>
  <nplcit num="00020">
   <othercit>
    European Search Report dated Jan. 20, 2011 as received in European Patent Application No. GB1016384.8.
   </othercit>
  </nplcit>
  <category>cited by applicant</category>
 </us-citation>
</us-references-cited>

Inventors

  • For v 4.3, 4.4, 4.5
  • PatentNumber (and country) to build a key
  • We need a standard name and address object for each inventor


<us-parties>
 <us-applicants>
...
 </us-applicants>
 <inventors>
   <inventor sequence="001" designation="us-only">
    <addressbook>
     <last-name>Oliver</last-name>
     <first-name>Paul</first-name>
    <address>
     <city>Rhyl</city>
     <country>GB</country>
    </address>
   </addressbook>
  </inventor>
...
 </inventors>
...
<us-parties>


  • For v 4.2
<parties>
 <applicants>
  <applicant sequence="001" app-type="applicant-inventor" designation="us-only">
   <addressbook>
    <last-name>Kamath</last-name>
    <first-name>Sandeep</first-name>
    <address>
     <city>Bangalore</city>
     <country>IN</country>
    </address>
   </addressbook>
   <nationality>
    <country>omitted</country>
   </nationality>
   <residence>
    <country>IN</country>
   </residence>
  </applicant>
 ...
 </applicants>
 ...
</parties>

Assignees

  • PatentNumber (and country) to build a key
  • We need a "standard" name and address object for each assignee
<assignees>
  <assignee>
   <addressbook>
    <orgname>Survitec Group Limited</orgname>
    <role>03</role>
   <address>
    <city>Merseyside</city>
    <country>GB</country>
   </address>
  </addressbook>
 </assignee>
</assignees>


For further information on Assignee data from the USPTO, see USPTO Assignees Data.

Fields with Potential

  • Abstract
  • Claims (other than their count)

Things we don't need

General:

Classification related:

  • Level - This appears to be either core or advanced. Not sure it matters.
  • SymbolPosition, ClassificationValue - we likely don't need them
  • Classification status and data source - no idea what these do

About the scripts

The scripts to process the Patent Data are all located under /bulk/Software/Scripts/PatentData/ ("E:\Software\Scripts\PatentData\")

There are currently 5 .pm files: PatentApplication.pm, Inventor.pm, Claim.pm, and Addressbook.pm, and Loader.pm available.

Each of the first 4 represents an Object type. The last one is a helper object that is able to extract the wanted fields as a perl object given a schema file. Future work should be done in this file to support more schema files.

Example Usage:

perl PatentParser.pl -file=ipa150319.xml

This will parse the xml file with name ipa150319.xml, extract all the Patents (in this case PatentApplications) each as a temporary xml file, and then, using a Loader object with a specified schema file, in this case "us-patent-application-v44-2014-04-03.dtd" to be able to extract each of the 4 object types from the Patents. If any error happened during the parsing of any file, that file will be moved to a directory called "failed_files". Most likely if a file failed the parsing it is likely not a Utility patent.

About the Harvard Dataverse

The patents from 1975-2010 loaded as .sqlite3 and csv files can be found at

Harvard Dataverse

I have also downloaded all of them on to the database server and can be found by

cd /bulk/patent