Difference between revisions of "Patent Data Extraction Scripts (Tool)"

From edegan.com
Jump to navigation Jump to search
(Created page with "===Utility patent grants fields=== ====Patent==== *patent number *kind: http://www.uspto.gov/patents-application-process/patent-search/authority-files/uspto-kind-codes *gran...")
(No difference)

Revision as of 18:03, 7 June 2016

Utility patent grants fields


For version 4.5:

  • type
  • applicationnumber
  • filingdate
<application-reference appl-type="utility">

For priority, if there is more than 1, we want sequence 01

  • prioritydate
  • prioritycountry (should use ISO country codes - may need a lookup table)
  • prioritypatentnumber
  • find 4.3 file with priority claim
 <priority-claim sequence="01" kind="national">

Classification IPC - we only need the first one: http://www.wipo.int/export/sites/www/classifications/ipc/en/guide/guide_ipc.pdf

<class>64</class> <subclass>G</subclass> <main-group>6</main-group> <subgroup>00</subgroup> <symbol-position>F</symbol-position> <classification-value>I</classification-value> ... </classification-ipcr> ... </classifications-ipcr>

Classification CPC - we only need the main one

CPC is a classification scheme set up by the USPTO and the European Patent Office (EPO). The first classification codes rolled out on November 9, 2012.[1] Full implementation of the CPC classification system occurred on January 2015, at the same time of version 4.5 of the USPTO patent bulk data.[2]

  • Section, Class, Subclass
  • Main Group, Subgroup
  • v 4.2, 4.3, 4.4 does not have this
<class>64</class> <subclass>D</subclass> <main-group>10</main-group> <subgroup>00</subgroup> <symbol-position>F</symbol-position> <classification-value>I</classification-value> ... </classification-cpc> </main-cpc> </classifications-cpc>

Classification National: Note that the one below comes out to 2/2.11 (http://www.google.com/patents/US8925112#classifications)

  • Country
  • Class

THIS IS NOT UNIQUE. What classifications are we searching for?

  <main-classification>2 211</main-classification>

Title of the patent:

<invention-title id="d2e61">Aircrew ensembles</invention-title>

Number of Claims:


Primary examiner:

  • FirstName, LastName, Department

PCT/Regional Patent Number:

  • PCTNumber (just the doc number - if it starts with PCT set a flag)
  • not in all v 4.5
  • not in v 4.2, 4.3, 4.4
  • maybe not all patents are filed under PCT, need to use code to search all files for key word


Patent Citations (we need all of them):

  • CitingPatentNumber (from the patent)
  • CitingPatentCountry (from the patent)
  • CitedPatentNumber
  • CitedPatentCountry
  • V 4.2 does not have <us-references-cited>
  <patcit num="00001">
  <category>cited by examiner</category>
   <main-classification>2 214</main-classification>

For non-patent references, we are just going to count them:

  • NoNonPatRefs
  <nplcit num="00020">
    European Search Report dated Jan. 20, 2011 as received in European Patent Application No. GB1016384.8.
  <category>cited by applicant</category>


  • For v 4.3, 4.4, 4.5
  • PatentNumber (and country) to build a key
  • We need a "standard" name and address object for each inventor
   <inventor sequence="001" designation="us-only">

  • For v 4.2
  <applicant sequence="001" app-type="applicant-inventor" designation="us-only">


  • PatentNumber (and country) to build a key
  • We need a "standard" name and address object for each assignee
    <orgname>Survitec Group Limited</orgname>

Other things we might want

  • Abstract
  • Claims (other than their count)

Things we don't need


Classification related:

  • Level - This appears to be either core or advanced. Not sure it matters.
  • SymbolPosition, ClassificationValue - we likely don't need them
  • Classification status and data source - no idea what these do

About the scripts

The scripts to process the Patent Data are all located under /bulk/Software/Scripts/PatentData/ ("E:\Software\Scripts\PatentData\")

There are currently 5 .pm files: PatentApplication.pm, Inventor.pm, Claim.pm, and Addressbook.pm, and Loader.pm available.

Each of the first 4 represents an Object type. The last one is a helper object that is able to extract the wanted fields as a perl object given a schema file. Future work should be done in this file to support more schema files.

Example Usage:

perl PatentParser.pl -file=ipa150319.xml

This will parse the xml file with name ipa150319.xml, extract all the Patents (in this case PatentApplications) each as a temporary xml file, and then, using a Loader object with a specified schema file, in this case "us-patent-application-v44-2014-04-03.dtd" to be able to extract each of the 4 object types from the Patents. If any error happened during the parsing of any file, that file will be moved to a directory called "failed_files". Most likely if a file failed the parsing it is likely not a Utility patent.

About the Harvard Dataverse

The patents from 1975-2010 loaded as .sqlite3 and csv files can be found at

Harvard Dataverse

I have also downloaded all of them on to the database server and can be found by

cd /bulk/patent