Patent Data Extraction Scripts (Tool)

Project
Patent Data Extraction Scripts (Tool)
Project Information
Has title	Patent Data Extraction Scripts (Tool)
Has owner	Marcela Interiano
Has start date
Has deadline date
Has keywords	Tool
Has project status	Subsume
Subsumed by:	Patent Assignment Data Restructure
Has sponsor	McNair Center
	Copyright © 2019 edegan.com. All Rights Reserved.

<us-patent-application lang="EN" dtd-version="v4.0 2004-12-02" file="US20050000001A1-20050106.XML"
status="PARALLEL-RUN" id="us-patent-application" country="US" date-produced="20041222" date-publ="20050106">
  <us-bibliographic-data-application lang="EN" country="US">
     ...
  </us-bibliographic-data-application>
  <abstract id="abstract">
  </abstract>
  <drawings id="DRAWINGS">
  </drawings>
  <description id="description">
     <?summary-of-invention description="Summary of Invention" end="lead"?>
     <?summary-of-invention description="Summary of Invention" end="tail"?>
     <?brief-description-of-drawings description="Brief Description of Drawings" end="lead"?>
     <?brief-description-of-drawings description="Brief Description of Drawings" end="tail"?>
     <?detailed-description description="Detailed Description" end="lead"?>
     <?detailed-description description="Detailed Description" end="tail"?>
  </description>
  <claims id="claims">
  </claims>
</us-patent-application>

We are currently processing only:

<us-bibliographic-data-application lang="EN" country="US">
   ...
</us-bibliographic-data-application>

Utility patent grants fields

The XML files for patent data are available at

Patent data up to year 2015 can also be obtained from https://www.google.com/googlebooks/uspto-patents.html. This repository is no longer updated.

Each XML file contains, in order, sorted by document ID:

Design patents
Plant patents
Reissues
Utility patents

Overview

DESIGN Patents:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v45-2014-04-03.dtd" [ ]>
<us-patent-grant lang="EN" dtd-version="v4.5 2014-04-03" file="USD0774273-20161220.XML" 
status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20161205" date-publ="20161220">
 <us-bibliographic-data-grant>
 </us-bibliographic-data-grant>
 <drawings id="DRAWINGS">
 </drawings>
 <description id="description">
  <?brief-description-of-drawings description="Brief Description of Drawings" end="lead"?>
  <description-of-drawings>
  </description-of-drawings>
  <?brief-description-of-drawings description="Brief Description of Drawings" end="tail"?>
 </description>
 <us-claim-statement>CLAIM</us-claim-statement>
 <claims id="claims">
 </claims>
</us-patent-grant>

Patent

patent number
kind: http://www.uspto.gov/patents-application-process/patent-search/authority-files/uspto-kind-codes
grantdate

For version 4.5:

<publication-reference>
 <document-id>
  <country>US</country>
  <doc-number>08925112</doc-number>
  <kind>B2</kind>
  <date>20150106</date>
 </document-id>
</publication-reference>

type
applicationnumber
filingdate

<application-reference appl-type="utility">
 <document-id>
  <country>US</country>
  <doc-number>13824291</doc-number>
  <date>20110929</date>
 </document-id>
</application-reference>

For priority, if there is more than 1, we want sequence 01

prioritydate
prioritycountry (should use ISO country codes - may need a lookup table)
prioritypatentnumber

find 4.3 file with priority claim

<priority-claims>
 <priority-claim sequence="01" kind="national">
  <country>GB</country>
  <doc-number>1016384.8</doc-number>
  <date>20100930</date>
 </priority-claim>
</priority-claims>

Classification IPC - we only need the first one: http://www.wipo.int/export/sites/www/classifications/ipc/en/guide/guide_ipc.pdf

Section, Class, SubClass - Together these concord to US subclass: http://www.uspto.gov/web/patents/classification/international/ipc/ipc8/ipc_concordance/ipcsel.htm#a
MainGroup, SubGroup

<classifications-ipcr>
 <classification-ipcr>
  <ipc-version-indicator>
   <date>20060101</date>
  </ipc-version-indicator>
  <classification-level>A</classification-level>
  B
  <class>64</class>
  <subclass>G</subclass>
  <main-group>6</main-group>
  <subgroup>00</subgroup>
  <symbol-position>F</symbol-position>
  <classification-value>I</classification-value>
...
 </classification-ipcr>
...
</classifications-ipcr>

Classification CPC - we only need the main one

CPC is a classification scheme set up by the USPTO and the European Patent Office (EPO). The first classification codes rolled out on November 9, 2012.[1] Full implementation of the CPC classification system occurred on January 2015, at the same time of version 4.5 of the USPTO patent bulk data.[2]

Section, Class, Subclass
Main Group, Subgroup

v 4.2, 4.3, 4.4 does not have this

<classifications-cpc>
 <main-cpc>
  <classification-cpc>
    <cpc-version-indicator>
      <date>20130101</date>
    </cpc-version-indicator>
    B
    <class>64</class>
    <subclass>D</subclass>
    <main-group>10</main-group>
    <subgroup>00</subgroup>
    <symbol-position>F</symbol-position>
    <classification-value>I</classification-value>
 ... 
   </classification-cpc>
  </main-cpc>
</classifications-cpc>

Classification National: Note that the one below comes out to 2/2.11 (http://www.google.com/patents/US8925112#classifications)

Country
Class

THIS IS NOT UNIQUE. What classifications are we searching for?

<classification-national>
 <country>US</country>
  <main-classification>2 211</main-classification>
</classification-national>

Title of the patent:

<invention-title id="d2e61">Aircrew ensembles</invention-title>

Number of Claims:

<number-of-claims>12</number-of-claims>

Primary examiner:

FirstName, LastName, Department

<examiners>
 <primary-examiner>
  <last-name>Patel</last-name>
  <first-name>Tejash</first-name>
  <department>3765</department>
 </primary-examiner>
...
</examiners>

PCT/Regional Patent Number:

PCTNumber (just the doc number - if it starts with PCT set a flag)
not in all v 4.5
not in v 4.2, 4.3, 4.4
maybe not all patents are filed under PCT, need to use code to search all files for key word

<pct-or-regional-filing-data>
 <document-id>
  <country>WO</country>
  <doc-number>PCT/EP2011/067014</doc-number>
  <kind>00</kind>
  <date>20110929</date>
 </document-id>
...
</pct-or-regional-filing-data>

Citations

Patent Citations (we need all of them):

CitingPatentNumber (from the patent)
CitingPatentCountry (from the patent)

<publication-reference>
 <document-id>
  <country>US</country>
  <doc-number>08925112</doc-number>
  <kind>B2</kind>
  <date>20150106</date>
 </document-id>
</publication-reference>

CitedPatentNumber
CitedPatentCountry

V 4.2 does not have <us-references-cited>

<us-references-cited>
 <us-citation>
  <patcit num="00001">
   <document-id>
    <country>US</country>
    <doc-number>1105569</doc-number>
    <kind>A</kind>
    <name>Lacrotte</name>
    <date>19140700</date>
   </document-id>
  </patcit>
  <category>cited by examiner</category>
  <classification-national>
   <country>US</country>
   <main-classification>2 214</main-classification>
  </classification-national>
 </us-citation>
...
</us-references-cited>

For non-patent references, we are just going to count them:

NoNonPatRefs

<us-references-cited>
...
 <us-citation>
  <nplcit num="00020">
   <othercit>
    European Search Report dated Jan. 20, 2011 as received in European Patent Application No. GB1016384.8.
   </othercit>
  </nplcit>
  <category>cited by applicant</category>
 </us-citation>
</us-references-cited>

Inventors

For v 4.3, 4.4, 4.5

PatentNumber (and country) to build a key
We need a standard name and address object for each inventor

<us-parties>
 <us-applicants>
...
 </us-applicants>
 <inventors>
   <inventor sequence="001" designation="us-only">
    <addressbook>
     <last-name>Oliver</last-name>
     <first-name>Paul</first-name>
    <address>
     <city>Rhyl</city>
     <country>GB</country>
    </address>
   </addressbook>
  </inventor>
...
 </inventors>
...
<us-parties>

For v 4.2

<parties>
 <applicants>
  <applicant sequence="001" app-type="applicant-inventor" designation="us-only">
   <addressbook>
    <last-name>Kamath</last-name>
    <first-name>Sandeep</first-name>
    <address>
     <city>Bangalore</city>
     <country>IN</country>
    </address>
   </addressbook>
   <nationality>
    <country>omitted</country>
   </nationality>
   <residence>
    <country>IN</country>
   </residence>
  </applicant>
 ...
 </applicants>
 ...
</parties>

Assignees

PatentNumber (and country) to build a key
We need a "standard" name and address object for each assignee

<assignees>
  <assignee>
   <addressbook>
    <orgname>Survitec Group Limited</orgname>
    <role>03</role>
   <address>
    <city>Merseyside</city>
    <country>GB</country>
   </address>
  </addressbook>
 </assignee>
</assignees>

For further information on Assignee data from the USPTO, see USPTO Assignees Data.

Fields with Potential

Abstract
Claims (other than their count)

Things we don't need

General:

Series Code: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/filingyr.htm

Classification related:

Level - This appears to be either core or advanced. Not sure it matters.
SymbolPosition, ClassificationValue - we likely don't need them
Classification status and data source - no idea what these do

About the scripts

The scripts to process the Patent Data are all located under /bulk/Software/Scripts/PatentData/ ("E:\Software\Scripts\PatentData\")

There are currently 5 .pm files: PatentApplication.pm, Inventor.pm, Claim.pm, and Addressbook.pm, and Loader.pm available.

Each of the first 4 represents an Object type. The last one is a helper object that is able to extract the wanted fields as a perl object given a schema file. Future work should be done in this file to support more schema files.

Example Usage:

perl PatentParser.pl -file=ipa150319.xml

This will parse the xml file with name ipa150319.xml, extract all the Patents (in this case PatentApplications) each as a temporary xml file, and then, using a Loader object with a specified schema file, in this case "us-patent-application-v44-2014-04-03.dtd" to be able to extract each of the 4 object types from the Patents. If any error happened during the parsing of any file, that file will be moved to a directory called "failed_files". Most likely if a file failed the parsing it is likely not a Utility patent.

About the Harvard Dataverse

The patents from 1975-2010 loaded as .sqlite3 and csv files can be found at

Harvard Dataverse

I have also downloaded all of them on to the database server and can be found by

cd /bulk/patent

Patent Data Extraction Scripts (Tool)

Contents

Patent applications

Utility patent grants fields

Overview

Patent

Citations

Inventors

Assignees

Fields with Potential

Things we don't need

About the scripts

About the Harvard Dataverse

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools