8,552 bytes added
, 17:03, 7 June 2016
===Utility patent grants fields===
====Patent====
*patent number
*kind: http://www.uspto.gov/patents-application-process/patent-search/authority-files/uspto-kind-codes
*grantdate
For version 4.5:
<publication-reference>
<document-id>
<country>US</country>
<doc-number>08925112</doc-number>
<kind>B2</kind>
<date>20150106</date>
</document-id>
</publication-reference>
*type
*applicationnumber
*filingdate
<application-reference appl-type="utility">
<document-id>
<country>US</country>
<doc-number>13824291</doc-number>
<date>20110929</date>
</document-id>
</application-reference>
For priority, if there is more than 1, we want sequence 01
*prioritydate
*prioritycountry (should use ISO country codes - may need a lookup table)
*prioritypatentnumber
*'''find 4.3 file with priority claim'''
<priority-claims>
<priority-claim sequence="01" kind="national">
<country>GB</country>
<doc-number>1016384.8</doc-number>
<date>20100930</date>
</priority-claim>
</priority-claims>
Classification IPC - we only need the first one: http://www.wipo.int/export/sites/www/classifications/ipc/en/guide/guide_ipc.pdf
*Section, Class, SubClass - Together these concord to US subclass: http://www.uspto.gov/web/patents/classification/international/ipc/ipc8/ipc_concordance/ipcsel.htm#a
*MainGroup, SubGroup
<classifications-ipcr>
<classification-ipcr>
<ipc-version-indicator>
<date>20060101</date>
</ipc-version-indicator>
<classification-level>A</classification-level>
<section>B</section>
<class>64</class>
<subclass>G</subclass>
<main-group>6</main-group>
<subgroup>00</subgroup>
<symbol-position>F</symbol-position>
<classification-value>I</classification-value>
...
</classification-ipcr>
...
</classifications-ipcr>
Classification CPC - we only need the main one
CPC is a classification scheme set up by the USPTO and the European Patent Office (EPO). The first classification codes rolled out on November 9, 2012.[http://www.cooperativepatentclassification.org/cpcSchemeAndDefinitions.html] Full implementation of the CPC classification system occurred on January 2015, at the same time of version 4.5 of the USPTO patent bulk data.[http://www.uspto.gov/sites/default/files/about/advisory/ppac/120927-09a-international_cpc.pdf]
*Section, Class, Subclass
*Main Group, Subgroup
*'''v 4.2, 4.3, 4.4 does not have this'''
<classifications-cpc>
<main-cpc>
<classification-cpc>
<cpc-version-indicator>
<date>20130101</date>
</cpc-version-indicator>
<section>B</section>
<class>64</class>
<subclass>D</subclass>
<main-group>10</main-group>
<subgroup>00</subgroup>
<symbol-position>F</symbol-position>
<classification-value>I</classification-value>
...
</classification-cpc>
</main-cpc>
</classifications-cpc>
Classification National: Note that the one below comes out to 2/2.11 (http://www.google.com/patents/US8925112#classifications)
*Country
*Class
'''THIS IS NOT UNIQUE. What classifications are we searching for?'''
<classification-national>
<country>US</country>
<main-classification>2 211</main-classification>
</classification-national>
Title of the patent:
<invention-title id="d2e61">Aircrew ensembles</invention-title>
Number of Claims:
<number-of-claims>12</number-of-claims>
Primary examiner:
*FirstName, LastName, Department
<examiners>
<primary-examiner>
<last-name>Patel</last-name>
<first-name>Tejash</first-name>
<department>3765</department>
</primary-examiner>
...
</examiners>
PCT/Regional Patent Number:
*PCTNumber (just the doc number - if it starts with PCT set a flag)
*'''not in all v 4.5'''
*'''not in v 4.2, 4.3, 4.4'''
*'''maybe not all patents are filed under PCT, need to use code to search all files for key word'''
<pct-or-regional-filing-data>
<document-id>
<country>WO</country>
<doc-number>PCT/EP2011/067014</doc-number>
<kind>00</kind>
<date>20110929</date>
</document-id>
...
</pct-or-regional-filing-data>
====Citations====
Patent Citations (we need all of them):
*CitingPatentNumber (from the patent)
*CitingPatentCountry (from the patent)
<publication-reference>
<document-id>
<country>US</country>
<doc-number>08925112</doc-number>
<kind>B2</kind>
<date>20150106</date>
</document-id>
</publication-reference>
*CitedPatentNumber
*CitedPatentCountry
*'''V 4.2 does not have <us-references-cited>
<us-references-cited>
<us-citation>
<patcit num="00001">
<document-id>
<country>US</country>
<doc-number>1105569</doc-number>
<kind>A</kind>
<name>Lacrotte</name>
<date>19140700</date>
</document-id>
</patcit>
<category>cited by examiner</category>
<classification-national>
<country>US</country>
<main-classification>2 214</main-classification>
</classification-national>
</us-citation>
...
</us-references-cited>
For non-patent references, we are just going to count them:
*NoNonPatRefs
<us-references-cited>
...
<us-citation>
<nplcit num="00020">
<othercit>
European Search Report dated Jan. 20, 2011 as received in European Patent Application No. GB1016384.8.
</othercit>
</nplcit>
<category>cited by applicant</category>
</us-citation>
</us-references-cited>
====Inventors====
*'''For v 4.3, 4.4, 4.5'''
*PatentNumber (and country) to build a key
*We need a "standard" name and address object for each inventor
<us-parties>
<us-applicants>
...
</us-applicants>
<inventors>
<inventor sequence="001" designation="us-only">
<addressbook>
<last-name>Oliver</last-name>
<first-name>Paul</first-name>
<address>
<city>Rhyl</city>
<country>GB</country>
</address>
</addressbook>
</inventor>
...
</inventors>
...
<us-parties>
*'''For v 4.2'''
<parties>
<applicants>
<applicant sequence="001" app-type="applicant-inventor" designation="us-only">
<addressbook>
<last-name>Kamath</last-name>
<first-name>Sandeep</first-name>
<address>
<city>Bangalore</city>
<country>IN</country>
</address>
</addressbook>
<nationality>
<country>omitted</country>
</nationality>
<residence>
<country>IN</country>
</residence>
</applicant>
...
</applicants>
...
</parties>
====Assignees====
*PatentNumber (and country) to build a key
*We need a "standard" name and address object for each assignee
<assignees>
<assignee>
<addressbook>
<orgname>Survitec Group Limited</orgname>
<role>03</role>
<address>
<city>Merseyside</city>
<country>GB</country>
</address>
</addressbook>
</assignee>
</assignees>
====Other things we might want====
*Abstract
*Claims (other than their count)
====Things we don't need====
General:
*Series Code: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/filingyr.htm
Classification related:
*Level - This appears to be either core or advanced. Not sure it matters.
*SymbolPosition, ClassificationValue - we likely don't need them
*Classification status and data source - no idea what these do
====About the scripts====
The scripts to process the Patent Data are all located under /bulk/Software/Scripts/PatentData/ ("E:\Software\Scripts\PatentData\")
There are currently 5 .pm files: PatentApplication.pm, Inventor.pm, Claim.pm, and Addressbook.pm, and Loader.pm available.
Each of the first 4 represents an Object type. The last one is a helper object that is able to extract the wanted fields as a perl object given a schema file.
Future work should be done in this file to support more schema files.
Example Usage:
perl PatentParser.pl -file=ipa150319.xml
This will parse the xml file with name ipa150319.xml, extract all the Patents (in this case PatentApplications) each as a temporary xml file, and then, using a Loader object with a specified
schema file, in this case "us-patent-application-v44-2014-04-03.dtd" to be able to extract each of the 4 object types from the Patents.
If any error happened during the parsing of any file, that file will be moved to a directory called "failed_files". Most likely if a file failed the parsing it is likely not a Utility patent.
====About the Harvard Dataverse====
The patents from 1975-2010 loaded as .sqlite3 and csv files can be found at
[https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/15705 Harvard Dataverse]
I have also downloaded all of them on to the database server and can be found by
cd /bulk/patent