imported>Ed |
|
(47 intermediate revisions by 5 users not shown) |
Line 1: |
Line 1: |
− | The Patent Data page is for instructions on how to get the USPTO patent data, how to use the database, and for the documentation of our database.
| + | {{Project |
| + | |Has project output=Data,Tool,Content,Guide |
| + | |Has sponsor=McNair Center |
| + | |Has title=Patent Data |
| + | |Has owner=Marcela Interiano |
| + | |Has start date=Spring 2016 |
| + | ||Has keywords=Patent, Database,Data |
| + | |Has project status=Subsume |
| + | |Due Date=NA |
| + | }} |
| + | This project is concerned with maintaining and updating patent data, to enable the McNair Center staff to extract meaningful data for academic papers and reports. Currently, there are two primary sources for this data - the US Patent and Trademark Office as well as the Harvard Dataverse. Data from the LexMachina online database has been added to provide information on [[Guide to Patent Litigation (Wiki Page) | patent litigation]]. All the acquired data is stored in normalized tables to be accessed and modified using PostgreSQL. The patent data has been separated into multiple databases based on data source or subject matter. Each database consists of several tables for which the known [[Patent Data Issues | issues]] have been recorded. |
| | | |
− | == ER diagram == | + | ==Data Sources== |
− | [[Image:Patent_Data.png|1024x768px]] | + | Data has been extracted from the '''[[USPTO Bulk Data]]''', the '''[[Harvard Dataverse]]''', and '''[[Lex Machina]]''', an online patent litigation database. {{#section:USPTO_Bulk_Data_Processing|bulk}}{{#section:Harvard_Dataverse|dataverse}} The sources used were intended to follow the overall [[Data Model]] established by the McNair Center. |
| | | |
− | == Downloading the files == | + | ==Database Specifics== |
− | The files (in xml format) for granted patent data can be obtained at [https://www.google.com/googlebooks/uspto-patents-grants-text.html granted patent] | + | The [[Patent|Patent Database]] contains the merged datasets from the USPTO bulk data and Harvard Dataverse using PostgreSQL. Specifics on how the datasets were merged are given in [[Patent Data Processing - SQL Steps]]. |
| + | Patent Database focuses on patents, patent litigation, patent maintenance, patent assignment, and other details on patent owners. The [[USPTOAssigneesData|USPTO Assignees Database (version 2)]] focuses on patent assignments, a transaction between one or more patent owners with one or more parties where ownership or interest in one or more patents is assigned or shared. The database consists of historical assignment data provided by the USPTO in XML files. Specifics on how the database are given on the [[USPTO Assignees Data Processing]] Page. |
| | | |
− | The files for patent application data can be obtained at [https://www.google.com/googlebooks/uspto-patents-applications-text.html patent applications]
| + | ==Academic Projects== |
| | | |
− | The files for maintenance fees data can be obtained at [https://www.google.com/googlebooks/uspto-patents-maintenance-fees.html maintenance] | + | ===[[Little Guy Academic Paper|'Little Guy' Academic Paper]]=== |
| + | The first application of the refined database will be the [[Little Guy Academic Paper]]. {{#section:Little_Guy_Academic_Paper|Little Guy}} |
| + | ===Patent Trolls=== |
| | | |
− | Scripts are available to perform a bulk download of all the above files
| + | Academic Paper: The patent database will also be used to explore the existence of patent trolls and characteristic litigation activity. An academic paper may be developed defining patent trolls and other entities often confused as patent trolls. The data from Lex Machina will be used to track troll behavior and associated outcomes as well as the impact of other patent intermediary and assertion bodies. |
| | | |
− | These scripts can also be found under /bulk/Software/download\ scripts ("E:\Software\download scripts")on McNair RDP:
| + | Issue Brief: Based on an analysis of the litigation data from Lex Machina, an issue brief, tentatively titled [[The Truth Behind Patent Trolls Issue Brief| The Truth Behind Patent Trolls]], on patent troll activity may be written to report on how best to curve abuses through [[Innovation Policy| innovation policy]] and reform. |
| | | |
− | [http://www.edegan.com/wiki/index.php/Image:Applications_download_2001-2004.sh Script to download patent application data from 2001-2004]
| |
| | | |
− | [http://www.edegan.com/wiki/index.php/Image:Applications_download_2005-2015.sh Script to download patent application data from 2005-2015]
| + | <!-- flush --> |
− | | |
− | [http://www.edegan.com/wiki/index.php/Image:Grant_download_1976-2000.sh Script to download patent grant data from 1976-2000]
| |
− | | |
− | [http://www.edegan.com/wiki/index.php/Image:Grant_download_2001-2004.sh Script to download patent grant data from 2001-2004]
| |
− | | |
− | [http://www.edegan.com/wiki/index.php/Image:Grant_download_2005-2015.sh Script to download patent grant data from 2005-2015]
| |
− | | |
− | To use the scripts, save the scripts as shell scripts, then either
| |
− | | |
− | $ sh Applications_download_2001-2004.sh
| |
− | | |
− | or first change the script to an executable and execute it
| |
− | | |
− | $ chmod a+x Applications_download_2001-2004.sh
| |
− | $ ./Applications_download_2001-2004.sh
| |
− | | |
− | Notice there will be several hundreds of .zip files of size ~100mb getting downloaded so the process might take a while.
| |
− | When all the files are downloaded, unzip all of them using
| |
− | | |
− | $ unzip *.zip
| |
− | | |
− | ==XML Schema Notes==
| |
− | | |
− | Tags we are using:
| |
− | *CPC Classification: https://en.wikipedia.org/wiki/Cooperative_Patent_Classification
| |
− | | |
− | Tags we aren't using:
| |
− | *Kind codes: http://www.uspto.gov/learning-and-resources/support-centers/electronic-business-center/kind-codes-included-uspto-patent
| |
− | *Series codes: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/filingyr.htm
| |
− | | |
− | == Parsing and Processing the XML files ==
| |
− | | |
− | The ParserSpliter.pl script will first split a large Patent Data XML file into smaller XML files, one for each patent data. And it will then parse and process each Patent Data XML file.
| |
− | | |
− | Some of the files are somehow mal-formatted, and will be moved to a ./failed_files directory If you add a character anywhere in these files, they somehow become fine to be processed by the script.
| |
− |
| |
− | In order to use this script, you will need to have XML::Simple and Try::Tiny installed.
| |
− | | |
− | Open up CPAN shell:
| |
− | $ perl -e shell -MCPAN
| |
− | | |
− | Install:
| |
− | cpan[0]> install XML::Simple
| |
− | cpan[1]> install Try::Tiny
| |
− | cpan[2]> install Switch
| |
− | | |
− | Once the packages have been installed, use the script like the following example:
| |
− | perl PatentParser.pl -file=ipa150319_small.xml
| |
− | | |
− | ==Other Resources==
| |
− | | |
− | The Harvard Dataverse page: [http://www.edegan.com/wiki/index.php/Harvard_Dataverse]
| |
− | | |
− | [http://www.uspto.gov/learning-and-resources/xml-resources Documentations for the xml files]
| |
− | | |
− | [http://www.uspto.gov/learning-and-resources/xml-resources/xml-resources-retrospective See Also]
| |
− | | |
− | [https://www.w3.org/2000/04/schema_hack/ tool to convert dtd to xsd]
| |
− | | |
− | [https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/15705 Harvard Dataverse]
| |
− | | |
− | ==New Notes==
| |
− | | |
− | The source files have transitioned from here:
| |
− | *https://www.google.com/googlebooks/uspto-patents-grants-text.html (No longer maintained)
| |
− | To:
| |
− | *https://bulkdata.uspto.gov/ (includes 2016 data)
| |
− | | |
− | The historic data is the same both sides.
| |
− | | |
− | Each file contains, in order, sorted by document ID:
| |
− | #Design patents (we will discard)
| |
− | #Plant patents (we will discard)
| |
− | #Reissues (we probably want them)
| |
− | #Utility patents (we want them)
| |
− | | |
− | The classifications in the XML file are:
| |
− | *IPC - these are good and we just need the main classification
| |
− | *CPC - as above
| |
− | *USPC - just a numeric but not split. Is 22431 224/31 or 22/431, etc.
| |
− | | |
− | ==Scripts==
| |
− | All the scripts related to the patent Data are at:
| |
− | \\father\bulk\Software\Scripts\Patent
| |
− | USPTO_Parser.pl will parse the USPTO website and downloads the concatenated xmls to:
| |
− | \\father\bulk\PatentData
| |
− | It should be run as follows
| |
− | USPTO_Parser.pl year1 year2
| |
− |
| |
− | Gets the data from year1 to year2
| |
− | | |
− | Splitter.pl will split those concatenated xmls into individual xmls into:
| |
− | \\father\bulk\PatentData\Processed
| |
− | Note: The ByYear (2010-2016) folders are for convenience (the XMLs inside them are post-processed to deal with genome sequences)
| |
− | xmlparser_4.5_4.4_4.3.pl is the script that processes the xmls given the path where the xmls are stored. This script is located at
| |
− | \\father\bulk\PatentData\Processed
| |
− | It should be run as
| |
− | xmlparser_4.5_4.4_4.3.pl '\\father\bulk\PatentData\Processed\2010'
| |
− | This will process all the xmls present in the 2010 directory and store them in the database.
| |
− | The database connection string is hard coded for now inside the script. The database name is patentDB (located in the postgres installation of the RDP server). We then pg_dump them and pg_restore on the dbase server.
| |
− | | |
− | ==Fields of Interest==
| |
− | | |
− | We only care about Utility patents (and maybe Reissue patents too)
| |
− | | |
− | ===Utility patent grants fields===
| |
− | | |
− | ====Patent====
| |
− | | |
− | *patent number
| |
− | *kind: http://www.uspto.gov/patents-application-process/patent-search/authority-files/uspto-kind-codes
| |
− | *grantdate
| |
− | | |
− | For version 4.5:
| |
− | <publication-reference>
| |
− | <document-id>
| |
− | <country>US</country>
| |
− | <doc-number>08925112</doc-number>
| |
− | <kind>B2</kind>
| |
− | <date>20150106</date>
| |
− | </document-id>
| |
− | </publication-reference>
| |
− | | |
− | *type
| |
− | *applicationnumber
| |
− | *filingdate
| |
− | <application-reference appl-type="utility">
| |
− | <document-id>
| |
− | <country>US</country>
| |
− | <doc-number>13824291</doc-number>
| |
− | <date>20110929</date>
| |
− | </document-id>
| |
− | </application-reference>
| |
− | | |
− | For priority, if there is more than 1, we want sequence 01
| |
− | *prioritydate
| |
− | *prioritycountry (should use ISO country codes - may need a lookup table)
| |
− | *prioritypatentnumber
| |
− | *'''find 4.3 file with priority claim'''
| |
− | | |
− | <priority-claims>
| |
− | <priority-claim sequence="01" kind="national">
| |
− | <country>GB</country>
| |
− | <doc-number>1016384.8</doc-number>
| |
− | <date>20100930</date>
| |
− | </priority-claim>
| |
− | </priority-claims>
| |
− | | |
− | Classification IPC - we only need the first one: http://www.wipo.int/export/sites/www/classifications/ipc/en/guide/guide_ipc.pdf
| |
− | *Section, Class, SubClass - Together these concord to US subclass: http://www.uspto.gov/web/patents/classification/international/ipc/ipc8/ipc_concordance/ipcsel.htm#a
| |
− | *MainGroup, SubGroup
| |
− | | |
− | <classifications-ipcr>
| |
− | <classification-ipcr>
| |
− | <ipc-version-indicator>
| |
− | <date>20060101</date>
| |
− | </ipc-version-indicator>
| |
− | <classification-level>A</classification-level>
| |
− | <section>B</section>
| |
− | <class>64</class>
| |
− | <subclass>G</subclass>
| |
− | <main-group>6</main-group>
| |
− | <subgroup>00</subgroup>
| |
− | <symbol-position>F</symbol-position>
| |
− | <classification-value>I</classification-value>
| |
− | ...
| |
− | </classification-ipcr>
| |
− | ...
| |
− | </classifications-ipcr>
| |
− | | |
− | Classification CPC - we only need the main one
| |
− | | |
− | CPC is a classification scheme set up by the USPTO and the European Patent Office (EPO). The first classification codes rolled out on November 9, 2012.[http://www.cooperativepatentclassification.org/cpcSchemeAndDefinitions.html] Full implementation of the CPC classification system occurred on January 2015, at the same time of version 4.5 of the USPTO patent bulk data.[http://www.uspto.gov/sites/default/files/about/advisory/ppac/120927-09a-international_cpc.pdf]
| |
− | | |
− | *Section, Class, Subclass
| |
− | *Main Group, Subgroup
| |
− | *'''v 4.2, 4.3, 4.4 does not have this'''
| |
− | | |
− | <classifications-cpc>
| |
− | <main-cpc>
| |
− | <classification-cpc>
| |
− | <cpc-version-indicator>
| |
− | <date>20130101</date>
| |
− | </cpc-version-indicator>
| |
− | <section>B</section>
| |
− | <class>64</class>
| |
− | <subclass>D</subclass>
| |
− | <main-group>10</main-group>
| |
− | <subgroup>00</subgroup>
| |
− | <symbol-position>F</symbol-position>
| |
− | <classification-value>I</classification-value>
| |
− | ...
| |
− | </classification-cpc>
| |
− | </main-cpc>
| |
− | </classifications-cpc>
| |
− | | |
− | Classification National: Note that the one below comes out to 2/2.11 (http://www.google.com/patents/US8925112#classifications)
| |
− | *Country
| |
− | *Class
| |
− | | |
− | '''THIS IS NOT UNIQUE. What classifications are we searching for?'''
| |
− | <classification-national>
| |
− | <country>US</country>
| |
− | <main-classification>2 211</main-classification>
| |
− | </classification-national>
| |
− | | |
− | Title of the patent:
| |
− | <invention-title id="d2e61">Aircrew ensembles</invention-title>
| |
− | | |
− | Number of Claims:
| |
− | <number-of-claims>12</number-of-claims>
| |
− | | |
− | Primary examiner:
| |
− | *FirstName, LastName, Department
| |
− | | |
− | <examiners>
| |
− | <primary-examiner>
| |
− | <last-name>Patel</last-name>
| |
− | <first-name>Tejash</first-name>
| |
− | <department>3765</department>
| |
− | </primary-examiner>
| |
− | ...
| |
− | </examiners>
| |
− | | |
− | PCT/Regional Patent Number:
| |
− | *PCTNumber (just the doc number - if it starts with PCT set a flag)
| |
− | *'''not in all v 4.5'''
| |
− | *'''not in v 4.2, 4.3, 4.4'''
| |
− | *'''maybe not all patents are filed under PCT, need to use code to search all files for key word'''
| |
− | | |
− | <pct-or-regional-filing-data>
| |
− | <document-id>
| |
− | <country>WO</country>
| |
− | <doc-number>PCT/EP2011/067014</doc-number>
| |
− | <kind>00</kind>
| |
− | <date>20110929</date>
| |
− | </document-id>
| |
− | ...
| |
− | </pct-or-regional-filing-data>
| |
− | | |
− | ====Citations====
| |
− | | |
− | Patent Citations (we need all of them):
| |
− | *CitingPatentNumber (from the patent)
| |
− | *CitingPatentCountry (from the patent)
| |
− | | |
− | <publication-reference>
| |
− | <document-id>
| |
− | <country>US</country>
| |
− | <doc-number>08925112</doc-number>
| |
− | <kind>B2</kind>
| |
− | <date>20150106</date>
| |
− | </document-id>
| |
− | </publication-reference>
| |
− | | |
− | *CitedPatentNumber
| |
− | *CitedPatentCountry
| |
− | *'''V 4.2 does not have <us-references-cited>
| |
− | | |
− | <us-references-cited>
| |
− | <us-citation>
| |
− | <patcit num="00001">
| |
− | <document-id>
| |
− | <country>US</country>
| |
− | <doc-number>1105569</doc-number>
| |
− | <kind>A</kind>
| |
− | <name>Lacrotte</name>
| |
− | <date>19140700</date>
| |
− | </document-id>
| |
− | </patcit>
| |
− | <category>cited by examiner</category>
| |
− | <classification-national>
| |
− | <country>US</country>
| |
− | <main-classification>2 214</main-classification>
| |
− | </classification-national>
| |
− | </us-citation>
| |
− | ...
| |
− | </us-references-cited>
| |
− | | |
− | For non-patent references, we are just going to count them:
| |
− | *NoNonPatRefs
| |
− | | |
− | <us-references-cited>
| |
− | ...
| |
− | <us-citation>
| |
− | <nplcit num="00020">
| |
− | <othercit>
| |
− | European Search Report dated Jan. 20, 2011 as received in European Patent Application No. GB1016384.8.
| |
− | </othercit>
| |
− | </nplcit>
| |
− | <category>cited by applicant</category>
| |
− | </us-citation>
| |
− | </us-references-cited>
| |
− | | |
− | ====Inventors====
| |
− | | |
− | *'''For v 4.3, 4.4, 4.5'''
| |
− | *PatentNumber (and country) to build a key
| |
− | *We need a "standard" name and address object for each inventor
| |
− | <us-parties>
| |
− | <us-applicants>
| |
− | ...
| |
− | </us-applicants>
| |
− | <inventors>
| |
− | <inventor sequence="001" designation="us-only">
| |
− | <addressbook>
| |
− | <last-name>Oliver</last-name>
| |
− | <first-name>Paul</first-name>
| |
− | <address>
| |
− | <city>Rhyl</city>
| |
− | <country>GB</country>
| |
− | </address>
| |
− | </addressbook>
| |
− | </inventor>
| |
− | ...
| |
− | </inventors>
| |
− | ...
| |
− | <us-parties>
| |
− | | |
− | | |
− | *'''For v 4.2'''
| |
− | | |
− | <parties>
| |
− | <applicants>
| |
− | <applicant sequence="001" app-type="applicant-inventor" designation="us-only">
| |
− | <addressbook>
| |
− | <last-name>Kamath</last-name>
| |
− | <first-name>Sandeep</first-name>
| |
− | <address>
| |
− | <city>Bangalore</city>
| |
− | <country>IN</country>
| |
− | </address>
| |
− | </addressbook>
| |
− | <nationality>
| |
− | <country>omitted</country>
| |
− | </nationality>
| |
− | <residence>
| |
− | <country>IN</country>
| |
− | </residence>
| |
− | </applicant>
| |
− | ...
| |
− | </applicants>
| |
− | ...
| |
− | </parties>
| |
− | | |
− | ====Assignees====
| |
− | | |
− | *PatentNumber (and country) to build a key
| |
− | *We need a "standard" name and address object for each assignee
| |
− | | |
− | <assignees>
| |
− | <assignee>
| |
− | <addressbook>
| |
− | <orgname>Survitec Group Limited</orgname>
| |
− | <role>03</role>
| |
− | <address>
| |
− | <city>Merseyside</city>
| |
− | <country>GB</country>
| |
− | </address>
| |
− | </addressbook>
| |
− | </assignee>
| |
− | </assignees>
| |
− | | |
− |
| |
− | ====Other things we might want====
| |
− | | |
− | *Abstract
| |
− | *Claims (other than their count)
| |
− | | |
− | ====Things we don't need====
| |
− | | |
− | General:
| |
− | *Series Code: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/filingyr.htm
| |
− | | |
− | Classification related:
| |
− | *Level - This appears to be either core or advanced. Not sure it matters.
| |
− | *SymbolPosition, ClassificationValue - we likely don't need them
| |
− | *Classification status and data source - no idea what these do
| |
− | | |
− | ====About the scripts====
| |
− | | |
− | The scripts to process the Patent Data are all located under /bulk/Software/Scripts/PatentData/ ("E:\Software\Scripts\PatentData\")
| |
− | | |
− | There are currently 5 .pm files: PatentApplication.pm, Inventor.pm, Claim.pm, and Addressbook.pm, and Loader.pm available.
| |
− | | |
− | Each of the first 4 represents an Object type. The last one is a helper object that is able to extract the wanted fields as a perl object given a schema file.
| |
− | Future work should be done in this file to support more schema files.
| |
− | | |
− | Example Usage:
| |
− | perl PatentParser.pl -file=ipa150319.xml
| |
− | This will parse the xml file with name ipa150319.xml, extract all the Patents (in this case PatentApplications) each as a temporary xml file, and then, using a Loader object with a specified
| |
− | schema file, in this case "us-patent-application-v44-2014-04-03.dtd" to be able to extract each of the 4 object types from the Patents.
| |
− | If any error happened during the parsing of any file, that file will be moved to a directory called "failed_files". Most likely if a file failed the parsing it is likely not a Utility patent.
| |
− | | |
− | ====About the Harvard Dataverse====
| |
− | The patents from 1975-2010 loaded as .sqlite3 and csv files can be found at
| |
− | | |
− | [https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/15705 Harvard Dataverse]
| |
− | | |
− | I have also downloaded all of them on to the database server and can be found by
| |
− | cd /bulk/patent
| |