Changes

Patent Data (view source)

Revision as of 13:42, 8 July 2016

13,944 bytes removed , 13:42, 8 July 2016

MarcelaInteriano moved page Patent Data (Wiki Page) to Patent Data Wiki Page

~~The Patent Data page is for instructions on how to get the USPTO patent data, how to use the database, and for the documentation of our database.~~ ~~== ER diagram ==[[Image:Patent_Data.png|1024x768px]]~~ ~~== Downloading the files ==The files (in xml format) for granted patent data can be obtained at~~ #REDIRECT [~~https://www.google.com/googlebooks/uspto-patents-grants-text.html granted patent]~~ ~~The files for patent application data can be obtained at~~ [~~https://www.google.com/googlebooks/uspto-patents-applications-text.html patent applications]~~ ~~The files for maintenance fees data can be obtained at [https://www.google.com/googlebooks/uspto-patents-maintenance-fees.html maintenance]~~ ~~Scripts are available to perform a bulk download of all the above files~~ ~~These scripts can also be found under /bulk/Software/download\ scripts ("E:\Software\download scripts")on McNair RDP:~~ ~~[http://www.edegan.com/wiki/index.php/Image:Applications_download_2001-2004.sh Script to download patent application data from 2001-2004]~~ ~~[http://www.edegan.com/wiki/index.php/Image:Applications_download_2005-2015.sh Script to download patent application data from 2005-2015]~~ ~~[http://www.edegan.com/wiki/index.php/Image:Grant_download_1976-2000.sh Script to download patent grant data from 1976-2000]~~ ~~[http://www.edegan.com/wiki/index.php/Image:Grant_download_2001-2004.sh Script to download patent grant data from 2001-2004]~~ ~~[http://www.edegan.com/wiki/index.php/Image:Grant_download_2005-2015.sh Script to download patent grant data from 2005-2015]~~ ~~To use the scripts, save the scripts as shell scripts, then either~~ ~~$ sh Applications_download_2001-2004.sh~~ ~~or first change the script to an executable and execute it~~ ~~$ chmod a+x Applications_download_2001-2004.sh~~ ~~$ ./Applications_download_2001-2004.sh~~ ~~Notice there will be several hundreds of .zip files of size ~100mb getting downloaded so the process might take a while.When all the files are downloaded, unzip all of them using~~ ~~$ unzip *.zip~~ ~~==XML Schema Notes==~~ ~~Tags we are using:~~*CPC Classification: https://en.wikipedia.org/wiki/Cooperative_Patent_Classification ~~Tags we aren't using:~~*Kind codes: http://www.uspto.gov/learning-and-resources/support-centers/electronic-business-center/kind-codes-included-uspto-patent*Series codes: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/filingyr.htm ~~== Parsing and Processing the XML files ==~~ ~~The ParserSpliter.pl script will first split a large Patent Data XML file into smaller XML files, one for each patent data. And it will then parse and process each~~ Patent Data ~~XML file.~~ ~~Some of the files are somehow mal-formatted, and will be moved to a ./failed_files directory If you add a character anywhere in these files, they somehow become fine to be processed by the script.~~ ~~In order to use this script, you will need to have XML::Simple and Try::Tiny installed.~~ ~~Open up CPAN shell:~~ ~~$ perl -e shell -MCPAN~~ ~~Install:~~ ~~cpan[0]> install XML::Simple~~ ~~cpan[1]> install Try::Tiny~~ ~~cpan[2]> install Switch~~ ~~Once the packages have been installed, use the script like the following example:~~ ~~perl PatentParser.pl -file=ipa150319_small.xml~~ ~~==Other Resources==~~ ~~The Harvard Dataverse page: [http://www.edegan.com/wiki/index.php/Harvard_Dataverse]~~ ~~[http://www.uspto.gov/learning-and-resources/xml-resources Documentations for the xml files]~~ ~~[http://www.uspto.gov/learning-and-resources/xml-resources/xml-resources-retrospective See Also]~~ ~~[https://www.w3.org/2000/04/schema_hack/ tool to convert dtd to xsd]~~ ~~[https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/15705 Harvard Dataverse]~~ ~~==New Notes==~~ ~~The source files have transitioned from here:~~*https://www.google.com/googlebooks/uspto-patents-grants-text.html (No longer maintained)~~To:~~*https://bulkdata.uspto.gov/ (includes 2016 data) ~~The historic data is the same both sides.~~ ~~Each file contains, in order, sorted by document ID:#Design patents (we will discard)#Plant patents (we will discard)#Reissues (we probably want them)#Utility patents (we want them)~~ ~~The classifications in the XML file are:~~*IPC - these are good and we just need the main classification*CPC - as above*USPC - just a numeric but not split. Is 22431 224/31 or 22/431, etc. ~~==Scripts==All the scripts related to the patent Data are at:~~ ~~\\father\bulk\Software\Scripts\PatentUSPTO_Parser.pl will parse the USPTO website and downloads the concatenated xmls to:~~ ~~\\father\bulk\PatentDataIt should be run as follows~~ ~~USPTO_Parser.pl year1 year2~~ ~~Gets the data from year1 to year2~~ ~~Splitter.pl will split those concatenated xmls into individual xmls into:~~ ~~\\father\bulk\PatentData\Processed~~ ~~Note: The ByYear (2010-2016) folders are for convenience (the XMLs inside them are post-processed to deal with genome sequences)xmlparser_4.5_4.4_4.3.pl is the script that processes the xmls given the path where the xmls are stored. This script is located at~~ ~~\\father\bulk\PatentData\ProcessedIt should be run as~~ ~~xmlparser_4.5_4.4_4.3.pl '\\father\bulk\PatentData\Processed\2010'This will process all the xmls present in the 2010 directory and store them in the database.~~The database connection string is hard coded for now inside the script. The database name is patentDB (located in the postgres installation of the RDP server). We then pg_dump them and pg_restore on the dbase server. ~~==Fields of Interest==~~ ~~We only care about Utility patents (and maybe Reissue patents too)~~ ~~===Utility patent grants fields===~~ ~~====Patent====~~ *patent number*kind: http://www.uspto.gov/patents-application-process/patent-search/authority-files/uspto-kind-codes*grantdate ~~For version 4.5:~~ ~~<publication-reference>~~ ~~<document-id>~~ ~~<country>US</country>~~ ~~<doc-number>08925112</doc-number>~~ ~~<kind>B2</kind>~~ ~~<date>20150106</date>~~ ~~</document-id>~~ ~~</publication-reference>~~ *type*applicationnumber*filingdate ~~<application-reference appl-type="utility">~~ ~~<document-id>~~ ~~<country>US</country>~~ ~~<doc-number>13824291</doc-number>~~ ~~<date>20110929</date>~~ ~~</document-id>~~ ~~</application-reference>~~ ~~For priority, if there is more than 1, we want sequence 01~~*prioritydate*prioritycountry (should use ISO country codes - may need a lookup table)*prioritypatentnumber*'''find 4.3 file with priority claim''' ~~<priority-claims>~~ ~~<priority-claim sequence="01" kind="national">~~ ~~<country>GB</country>~~ ~~<doc-number>1016384.8</doc-number>~~ ~~<date>20100930</date>~~ ~~</priority-claim>~~ ~~</priority-claims>~~ ~~Classification IPC - we only need the first one: http://www.wipo.int/export/sites/www/classifications/ipc/en/guide/guide_ipc.pdf~~*Section, Class, SubClass - Together these concord to US subclass: http://www.uspto.gov/web/patents/classification/international/ipc/ipc8/ipc_concordance/ipcsel.htm#a*MainGroup, SubGroup ~~<classifications-ipcr>~~ ~~<classification-ipcr>~~ ~~<ipc-version-indicator>~~ ~~<date>20060101</date>~~ ~~</ipc-version-indicator>~~ ~~<classification-level>A</classification-level>~~ ~~<section>B</section>~~ ~~<class>64</class>~~ ~~<subclass>G</subclass>~~ ~~<main-group>6</main-group>~~ ~~<subgroup>00</subgroup>~~ ~~<symbol-position>F</symbol-position>~~ ~~<classification-value>I</classification-value>~~ ~~...~~ ~~</classification-ipcr>~~ ~~...~~ ~~</classifications-ipcr>~~ ~~Classification CPC - we only need the main one~~ CPC is a classification scheme set up by the USPTO and the European Patent Office (EPO). The first classification codes rolled out on November 9, 2012.[http://www.cooperativepatentclassification.org/cpcSchemeAndDefinitions.html] Full implementation of the CPC classification system occurred on January 2015, at the same time of version 4.5 of the USPTO patent bulk data.[http://www.uspto.gov/sites/default/files/about/advisory/ppac/120927-09a-international_cpc.pdfWiki Page] *Section, Class, Subclass*Main Group, Subgroup*'''v 4.2, 4.3, 4.4 does not have this''' ~~<classifications-cpc>~~ ~~<main-cpc>~~ ~~<classification-cpc>~~ ~~<cpc-version-indicator>~~ ~~<date>20130101</date>~~ ~~</cpc-version-indicator>~~ ~~<section>B</section>~~ ~~<class>64</class>~~ ~~<subclass>D</subclass>~~ ~~<main-group>10</main-group>~~ ~~<subgroup>00</subgroup>~~ ~~<symbol-position>F</symbol-position>~~ ~~<classification-value>I</classification-value>~~ ~~...~~ ~~</classification-cpc>~~ ~~</main-cpc>~~ ~~</classifications-cpc>~~ ~~Classification National: Note that the one below comes out to 2/2.11 (http://www.google.com/patents/US8925112#classifications)~~*Country*Class ~~'''THIS IS NOT UNIQUE. What classifications are we searching for?'''~~ ~~<classification-national>~~ ~~<country>US</country>~~ ~~<main-classification>2 211</main-classification>~~ ~~</classification-national>~~ ~~Title of the patent:~~ ~~<invention-title id="d2e61">Aircrew ensembles</invention-title>~~ ~~Number of Claims:~~ ~~<number-of-claims>12</number-of-claims>~~ ~~Primary examiner:~~*FirstName, LastName, Department ~~<examiners>~~ ~~<primary-examiner>~~ ~~<last-name>Patel</last-name>~~ ~~<first-name>Tejash</first-name>~~ ~~<department>3765</department>~~ ~~</primary-examiner>~~ ~~...~~ ~~</examiners>~~ ~~PCT/Regional Patent Number:~~*PCTNumber (just the doc number - if it starts with PCT set a flag)*'''not in all v 4.5'''*'''not in v 4.2, 4.3, 4.4'''*'''maybe not all patents are filed under PCT, need to use code to search all files for key word''' ~~<pct-or-regional-filing-data>~~ ~~<document-id>~~ ~~<country>WO</country>~~ ~~<doc-number>PCT/EP2011/067014</doc-number>~~ ~~<kind>00</kind>~~ ~~<date>20110929</date>~~ ~~</document-id>~~ ~~...~~ ~~</pct-or-regional-filing-data>~~ ~~====Citations====~~ ~~Patent Citations (we need all of them):~~*CitingPatentNumber (from the patent)*CitingPatentCountry (from the patent) ~~<publication-reference>~~ ~~<document-id>~~ ~~<country>US</country>~~ ~~<doc-number>08925112</doc-number>~~ ~~<kind>B2</kind>~~ ~~<date>20150106</date>~~ ~~</document-id>~~ ~~</publication-reference>~~ *CitedPatentNumber*CitedPatentCountry*'''V 4.2 does not have <us-references-cited> ~~<us-references-cited>~~ ~~<us-citation>~~ ~~<patcit num="00001">~~ ~~<document-id>~~ ~~<country>US</country>~~ ~~<doc-number>1105569</doc-number>~~ ~~<kind>A</kind>~~ ~~<name>Lacrotte</name>~~ ~~<date>19140700</date>~~ ~~</document-id>~~ ~~</patcit>~~ ~~<category>cited by examiner</category>~~ ~~<classification-national>~~ ~~<country>US</country>~~ ~~<main-classification>2 214</main-classification>~~ ~~</classification-national>~~ ~~</us-citation>~~ ~~...~~ ~~</us-references-cited>~~ ~~For non-patent references, we are just going to count them:~~*NoNonPatRefs ~~<us-references-cited>~~ ~~...~~ ~~<us-citation>~~ ~~<nplcit num="00020">~~ ~~<othercit>~~ ~~European Search Report dated Jan. 20, 2011 as received in European Patent Application No. GB1016384.8.~~ ~~</othercit>~~ ~~</nplcit>~~ ~~<category>cited by applicant</category>~~ ~~</us-citation>~~ ~~</us-references-cited>~~ ~~====Inventors====~~ *'''For v 4.3, 4.4, 4.5'''*PatentNumber (and country) to build a key*We need a "standard" name and address object for each inventor ~~<us-parties>~~ ~~<us-applicants>~~ ~~...~~ ~~</us-applicants>~~ ~~<inventors>~~ ~~<inventor sequence="001" designation="us-only">~~ ~~<addressbook>~~ ~~<last-name>Oliver</last-name>~~ ~~<first-name>Paul</first-name>~~ ~~<address>~~ ~~<city>Rhyl</city>~~ ~~<country>GB</country>~~ ~~</address>~~ ~~</addressbook>~~ ~~</inventor>~~ ~~...~~ ~~</inventors>~~ ~~...~~ ~~<us-parties>~~ *'''For v 4.2''' ~~<parties>~~ ~~<applicants>~~ ~~<applicant sequence="001" app-type="applicant-inventor" designation="us-only">~~ ~~<addressbook>~~ ~~<last-name>Kamath</last-name>~~ ~~<first-name>Sandeep</first-name>~~ ~~<address>~~ ~~<city>Bangalore</city>~~ ~~<country>IN</country>~~ ~~</address>~~ ~~</addressbook>~~ ~~<nationality>~~ ~~<country>omitted</country>~~ ~~</nationality>~~ ~~<residence>~~ ~~<country>IN</country>~~ ~~</residence>~~ ~~</applicant>~~ ~~...~~ ~~</applicants>~~ ~~...~~ ~~</parties>~~ ~~====Assignees====~~ *PatentNumber (and country) to build a key*We need a "standard" name and address object for each assignee ~~<assignees>~~ ~~<assignee>~~ ~~<addressbook>~~ ~~<orgname>Survitec Group Limited</orgname>~~ ~~<role>03</role>~~ ~~<address>~~ ~~<city>Merseyside</city>~~ ~~<country>GB</country>~~ ~~</address>~~ ~~</addressbook>~~ ~~</assignee>~~ ~~</assignees>~~ ~~====Other things we might want====~~ *Abstract*Claims (other than their count) ~~====Things we don't need====~~ ~~General:~~*Series Code: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/filingyr.htm ~~Classification related:~~*Level - This appears to be either core or advanced. Not sure it matters.*SymbolPosition, ClassificationValue - we likely don't need them*Classification status and data source - no idea what these do ~~====About the scripts====~~ ~~The scripts to process the Patent Data are all located under /bulk/Software/Scripts/PatentData/ ("E:\Software\Scripts\PatentData\")~~ ~~There are currently 5 .pm files: PatentApplication.pm, Inventor.pm, Claim.pm, and Addressbook.pm, and Loader.pm available.~~ ~~Each of the first 4 represents an Object type. The last one is a helper object that is able to extract the wanted fields as a perl object given a schema file.Future work should be done in this file to support more schema files.~~ ~~Example Usage:~~ ~~perl PatentParser.pl -file=ipa150319.xmlThis will parse the xml file with name ipa150319.xml, extract all the Patents (in this case PatentApplications) each as a temporary xml file, and then, using a Loader object with a specifiedschema file, in this case "us-patent-application-v44-2014-04-03.dtd" to be able to extract each of the 4 object types from the Patents.If any error happened during the parsing of any file, that file will be moved to a directory called "failed_files". Most likely if a file failed the parsing it is likely not a Utility patent.~~ ~~====About the Harvard Dataverse====The patents from 1975-2010 loaded as .sqlite3 and csv files can be found at~~ ~~[https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/15705 Harvard Dataverse~~] ~~I have also downloaded all of them on to the database server and can be found by~~ ~~cd /bulk/patent~~

MarcelaInteriano

Bots, Bureaucrats, Administrators (Semantic MediaWiki), Administrators

1,181

edits

Changes

Patent Data (view source)

Revision as of 13:42, 8 July 2016

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools