Difference between revisions of "Patent Data Processing - SQL Steps"
(27 intermediate revisions by 5 users not shown) | |||
Line 1: | Line 1: | ||
− | + | {{Project | |
− | [[ | + | |Has project output= |
+ | |Has sponsor=McNair Center | ||
+ | |Has title=Patent Data Processing - SQL Steps | ||
+ | |Has owner=Shelby Bice, | ||
+ | |Has project status=Subsume | ||
+ | |Has keywords=Tool | ||
+ | }} | ||
+ | Return to [[Patent Data (Wiki Page)]]. | ||
== Objective == | == Objective == | ||
Line 11: | Line 18: | ||
The schema for the assignees table in '''patentdata''' database is: | The schema for the assignees table in '''patentdata''' database is: | ||
− | + | Column | Type | Modifiers | |
− | -------------+-------------------+----------- | + | -------------+-------------------+----------- |
− | + | patent | integer | | |
− | + | asgtype | integer | | |
− | + | assignee | character varying | | |
− | + | city | character varying | | |
− | + | state | character varying | | |
− | + | country | character varying | | |
− | + | nationality | character varying | | |
− | + | residence | character varying | | |
− | + | asgseq | integer | | |
The schema for the assignees table in patent_2015 is : | The schema for the assignees table in patent_2015 is : | ||
Column | Type | Modifiers | Column | Type | Modifiers | ||
− | + | --------------+---------+----------- | |
lastname | text | | lastname | text | | ||
firstname | text | | firstname | text | | ||
Line 42: | Line 49: | ||
'''Overlapping Columns''' | '''Overlapping Columns''' | ||
− | + | patent_2015 | patentdata | |
− | --------------+-------------- | + | --------------+-------------- |
− | + | orgname | assignee | |
− | + | city | city | |
− | + | country | country | |
− | + | patentnumber | patent | |
− | + | state | state | |
These columns will have entries for most rows in the table, because they exist in both tables. The rest of the columns will be populated based on which table the row is coming from. | These columns will have entries for most rows in the table, because they exist in both tables. The rest of the columns will be populated based on which table the row is coming from. | ||
'''Final Schema''' | '''Final Schema''' | ||
+ | |||
Table "public.assignees" | Table "public.assignees" | ||
− | + | Column | Type | Modifiers | |
− | ---------------+-------------------+----------- | + | ---------------+-------------------+----------- |
− | + | lastname | character varying | | |
− | + | firstname | character varying | | |
− | + | address | character varying | | |
− | + | postcode | character varying | | |
− | + | orgname | character varying | | |
− | + | city | character varying | | |
− | + | country | character varying | | |
− | + | patentnumber | integer | | |
− | + | state | character varying | | |
− | + | patentcountry | character varying | | |
− | + | nationality2 | character varying | | |
− | + | residence | character varying | | |
− | + | asgseq | integer | | |
− | + | asgtype | integer | | |
'''Non-overlapping Columns''' | '''Non-overlapping Columns''' | ||
Line 78: | Line 86: | ||
Therefore, if a row has appropriate values for orgname, state, city ,etc, but 'null' values for lastname, firstname, address and postcode, the row has come from the patentdata table. | Therefore, if a row has appropriate values for orgname, state, city ,etc, but 'null' values for lastname, firstname, address and postcode, the row has come from the patentdata table. | ||
+ | |||
+ | ==== Index ==== | ||
+ | Since the table is relatively large, and is likely to be searched often, an index has been imposed on the table. | ||
+ | |||
+ | allpatent=# CREATE INDEX ON assignees (orgname); | ||
+ | CREATE INDEX | ||
====Sample insert and copy commands ==== | ====Sample insert and copy commands ==== | ||
Line 133: | Line 147: | ||
==Patents == | ==Patents == | ||
− | + | '''Patentdata Schema:''' | |
− | |||
− | |||
− | |||
− | |||
− | + | Column | Type | Modifiers | |
− | + | --------+-------------------+----------- | |
− | |||
− | |||
− | Column | Type | Modifiers | ||
− | |||
patent | integer | | patent | integer | | ||
kind | character varying | | kind | character varying | | ||
Line 155: | Line 161: | ||
appyear | integer | | appyear | integer | | ||
− | + | ||
+ | '''Patent_2015 Schema:''' | ||
+ | |||
Column | Type | Modifiers | Column | Type | Modifiers | ||
-------------------------------+---------+----------- | -------------------------------+---------+----------- | ||
Line 183: | Line 191: | ||
filename | varchar | | filename | varchar | | ||
+ | ''' Overlapping Columns ''' | ||
+ | patent_data patent_2015 | ||
+ | --------------+------------- | ||
+ | patent | patentnumber | ||
+ | kind | kind | ||
+ | claims | numberofclaims | ||
+ | apptype | type | ||
+ | appnum | applicationnumber | ||
+ | gdate | grantdate | ||
+ | appdate | filingdate | ||
+ | |||
− | Combined Schema: | + | '''Combined Schema:''' |
− | + | The final schema of the patents table is : | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
+ | Column | Type | Modifiers | ||
+ | ----------------------+-------------------+----------- | ||
+ | patent | integer | not null | ||
+ | grantdate | date | | ||
+ | prioritydate | date | | ||
+ | prioritycountry | character varying | | ||
+ | prioritypatentnumber | character varying | | ||
+ | cpcsubgroup | character varying | | ||
+ | pctpatentnumber | character varying | | ||
+ | claims | integer | | ||
+ | appnum | integer | | ||
+ | gyear | integer | | ||
+ | appdate | date | | ||
+ | appyear | integer | | ||
+ | nber | integer | | ||
+ | uspc | character varying | | ||
+ | uspc_sub | character varying | | ||
+ | |||
+ | From the total list of columns belonging to both the tables (patentdata and patent_2015), a few columns, most of them related to classification of patents, have been dropped since the data in the tables was not clean. | ||
− | + | Additionally, three columns - nber, uspc, uspc_sub have been added from the historicalpatentdata, a table built from data downloaded from the USPTO Bulk Data Storage. The join was executed on the patent number. | |
− | + | ||
− | + | Note : The addition, deletion of columns as through separate [[Patent Data Cleanup - June 2016 |scripts]], therefore the scripts below will be slightly discrepant. | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | ==== Index and Key Creation ==== | |
+ | Patent numbers are distinct in this table, and are central to the rest of the fields in the table. A primary key can therefore be imposed on the column. Also, since a number of searches are likely to be conducted on this table, an index has been imposed as well. | ||
+ | |||
+ | Code: | ||
+ | ALTER TABLE patents ADD PRIMARY KEY (patentnumber); | ||
+ | -- RESULT : ALTER TABLE | ||
+ | allpatent=# CREATE UNIQUE INDEX patent_idx ON patents (patentnumber); | ||
+ | |||
+ | |||
+ | |||
+ | ====Sample Insert and Copy Statements==== | ||
patentdata: | patentdata: | ||
INSERT INTO patents_merged | INSERT INTO patents_merged | ||
( | ( | ||
− | + | SELECT | |
− | + | patent, | |
− | + | kind, | |
− | + | gdate, | |
− | + | 'NULL', | |
− | + | 'NULL', | |
− | + | NULL, | |
− | + | NULL, | |
− | + | 'NULL', | |
− | + | 'NULL', | |
− | + | 'NULL', | |
− | + | 'NULL', | |
− | + | 'NULL', | |
− | + | 'NULL', | |
− | + | 'NULL', | |
− | + | 'NULL', | |
− | + | 'NULL', | |
− | + | 'NULL', | |
− | + | 'NULL', | |
− | + | -1, | |
− | + | 'NULL', | |
− | + | 'NULL', | |
− | + | 'NULL', | |
− | + | 'NULL', | |
− | + | 'NULL', | |
− | + | claims, | |
− | + | apptype, | |
− | + | appnum, | |
− | + | gyear, | |
− | + | appdate, | |
− | + | appyear | |
− | + | FROM patents | |
− | ); | + | ); |
− | -- RESULT : INSERT 0 3984771 | + | -- RESULT : INSERT 0 3984771 |
patent_2015: | patent_2015: | ||
Line 344: | Line 320: | ||
COPY SCRIPTS: | COPY SCRIPTS: | ||
patentdata: | patentdata: | ||
− | \COPY patents_merged TO '/tmp/merged_patents_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV; | + | \COPY patents_merged TO '/tmp/merged_patents_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV; |
− | --COPY 3984771 | + | --COPY 3984771 |
patent_2015: | patent_2015: | ||
− | \COPY patents_merged TO '/tmp/merged_patents_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV; | + | \COPY patents_merged TO '/tmp/merged_patents_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV; |
− | --COPY 1646225 | + | --COPY 1646225 |
PATENTS TABLE | PATENTS TABLE | ||
− | \COPY patents FROM '/tmp/merged_patents_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV; | + | \COPY patents FROM '/tmp/merged_patents_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV; |
− | -- RESULT : COPY 3984771 | + | -- RESULT : COPY 3984771 |
− | \COPY patents FROM '/tmp/merged_patents_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV; | + | \COPY patents FROM '/tmp/merged_patents_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV; |
− | -- RESULT : COPY 1646225 | + | -- RESULT : COPY 1646225 |
− | |||
====TESTING ==== | ====TESTING ==== | ||
− | select count(*) FROM (SELECT DISTINCT patentnumber FROM patents) AS t; | + | select count(*) FROM (SELECT DISTINCT patentnumber FROM patents) AS t; |
− | --RESULT: 5411151 | + | --RESULT: 5411151 |
− | EXPECTED: 5426566 | + | EXPECTED: 5426566 |
We found some copies of a few rows, where both the patent_2015 and patentdata | We found some copies of a few rows, where both the patent_2015 and patentdata | ||
− | SELECT COUNT(*), * | + | SELECT COUNT(*), * |
− | FROM patents | + | FROM patents |
− | GROUP BY | + | GROUP BY |
− | + | patentnumber, | |
kind, | kind, | ||
grantdate, | grantdate, | ||
Line 398: | Line 373: | ||
appdate, | appdate, | ||
appyear | appyear | ||
− | HAVING COUNT(*) > 1; | + | HAVING COUNT(*) > 1; |
− | SELECT patentnumber, count(*) | + | SELECT patentnumber, count(*) |
− | FROM patents | + | FROM patents |
− | GROUP BY patentnumber | + | GROUP BY patentnumber |
− | HAVING count(*)>1; | + | HAVING count(*)>1; |
− | --7640598 | + | --7640598 |
− | SELECT * | + | SELECT * |
− | FROM patents op | + | FROM patents op |
− | WHERE op.patentnumber IN | + | WHERE op.patentnumber IN |
( | ( | ||
SELECT ip.patentnumber | SELECT ip.patentnumber | ||
Line 416: | Line 391: | ||
HAVING COUNT(*)>1 | HAVING COUNT(*)>1 | ||
) | ) | ||
− | ORDER BY op.patentnumber; | + | ORDER BY op.patentnumber; |
− | ( | + | ( |
− | SELECT * | + | SELECT * |
− | INTO patentsCleaned | + | INTO patentsCleaned |
− | FROM patents op | + | FROM patents op |
− | WHERE op.patentnumber IN | + | WHERE op.patentnumber IN |
( | ( | ||
SELECT ip.patentnumber | SELECT ip.patentnumber | ||
Line 429: | Line 404: | ||
HAVING COUNT(*)=1 | HAVING COUNT(*)=1 | ||
) | ) | ||
− | ORDER BY op.patentnumber | + | ORDER BY op.patentnumber |
− | ) | + | ) |
− | --SELECT 5191306 | + | --SELECT 5191306 |
− | INSERT INTO patentsCleaned( | + | INSERT INTO patentsCleaned( |
− | SELECT * | + | SELECT * |
− | FROM patents op | + | FROM patents op |
− | WHERE op.patentnumber IN | + | WHERE op.patentnumber IN |
( | ( | ||
SELECT ip.patentnumber | SELECT ip.patentnumber | ||
Line 444: | Line 419: | ||
HAVING COUNT(*)>1 | HAVING COUNT(*)>1 | ||
) | ) | ||
− | AND op.applicationnumber NOT LIKE 'NULL' | + | AND op.applicationnumber NOT LIKE 'NULL' |
− | ORDER BY op.patentnumber | + | ORDER BY op.patentnumber |
− | ); | + | ); |
− | --219845 | + | --219845 |
− | TESTING: | + | ====TESTING:==== |
− | allpatent=# select count(*) from patentsCleaned; | + | allpatent=# select count(*) from patentsCleaned; |
− | + | count | |
− | --------- | + | --------- |
− | + | 5411151 | |
− | (1 row) | + | (1 row) |
− | allpatent=# select count(*), patentnumber FROM patentsCleaned group by patentnumber having count(*) > 1; | + | allpatent=# select count(*), patentnumber FROM patentsCleaned group by patentnumber having count(*) > 1; |
− | + | count | patentnumber | |
− | -------+-------------- | + | -------+-------------- |
− | (0 rows) | + | (0 rows) |
+ | == Citations== | ||
− | + | In the citations table, we needed to define another function that would convert a textual patent number into a number (big int, since the patents number were exceeding the range of regular integers.) | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
To Extract Patents with Numbers Only and to Ignore Other RegExes | To Extract Patents with Numbers Only and to Ignore Other RegExes | ||
− | CREATE OR REPLACE FUNCTION cleanpatno (text) RETURNS bigint AS $$ | + | CREATE OR REPLACE FUNCTION cleanpatno (text) RETURNS bigint AS $$ |
if ($_[0]) { | if ($_[0]) { | ||
my $var=$_[0]; | my $var=$_[0]; | ||
Line 484: | Line 451: | ||
$$ LANGUAGE plperl; | $$ LANGUAGE plperl; | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | patent_2015: | + | '''patentdata schema:''' |
− | + | ||
− | ---------------------+---------+----------- | + | Column | Type | Modifiers |
− | + | ------------+-------------------+----------- | |
− | + | patent | integer | | |
− | + | cit_date | date | | |
− | + | cit_name | character varying | | |
+ | cit_kind | character varying | | ||
+ | cit_country | character varying | | ||
+ | citation | integer | | ||
+ | category | character varying | | ||
+ | citseq | integer | | ||
+ | |||
+ | SELECT patent as citingpatentnumber, citation AS citedpatentnumber | ||
+ | INTO citations_merged | ||
+ | FROM citations; | ||
+ | --SELECT 38452957 | ||
+ | |||
+ | '''patent_2015 schema:''' | ||
+ | Column | Type | Modifiers | ||
+ | ---------------------+---------+----------- | ||
+ | citingpatentnumber | integer | | ||
+ | citingpatentcountry | text | | ||
+ | citedpatentnumber | text | | ||
+ | citedpatentcountry | text | | ||
+ | |||
+ | SELECT CAST(citingpatentnumber AS bigint), CAST(cleanpatno( citedpatentnumber) AS bigint) as citedpatentnumber | ||
+ | INTO citations_merged | ||
+ | FROM citations; | ||
+ | -- RESULT : SELECT 59227881 | ||
− | + | '''Overlapping Columns''' | |
− | |||
− | |||
− | |||
− | + | patent_2015 | patentdata | | |
+ | ---------------------+---------------+ | ||
+ | citingpatentnumber | patent | | ||
+ | citedpatentnumber | citation | | ||
+ | |||
+ | ''' Combined Schema:''' | ||
− | + | Column | Type | Modifiers | |
− | + | --------------------+--------+----------- | |
− | citingpatentnumber bigint | + | citingpatentnumber | bigint | |
− | citedpatentnumber bigint | + | citedpatentnumber | bigint | |
− | |||
Line 526: | Line 502: | ||
patentdata: | patentdata: | ||
− | \COPY citations_merged TO '/tmp/merged_citations_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV; | + | \COPY citations_merged TO '/tmp/merged_citations_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV; |
− | --COPY 38452957 | + | --COPY 38452957 |
patent_2015: | patent_2015: | ||
− | \COPY citations_merged TO '/tmp/merged_citations_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV; | + | \COPY citations_merged TO '/tmp/merged_citations_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV; |
− | -- RESULT : COPY 59227881 | + | -- RESULT : COPY 59227881 |
allpatent: | allpatent: | ||
− | \COPY citations FROM '/tmp/merged_citations_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV; | + | \COPY citations FROM '/tmp/merged_citations_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV; |
− | --RESULT : COPY 59227881 | + | --RESULT : COPY 59227881 |
− | \COPY citations FROM '/tmp/merged_citations_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV; | + | \COPY citations FROM '/tmp/merged_citations_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV; |
− | --RESULT: COPY 38452957 | + | --RESULT: COPY 38452957 |
Line 546: | Line 522: | ||
== USPTO Consolidated Patent Data == | == USPTO Consolidated Patent Data == | ||
− | + | The USPTO has a repository of patent data on their Bulk Data Storage system. We have this data downloaded and loaded into a table on the patent database. Here are the steps followed: | |
+ | * Download file from the BDS system - we have access to CSV files. | ||
+ | * Create table with required specs | ||
+ | * use the \COPY command to copy the data from the file into the table. | ||
− | + | Script follows. | |
− | + | '''Script:''' | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | /* creating patent data tables from : https://bulkdata.uspto.gov/data2/patent/maintenancefee/*/ | |
− | |||
− | + | CREATE TABLE PatentMaintenanceFee( | |
+ | patentnumber varchar, | ||
+ | applicationnumber int, | ||
+ | smallentity varchar, | ||
+ | filingdate date, | ||
+ | grantissuedate date, | ||
+ | maintenancefeedate date, | ||
+ | maintenancefeecode varchar | ||
+ | ); | ||
− | + | \COPY PatentMaintenanceFee FROM '/bulk/USPTO_Consolidated/MaintFeeEvents_20160613.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV; | |
− | + | -- RESULT : COPY 14042059 | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | /* creating tables for historical patent data - USPTO */ | |
− | --COPY 11191813 | + | CREATE TABLE HistoricalPatentData( |
+ | applicationid int, | ||
+ | pubno varchar, | ||
+ | patentnumber varchar, | ||
+ | NBER int, | ||
+ | USPC varchar, | ||
+ | USPC_sub varchar, | ||
+ | applicationdate date, | ||
+ | prioritydate date, | ||
+ | pubdate date, | ||
+ | displaydate date, | ||
+ | disptype varchar, | ||
+ | exp_dt date, | ||
+ | exp_dt_max date, | ||
+ | pta int | ||
+ | ); | ||
+ | |||
+ | \COPY historicalpatentdata FROM '/bulk/USPTO_Consolidated/HistoricalFiles/historical_masterfile.csv' DELIMITER AS ',' HEADER NULL AS '' CSV; | ||
+ | |||
+ | --COPY 11191813 | ||
+ | |||
+ | [[Category:Internal]] | ||
+ | [[Internal Classification::Legacy| ]] | ||
+ | [[Category:Patent]] |
Latest revision as of 12:34, 21 September 2020
Patent Data Processing - SQL Steps | |
---|---|
Project Information | |
Has title | Patent Data Processing - SQL Steps |
Has owner | Shelby Bice |
Has start date | |
Has deadline date | |
Has keywords | Tool |
Has project status | Subsume |
Has sponsor | McNair Center |
Copyright © 2019 edegan.com. All Rights Reserved. |
Return to Patent Data (Wiki Page).
Contents
Objective
The McNair Center owns two sets of patent data - one set that is inherited from Harvard, the Harvard dataverse, which is stored in the database patentdata and another that is generated by crawlers pulling data from the USPTO website, which is stored in the database patent_2015.
We are now merging and cleaning the two data sets, and storing them in a schema that is amalgamation of the two underlying schema for the citations tables, assignees tables, and patents tables. The destination schema is allpatent.
Assignees Data
The schema for the assignees table in patentdata database is:
Column | Type | Modifiers -------------+-------------------+----------- patent | integer | asgtype | integer | assignee | character varying | city | character varying | state | character varying | country | character varying | nationality | character varying | residence | character varying | asgseq | integer |
The schema for the assignees table in patent_2015 is :
Column | Type | Modifiers --------------+---------+----------- lastname | text | firstname | text | orgname | text | city | text | country | text | patentcountry | text | patentnumber | integer | state | text | address | text | postcode | text |
To merge both schemas, we have some columns that overlap, and some columns that don't.
Overlapping Columns
patent_2015 | patentdata --------------+-------------- orgname | assignee city | city country | country patentnumber | patent state | state
These columns will have entries for most rows in the table, because they exist in both tables. The rest of the columns will be populated based on which table the row is coming from.
Final Schema
Table "public.assignees" Column | Type | Modifiers ---------------+-------------------+----------- lastname | character varying | firstname | character varying | address | character varying | postcode | character varying | orgname | character varying | city | character varying | country | character varying | patentnumber | integer | state | character varying | patentcountry | character varying | nationality2 | character varying | residence | character varying | asgseq | integer | asgtype | integer |
Non-overlapping Columns These are the columns that belong to either one of the assignees tables, and not to both. For these cases, to help users understand where the row is coming from, the following insert rules have been followed:
- For columns of type int, insert -1
- For columns of type string (character varying), the string 'null' has been inserted.
Therefore, if a row has appropriate values for orgname, state, city ,etc, but 'null' values for lastname, firstname, address and postcode, the row has come from the patentdata table.
Index
Since the table is relatively large, and is likely to be searched often, an index has been imposed on the table.
allpatent=# CREATE INDEX ON assignees (orgname); CREATE INDEX
Sample insert and copy commands
INSERT INTO assignees_merge ( SELECT 'null', 'null', 'null', 'null', a.assignee, a.city, a.country, a.patent, a.state, 'null', a.nationality, a.residence, a.asgseq, a.asgtype FROM assignees a );
INSERT INTO assignees_merge ( SELECT assignees.lastname, assignees.firstname, assignees.address, assignees.postcode, assignees.orgname, assignees.city, assignees.country, assignees.patentnumber, assignees.state, assignees.patentcountry, 'null', 'null', -1, -1 FROM assignees );
\COPY assignees_merge TO '/tmp/assignees_merge_export.txt' DELIMITER AS E'\t' HEADER NULL AS CSV;
\COPY assignees FROM '/tmp/assignees_merge_export.txt' DELIMITER AS E'\t' HEADER NULL AS CSV;
--1607724
\COPY assignees_merge TO '/tmp/assignees_merge_export1.txt' DELIMITER AS E'\t' HEADER NULL AS CSV; \COPY assignees FROM '/tmp/assignees_merge_export1.txt' DELIMITER AS E'\t' HEADER NULL AS CSV; --3818842
Note : The assignees table was updated on 6/23 to remove the 'null' string and the '-1' values.
Patents
Patentdata Schema:
Column | Type | Modifiers --------+-------------------+----------- patent | integer | kind | character varying | claims | integer | apptype | integer | appnum | integer | gdate | date | gyear | integer | appdate | date | appyear | integer |
Patent_2015 Schema:
Column | Type | Modifiers -------------------------------+---------+----------- patentnumber | int | not null kind | varchar | grantdate | date | type | varchar | applicationnumber | varchar | filingdate | date | prioritydate | date | prioritycountry | varchar | prioritypatentnumber | varchar | ussubclass | varchar | maingroup | varchar | subgroup | varchar | cpcsubclass | varchar | cpcmaingroup | varchar | cpcsubgroup | varchar | classificationnationalcountry | varchar | classificationnationalclass | varchar | title | varchar | numberofclaims | int | primaryexaminerfirstname | varchar | primaryexaminerlastname | varchar | primaryexaminerdepartment | varchar | pctpatentnumber | varchar | filename | varchar |
Overlapping Columns
patent_data patent_2015 --------------+------------- patent | patentnumber kind | kind claims | numberofclaims apptype | type appnum | applicationnumber gdate | grantdate appdate | filingdate
Combined Schema:
The final schema of the patents table is :
Column | Type | Modifiers ----------------------+-------------------+----------- patent | integer | not null grantdate | date | prioritydate | date | prioritycountry | character varying | prioritypatentnumber | character varying | cpcsubgroup | character varying | pctpatentnumber | character varying | claims | integer | appnum | integer | gyear | integer | appdate | date | appyear | integer | nber | integer | uspc | character varying | uspc_sub | character varying |
From the total list of columns belonging to both the tables (patentdata and patent_2015), a few columns, most of them related to classification of patents, have been dropped since the data in the tables was not clean.
Additionally, three columns - nber, uspc, uspc_sub have been added from the historicalpatentdata, a table built from data downloaded from the USPTO Bulk Data Storage. The join was executed on the patent number.
Note : The addition, deletion of columns as through separate scripts, therefore the scripts below will be slightly discrepant.
Index and Key Creation
Patent numbers are distinct in this table, and are central to the rest of the fields in the table. A primary key can therefore be imposed on the column. Also, since a number of searches are likely to be conducted on this table, an index has been imposed as well.
Code:
ALTER TABLE patents ADD PRIMARY KEY (patentnumber); -- RESULT : ALTER TABLE allpatent=# CREATE UNIQUE INDEX patent_idx ON patents (patentnumber);
Sample Insert and Copy Statements
patentdata: INSERT INTO patents_merged ( SELECT patent,
kind, gdate, 'NULL', 'NULL', NULL, NULL, 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', -1, 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', claims, apptype, appnum, gyear, appdate,
appyear FROM patents ); -- RESULT : INSERT 0 3984771 patent_2015: INSERT INTO patents_merged ( SELECT patentnumber,
kind, grantdate, type, applicationnumber, filingdate, prioritydate, prioritycountry, prioritypatentnumber, ussubclass, maingroup, subgroup, cpcsubclass, cpcmaingroup, cpcsubgroup, classificationnationalcountry, classificationnationalclass, title, numberofclaims, primaryexaminerfirstname, primaryexaminerlastname, primaryexaminerdepartment, pctpatentnumber, filename, -1, -1, -1, -1, NULL,
-1 FROM patents ); -- RESULT : INSERT 0 1646225
COPY SCRIPTS: patentdata:
\COPY patents_merged TO '/tmp/merged_patents_export1.txt' DELIMITER AS E'\t' HEADER NULL AS CSV; --COPY 3984771
patent_2015:
\COPY patents_merged TO '/tmp/merged_patents_export.txt' DELIMITER AS E'\t' HEADER NULL AS CSV; --COPY 1646225
PATENTS TABLE
\COPY patents FROM '/tmp/merged_patents_export1.txt' DELIMITER AS E'\t' HEADER NULL AS CSV; -- RESULT : COPY 3984771 \COPY patents FROM '/tmp/merged_patents_export.txt' DELIMITER AS E'\t' HEADER NULL AS CSV; -- RESULT : COPY 1646225
TESTING
select count(*) FROM (SELECT DISTINCT patentnumber FROM patents) AS t; --RESULT: 5411151 EXPECTED: 5426566
We found some copies of a few rows, where both the patent_2015 and patentdata
SELECT COUNT(*), * FROM patents GROUP BY patentnumber,
kind, grantdate, type, applicationnumber, filingdate, prioritydate, prioritycountry, prioritypatentnumber, ussubclass, maingroup, subgroup, cpcsubclass, cpcmaingroup, cpcsubgroup, classificationnationalcountry, classificationnationalclass, title, numberofclaims, primaryexaminerfirstname, primaryexaminerlastname, primaryexaminerdepartment, pctpatentnumber, filename, claims, apptype, appnum, gyear, appdate, appyear
HAVING COUNT(*) > 1;
SELECT patentnumber, count(*) FROM patents GROUP BY patentnumber HAVING count(*)>1; --7640598
SELECT * FROM patents op WHERE op.patentnumber IN
( SELECT ip.patentnumber FROM patents ip GROUP BY ip.patentnumber HAVING COUNT(*)>1 )
ORDER BY op.patentnumber;
( SELECT * INTO patentsCleaned FROM patents op WHERE op.patentnumber IN
( SELECT ip.patentnumber FROM patents ip GROUP BY ip.patentnumber HAVING COUNT(*)=1 )
ORDER BY op.patentnumber ) --SELECT 5191306
INSERT INTO patentsCleaned( SELECT * FROM patents op WHERE op.patentnumber IN
( SELECT ip.patentnumber FROM patents ip GROUP BY ip.patentnumber HAVING COUNT(*)>1 )
AND op.applicationnumber NOT LIKE 'NULL' ORDER BY op.patentnumber );
--219845
TESTING:
allpatent=# select count(*) from patentsCleaned; count --------- 5411151 (1 row)
allpatent=# select count(*), patentnumber FROM patentsCleaned group by patentnumber having count(*) > 1; count | patentnumber -------+-------------- (0 rows)
Citations
In the citations table, we needed to define another function that would convert a textual patent number into a number (big int, since the patents number were exceeding the range of regular integers.)
To Extract Patents with Numbers Only and to Ignore Other RegExes
CREATE OR REPLACE FUNCTION cleanpatno (text) RETURNS bigint AS $$
if ($_[0]) { my $var=$_[0]; if ($var=~/^\d*$/) {return $var;} return undef; } return undef; $$ LANGUAGE plperl;
patentdata schema:
Column | Type | Modifiers ------------+-------------------+----------- patent | integer | cit_date | date | cit_name | character varying | cit_kind | character varying | cit_country | character varying | citation | integer | category | character varying | citseq | integer |
SELECT patent as citingpatentnumber, citation AS citedpatentnumber INTO citations_merged FROM citations; --SELECT 38452957
patent_2015 schema:
Column | Type | Modifiers ---------------------+---------+----------- citingpatentnumber | integer | citingpatentcountry | text | citedpatentnumber | text | citedpatentcountry | text |
SELECT CAST(citingpatentnumber AS bigint), CAST(cleanpatno( citedpatentnumber) AS bigint) as citedpatentnumber INTO citations_merged FROM citations; -- RESULT : SELECT 59227881
Overlapping Columns
patent_2015 | patentdata | ---------------------+---------------+ citingpatentnumber | patent | citedpatentnumber | citation |
Combined Schema:
Column | Type | Modifiers
+--------+-----------
citingpatentnumber | bigint | citedpatentnumber | bigint |
Copy Statements:
patentdata:
\COPY citations_merged TO '/tmp/merged_citations_export1.txt' DELIMITER AS E'\t' HEADER NULL AS CSV; --COPY 38452957
patent_2015:
\COPY citations_merged TO '/tmp/merged_citations_export.txt' DELIMITER AS E'\t' HEADER NULL AS CSV; -- RESULT : COPY 59227881
allpatent:
\COPY citations FROM '/tmp/merged_citations_export.txt' DELIMITER AS E'\t' HEADER NULL AS CSV; --RESULT : COPY 59227881
\COPY citations FROM '/tmp/merged_citations_export1.txt' DELIMITER AS E'\t' HEADER NULL AS CSV; --RESULT: COPY 38452957
CLONING:
CREATE DATABASE allpatentsProcessed WITH TEMPLATE allpatent OWNER researcher;
USPTO Consolidated Patent Data
The USPTO has a repository of patent data on their Bulk Data Storage system. We have this data downloaded and loaded into a table on the patent database. Here are the steps followed:
- Download file from the BDS system - we have access to CSV files.
- Create table with required specs
- use the \COPY command to copy the data from the file into the table.
Script follows.
Script:
/* creating patent data tables from : https://bulkdata.uspto.gov/data2/patent/maintenancefee/*/
CREATE TABLE PatentMaintenanceFee( patentnumber varchar, applicationnumber int, smallentity varchar, filingdate date, grantissuedate date, maintenancefeedate date, maintenancefeecode varchar );
\COPY PatentMaintenanceFee FROM '/bulk/USPTO_Consolidated/MaintFeeEvents_20160613.txt' DELIMITER AS E'\t' HEADER NULL AS CSV; -- RESULT : COPY 14042059
/* creating tables for historical patent data - USPTO */
CREATE TABLE HistoricalPatentData( applicationid int, pubno varchar, patentnumber varchar, NBER int, USPC varchar, USPC_sub varchar, applicationdate date, prioritydate date, pubdate date, displaydate date, disptype varchar, exp_dt date, exp_dt_max date, pta int );
\COPY historicalpatentdata FROM '/bulk/USPTO_Consolidated/HistoricalFiles/historical_masterfile.csv' DELIMITER AS ',' HEADER NULL AS CSV;
--COPY 11191813