Difference between revisions of "Harvard Dataverse"
imported>Leo |
imported>Leo |
||
Line 116: | Line 116: | ||
The following is the schema for the 10 tables in this database: | The following is the schema for the 10 tables in this database: | ||
+ | |||
+ | I have cleaned up and fixed the types for the the following tables: | ||
CREATE TABLE patents ( | CREATE TABLE patents ( | ||
− | Patent | + | Patent integer, |
− | Kind varchar, #only types NULL, A1, B1, B2 | + | Kind varchar, #only types NULL, A1, B1, B2 |
Claims integer, | Claims integer, | ||
− | AppType | + | AppType integer, #int (http://www.uspto.gov/web/offices/ac/ido/oeip/taf/filingyr.htm) 29 is design |
− | AppNum | + | AppNum integer, |
GDate date, | GDate date, | ||
GYear integer, | GYear integer, | ||
− | AppDate | + | AppDate date, |
− | AppYear | + | AppYear integer |
); | ); | ||
CREATE TABLE classes ( | CREATE TABLE classes ( | ||
− | Patent | + | Patent integer |
Prim integer, | Prim integer, | ||
Class varchar, | Class varchar, | ||
Line 137: | Line 139: | ||
CREATE TABLE assignees ( | CREATE TABLE assignees ( | ||
− | Patent | + | Patent integer, |
− | AsgType | + | AsgType integer, |
Assignee varchar, | Assignee varchar, | ||
City varchar, | City varchar, | ||
Line 149: | Line 151: | ||
CREATE TABLE citations ( | CREATE TABLE citations ( | ||
− | Patent | + | Patent integer, |
− | Cit_Date | + | Cit_Date date, |
Cit_Name varchar, | Cit_Name varchar, | ||
Cit_Kind varchar, | Cit_Kind varchar, | ||
Line 160: | Line 162: | ||
CREATE TABLE inventors ( | CREATE TABLE inventors ( | ||
− | Patent | + | Patent integer, |
Firstname varchar, | Firstname varchar, | ||
Lastname varchar, | Lastname varchar, | ||
Line 167: | Line 169: | ||
State varchar, | State varchar, | ||
Country varchar, | Country varchar, | ||
− | Zipcode | + | Zipcode varchar, #?? |
Nationality varchar, | Nationality varchar, | ||
InvSeq integer | InvSeq integer | ||
); | ); | ||
+ | |||
+ | I have not processed the following tables: | ||
CREATE TABLE patdesc ( | CREATE TABLE patdesc ( |
Revision as of 14:20, 6 April 2016
This page records how to load/use the Harvard Dataverse.
Contents
Getting the data
Where we get the Harvard Dataverse data: Harvard Dataverse
For more information about the patent data: Patent Data
Loading the tables into the database
COPY classes FROM '/tmp/class.csv' DELIMITER ',' NULL CSV; COPY assignees FROM '/tmp/assignee.csv' DELIMITER ',' NULL CSV; COPY citations FROM '/tmp/citation.csv' DELIMITER ',' NULL CSV; COPY inventors FROM '/tmp/inventor.csv' DELIMITER ',' NULL CSV; COPY patents FROM '/tmp/patent.csv' DELIMITER ',' NULL CSV; COPY patdescs FROM '/tmp/patdesc.csv' DELIMITER ',' NULL CSV; COPY lawyers FROM '/tmp/lawyer.csv' DELIMITER ',' NULL CSV; COPY scirefs FROM '/tmp/sciref.csv' DELIMITER ',' NULL CSV; COPY usreldocs FROM '/tmp/usreldoc.csv' DELIMITER ',' NULL CSV; COPY invpats FROM '/tmp/invpat.csv' DELIMITER ',' NULL CSV;
Accessing the dbase
On the database server, the entire Harvard Dataverse for Patent Data is downloaded in directory
/bulk/patent
The files have been processed to be used for the create table scripts I have created to load the Harvard Dataverse database into our own psql database.
These preprocessed files are available under
/bulk/patent/preprocessed
The database I have created for the Patent Data from 1975-2010 is called patent data. After logging in to researcher@128.42.44.181, you can get access to the database by
psql patentdata
You can see a list of the tables inside the patent database by
patentdata=# \dt List of relations Schema | Name | Type | Owner --------+-----------+-------+------------ public | assignees | table | researcher public | citations | table | researcher public | classes | table | researcher public | inventors | table | researcher public | invpat | table | researcher public | lawyers | table | researcher public | patdesc | table | researcher public | patents | table | researcher public | scirefs | table | researcher public | usreldocs | table | researcher (10 rows)
You can select from the database, for example, by
patentdata=# SELECT * FROM CLASSES WHERE patent='03930270'; patent | prim | class | subclass ----------+------+-------+---------- 03930270 | 1 | 360 | 130.24 03930270 | 0 | 360 | 84 03930270 | 0 | 360 | 271.8 03930270 | 0 | G9B | 15.08 (4 rows)
Or, do a join:
patentdata=# SELECT * FROM inventors AS i, classes AS c WHERE i.Patent=c.Patent AND i.Patent='03930270';
patent | firstname | lastname | street | city | state | country | zipcode | nationality | invseq | patent | prim | class | subclass ----------+-----------+----------+--------+-----------+-------+---------+---------+-------------+--------+----------+------+-------+---------- 03930270 | YOSHINORI | YAMAZOE | | OSAKA | | JP | | | 3 | 03930270 | 1 | 360 | 130.24 03930270 | YOSHINORI | YAMAZOE | | OSAKA | | JP | | | 3 | 03930270 | 0 | 360 | 84 03930270 | YOSHINORI | YAMAZOE | | OSAKA | | JP | | | 3 | 03930270 | 0 | 360 | 271.8 03930270 | YOSHINORI | YAMAZOE | | OSAKA | | JP | | | 3 | 03930270 | 0 | G9B | 15.08 03930270 | CHIHIRO | NAKAMURA | | MORIGUCHI | | JP | | | 4 | 03930270 | 1 | 360 | 130.24 03930270 | CHIHIRO | NAKAMURA | | MORIGUCHI | | JP | | | 4 | 03930270 | 0 | 360 | 84 03930270 | CHIHIRO | NAKAMURA | | MORIGUCHI | | JP | | | 4 | 03930270 | 0 | 360 | 271.8 03930270 | CHIHIRO | NAKAMURA | | MORIGUCHI | | JP | | | 4 | 03930270 | 0 | G9B | 15.08 03930270 | SATOSHI | KIKUYA | | KADOMA | | JP | | | 2 | 03930270 | 1 | 360 | 130.24 03930270 | SATOSHI | KIKUYA | | KADOMA | | JP | | | 2 | 03930270 | 0 | 360 | 84 03930270 | SATOSHI | KIKUYA | | KADOMA | | JP | | | 2 | 03930270 | 0 | 360 | 271.8 03930270 | SATOSHI | KIKUYA | | KADOMA | | JP | | | 2 | 03930270 | 0 | G9B | 15.08 03930270 | SHIZUHIKO | TANIGAWA | | MORIGUCHI | | JP | | | 1 | 03930270 | 1 | 360 | 130.24 03930270 | SHIZUHIKO | TANIGAWA | | MORIGUCHI | | JP | | | 1 | 03930270 | 0 | 360 | 84 03930270 | SHIZUHIKO | TANIGAWA | | MORIGUCHI | | JP | | | 1 | 03930270 | 0 | 360 | 271.8 03930270 | SHIZUHIKO | TANIGAWA | | MORIGUCHI | | JP | | | 1 | 03930270 | 0 | G9B | 15.08 (16 rows)
The schema I used for this database is documented at the bottom of this page.
And this is also saved as a .sql script under
/bulk/Software/Database\ Scripts/createtables.sql
A corresponding Drop table scripts is also available at
/bulk/Software/Database\ Scripts/droptables.sql
Converting DataTypes
Method 1: Explicit recast
CAST function
Method 2: implicit recast
SELECT varname::type
Method 3: through a language
CREATE OR REPLACE FUNCTION chartoint (text) RETURNS int AS $$ if ($_[0]) { return $_[0]; } return undef; $$ LANGUAGE plperl;
The scripts for cleaning up the tables are available under /Database\ Scripts/cleaning\ db.sql
Creating the tables
The following is the schema for the 10 tables in this database:
I have cleaned up and fixed the types for the the following tables:
CREATE TABLE patents ( Patent integer, Kind varchar, #only types NULL, A1, B1, B2 Claims integer, AppType integer, #int (http://www.uspto.gov/web/offices/ac/ido/oeip/taf/filingyr.htm) 29 is design AppNum integer, GDate date, GYear integer, AppDate date, AppYear integer );
CREATE TABLE classes ( Patent integer Prim integer, Class varchar, Subclass varchar );
CREATE TABLE assignees ( Patent integer, AsgType integer, Assignee varchar, City varchar, State varchar, Country varchar, Nationality varchar, Residence varchar, AsgSeq integer );
CREATE TABLE citations ( Patent integer, Cit_Date date, Cit_Name varchar, Cit_Kind varchar, Cit_Country varchar, Citation varchar, Category varchar, CitSeq integer );
CREATE TABLE inventors ( Patent integer, Firstname varchar, Lastname varchar, Street varchar, City varchar, State varchar, Country varchar, Zipcode varchar, #?? Nationality varchar, InvSeq integer );
I have not processed the following tables:
CREATE TABLE patdesc ( Patent varchar, Abstract varchar, Title varchar );
CREATE TABLE lawyers ( Patent varchar, Firstname varchar, Lastname varchar, LawCountry varchar, OrgName varchar, LawSeq integer );
CREATE TABLE scirefs ( Patent varchar, Descrip varchar, CitSeq integer );
CREATE TABLE usreldocs ( Patent varchar, DocType varchar, OrderSeq integer, Country varchar, RelPatent varchar, Kind varchar, RelDate varchar, Status varchar );
CREATE TABLE invpat ( Firstname varchar, Lastname varchar, Street varchar, City varchar, State varchar, Country varchar, Zipcode integer, Lat double precision, Lon double precision, InvSeq integer, Patent varchar, GYear integer, AppYear varchar, AppDate varchar, Assignee varchar, AsgNum varchar, Class varchar, InvNum varchar, Low varchar, Up varchar );