Difference between revisions of "USPTO Patent Assignment Dataset"

McNair Project
USPTO Patent Assignment Dataset
Project Information
Project Title	USPTO Patent Assignment Dataset
Owner	Ed Egan
Start Date
Deadline
Keywords	Data
Primary Billing
Notes
Has project status	Active
	Copyright © 2016 edegan.com. All Rights Reserved.

Revision as of 15:17, 16 November 2017

This project describes the build out and basic use of the USPTO Assignment Dataset.

The data, scripts, etc. are in:

E:\McNair\Projects\USPTO Patent Assignment Dataset

The data is described in a USPTO Economic Working Paper by Marco, Myers, Graham and others: https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf

The data is large. We don't have space on the main dbase server for it.

df -h
/dev/nvme1n1p2  235G  208G   15G  94% /var/postgresql

Note: To check dbase space usage on the dbase server see Posgres_Server_Configuration#Size.2C_Backup_.26_Restore.

The postgres dbase on the RDP, however, currently has more than 300Gb free and is on a solid state drive, so its performance should be acceptable.

The data is available pre-processed (see the working paper) from https://bulkdata.uspto.gov/#addt. Specifically, download csv.zip (1284462233, 2017-03-28 15:47) from https://bulkdata.uspto.gov/data/patent/assignment/economics/2016/

The load script is:

LoadUSPTOPAD.sql

To get the data into UTF-8, move it to the dbase server then:

file -i Car.java

Convert it to UTF-8 using (the TRANSLIT option approximates characters that can't be directly encoded)

iconv -f oldformat -t UTF-8//TRANSLIT file -o outfile

iconv -sc -f oldformat -t UTF-8//TRANSLIT file -o outfile

Bash scripts to do all of the csvs is in Z:\USPTO_assigneesdata; make them executable and then run whichever you need

chmod  +x  encoding.sh
./encoding.sh

@@ Line 37: / Line 37: @@
 *Convert it to UTF-8 using (the TRANSLIT option approximates characters that can't be directly encoded)
   iconv -f oldformat -t UTF-8//TRANSLIT file -o outfile
-*A bash script to do all of the csvs is in Z:\USPTO_assigneesdata; make it executable and then run it
+**The sc  options forces iconv to ignore bad chars and move on:
+ iconv -sc -f oldformat -t UTF-8//TRANSLIT file -o outfile
+*Bash scripts to do all of the csvs is in Z:\USPTO_assigneesdata; make them executable and then run whichever you need
   chmod  +x  encoding.sh
   ./encoding.sh