Difference between revisions of "USPTO Patent Assignment Dataset"
(7 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | {{McNair | + | {{Project |
+ | |Has project output=Data | ||
+ | |Has sponsor=McNair Center | ||
|Has title=USPTO Patent Assignment Dataset | |Has title=USPTO Patent Assignment Dataset | ||
|Has owner=Ed Egan, | |Has owner=Ed Egan, | ||
Line 31: | Line 33: | ||
The load script is: | The load script is: | ||
LoadUSPTOPAD.sql | LoadUSPTOPAD.sql | ||
+ | |||
+ | To get the data into ASCII or ASCII, move it to the dbase server then: | ||
+ | *Check its encoding using: | ||
+ | file -i Car.java | ||
+ | *Convert it to UTF-8 using (the TRANSLIT option approximates characters that can't be directly encoded) | ||
+ | iconv -f oldformat -t UTF-8//TRANSLIT file -o outfile | ||
+ | **The sc options forces iconv to ignore bad chars and move on: | ||
+ | iconv -sc -f oldformat -t UTF-8//TRANSLIT file -o outfile | ||
+ | *Bash scripts to do all of the csvs is in Z:\USPTO_assigneesdata; make them executable and then run whichever you need | ||
+ | chmod +x encoding.sh | ||
+ | ./encoding.sh | ||
+ | *Note that the final source encoding was Win1252 and the final target encoding was ASCII | ||
+ | *All bar three of the files had to be manually fixed to remove errors. Final files are in E:\McNair\Projects\USPTO Patent Assignment Dataset |
Latest revision as of 12:41, 21 September 2020
USPTO Patent Assignment Dataset | |
---|---|
Project Information | |
Has title | USPTO Patent Assignment Dataset |
Has owner | Ed Egan |
Has start date | |
Has deadline date | |
Has keywords | Data |
Has project status | Active |
Has sponsor | McNair Center |
Has project output | Data |
Copyright © 2019 edegan.com. All Rights Reserved. |
This project describes the build out and basic use of the USPTO Assignment Dataset.
The data, scripts, etc. are in:
E:\McNair\Projects\USPTO Patent Assignment Dataset
The data is described in a USPTO Economic Working Paper by Marco, Myers, Graham and others: https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf
Pre-load checks
The data is large. We don't have space on the main dbase server for it.
df -h /dev/nvme1n1p2 235G 208G 15G 94% /var/postgresql
Note: To check dbase space usage on the dbase server see Posgres_Server_Configuration#Size.2C_Backup_.26_Restore.
The postgres dbase on the RDP, however, currently has more than 300Gb free and is on a solid state drive, so its performance should be acceptable.
Getting the data
The data is available pre-processed (see the working paper) from https://bulkdata.uspto.gov/#addt. Specifically, download csv.zip (1284462233, 2017-03-28 15:47) from https://bulkdata.uspto.gov/data/patent/assignment/economics/2016/
The load script is:
LoadUSPTOPAD.sql
To get the data into ASCII or ASCII, move it to the dbase server then:
- Check its encoding using:
file -i Car.java
- Convert it to UTF-8 using (the TRANSLIT option approximates characters that can't be directly encoded)
iconv -f oldformat -t UTF-8//TRANSLIT file -o outfile
- The sc options forces iconv to ignore bad chars and move on:
iconv -sc -f oldformat -t UTF-8//TRANSLIT file -o outfile
- Bash scripts to do all of the csvs is in Z:\USPTO_assigneesdata; make them executable and then run whichever you need
chmod +x encoding.sh ./encoding.sh
- Note that the final source encoding was Win1252 and the final target encoding was ASCII
- All bar three of the files had to be manually fixed to remove errors. Final files are in E:\McNair\Projects\USPTO Patent Assignment Dataset