Difference between revisions of "USITC"

From edegan.com
Jump to navigation Jump to search
 
(43 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{McNair Projects
+
{{Project
 +
|Has project output=Data,Tool
 +
|Has sponsor=McNair Center
 
|Has title=USITC Data
 
|Has title=USITC Data
 
|Has owner=Harrison Brown
 
|Has owner=Harrison Brown
Line 7: Line 9:
 
}}
 
}}
 
==Files==
 
==Files==
This is where the files will go.
 
  
The files are in:
+
My  files are in 2 different places:
 +
Did development work here
 
  E:\McNair\Projects\USITC
 
  E:\McNair\Projects\USITC
  
The results file is a csv of the data that I have been able to scrape from the HTML
+
I had to use to the Postgres SQL server to download the PDFS
of https://www.usitc.gov/secretary/fed_reg_notices/337.htm
+
The Postgres SQL Server:
 +
128.42.44.182/bulk/USITC
 +
 
 +
==Additional Information==
 +
There is more information that the USITC provides besides 337 notices.
 +
 
 +
Here is information and a database on Section 701/731
 +
https://www.usitc.gov/trade_remedy/trade_research_tools
 +
https://pubapps2.usitc.gov/sunset/
 +
 
 +
 
 +
=New Work=
 +
==USITC 337 Cases Tab Delimited Text==
 +
USITC patent information was gathered from the investigations.json file downloaded from the USITC website (https://pubapps2.usitc.gov/337external/, Click on Cases Instituted After 2008).
 +
This contains information on 337 cases and their respondents/complainants and the patents that were part of the case.
 +
The code and results for this program are here:
 +
Projects/USITC/JSON_scraping_python
 +
The program grabs the information, places it into lists of lists in Python, and then writes to the file names listed below. The files do not have headers and null values are set to be empty strings.
 +
To create the tab delimited text files, run code.py in the JSON_scraping_python directory. This has all of the file names hard coded. It will create the following files
 +
investigation_info.txt
 +
Schema for this file is id, title, investigation number,  investigation type, docket number, date of publication notice
 +
 
 +
complainant_info.txt
 +
Schema for this file is investigation id, investigation number, Complaintant name, complainant outside party ID, comp_city, comp country
 +
 
 +
respondent_info.txt
 +
Schema for this file is investigation id, investigation number, Respondent Outside Party ID , Respondent Name, Respondent City, Respondent Country
 +
 
 +
patent_info.txt
 +
Schema for this file is Investigation Number, Patent ID, Patent Number, Active Date, Inactive Date,
 +
 
 +
==XML Information==
 +
UPDATE: used JSON file of data to convert to tab-delimited text.
 +
There is an XML file that contains information on investigations. To get it go to the link below and 'xml' link that is under the tab that under 'Cases instituted after October 2008'.
 +
https://pubapps2.usitc.gov/337external/
 +
 
 +
The information that is found this file can be grabbed with an XML parser. For each investigation, we can find out
 +
* Investigation Number ex - (<entry key="investigationNo">966</entry>)
 +
* Date of publication Notice - (<entry key="dateOfPublicationFrNotice">2015-09-24T04:00:00.000Z</entry>)
 +
* Title ex - <entry key="title">Silicon-on-Insulator Wafers</entry>
 +
* There is an entry for patent numbers, ex - <entry key="patentNumbers">
 +
* Investigation Type ex - <entry key="investigationType">Violation</entry>
 +
* Respondents can be found under <entry key="respondent">
 +
* Complainant can be found under <entry key="complainant">
 +
Additional information can also be gathered from the XML document
 +
 
 +
To find information on cases prior to 2008, go to the link above and click on 'Looking for cases instituted prior to October 2008?', and it will
 +
download a csv file.
 +
* The investigation number, Title, Unfair Act Alleged, Patent Numbers,Complainants, Respondents, can be grabbed easily from the CSV
 +
* Target Date, Beginning and Ending Dates contain notes (some cases are extended and dates are changed)and so we may need to do some text processing to grab this information
 +
 
 +
==Old Work==
 +
This is the work that was done before I knew about the XML files provided by the USITC (see Alternative Solutions section below)
 +
 
 +
Here is a csv of the data that I have been able to scrape from the HTML of https://www.usitc.gov/secretary/fed_reg_notices/337.htm
 +
The results.csv file found here,
 +
E:\McNair\Projects\USITC
  
 
For every notice paper, there is a line in the CSV file that
 
For every notice paper, there is a line in the CSV file that
 
contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued
 
contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued
  
 +
I have also downloaded the PDFS from the website. These are the pdfs from the links that are in the csv file. Some of the PDFS were not able to be downloaded. The PDFs are here
 +
E:\McNair\Projects\USITC\pdfs_copy
 +
 +
These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason.
 +
You must download the PDFs on the Postgres Server and transfer them to the RDP. The script to download the PDFS
 +
is
 +
128.42.44.182\bulk\USITC\download
 +
 +
Using the pdf scraper from previous project found here
 +
E:\McNair\software\utilities\PDF_RIPPER
 +
You can scrape the PDFs. This file was modified to scrape all of the pdfs in the pdfs folder. The modified code is in the McNair/Project/USITC/ directory and it is
 +
called pdf_to_text_bulk.py
 +
 +
An example of PDF parsing that works is parsing this pdf: https://www.usitc.gov/secretary/fed_reg_notices/337/337_959_notice02062017sgl.pdf
 +
E:\McNair\Projects\USITC\Parsed_Texts\337_959_notice02062017sgl.txt
 +
There are PDFs where the parsing does not work completely and the  text is scrambled.
 +
 +
I have started using NLTK to gather information about the filings. See code in Projects/USITC/ProcessingTexts
 +
I am working on extracting the respondents from these documents.
 +
 +
==Alternative Solutions==
 +
Here is where you can get data on the USITC 337 notices instead of extracting this information from PDFs
 +
 +
337Info - Unfair Import Investigations Information System
 +
 +
https://pubapps2.usitc.gov/337external/
 +
 +
This link has links to Complaints. Most of the data that can be gathered is there and it is in a nice format that could be easily web scraped if we needed to. They have a query builder (link below). There is also a link that download all of the raw data in JSON or XML.
 +
 +
Here are links to various  statistics we could use:
 +
Section 337 Statistics: Settlement Rate Data
 +
https://www.usitc.gov/intellectual_property/337_statistics_settlement_rate_data.htm
 +
 +
Section 337 Statistics
 +
https://www.usitc.gov/press_room/337_stats.htm
 +
 +
Section 337 Statistics: Number of New, Completed, and Active Investigations by Fiscal Year (Updated Quarterly)
 +
https://www.usitc.gov/intellectual_property/337_statistics_number_new_completed_and_active.htm
 +
 +
Section 337 Statistics
 +
This contains links to various other pages with statistics
 +
https://www.usitc.gov/intellectual_property/337_statistics.htm
 +
 +
Here is a dictionary of terms used in these Documents
 +
https://www.usitc.gov/documents/337Info_ext_DataDictionary.pdf
  
==Status==
 
Check my work log to see what I have done on a day to day basis
 
  
Currently the web scraper is able to gather all of the data that I can gather from the HTML.
+
There are FAQs listed here
There are a few cases where the Investigation Number is not listed and I need to test for those and
+
https://www.usitc.gov/documents/337Info_FAQ.pdf
fix that in the code.
 
  
Next steps will be to parse the PDFS
+
There is a query builder for Section 337 Notices Here
 +
https://pubapps2.usitc.gov/337external/advanced
 +
To use you must select fields from the GUI at the bottom of the page

Latest revision as of 12:44, 21 September 2020


Project
USITC
Project logo 02.png
Project Information
Has title USITC Data
Has owner Harrison Brown
Has start date 9/11/2017
Has deadline date
Has project status Active
Has sponsor McNair Center
Has project output Data, Tool
Copyright © 2019 edegan.com. All Rights Reserved.

Files

My files are in 2 different places: Did development work here

E:\McNair\Projects\USITC

I had to use to the Postgres SQL server to download the PDFS The Postgres SQL Server:

128.42.44.182/bulk/USITC

Additional Information

There is more information that the USITC provides besides 337 notices.

Here is information and a database on Section 701/731

https://www.usitc.gov/trade_remedy/trade_research_tools
https://pubapps2.usitc.gov/sunset/


New Work

USITC 337 Cases Tab Delimited Text

USITC patent information was gathered from the investigations.json file downloaded from the USITC website (https://pubapps2.usitc.gov/337external/, Click on Cases Instituted After 2008). This contains information on 337 cases and their respondents/complainants and the patents that were part of the case. The code and results for this program are here:

Projects/USITC/JSON_scraping_python

The program grabs the information, places it into lists of lists in Python, and then writes to the file names listed below. The files do not have headers and null values are set to be empty strings. To create the tab delimited text files, run code.py in the JSON_scraping_python directory. This has all of the file names hard coded. It will create the following files

investigation_info.txt 
Schema for this file is id, title, investigation number,  investigation type, docket number, date of publication notice
complainant_info.txt
Schema for this file is investigation id, investigation number, Complaintant name, complainant outside party ID, comp_city, comp country
respondent_info.txt
Schema for this file is investigation id, investigation number, Respondent Outside Party ID , Respondent Name, Respondent City, Respondent Country
patent_info.txt
Schema for this file is Investigation Number, Patent ID, Patent Number, Active Date, Inactive Date,

XML Information

UPDATE: used JSON file of data to convert to tab-delimited text. There is an XML file that contains information on investigations. To get it go to the link below and 'xml' link that is under the tab that under 'Cases instituted after October 2008'.

https://pubapps2.usitc.gov/337external/

The information that is found this file can be grabbed with an XML parser. For each investigation, we can find out

  • Investigation Number ex - (<entry key="investigationNo">966</entry>)
  • Date of publication Notice - (<entry key="dateOfPublicationFrNotice">2015-09-24T04:00:00.000Z</entry>)
  • Title ex - <entry key="title">Silicon-on-Insulator Wafers</entry>
  • There is an entry for patent numbers, ex - <entry key="patentNumbers">
  • Investigation Type ex - <entry key="investigationType">Violation</entry>
  • Respondents can be found under <entry key="respondent">
  • Complainant can be found under <entry key="complainant">

Additional information can also be gathered from the XML document

To find information on cases prior to 2008, go to the link above and click on 'Looking for cases instituted prior to October 2008?', and it will download a csv file.

  • The investigation number, Title, Unfair Act Alleged, Patent Numbers,Complainants, Respondents, can be grabbed easily from the CSV
  • Target Date, Beginning and Ending Dates contain notes (some cases are extended and dates are changed)and so we may need to do some text processing to grab this information

Old Work

This is the work that was done before I knew about the XML files provided by the USITC (see Alternative Solutions section below)

Here is a csv of the data that I have been able to scrape from the HTML of https://www.usitc.gov/secretary/fed_reg_notices/337.htm The results.csv file found here,

E:\McNair\Projects\USITC

For every notice paper, there is a line in the CSV file that contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued

I have also downloaded the PDFS from the website. These are the pdfs from the links that are in the csv file. Some of the PDFS were not able to be downloaded. The PDFs are here

E:\McNair\Projects\USITC\pdfs_copy

These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason. You must download the PDFs on the Postgres Server and transfer them to the RDP. The script to download the PDFS is

128.42.44.182\bulk\USITC\download

Using the pdf scraper from previous project found here

E:\McNair\software\utilities\PDF_RIPPER

You can scrape the PDFs. This file was modified to scrape all of the pdfs in the pdfs folder. The modified code is in the McNair/Project/USITC/ directory and it is called pdf_to_text_bulk.py

An example of PDF parsing that works is parsing this pdf: https://www.usitc.gov/secretary/fed_reg_notices/337/337_959_notice02062017sgl.pdf

E:\McNair\Projects\USITC\Parsed_Texts\337_959_notice02062017sgl.txt

There are PDFs where the parsing does not work completely and the text is scrambled.

I have started using NLTK to gather information about the filings. See code in Projects/USITC/ProcessingTexts I am working on extracting the respondents from these documents.

Alternative Solutions

Here is where you can get data on the USITC 337 notices instead of extracting this information from PDFs

337Info - Unfair Import Investigations Information System

https://pubapps2.usitc.gov/337external/

This link has links to Complaints. Most of the data that can be gathered is there and it is in a nice format that could be easily web scraped if we needed to. They have a query builder (link below). There is also a link that download all of the raw data in JSON or XML.

Here are links to various statistics we could use: Section 337 Statistics: Settlement Rate Data

https://www.usitc.gov/intellectual_property/337_statistics_settlement_rate_data.htm

Section 337 Statistics

https://www.usitc.gov/press_room/337_stats.htm

Section 337 Statistics: Number of New, Completed, and Active Investigations by Fiscal Year (Updated Quarterly)

https://www.usitc.gov/intellectual_property/337_statistics_number_new_completed_and_active.htm

Section 337 Statistics This contains links to various other pages with statistics

https://www.usitc.gov/intellectual_property/337_statistics.htm

Here is a dictionary of terms used in these Documents

https://www.usitc.gov/documents/337Info_ext_DataDictionary.pdf


There are FAQs listed here

https://www.usitc.gov/documents/337Info_FAQ.pdf

There is a query builder for Section 337 Notices Here

https://pubapps2.usitc.gov/337external/advanced

To use you must select fields from the GUI at the bottom of the page