Difference between revisions of "USITC"

From edegan.com
Jump to navigation Jump to search
Line 45: Line 45:
  
 
==Alternative Solutions==
 
==Alternative Solutions==
 +
 +
337Info - Unfair Import Investigations Information System
 +
 +
https://pubapps2.usitc.gov/337external/
 +
 +
This link has links to Complaints. Most of the data that can be gathered is there and it is in a nice format that could be easily web scraped. There is a query builder but I believe there may be errors with it. I will try to see if I can get it to work.
 +
 +
There are various statistics that are publicly available that we could use.
 +
 +
 +
Section 337 Statistics: Settlement Rate Data
 +
https://www.usitc.gov/intellectual_property/337_statistics_settlement_rate_data.htm
 +
 +
Section 337 Statistics
 +
https://www.usitc.gov/press_room/337_stats.htm
 +
 +
Section 337 Statistics: Number of New, Completed, and Active Investigations by Fiscal Year (Updated Quarterly)
 +
https://www.usitc.gov/intellectual_property/337_statistics_number_new_completed_and_active.htm
 +
 +
Section 337 Statistics
 +
This contains links to various other pages with statistics
 +
https://www.usitc.gov/intellectual_property/337_statistics.htm
 +
 +
Here is a dictionary of terms used in these Documents
 +
https://www.usitc.gov/documents/337Info_ext_DataDictionary.pdf
 +
 +
 +
There are FAQs listed here
 +
https://www.usitc.gov/documents/337Info_FAQ.pdf
 +
 +
There is a query builder for Section 337 Notices Here
 +
https://pubapps2.usitc.gov/337external/advanced
 +
 +
To use you must select fields from the GUI at the bottom of the page
 +
  
 
==Status==
 
==Status==

Revision as of 14:49, 25 October 2017


McNair Project
USITC
Project logo 02.png
Project Information
Project Title USITC Data
Owner Harrison Brown
Start Date 9/11/2017
Deadline
Primary Billing
Notes In Progress
Has project status Active
Copyright © 2016 edegan.com. All Rights Reserved.


Files

My files are in 2 different places:

E:\McNair\Projects\USITC

The Postgres SQL Server:

128.42.44.182/bulk/USITC

Work

The results.csv file found here,

E:\McNair\Projects\USITC

is a csv of the data that I have been able to scrape from the HTML of https://www.usitc.gov/secretary/fed_reg_notices/337.htm

For every notice paper, there is a line in the CSV file that contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued

I have also downloaded the PDFS from the website. These are the pdfs from the links that are in the csv file. Some of the PDFS were not able to be downloaded. The PDFs are here

E:\McNair\Projects\USITC\pdfs_copy

These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason. You must download the PDFs on the Postgres Server and transfer them to the RDP. The script to download the PDFS is

128.42.44.182\bulk\USITC\download

Using the pdf scraper from previous project found here

E:\McNair\software\utilities\PDF_RIPPER

You can scrape the PDFs. This file was modified to scrape all of the pdfs in the pdfs folder. The modified code is in the McNair/Project/USITC/ directory and it is called pdf_to_text_bulk.py

An example of PDF parsing that works is parsing this pdf: https://www.usitc.gov/secretary/fed_reg_notices/337/337_959_notice02062017sgl.pdf

E:\McNair\Projects\USITC\Parsed_Texts\337_959_notice02062017sgl.txt

There are PDFs where the parsing does not work completely and the text is scrambled.

I have started using NLTK to gather information about the filings. See code in Projects/USITC/ProcessingTexts I am working on extracting the respondents from these documents.

Alternative Solutions

337Info - Unfair Import Investigations Information System

https://pubapps2.usitc.gov/337external/

This link has links to Complaints. Most of the data that can be gathered is there and it is in a nice format that could be easily web scraped. There is a query builder but I believe there may be errors with it. I will try to see if I can get it to work.

There are various statistics that are publicly available that we could use.


Section 337 Statistics: Settlement Rate Data https://www.usitc.gov/intellectual_property/337_statistics_settlement_rate_data.htm

Section 337 Statistics https://www.usitc.gov/press_room/337_stats.htm

Section 337 Statistics: Number of New, Completed, and Active Investigations by Fiscal Year (Updated Quarterly) https://www.usitc.gov/intellectual_property/337_statistics_number_new_completed_and_active.htm

Section 337 Statistics This contains links to various other pages with statistics https://www.usitc.gov/intellectual_property/337_statistics.htm

Here is a dictionary of terms used in these Documents https://www.usitc.gov/documents/337Info_ext_DataDictionary.pdf


There are FAQs listed here https://www.usitc.gov/documents/337Info_FAQ.pdf

There is a query builder for Section 337 Notices Here https://pubapps2.usitc.gov/337external/advanced

To use you must select fields from the GUI at the bottom of the page


Status

Currently the web scraper is able to gather all of the data that I can gather from the HTML. There are a few cases where the Investigation Number is not listed and I need to test for those and fix that in the code.

Downloaded most of the PDFs. There were errors download some of the files. I need to calculate what PDFs were not able to be downloaded and why. Investigating what other ways we can gather the information.