Difference between revisions of "USITC"
Line 24: | Line 24: | ||
contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued | contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued | ||
− | + | I have also downloaded the PDFS from the website. These are the pdfs that are in the csv file. Some of the PDFS were no able tot be downloaded. The PDFs are here | |
− | I have also downloaded the PDFS from the website. These are the pdfs that are in the csv file. Some of the PDFS were no able | + | E:\McNair\Projects\USITC\pdfs_copy |
− | E:\McNair\Projects\USITC\ | ||
These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason. | These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason. | ||
Line 43: | Line 42: | ||
==Status== | ==Status== | ||
− | |||
− | |||
Currently the web scraper is able to gather all of the data that I can gather from the HTML. | Currently the web scraper is able to gather all of the data that I can gather from the HTML. | ||
There are a few cases where the Investigation Number is not listed and I need to test for those and | There are a few cases where the Investigation Number is not listed and I need to test for those and | ||
fix that in the code. | fix that in the code. | ||
− | + | Downloaded most of the PDFs. There were errors download some of the files. | |
− | |||
− |
Revision as of 13:51, 28 September 2017
USITC | |
---|---|
Project Information | |
Project Title | USITC Data |
Owner | Harrison Brown |
Start Date | 9/11/2017 |
Deadline | |
Primary Billing | |
Notes | In Progress |
Has project status | Active |
Copyright © 2016 edegan.com. All Rights Reserved. |
Files
My files are in 2 different places:
E:\McNair\Projects\USITC
The Postgres SQL Server:
128.42.44.182/bulk/USITC
Work
The results.csv file found here,
E:\McNair\Projects\USITC
is a csv of the data that I have been able to scrape from the HTML of https://www.usitc.gov/secretary/fed_reg_notices/337.htm
For every notice paper, there is a line in the CSV file that contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued
I have also downloaded the PDFS from the website. These are the pdfs that are in the csv file. Some of the PDFS were no able tot be downloaded. The PDFs are here
E:\McNair\Projects\USITC\pdfs_copy
These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason. You must download the PDFs on Postgres and transfer them to the RDP. The script to download the PDFS is
128.42.44.182\bulk\USITC\download
Using the pdf scraper from previous project found here
E:\McNair\software\utilities\PDF_RIPPER
You can scrape the PDFs. This file was modified to scrape all of the pdfs in the pdfs folder. The modified code is in the McNair/Project/USITC/ directory and it is called pdf_to_text_bulk.py
An example of PDF parsing that works parsing this PDF: https://www.usitc.gov/secretary/fed_reg_notices/337/337_959_notice02062017sgl.pdf
E:\McNair\Projects\USITC\Parsed_Texts\337_959_notice02062017sgl.txt
There exists PDFs where the parsing does not work completely and the text is scrambled.
Status
Currently the web scraper is able to gather all of the data that I can gather from the HTML. There are a few cases where the Investigation Number is not listed and I need to test for those and fix that in the code.
Downloaded most of the PDFs. There were errors download some of the files.