Difference between revisions of "USITC"
Line 7: | Line 7: | ||
}} | }} | ||
==Files== | ==Files== | ||
− | |||
− | The files are in: | + | The files are in 2 different places: |
E:\McNair\Projects\USITC | E:\McNair\Projects\USITC | ||
− | The results file is a csv of the data that I have been able to scrape from the HTML | + | The Postgres SQL Server: |
+ | 128.42.44.182/bulk/USITC | ||
+ | |||
+ | The results.csv file is a csv of the data that I have been able to scrape from the HTML | ||
of https://www.usitc.gov/secretary/fed_reg_notices/337.htm | of https://www.usitc.gov/secretary/fed_reg_notices/337.htm | ||
For every notice paper, there is a line in the CSV file that | For every notice paper, there is a line in the CSV file that | ||
contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued | contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued | ||
+ | |||
+ | |||
+ | I have also downloaded the PDFS from the website. That is here | ||
+ | E:\McNair\Projects\USITC\pdf_copy | ||
+ | |||
+ | These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason. | ||
+ | You must download the PDFs on Postgres and transfer them to the RDP. The script to download the PDFS | ||
+ | is | ||
+ | 128.42.44.182/bulk/USITC/download | ||
+ | |||
+ | Using the pdf scraper from previous project found here | ||
+ | E/McNair/software/utilities/PDF_RIPPER | ||
Line 26: | Line 40: | ||
fix that in the code. | fix that in the code. | ||
− | Next steps will be to parse the PDFS | + | Next steps will be to parse the PDFS, currently running a script to convert them to text |
Currently running a shell script to download the PDFs. Will update when that is completed | Currently running a shell script to download the PDFs. Will update when that is completed |
Revision as of 14:34, 27 September 2017
USITC | |
---|---|
Project Information | |
Project Title | USITC Data |
Owner | Harrison Brown |
Start Date | 9/11/2017 |
Deadline | |
Primary Billing | |
Notes | In Progress |
Has project status | Active |
Copyright © 2016 edegan.com. All Rights Reserved. |
Files
The files are in 2 different places:
E:\McNair\Projects\USITC
The Postgres SQL Server: 128.42.44.182/bulk/USITC
The results.csv file is a csv of the data that I have been able to scrape from the HTML of https://www.usitc.gov/secretary/fed_reg_notices/337.htm
For every notice paper, there is a line in the CSV file that contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued
I have also downloaded the PDFS from the website. That is here
E:\McNair\Projects\USITC\pdf_copy
These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason. You must download the PDFs on Postgres and transfer them to the RDP. The script to download the PDFS is 128.42.44.182/bulk/USITC/download
Using the pdf scraper from previous project found here E/McNair/software/utilities/PDF_RIPPER
Status
Check my work log to see what I have done on a day to day basis
Currently the web scraper is able to gather all of the data that I can gather from the HTML. There are a few cases where the Investigation Number is not listed and I need to test for those and fix that in the code.
Next steps will be to parse the PDFS, currently running a script to convert them to text
Currently running a shell script to download the PDFs. Will update when that is completed