contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued
I have also downloaded the PDFS from the website. These are the pdfs from the links that are in the csv file. Some of the PDFS were no not able tot to be downloaded. The PDFs are here
E:\McNair\Projects\USITC\pdfs_copy
These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason.
You must download the PDFs on the Postgres Server and transfer them to the RDP. The script to download the PDFS
is
128.42.44.182\bulk\USITC\download
called pdf_to_text_bulk.py
An example of PDF parsing that works is parsing this PDFpdf: https://www.usitc.gov/secretary/fed_reg_notices/337/337_959_notice02062017sgl.pdf
E:\McNair\Projects\USITC\Parsed_Texts\337_959_notice02062017sgl.txt
There exists are PDFs where the parsing does not work completely and the text is scrambled.
==Status==