Difference between revisions of "Patent Thicket"
Line 18: | Line 18: | ||
===Google Scholar Crawler=== | ===Google Scholar Crawler=== | ||
− | used [[]] | + | used [[Google Scholar Crawler]] |
I used the selenium box and switched from Rice Visitor, Rice Owls, and eduroam to prevent Google Scholar from blocking me. | I used the selenium box and switched from Rice Visitor, Rice Owls, and eduroam to prevent Google Scholar from blocking me. | ||
Line 24: | Line 24: | ||
===Downloading PDFs=== | ===Downloading PDFs=== | ||
− | Used | + | Used [[PDF Downloader]] |
I tweaked the code to take into account repeat of file names. | I tweaked the code to take into account repeat of file names. | ||
Line 30: | Line 30: | ||
===pdf_to_txt_bulk_PTLR.py=== | ===pdf_to_txt_bulk_PTLR.py=== | ||
+ | See [[PDF to Text Converter]] | ||
+ | |||
The code must be run in E because of the libraries it uses is not in Z. | The code must be run in E because of the libraries it uses is not in Z. | ||
I reinstalled pdfminer which might be a problem in the future if the libraries change. | I reinstalled pdfminer which might be a problem in the future if the libraries change. | ||
This program converts all pdfs to txt files. It also generates two files _LOG_ERR.txt and _LOG_RUN.txt that includes the names of the pdfs that could not be converted and were converted successfully. Some of the files that were successfuly converted, especially the very small ones, don't have the text from the paper. | This program converts all pdfs to txt files. It also generates two files _LOG_ERR.txt and _LOG_RUN.txt that includes the names of the pdfs that could not be converted and were converted successfully. Some of the files that were successfuly converted, especially the very small ones, don't have the text from the paper. |
Revision as of 14:46, 25 July 2018
Patent Thicket | |
---|---|
Project Information | |
Project Title | Patent Thicket |
Owner | Grace Tan |
Start Date | Summer 2018 |
Deadline | |
Primary Billing | |
Notes | |
Has project status | Active |
Is dependent on | Google Scholar Crawler, PDF Downloader, PDF to Text Converter |
Copyright © 2016 edegan.com. All Rights Reserved. |
Location of Files
E:://McNair/Software/Patent_Thicket
Downloaded PDFs:
E://McNair/Projects/Software/Patent_Thicket/AllPDFs/successful_downloads
Converted PDFs to txt files:
E://McNair/Projects/Software/Patent_Thicket/Parsed_Texts
Google Scholar Crawler
used Google Scholar Crawler I used the selenium box and switched from Rice Visitor, Rice Owls, and eduroam to prevent Google Scholar from blocking me.
I downloaded 613 pdf urls and 958 bibtex filees from 100 pages on Google Scholar when searching for "patent thicket."
Downloading PDFs
Used PDF Downloader
I tweaked the code to take into account repeat of file names. 5 of the pdf urls were not downloadable so I ended up with 608 working pdfs.
pdf_to_txt_bulk_PTLR.py
The code must be run in E because of the libraries it uses is not in Z. I reinstalled pdfminer which might be a problem in the future if the libraries change.
This program converts all pdfs to txt files. It also generates two files _LOG_ERR.txt and _LOG_RUN.txt that includes the names of the pdfs that could not be converted and were converted successfully. Some of the files that were successfuly converted, especially the very small ones, don't have the text from the paper.