Difference between revisions of "Patent Thicket"

From edegan.com
Jump to navigation Jump to search
(Created page with "{{McNair Projects |Has title=Patent Thicket |Has owner=Grace Tan |Has start date=Summer 2018 |Has keywords= |Has project status=Active |Is dependent on=Google Scholar Crawler,...")
 
 
(8 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{McNair Projects
+
{{Project
 +
|Has project output=Tool
 +
|Has sponsor=McNair Center
 
|Has title=Patent Thicket
 
|Has title=Patent Thicket
 
|Has owner=Grace Tan
 
|Has owner=Grace Tan
Line 5: Line 7:
 
|Has keywords=
 
|Has keywords=
 
|Has project status=Active
 
|Has project status=Active
|Is dependent on=Google Scholar Crawler, pdfdownloader.py, pdf_to_bulk_PTLR.py
+
|Is dependent on=Google Scholar Crawler, PDF Downloader, PDF to Text Converter
 
}}
 
}}
  
 
===Location of Files===
 
===Location of Files===
  E:://McNair/Software/Patent_Thicket
+
  E://McNair/Software/Patent_Thicket
  
 
Downloaded PDFs:
 
Downloaded PDFs:
Line 18: Line 20:
  
 
===Google Scholar Crawler===
 
===Google Scholar Crawler===
used [[]]
+
Used [[Google Scholar Crawler]]
 +
 
 
I used the selenium box and switched from Rice Visitor, Rice Owls, and eduroam to prevent Google Scholar from blocking me.
 
I used the selenium box and switched from Rice Visitor, Rice Owls, and eduroam to prevent Google Scholar from blocking me.
  
Line 24: Line 27:
  
 
===Downloading PDFs===
 
===Downloading PDFs===
Used pdfdownloader.py
+
Used [[PDF Downloader]]
  
 
I tweaked the code to take into account repeat of file names.  
 
I tweaked the code to take into account repeat of file names.  
Line 30: Line 33:
  
 
===pdf_to_txt_bulk_PTLR.py===
 
===pdf_to_txt_bulk_PTLR.py===
 +
See [[PDF to Text Converter]]
 +
 
The code must be run in E because of the libraries it uses is not in Z.
 
The code must be run in E because of the libraries it uses is not in Z.
 
I reinstalled pdfminer which might be a problem in the future if the libraries change.
 
I reinstalled pdfminer which might be a problem in the future if the libraries change.
  
 
This program converts all pdfs to txt files. It also generates two files _LOG_ERR.txt and _LOG_RUN.txt that includes the names of the pdfs that could not be converted and were converted successfully. Some of the files that were successfuly converted, especially the very small ones, don't have the text from the paper.
 
This program converts all pdfs to txt files. It also generates two files _LOG_ERR.txt and _LOG_RUN.txt that includes the names of the pdfs that could not be converted and were converted successfully. Some of the files that were successfuly converted, especially the very small ones, don't have the text from the paper.
 +
 +
There were 573 successful txt files and 36 files that failed to convert (which does not add up to 608 but I'm not sure why).

Latest revision as of 12:47, 21 September 2020


Project
Patent Thicket
Project logo 02.png
Project Information
Has title Patent Thicket
Has owner Grace Tan
Has start date Summer 2018
Has deadline date
Has project status Active
Is dependent on Google Scholar Crawler, PDF Downloader, PDF to Text Converter
Has sponsor McNair Center
Has project output Tool
Copyright © 2019 edegan.com. All Rights Reserved.


Location of Files

E://McNair/Software/Patent_Thicket

Downloaded PDFs:

E://McNair/Projects/Software/Patent_Thicket/AllPDFs/successful_downloads

Converted PDFs to txt files:

E://McNair/Projects/Software/Patent_Thicket/Parsed_Texts

Google Scholar Crawler

Used Google Scholar Crawler

I used the selenium box and switched from Rice Visitor, Rice Owls, and eduroam to prevent Google Scholar from blocking me.

I downloaded 613 pdf urls and 958 bibtex filees from 100 pages on Google Scholar when searching for "patent thicket."

Downloading PDFs

Used PDF Downloader

I tweaked the code to take into account repeat of file names. 5 of the pdf urls were not downloadable so I ended up with 608 working pdfs.

pdf_to_txt_bulk_PTLR.py

See PDF to Text Converter

The code must be run in E because of the libraries it uses is not in Z. I reinstalled pdfminer which might be a problem in the future if the libraries change.

This program converts all pdfs to txt files. It also generates two files _LOG_ERR.txt and _LOG_RUN.txt that includes the names of the pdfs that could not be converted and were converted successfully. Some of the files that were successfuly converted, especially the very small ones, don't have the text from the paper.

There were 573 successful txt files and 36 files that failed to convert (which does not add up to 608 but I'm not sure why).