Difference between revisions of "Patent Thicket"

McNair Project
Patent Thicket
Project Information
Project Title	Patent Thicket
Owner	Grace Tan
Start Date	Summer 2018
Deadline
Primary Billing
Notes
Has project status	Active
Is dependent on	Google Scholar Crawler, PDF Downloader, PDF to Text Converter
	Copyright © 2016 edegan.com. All Rights Reserved.

Revision as of 14:46, 25 July 2018

This program converts all pdfs to txt files. It also generates two files _LOG_ERR.txt and _LOG_RUN.txt that includes the names of the pdfs that could not be converted and were converted successfully. Some of the files that were successfuly converted, especially the very small ones, don't have the text from the paper.

@@ Line 18: / Line 18: @@
 ===Google Scholar Crawler===
-used [[]]
+used [[Google Scholar Crawler]]
 I used the selenium box and switched from Rice Visitor, Rice Owls, and eduroam to prevent Google Scholar from blocking me.
@@ Line 24: / Line 24: @@
 ===Downloading PDFs===
-Used pdfdownloader.py
+Used [[PDF Downloader]]
 I tweaked the code to take into account repeat of file names.
@@ Line 30: / Line 30: @@
 ===pdf_to_txt_bulk_PTLR.py===
+See [[PDF to Text Converter]]
 The code must be run in E because of the libraries it uses is not in Z.
 I reinstalled pdfminer which might be a problem in the future if the libraries change.
 This program converts all pdfs to txt files. It also generates two files _LOG_ERR.txt and _LOG_RUN.txt that includes the names of the pdfs that could not be converted and were converted successfully. Some of the files that were successfuly converted, especially the very small ones, don't have the text from the paper.

Difference between revisions of "Patent Thicket"

Revision as of 14:46, 25 July 2018

Contents

Location of Files

Google Scholar Crawler

Downloading PDFs

pdf_to_txt_bulk_PTLR.py

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools