Changes

Jump to navigation Jump to search
1,428 bytes added ,  17:06, 3 October 2017
no edit summary
2) Classifying papers based on the matrix of term appearances that the current program builds.
 
 
'''10/02'''
 
Program finally outputting something useful YAY. In FindKeyTerms.py (under McNair/Software/Google_Scholar_Crawler) I can input the path of a folder of txt files and it will scan all of them and seek they key words. It will put reports for every file in a new folder called KeyTerms that will appear in the input folder once the program terminates. An example file will be emailed to Lauren for corrections and adjustment. The file currently takes all the categories in the codification page and says 1) How many terms in that category appeared 2) How many times each of those terms appeared. At the bottom, it suggests potential definitions for patent thicket in the paper, but this part is pretty poor for now and needs adjustment. On the bright side, the program executes absurdly quickly and we can get through hundreds of files in less than a minute. In addition, while the program is running I am outputting a bag of words vector into a folder called WordBags in the input folder for future neural net usage to classify the papers. Need a training dataset that is relatively large.
 
Stuff to work on:
 
1) Neural net classification (computer suggesting which kind of paper it is)
 
2) Improving patent thicket definition finding
 
3) Finding the authors and having this as a contributing factor of the vectors
 
4) Potentially going back to the google scholar problem to try to find the PDFs automatically.
 
 
=Lauren's LOG=
272

edits

Navigation menu