Grace Tan (Work Log)
Grace Tan Work Logs (log page)
{{{name}}} | |
Staff Information | |
---|---|
Status | Active |
McNairCenterⓂ |
2018-07-23: Found my missing pdfs. There were 7 that were not able to download through the script for some reason so I manually downloaded 6 of them and I could not find the pdf for 1 of them. Now I have 612 pdfs on file and am ready to start converting to txt tomorrow.
2018-07-20: Reinstalled pdfminer and the test that GitHub provides works so it should be installed correctly. Ran all the pdf urls through pdfdownloader.py but came up short with 602/613 pdfs. I found 8 papers that did not successfully get downloaded but that still leaves 3 mystery papers and I'm not sure which ones they are. Did some sql to try to figure it out but hasn't worked yet.
2018-07-19: Started on converting pdfs to txt files. I found pdf_to_txt_bulk_PTLR.py in E:/McNair/Software/Google_Scholar_Crawler. I copied this and moved my data to E:/McNair/Software/Patent_Thicket. When I tried to run the program, it was giving me an error when trying to import pdfminer because I originally tried running it in Z, so I installed pdfminer.six and it did not complain with python3.6 It gave me a different error with a missing package which Wei said we shouldn't touch so I moved everything back to E. It now runs but cannot convert any pdfs. I have no idea how to fix this.
2018-07-18: Finished running the rest of the 100 pages. Took quite a long time because google scholar was catching me after 2-5 pages rather than 5-10. It helped to switch between the different wifi(rice visitor, rice owls, eduroam). Altogether resulted in 958 bibtex files and 613 pdfs from 1000 entries. There might be more entries but I'm not sure where to find them. I saved the data and code onto the rdp by connecting to it from the selenium box.
2018-07-17: Ran google scholar crawler. When google scholar blocks me with a 403 error code, I exit the program and rerun it at the page that it last looked at by clicking on the correct page number before crawling. I finished running through 68/100 pages of google scholar.
2018-07-16: Ran through 10 pages of google scholar first thing without a problem. Tried running through all 100 pages but kept on getting caught. Helped Augi with discrepancies in data and will try google scholar crawler again tomorrow.
2018-07-13: Fixed problem where pdf urls were not saving to txt file. Created another txt file to save urls that are not pdfs. Didn't run into a single recaptcha all morning. Towards the end, it started catching me at the 7th query and forced the program to restart. For some reason, selecting the css element triggered google scholar to find me. I changed the css element tag for the "next" button to the path and I was able to get through the 4th page. It is still not able to click on the actual link but I'm not sure if that's supposed to do anything.
2018-07-12: Figured out how to save BibTeX files to computer. Still had to do recaptcha tests. After giving it some time, I was able to run it completely once but I only got 49 BibTeX files and about half as many pdf links. When I tried to work on it further the recaptcha wasn't loading and gave me the error - "Cannot contact reCAPTCHA. Check your connection and try again." I ended up moving to the selenium computer and spent the rest of the day converting the code to python3 and messing with regex because for some reason it wasn't matching the text correctly.
2018-07-11: Started on Google Scholar Crawler for Patent Thicket Project. I'm not sure what the problem is. The code seems to work except that Google constantly blocks be to do reCaptcha tests. I am also not sure if the crawler is saving any data to txt files and if so, where those files are located.
2018-07-10: Finished LinkedIn Crawler! When no search results were found, I did not find the href and instead, added the founder and company name to a txt file. Spent way too much time doing reCaptcha tests and logging out and logging in again because firefox and wifi was being slow. Cleaned up code and put it on the rdp as well as fixed the wiki page - Crunchbase Accelerator Founders
2018-07-09: Continued working on the LinkedIn Crawler. I figured out how to get the web element of the first search result using path. You can achieve the same result by looking for the css element which probably would have been easier. I then used get_attribute('href') to find the href in the web element that of the first result to get the url of the founder. Note that previously, there was another function to click on the name on the screen which would open another window with the profile but I found it easier to just extract the url. Next, I will run the (hopefully) working crawler on all the founder data. Update - ran into error when there are no search results found.
2018-06-29: Spent a large part of today clicking on road signs and cars to prove to LinkedIn that I am not a robot. Figured out how to find search box with css element instead of xpath. Now trying to get information from search results.
2018-06-28: Tried to figure out the xpath for the search box but came up with no solutions. Halfway through, linkedin discovered that I was a bot so I moved to the selenium computer and used the Rice Visitor wifi. Linkedin still wouldn't let me in so we made another test account (see project page for details). It finally let me in at the end of the day but I set one of the delays to be 3-4 min so that I have time to do the tests that linkedin gives to ensure that you are not a bot. I still have no idea how to find search box xpath.
2018-06-27: Took the dictionary of accelerator to founder UUIDs and formed a table with the UUIDs combined with names of founders, gender, and linkedin_url. The file is in Z:\crunchbase2\FounderAccInfo.txt . Started looking at linkedin crawler documentation and got the crawler to get information from profiles with known urls. It crashed when it tried searching up founder names and accelerators so will work on that tomorrow.
2018-06-26: Ended up finding all founders manually. Then talked to Ed and figured out how to get the founder data off the crunchbase API with a link (see project page). Created a python script that goes through all the API pages for each accelerator API and returns a dictionary of accelerator UUIDs mapped to founder UUIDs. I found 209 founders on the API but 224 manually so we'll look at the discrepancy tomorrow.
2018-06-25: We took the 157 Accelerator UUIDs we found and created a new table that includes all the attributes of the accelerator that we want from organizations.csv called AccAllInfo. Maxine and I then split into our respectful projects. I tried joining people to the companies they are linked to in order to find the founders of each accelerator. I found about 90 matches but this there are still a lot of missing holes since some accelerators have no founders and others have multiple founders. Still unsure of how to fix this.
2018-06-22: Matched Connor's master list of accelerators with organizations.csv based on homepage_url and company_name. Found 90 that matched along with 76 blanks. Then tried matching with homepage_url or company_name and manually found about 30 more that had slight variations in url or name that we should keep. Using ILIKE we found ~25 more company UUIDs that match with accelerators on the list.
2018-06-21: Downloaded all 17 v3.1 csv tables and updated LoadTables.sql to match our data. We did this by manually updating the name and size of the fields. To solve the problem of "" from yesterday, we used regular expressions to change the empty string to nothing (see project page). We then worked with Connor to start extracting the accelerators from the organizations in the Crunchbase data. We found a lot of null matches based on company_name and a few that have the same name but are actually different companies. Maybe try matching with homepage_url tomorrow.
2018-06-20: Learned more SQL. Started working on Crunchbase Data project with Maxine. Old code contained 22 csv tables but new Crunchbase data only has 17 csv tables. We will be using the new Crunchbase API v3.1 ( not v3) with only 17 csv tables as data. We then started updating the old SQL tables to align with the 17 tables we have. We ran into a problem where a field of "" in the data for a date type and SQL did not like that. Ed was helping us with this but we have not found a solution yet.
2018-06-19: Set up monitors and continued learning SQL. We were also introduced to our projects. I will be continuing Christy's work on the Google Scholar Crawler as well as working with Maxine to update the Crunchbase data and then use that data to crawl Linkedin to find data on startup founders that go through accelerators.
2018-06-18: Introduced to the wiki, connected to RDP, and learned SQL.