Jeemin Sim (Work Log)
2/6/2017 14:00-18:00 Set up wikiPage & remote desktop. Started working on python version of web crawler. So far it successfully prints out a catchphrase/ description for one website. To be worked on. The python file can be found in: E:\McNair\Projects\Accelerators\Python WebCrawler\webcrawlerpython.py
2/8/2017 9AM-11AM Attempted to come up with possible cases for locating the description of accelerators - pick up from extracting bodies of text from the about page (given that it exists)
2/13/2017 MONDAY 2PM- 6PM Goals (for trials): 1) Build ER Diagram 2) For each entity, get XML snippet 3) Build a parser/ripper for single file; the python parser can be found at: E:\McNair\Projects\FDA Trials\Jeemin_Project Trial Data Project
2/15/2017 WEDNESDAY 9AM-11AM Discussed with Catherine what to do with FDA Trial data and decided to have a dictionary with zip-codes as keys and number of trials occurred in that zipcode as values. Was still attempting to loop through the files without the code having to exist in the same directory as the XML files. Plan to write to excel via tsv, with zip-code as one column and # of occurrence as the other.
2/17/2017 FRIDAY 2PM-6PM Completed code for counting the number of occurrences for each unique zipcode. (currently titled & located: E:\McNair\Projects\FDA Trials\Jeemin_Project\Jeemin_Running_File.py). It has been running for 20+min because of the comprehensive XML data files. Meanwhile started coding to create a dictionary with the keys corresponding to each unique trial ID, mapped to every other information (location, sponsors, phase, drugs ...etc.) (currently titled & located: E:\McNair\Projects\FDA Trials\Jeemin_Project\Jeemin_FDATrial_as_key_data_ripping.py).
2/20/2017 MONDAY 2PM-4:30PM Continued working on Jeemin_FDATrial_as_key_data_ripping.py to find tags and place all of those information in a list. The other zipcode file did not finish executing after 2+ hours of running it - considering the possibility of splitting the record file into smaller bits, or running the processing on a faster machine.
2/22/2017 WEDNESDAY 9AM-12:30PM Finished Jeemin_FDATrial_as_key_data_ripping.py (E:\McNair\Projects\FDA Trials\Jeemin_Project\Jeemin_FDATrial_as_key_data_ripping.py), which outputs to E:\McNair\Projects\FDA Trials\Jeemin_Project\general_data_ripping_output.txt; TODO: output four different tables & replace the write in the same for-loop as going through each file
2/24/2017 FRIDAY 2:30PM-6:30PM Continued working on producing multiple tables - first two are done. Was working on location, as there are multiple location tags per location.
2/27/2017 MONDAY 2PM-6PM Finished producing tables from Jeemin_FDATrial_as_key_data_ripping.py Talked to Julia about LinkedIn data extracting - to be discussed further with Julia & Peter. Started web crawler for Wikipedia - currently pulls Endowment, Academic staff, students, undergraduates, and postgraduates info found on Rice Wikipedia page. Can be found in : E:\McNair\Projects\University Patents\Jeemin_University_wikipedia_crawler.py
3/1/2017 WEDNESDAY 9AM-12PM Started re-running Jeemin_FDATrial_as_key_data_ripping.py
3/3/2017 FRIDAY 2PM-5PM Attempted to output sql tables
3/6/017 MONDAY 2PM-6PM Installing python in a database PostgreSQL Instructions