Difference between revisions of "Jeemin Sim (Work Log)"

From edegan.com
Jump to navigation Jump to search
Line 1: Line 1:
2/6/2017 14:00-18:00 Set up wikiPage & remote desktop. Started working on python version of web crawler. So far it successfully prints out a catchphrase/ description for one website. To be worked on. The python file can be found in: E:\McNair\Projects\Accelerators\Python WebCrawler\webcrawlerpython.py
+
===2/6/2017 MONDAY ''2PM-6PM''===
 +
* Set up wikiPage & remote desktop.  
 +
* Started working on python version of web crawler. So far it successfully prints out a catchphrase/ description for one website. To be worked on. The python file can be found in: E:\McNair\Projects\Accelerators\Python WebCrawler\webcrawlerpython.py
  
2/8/2017 9AM-11AM Attempted to come up with possible cases for locating the description of accelerators - pick up from extracting bodies of text from the about page (given that it exists)
+
===2/8/2017 WEDNESDAY''9AM-11AM''===
 +
* Attempted to come up with possible cases for locating the description of accelerators - pick up from extracting bodies of text from the about page (given that it exists)
  
2/13/2017 MONDAY 2PM- 6PM
+
===2/13/2017 MONDAY ''2PM-6PM''===
Goals (for trials): 1) Build ER Diagram 2) For each entity, get XML snippet 3) Build a parser/ripper for single file; the python parser can be found at: E:\McNair\Projects\FDA Trials\Jeemin_Project
+
* Goals (for trials): 1) Build ER Diagram 2) For each entity, get XML snippet 3) Build a parser/ripper for single file; the python parser can be found at: E:\McNair\Projects\FDA Trials\Jeemin_Project
 
[[Trial Data Project]]
 
[[Trial Data Project]]
  
2/15/2017 WEDNESDAY 9AM-11AM
+
===2/15/2017 WEDNESDAY ''9AM-11AM''===
Discussed with Catherine what to do with FDA Trial data and decided to have a dictionary with zip-codes as keys and number of trials occurred in that zipcode as values. Was still attempting to loop through the files without the code having to exist in the same directory as the XML files. Plan to write to excel via tsv, with zip-code as one column and # of occurrence as the other.
+
* Discussed with Catherine what to do with FDA Trial data and decided to have a dictionary with zip-codes as keys and number of trials occurred in that zipcode as values. Was still attempting to loop through the files without the code having to exist in the same directory as the XML files. Plan to write to excel via tsv, with zip-code as one column and # of occurrence as the other.
  
2/17/2017 FRIDAY 2PM-6PM
+
===2/17/2017 FRIDAY ''2PM-6PM''===
Completed code for counting the number of occurrences for each unique zipcode. (currently titled & located: E:\McNair\Projects\FDA Trials\Jeemin_Project\Jeemin_Running_File.py). It has been running for 20+min because of the comprehensive XML data files. Meanwhile started coding to create a dictionary with the keys corresponding to each unique trial ID, mapped to every other information (location, sponsors, phase, drugs ...etc.) (currently titled & located: E:\McNair\Projects\FDA Trials\Jeemin_Project\Jeemin_FDATrial_as_key_data_ripping.py).
+
* Completed code for counting the number of occurrences for each unique zipcode. (currently titled & located: E:\McNair\Projects\FDA Trials\Jeemin_Project\Jeemin_Running_File.py). It has been running for 20+min because of the comprehensive XML data files. Meanwhile started coding to create a dictionary with the keys corresponding to each unique trial ID, mapped to every other information (location, sponsors, phase, drugs ...etc.) (currently titled & located: E:\McNair\Projects\FDA Trials\Jeemin_Project\Jeemin_FDATrial_as_key_data_ripping.py).
  
2/20/2017 MONDAY 2PM-4:30PM
+
===2/20/2017 MONDAY ''2PM-4:30PM''===
Continued working on Jeemin_FDATrial_as_key_data_ripping.py to find tags and place all of those information in a list. The other zipcode file did not finish executing after 2+ hours of running it - considering the possibility of splitting the record file into smaller bits, or running the processing on a faster machine.
+
* Continued working on Jeemin_FDATrial_as_key_data_ripping.py to find tags and place all of those information in a list. The other zipcode file did not finish executing after 2+ hours of running it - considering the possibility of splitting the record file into smaller bits, or running the processing on a faster machine.
  
2/22/2017 WEDNESDAY 9AM-12:30PM
+
===2/22/2017 WEDNESDAY ''9AM-12:30PM''===
Finished Jeemin_FDATrial_as_key_data_ripping.py (E:\McNair\Projects\FDA Trials\Jeemin_Project\Jeemin_FDATrial_as_key_data_ripping.py), which outputs to E:\McNair\Projects\FDA Trials\Jeemin_Project\general_data_ripping_output.txt; TODO: output four different tables & replace the write in the same for-loop as going through each file
+
* Finished Jeemin_FDATrial_as_key_data_ripping.py (E:\McNair\Projects\FDA Trials\Jeemin_Project\Jeemin_FDATrial_as_key_data_ripping.py), which outputs to E:\McNair\Projects\FDA Trials\Jeemin_Project\general_data_ripping_output.txt; TODO: output four different tables & replace the write in the same for-loop as going through each file
  
2/24/2017 FRIDAY 2:30PM-6:30PM
+
===2/24/2017 FRIDAY ''2:30PM-6:30PM''===
Continued working on producing multiple tables - first two are done. Was working on location, as there are multiple location tags per location.
+
* Continued working on producing multiple tables - first two are done. Was working on location, as there are multiple location tags per location.
  
2/27/2017 MONDAY 2PM-6PM
+
===2/27/2017 MONDAY ''2PM-6PM''===
Finished producing tables from Jeemin_FDATrial_as_key_data_ripping.py
+
* Finished producing tables from Jeemin_FDATrial_as_key_data_ripping.py
Talked to Julia about LinkedIn data extracting - to be discussed further with Julia & Peter.
+
* Talked to Julia about LinkedIn data extracting - to be discussed further with Julia & Peter.
Started web crawler for Wikipedia - currently pulls Endowment, Academic staff, students, undergraduates, and postgraduates info found on Rice Wikipedia page. Can be found in : E:\McNair\Projects\University Patents\Jeemin_University_wikipedia_crawler.py
+
* Started web crawler for Wikipedia - currently pulls Endowment, Academic staff, students, undergraduates, and postgraduates info found on Rice Wikipedia page. Can be found in : E:\McNair\Projects\University Patents\Jeemin_University_wikipedia_crawler.py
  
3/1/2017 WEDNESDAY 9AM-12PM
+
===3/1/2017 WEDNESDAY ''9AM-12PM''===
Started re-running Jeemin_FDATrial_as_key_data_ripping.py
+
* Started re-running Jeemin_FDATrial_as_key_data_ripping.py
  
3/3/2017 FRIDAY 2PM-5PM
+
===3/3/2017 FRIDAY ''2PM-5PM''===
Attempted to output sql tables
+
* Attempted to output sql tables
  
3/6/017 MONDAY 2PM-6PM
+
===3/6/017 MONDAY ''2PM-6PM''===
[[Installing python in a database]], added building Python function section to [[Working with PostgreSQL]] at the bottom of the page.
+
* [[Installing python in a database]]
Ran FDA Trial data ripping again, as the text output files were wiped. Plan on discussing with Julia and Meghana again about pulling universities and other relevant institutions from the Assignee List USA. Talked to Sonia about pulling city, state, zipcode information, hence python was installed in a database. Will work with Sonia on Wednesday afternoon and see how best a regex function could be implemented
+
* Added building Python function section to [[Working with PostgreSQL]] at the bottom of the page.
 +
* Ran FDA Trial data ripping again, as the text output files were wiped.  
 +
* Plan on discussing with Julia and Meghana again about pulling universities and other relevant institutions from the Assignee List USA.  
 +
* Talked to Sonia about pulling city, state, zipcode information, hence python was installed in a database. Will work with Sonia on Wednesday afternoon and see how best a regex function could be implemented
  
===='''3/8/2017 WEDNESDAY 9AM-12PM'''====
+
====3/8/2017 WEDNESDAY ''9AM-12PM''====
 
* Output sql tables from finished run of Jeemin_FDATrial_as_key_data_ripping.py  
 
* Output sql tables from finished run of Jeemin_FDATrial_as_key_data_ripping.py  
 
* Ran through assigneelist_USA.txt to see how many different ways UNIVERSITY could be spelled wrong. There were many.
 
* Ran through assigneelist_USA.txt to see how many different ways UNIVERSITY could be spelled wrong. There were many.
 
* Tried to logic through creating a pattern that could catch all different versions of UNIVERSITY. Discuss further on whether UNIVERSITIES and those that include UNIVERSITIES but include  INC in the end should be pulled as relevant information
 
* Tried to logic through creating a pattern that could catch all different versions of UNIVERSITY. Discuss further on whether UNIVERSITIES and those that include UNIVERSITIES but include  INC in the end should be pulled as relevant information
  
===='''3/8/2017 WEDNESDAY 2PM-5PM '''====
+
====3/8/2017 WEDNESDAY ''2PM-5PM ''====
 
* Wrote regex pattern that identifies all "university" matchings - can be found in E:\McNair\Projects\University Patents\university_pulled_from_assignee_list_USA -- is an output file
 
* Wrote regex pattern that identifies all "university" matchings - can be found in E:\McNair\Projects\University Patents\university_pulled_from_assignee_list_USA -- is an output file
 
* Talked to Sonia, but didn't come to solid conclusion on identifying whether key words associate with city or country by running a python function
 
* Talked to Sonia, but didn't come to solid conclusion on identifying whether key words associate with city or country by running a python function
  
===='''3/13/2017 MONDAY 12PM-2PM'''====
+
====3/13/2017 MONDAY ''12PM-2PM''====
 
* For University Patent Data Matching - matched SCHOOL (output: E:\McNair\Projects\University Patents\school_pulled_from_assignee_list_USA) and matched INSTITUTE(output: E:\McNair\Projects\University Patents\institute_pulled_from_assignee_list_USA).  
 
* For University Patent Data Matching - matched SCHOOL (output: E:\McNair\Projects\University Patents\school_pulled_from_assignee_list_USA) and matched INSTITUTE(output: E:\McNair\Projects\University Patents\institute_pulled_from_assignee_list_USA).  
 
* [[University Patent Matching]]  
 
* [[University Patent Matching]]  

Revision as of 13:56, 14 March 2017

2/6/2017 MONDAY 2PM-6PM

  • Set up wikiPage & remote desktop.
  • Started working on python version of web crawler. So far it successfully prints out a catchphrase/ description for one website. To be worked on. The python file can be found in: E:\McNair\Projects\Accelerators\Python WebCrawler\webcrawlerpython.py

2/8/2017 WEDNESDAY9AM-11AM

  • Attempted to come up with possible cases for locating the description of accelerators - pick up from extracting bodies of text from the about page (given that it exists)

2/13/2017 MONDAY 2PM-6PM

  • Goals (for trials): 1) Build ER Diagram 2) For each entity, get XML snippet 3) Build a parser/ripper for single file; the python parser can be found at: E:\McNair\Projects\FDA Trials\Jeemin_Project

Trial Data Project

2/15/2017 WEDNESDAY 9AM-11AM

  • Discussed with Catherine what to do with FDA Trial data and decided to have a dictionary with zip-codes as keys and number of trials occurred in that zipcode as values. Was still attempting to loop through the files without the code having to exist in the same directory as the XML files. Plan to write to excel via tsv, with zip-code as one column and # of occurrence as the other.

2/17/2017 FRIDAY 2PM-6PM

  • Completed code for counting the number of occurrences for each unique zipcode. (currently titled & located: E:\McNair\Projects\FDA Trials\Jeemin_Project\Jeemin_Running_File.py). It has been running for 20+min because of the comprehensive XML data files. Meanwhile started coding to create a dictionary with the keys corresponding to each unique trial ID, mapped to every other information (location, sponsors, phase, drugs ...etc.) (currently titled & located: E:\McNair\Projects\FDA Trials\Jeemin_Project\Jeemin_FDATrial_as_key_data_ripping.py).

2/20/2017 MONDAY 2PM-4:30PM

  • Continued working on Jeemin_FDATrial_as_key_data_ripping.py to find tags and place all of those information in a list. The other zipcode file did not finish executing after 2+ hours of running it - considering the possibility of splitting the record file into smaller bits, or running the processing on a faster machine.

2/22/2017 WEDNESDAY 9AM-12:30PM

  • Finished Jeemin_FDATrial_as_key_data_ripping.py (E:\McNair\Projects\FDA Trials\Jeemin_Project\Jeemin_FDATrial_as_key_data_ripping.py), which outputs to E:\McNair\Projects\FDA Trials\Jeemin_Project\general_data_ripping_output.txt; TODO: output four different tables & replace the write in the same for-loop as going through each file

2/24/2017 FRIDAY 2:30PM-6:30PM

  • Continued working on producing multiple tables - first two are done. Was working on location, as there are multiple location tags per location.

2/27/2017 MONDAY 2PM-6PM

  • Finished producing tables from Jeemin_FDATrial_as_key_data_ripping.py
  • Talked to Julia about LinkedIn data extracting - to be discussed further with Julia & Peter.
  • Started web crawler for Wikipedia - currently pulls Endowment, Academic staff, students, undergraduates, and postgraduates info found on Rice Wikipedia page. Can be found in : E:\McNair\Projects\University Patents\Jeemin_University_wikipedia_crawler.py

3/1/2017 WEDNESDAY 9AM-12PM

  • Started re-running Jeemin_FDATrial_as_key_data_ripping.py

3/3/2017 FRIDAY 2PM-5PM

  • Attempted to output sql tables

3/6/017 MONDAY 2PM-6PM

  • Installing python in a database
  • Added building Python function section to Working with PostgreSQL at the bottom of the page.
  • Ran FDA Trial data ripping again, as the text output files were wiped.
  • Plan on discussing with Julia and Meghana again about pulling universities and other relevant institutions from the Assignee List USA.
  • Talked to Sonia about pulling city, state, zipcode information, hence python was installed in a database. Will work with Sonia on Wednesday afternoon and see how best a regex function could be implemented

3/8/2017 WEDNESDAY 9AM-12PM

  • Output sql tables from finished run of Jeemin_FDATrial_as_key_data_ripping.py
  • Ran through assigneelist_USA.txt to see how many different ways UNIVERSITY could be spelled wrong. There were many.
  • Tried to logic through creating a pattern that could catch all different versions of UNIVERSITY. Discuss further on whether UNIVERSITIES and those that include UNIVERSITIES but include INC in the end should be pulled as relevant information

3/8/2017 WEDNESDAY 2PM-5PM

  • Wrote regex pattern that identifies all "university" matchings - can be found in E:\McNair\Projects\University Patents\university_pulled_from_assignee_list_USA -- is an output file
  • Talked to Sonia, but didn't come to solid conclusion on identifying whether key words associate with city or country by running a python function

3/13/2017 MONDAY 12PM-2PM

  • For University Patent Data Matching - matched SCHOOL (output: E:\McNair\Projects\University Patents\school_pulled_from_assignee_list_USA) and matched INSTITUTE(output: E:\McNair\Projects\University Patents\institute_pulled_from_assignee_list_USA).
  • University Patent Matching
  • To be worked on later: Grant XML parsing & general name matcher

3/14/2017 TUESDAY 12PM-2PM

  • Started pulling academy cases but there are too many cases to worry about, in terms of institution of interest. A document is located in E:\McNair\Projects\University Patents\academies_verify_cases.txt
  • Need Julia/Meghana to look through the hits and see which are relevant & extract pattern from there.
  • Having trouble outputting txt file without double quotes around every line.
  • Thinking that one text file should be output for all keywords instead of having one each, to avoid overlap (ex) COLLEGE and UNIVERSITY are both keywords; ALBERT EINSTEIN COLLEGE OF YESHIVA UNIVERSITY will be hit twice if it were counted as two separate instances, one accounting for COLLEGE and the other for UNIVERSITY) - either in the form of if-elseif statements or one big regex check.