Changes

Peter Jalbert (Work Log) (view source)

Revision as of 11:20, 9 November 2017

385 bytes removed , 11:20, 9 November 2017

no edit summary

~~<font size~~=~~"5">'''~~==Fall ~~2016'''~~2017===<~~/font~~onlyinclude>

09[[Peter Jalbert]] [[Work Logs]] [[Peter Jalbert (Work Log)|(log page)]] 2017-11-07: Created file with 0s and 1s detailing whether crunchbase has the founder information for an accelerator. Details posted as a TODO on [http://27mcnair.bakerinstitute.org/wiki/~~2016 15~~Accelerator_Seed_List_(Data) Accelerator Seed List] page. Still waiting for feedback on the PostGIS installation from [http:00//mcnair.bakerinstitute.org/wiki/Tiger_Geocoder Tiger Geocoder]. Continued working on Accelerator Google Crawler. 2017-1811-06:00Contacted Geography Center for the US Census Bureau, [https: ~~Set up Staff~~ //www.census.gov/geo/about/contact.html here], and began email exchange on PostGIS installation problems. Began working on the [http://mcnair.bakerinstitute.org/wiki ~~page~~/Selenium_Documentation Selenium Documentation]. Also began working on an Accelerator Google Crawler that will be used with Yang and ML to find Demo Days for cohort companies. 2017-11-01: Attempted to continue downloading, ~~work log~~ however ran into HTTP Forbidden errors. Listed the errors on the [http://mcnair.bakerinstitute.org/wiki/Tiger_Geocoder Tiger Geocoder Page]. 2017-10-31: Began downloading blocks of data for individual states for the [http://mcnair.bakerinstitute.org/wiki/Tiger_Geocoder Tiger Geocoder] project. Wrote out the new wiki page~~; registered~~ for ~~Slack~~installation, ~~Microsoft Remote Desktop; downloaded Selenium~~ and beginning to write documentation on ~~personal computer~~usage. 2017-10-30: With Ed's help, was able to get the national data from Tiger installed onto a database server. The process required much jumping around and changing users, ~~read Selenium docs~~and all the things we learned are outlined in [http://mcnair.bakerinstitute.org/wiki/Database_Server_Documentation#Editing_Users the database server documentation] under "Editing Users". 2017-10-25: Continued working on the [http://mcnair.bakerinstitute. ~~Created~~ org/wiki ~~page for Moroccan Web Driver Project~~/PostGIS_Installation TigerCoder Installation].

~~09/29/2016 15:00~~2017-10-1824:~~00: Re-enroll in Microsoft Remote Desktop with proper authentication~~Throw some addresses into a database, ~~set up Selenium environment~~ use address normalizer and ~~Komodo IDE~~ geocoder. May need to install things. Details on the installation process can be found on ~~Remote Desktop, wrote program using Selenium that goes to a link and opens up~~ the ~~print dialog box~~[http://mcnair.bakerinstitute. ~~Developed computational recipe for a different approach to the problem~~org/wiki/PostGIS_Installation PostGIS Installation page].

~~09/30/2016 12:00~~2017-10-1423:00Finished Yelp crawler for [http: ~~Selenium program selects view pdf option from the website, and goes to the pdf webpage~~//mcnair. ~~Program then switches handle to the new page~~bakerinstitute. CTRL S is sent to the page to launch save dialog window. Text cannot be sent to this window. Brainstorm ways around this issue. Explored Chrome Options for saving automatically without a dialog window. Looking into other libraries besides selenium that may helporg/wiki/Houston_Innovation_District Houston Innovation District Project].

2017-10~~/3/2016 13:00~~ - 1619:00Continued work on Yelp crawler for [http: ~~Moroccan Web Driver projects completed for driving of the Monarchy proposed bills, the House of Representatives proposed bills, and the Ratified bills sites~~//mcnair. ~~Begun process of devising a naming system for the files that does not require scraping~~bakerinstitute. Tinkered with naming through regular expression parsing of the URL. Structure for the oral questions and written questions drivers is set up, but need fixes due to the differences in the sites. Fixed bug on McNair org/wiki ~~for women's biz team where email was plain text instead of an email link. Took a glimpse at Kuwait Parliament website, and it appears to be very different from the Moroccan setup~~/Houston_Innovation_District Houston Innovation District Project].

2017-10~~/6/2016 13:30~~ - 18:00Continued work on Yelp crawler for [http: ~~Discussed with Dr~~//mcnair. ~~Elbadawy about the desired file names for Moroccan data download~~bakerinstitute. The consensus was that the bill programs are ready to launch once the files can be named properly, and the questions data must be retrieved using a web crawler which I need to learn how to implement. The naming of files is currently drawing errors in going from arabic, to url, to download, to filename. Debugging in process. Also built a demo selenium program for Dr. Egan that drives the McNair blog site on an infinite looporg/wiki/Houston_Innovation_DistrictHouston Innovation District Project].

2017-10~~/7/2016 12:00~~ - 1417:~~00: Learned unicode and utf8 encoding and decoding~~ Constructed ArcGIS maps for the agglomeration project. Finished maps of points for every year in ~~arabic~~the state of California. ~~Still~~ Finished maps of Route 128. Began working on ~~transforming an ascii url into printable unicode~~selenium Yelp crawler to get cafe locations within the 610-loop.

2017-10~~/11/2016 15:00~~ - 1816:~~00: Fixed arabic bug, files can now be saved with arabic titles~~Assisted Harrison on the USITC project. ~~Monarchy bills downloaded~~ Looked for natural language processing tools to extract complaintants and ~~ready for shipment~~defendants along with their location from case files. ~~House~~ Experimented with pulling based on parts of ~~Representatives Bill mostly downloaded~~speech tags, ~~ratified bills prepared for download. Started learning scrapy library in python for web scraping. Discussed idea of screenshot-ing questions instead of scraping~~as well as using geotext or geograpy to pull locations from a case segment.

2017-10~~/13/2016~~ -13:00-18:00: Completed download of Moroccan Bills. Working on either a web driver screenshot approach or a webcrawler approach to download the Moroccan oral and written questions data. Began building Web Crawler for Oral and Written Questions site. Edited Moroccan Web Driver/Crawler Updated various project wiki ~~page. [http://mcnair~~pages.~~bakerinstitute.org/wiki/Moroccan_Parliament_Web_Crawler Moroccan Web Driver]~~

2017-10~~/14/2016~~ -12:~~00-14:00: Finished Oral Questions crawler. Finished Written Questions crawler. Waiting for~~ Continued work on Patent Thicket project, awaiting further ~~details on whether that data needs to be tweaked in any way. Updated the Moroccan Web Driver/Web Crawler wiki page. [http://mcnair.bakerinstitute~~project specs.~~org/wiki/Moroccan_Parliament_Web_Crawler Moroccan Web Driver]~~

2017-10~~/18/2016 15:00~~-~~18:30~~05: ~~Finished code for Oral Questions web driver and Written Questions web driver using selenium. Now, the data~~ Emergency ArcGIS creation for ~~the dates of questions can be found using the crawler, and the pdfs of the questions will be downloaded using selenium. [http://mcnair.bakerinstitute~~Agglomeration project.~~org/wiki/Moroccan_Parliament_Web_Crawler Moroccan Web Driver]~~

2017-10~~/20/2016 13:00~~-~~18:00~~04: ~~Continued to download data~~ Emergency ArcGIS creation for ~~the Moroccan Parliament Written and Oral Questions. Updated Wiki page. Started working on Twitter~~ Agglomeration project ~~with Christy. [http://mcnair.bakerinstitute~~.~~org/wiki/Moroccan_Parliament_Web_Crawler Moroccan Web Driver]~~

2017-10~~/21/2016 12:00~~-1402:~~00: Continued to download~~ Worked on ArcGIS data ~~for the Moroccan Parliament Written and Oral Questions~~. ~~Looked over [http://mcnair.bakerinstitute.org/wiki/Christy_Warden_(Twitter_Crawler_Application_1) Christy~~See Harrison's Twitter Crawler] to see how I can be helpful. Dr. Egan asked me to think about how to potentially make multiple tools to get cohorts and other sorts of data from accelerator sites. See [http://mcnair.bakerinstitute.org/wiki/Accelerator_Seed_List_(Data) Accelerator List] He also asked me to look at Work Log for the ~~[http://mcnair.bakerinstitute.org/wiki/Govtrack_Webcrawler_(Wiki_Page) GovTrack Web Crawler] for potential ideas on how to bring this project to fruition~~details.

~~11/1/2016: 15:00~~2017-09-1828:~~00: Continued~~ Added collaborative editing feature to download Moroccan data in the background. Went over code for GovTracker Web Crawler, continued learning Perl. [http://mcnair.bakerinstitute.org/wiki/Govtrack_Webcrawler_(Wiki_Page) GovTrack Web Crawler] Began Kuwait Web Crawler/DriverPyCharm.

~~11/3/2016: 13:00~~2017-09-~~18:00~~27: ~~Continued to download Moroccan data in the background. Dr. Egan fixed systems requirements to run the GovTrack Web Crawler. Made significant progress~~ Worked on ~~the Kuwait Web Crawler/Driver for the Middle East Studies Department~~big database file.

~~11/4/2016~~2017-09-25: ~~12:00~~New task --14:00: Continued to download Moroccan data in the background. Finished writing initial Kuwait Web Crawler/Driver for the Middle East Studies Department. Middle East Studies Department asked for additional embedded files in the Kuwait websiteCreate text file with company, description, and company type. #[http://mcnair.bakerinstitute.org/wiki/~~Moroccan_Parliament_Web_Crawler Moroccan Web Driver~~VC_Database_Rebuild VC Database Rebuild]#psql vcdb2#table name, sdccompanybasecore2#Combine with Crunchbasebulk

~~11/8/2016~~#TODO: ~~15:00-18:00: Continued to download Moroccan data in the background. Finished writing code for the embedded files~~ Write wiki on ~~the Kuwait Site. Spent time debugging the frame errors due to the dynamically generated content. Never found an answer to the bug~~linkedin crawler, ~~and instead found a workaround that sacrificed run time for the ability to work. [http://mcnair~~write wiki on creating accounts.~~bakerinstitute.org/wiki/Moroccan_Parliament_Web_Crawler Moroccan Web Driver]~~

~~11/10/2016 13:00~~2017-09-1821:~~00: Continued to download Moroccan data and Kuwait data in the background. Began work on [http://mcnair.bakerinstitute.org/wiki/Google_Scholar_Crawler Google Scholar Crawler].~~ Wrote ~~a crawler for the [http://mcnair.bakerinstitute.org/~~wiki~~/Accelerator_Seed_List_(Data) Accelerator Project] to get the HTML files of hundreds of accelerators. The~~ on Linkedin crawler ~~ended up failing; it appears to have been due to HTTPS~~, met with Laura about patents project.

~~11/11/2016 12:00~~2017-09-220:~~00: Continued to download Moroccan~~ Finished running linkedin crawler. Transferred data ~~in the background. Attempted~~ to ~~find bug fixes for the [http://mcnair~~RDP.~~bakerinstitute.org/wiki/Accelerator_Seed_List_(Data) Accelerator Project] crawler~~Will write wikis next.

~~11/15/2016 15:00~~2017-09-1819:~~00: Finished download of Moroccan Written Question pdfs. Wrote a parser with Christy to be used for parsing bills from Congress and eventually executive orders~~Began running linkedin crawler. ~~Found bug in the system Python that was worked out~~ Helped Yang create RDP account, get permissions, and ~~rebooted~~get wiki setup.

~~11/17/2016 13:00~~2017-09-18:00: Wrote a crawler to retrieve information about executive orders, and their corresponding pdfs. They can be found [http://mcnair.bakerinstitute.org/wiki/E%26I_Governance_Policy_Report here.] Next step is to run code to convert the pdfs to text filesFinished implementation of Experience Crawler, ~~then use the parser fixed by Christy~~continued working on Education Crawler for LinkedIn.

~~11/18/2016 12:00~~2017-09-~~2:00: Converted Executive Order PDFs to text files using adobe acrobat DC. See [http~~14:~~//mcnair.bakerinstitute.org/wiki/E%26I_Governance_Policy_Report Wikipage]~~ Continued implementing LinkedIn Crawler for ~~details~~profiles.

~~11/22/2016 15:00~~2017-09-1813:~~00: Transferred downloaded Morocco Written Bills to provided SeaGate Drive~~Implemented LinkedIn Crawler for main portion of profiles. ~~Made a "gentle" F6S crawler to retrieve HTMLs~~ Began working on crawling Experience section of ~~possible accelerator pages documented [http://mcnair.bakerinstitute.org/wiki/Accelerator_Seed_List_(Data) here]~~profiles.

~~11/29/2016 15:00~~2017-09-1812:~~00: Began pulling data from~~ Continued working on the ~~accelerators listed~~ [http://mcnair.bakerinstitute.org/wiki/~~Accelerator_Seed_List_~~LinkedIn_Crawler_(Python) LinkedIn Crawler for Accelerator Founders Data~~) here~~]. ~~Made text files for about 18 accelerators~~Added to the wiki on this topic.

~~12/1/2016 13:00~~2017-09-~~18:00~~11: Continued ~~making text files for~~ working on the [http://mcnair.bakerinstitute.org/wiki/~~Accelerator_Seed_List_~~LinkedIn_Crawler_(~~Data~~Python) LinkedIn Crawler for Accelerator ~~Seed List project]. Built tool for the [http://mcnair.bakerinstitute.org/wiki/E%26I_Governance_Policy_Report E&I Governance Report Project~~Founders Data] ~~with Christy. Adds a column of data that shows whether or not the bill has been passed~~.

~~12/2/2016 12:00~~2017-09-1406:~~00: Built and ran web crawler~~ Combined founders data retrieved with the Crunchbase API with the crunchbasebulk data to get linkedin urls for ~~Center for Middle East Studies on Kuwait~~different accelerator founders. ~~Continued making text files for the~~ For more information, see [http://mcnair.bakerinstitute.org/wiki/~~Accelerator_Seed_List_(Data) Accelerator Seed List project~~Crunchbase_Data here].

~~12/6/2016 15:00~~2017-09-1805:~~00: Learned how to use git~~Post Harvey. ~~Committed software projects~~ Finished retrieving names from the ~~semester~~ Crunchbase API on founders. Next step is to query crunchbase bulk database to ~~the McNair git repository. Projects can be found at; [http://mcnair.bakerinstitute.org/wiki/E%26I_Governance_Policy_Report Executive Order Crawler], [http://mcnair~~get linkedin urls.~~bakerinstitute.org/wiki/Moroccan_Parliament_Web_Crawler Foreign Government Web Crawlers]~~For more information, see [http://mcnair.bakerinstitute.org/wiki/~~Accelerator_Seed_List_(Data) F6S Crawler and Parser~~Crunchbase_Data here].

~~12/7/2016 15:00~~2017-08-1824:~~00: Continued making text files~~ Began using the Crunchbase API to retrieve founder information for ~~the [http://mcnair.bakerinstitute~~accelerators.~~org/wiki/Accelerator_Seed_List_(Data) Accelerator Seed List project]~~Halfway through compiling a dictionary that translates accelerator names into proper Crunchbase API URLs.

~~12/8/2016 14:00~~2017-08-1823:~~00: Continued making text files for~~ Decided with Ed to abandon LinkedIn crawling to retrieve accelerator founder data, and instead use crunchbase. Spent the day navigating the ~~[http://mcnair.bakerinstitute.org/wiki/Accelerator_Seed_List_(Data) Accelerator Seed List project]~~crunchbasebulk database, and seeing what useful information was contained in it.

2017-08-22: Discovered that LinkedIn Profiles cannot be viewed through LinkedIn if the target is 3rd degree or further. However, if entering LinkedIn through a Google search, the profile can still be viewed if the user has previously logged into LinkedIn. Devising a workaround crawler that utilizes Google search. Continued blog post [http://mcnair.bakerinstitute.org/wiki/LinkedIn_Crawler_(Python) here] under Section 4.

~~<font size="5">'''Spring~~ 2017~~'''~~-08-21: Began work on extracting founders for accelerators through LinkedIn Crawler. Discovered that Python3 is not installed on RDP, so the virtual environment for the project cannot be fired up. Continued working on Ubuntu machine.<~~/font~~onlyinclude>

===Spring 2017===

1/10/2017 14:30-17:15: Continued making text files for the [http://mcnair.bakerinstitute.org/wiki/Accelerator_Seed_List_(Data) Accelerator Seed List project]. Downloaded pdfs in the background for the [http://mcnair.bakerinstitute.org/wiki/Moroccan_Parliament_Web_Crawler Moroccan Government Crawler Project].

5/1/2017 13:00-17:00: Continued work on HTML Parser. Uploaded all semester projects to git server.

~~<font size~~=~~"5">'''~~==Fall ~~2017'''</font>~~2016===

809/2127/~~2017 14~~2016 15:00-1718:00: ~~Began~~ Set up Staff wiki page, work ~~on extracting founders~~ log page; registered for ~~accelerators through LinkedIn Crawler. Discovered that Python3 is not installed~~ Slack, Microsoft Remote Desktop; downloaded Selenium on ~~RDP~~personal computer, ~~so the virtual environment~~ read Selenium docs. Created wiki page for ~~the project cannot be fired up. Continued working on Ubuntu machine~~Moroccan Web Driver Project.

809/2229/~~2017 14~~2016 15:00-1618:00: ~~Discovered~~ Re-enroll in Microsoft Remote Desktop with proper authentication, set up Selenium environment and Komodo IDE on Remote Desktop, wrote program using Selenium that ~~LinkedIn Profiles cannot be viewed through LinkedIn if~~ goes to a link and opens up the ~~target is 3rd degree or further~~print dialog box. ~~However, if entering LinkedIn through~~ Developed computational recipe for a ~~Google search, the profile can still be viewed if~~ different approach to the user has previously logged into LinkedIn. Devising a workaround crawler that utilizes Google search. Continued blog post [http://mcnair.bakerinstitute.org/wiki/LinkedIn_Crawler_(Python) here] under Section 4problem.

809/2330/~~2017 14~~2016 12:00-1514:3000: ~~Decided with Ed to abandon LinkedIn crawling to retrieve accelerator founder data~~Selenium program selects view pdf option from the website, and ~~instead use crunchbase~~goes to the pdf webpage. ~~Spent~~ Program then switches handle to the ~~day navigating~~ new page. CTRL S is sent to the ~~crunchbasebulk database, and seeing what useful information was contained in it~~page to launch save dialog window. Text cannot be sent to this window. Brainstorm ways around this issue. Explored Chrome Options for saving automatically without a dialog window. Looking into other libraries besides selenium that may help.

810/243/~~2017 14~~2016 13:3000 -16:3000: ~~Began using~~ Moroccan Web Driver projects completed for driving of the Monarchy proposed bills, the House of Representatives proposed bills, and the ~~Crunchbase API~~ Ratified bills sites. Begun process of devising a naming system for the files that does not require scraping. Tinkered with naming through regular expression parsing of the URL. Structure for the oral questions and written questions drivers is set up, but need fixes due to ~~retrieve founder information~~ the differences in the sites. Fixed bug on McNair wiki for ~~accelerators~~women's biz team where email was plain text instead of an email link. ~~Halfway through compiling~~ Took a ~~dictionary that translates accelerator names into proper Crunchbase API URLs~~glimpse at Kuwait Parliament website, and it appears to be very different from the Moroccan setup.

910/56/~~2017 14~~2016 13:0030 -1618:00: ~~Post Harvey~~Discussed with Dr. ~~Finished retrieving~~ Elbadawy about the desired file names ~~from~~ for Moroccan data download. The consensus was that the bill programs are ready to launch once the files can be named properly, and the ~~Crunchbase API on founders~~questions data must be retrieved using a web crawler which I need to learn how to implement. ~~Next step~~ The naming of files is currently drawing errors in going from arabic, to url, to ~~query crunchbase bulk database~~ download, to ~~get linkedin urls~~filename. ~~For more information, see [http://mcnair~~Debugging in process.~~bakerinstitute~~Also built a demo selenium program for Dr.~~org/wiki/Crunchbase_Data here]~~Egan that drives the McNair blog site on an infinite loop.

910/67/~~2017 14~~2016 12:00-1514:~~30: Combined founders data retrieved with the Crunchbase API with the crunchbasebulk data to get linkedin urls for different accelerator founders. For more information, see [http~~00:~~//mcnair~~Learned unicode and utf8 encoding and decoding in arabic.~~bakerinstitute.org/wiki/Crunchbase_Data here]~~Still working on transforming an ascii url into printable unicode.

910/11/~~2017 14~~2016 15:00-1718:00: ~~Continued working on the [http://mcnair~~Fixed arabic bug, files can now be saved with arabic titles. Monarchy bills downloaded and ready for shipment.~~bakerinstitute~~House of Representatives Bill mostly downloaded, ratified bills prepared for download.~~org/wiki/LinkedIn_Crawler_(Python) LinkedIn Crawler~~ Started learning scrapy library in python for ~~Accelerator Founders Data]~~web scraping. Discussed idea of screenshot-ing questions instead of scraping.

910/1213/~~2017 14~~2016 13:00-1618:00: ~~Continued working~~ Completed download of Moroccan Bills. Working on either a web driver screenshot approach or a webcrawler approach to download the Moroccan oral and written questions data. Began building Web Crawler for Oral and Written Questions site. Edited Moroccan Web Driver/Crawler wiki page. [http://mcnair.bakerinstitute.org/wiki/~~LinkedIn_Crawler_(Python) LinkedIn Crawler for Accelerator Founders Data~~Moroccan_Parliament_Web_Crawler Moroccan Web Driver]~~. Added to the wiki on this topic.~~

910/1314/~~2017 14~~2016 12:00-1514:3000: ~~Implemented LinkedIn~~ Finished Oral Questions crawler. Finished Written Questions crawler. Waiting for further details on whether that data needs to be tweaked in any way. Updated the Moroccan Web Driver/Web Crawler ~~for main portion of profiles~~wiki page. [http://mcnair. ~~Began working on crawling Experience section of profiles~~bakerinstitute.org/wiki/Moroccan_Parliament_Web_Crawler Moroccan Web Driver]

910/1418/~~2017 13~~2016 15:3000-1518:30: ~~Continued implementing LinkedIn Crawler~~ Finished code for Oral Questions web driver and Written Questions web driver using selenium. Now, the data for ~~profiles~~the dates of questions can be found using the crawler, and the pdfs of the questions will be downloaded using selenium. [http://mcnair.bakerinstitute.org/wiki/Moroccan_Parliament_Web_Crawler Moroccan Web Driver]

910/1820/~~2017 14~~2016 13:00-1718:00: ~~Finished implementation of Experience Crawler, continued~~ Continued to download data for the Moroccan Parliament Written and Oral Questions. Updated Wiki page. Started working on ~~Education Crawler for LinkedIn~~Twitter project with Christy. [http://mcnair.bakerinstitute.org/wiki/Moroccan_Parliament_Web_Crawler Moroccan Web Driver]

910/1921/~~2017~~ 2016 12:00-14:~~30-16~~00:30Continued to download data for the Moroccan Parliament Written and Oral Questions. Looked over [http: ~~Began running linkedin crawler~~//mcnair. ~~Helped Yang create RDP account,~~ bakerinstitute.org/wiki/Christy_Warden_(Twitter_Crawler_Application_1) Christy's Twitter Crawler] to see how I can be helpful. Dr. Egan asked me to think about how to potentially make multiple tools to get ~~permissions,~~ cohorts and ~~get~~ other sorts of data from accelerator sites. See [http://mcnair.bakerinstitute.org/wiki/Accelerator_Seed_List_(Data) Accelerator List] He also asked me to look at the [http://mcnair.bakerinstitute.org/wiki ~~setup~~/Govtrack_Webcrawler_(Wiki_Page) GovTrack Web Crawler] for potential ideas on how to bring this project to fruition.

911/201/~~2017 14~~2016: 15:00-1518:00:30Continued to download Moroccan data in the background. Went over code for GovTracker Web Crawler, continued learning Perl. [http: ~~Finished running linkedin crawler~~//mcnair. ~~Transferred data to RDP~~bakerinstitute. ~~Will write wikis next~~org/wiki/Govtrack_Webcrawler_(Wiki_Page) GovTrack Web Crawler] Began Kuwait Web Crawler/Driver.

~~#TODO~~11/3/2016: ~~Write wiki on linkedin crawler, write wiki~~ 13:00-18:00: Continued to download Moroccan data in the background. Dr. Egan fixed systems requirements to run the GovTrack Web Crawler. Made significant progress on ~~creating accounts~~the Kuwait Web Crawler/Driver for the Middle East Studies Department.

911/214/~~2017 14~~2016: 12:00-1614:00: ~~Wrote~~ Continued to download Moroccan data in the background. Finished writing initial Kuwait Web Crawler/Driver for the Middle East Studies Department. Middle East Studies Department asked for additional embedded files in the Kuwait website. [http://mcnair.bakerinstitute.org/wiki ~~on Linkedin crawler, met with Laura about patents project.~~ /Moroccan_Parliament_Web_Crawler Moroccan Web Driver]

911/258/~~2017 14~~2016: 15:00-1718:00: ~~New task -- Create text file with company, description~~Continued to download Moroccan data in the background. Finished writing code for the embedded files on the Kuwait Site. Spent time debugging the frame errors due to the dynamically generated content. Never found an answer to the bug, and ~~company type~~instead found a workaround that sacrificed run time for the ability to work.#[http://mcnair.bakerinstitute.org/wiki/~~VC_Database_Rebuild VC Database Rebuild~~Moroccan_Parliament_Web_Crawler Moroccan Web Driver]~~#psql vcdb2#table name, sdccompanybasecore2#Combine with Crunchbasebulk~~

911/2710/~~2017 14~~2016 13:00-1618:00: ~~Worked~~ Continued to download Moroccan data and Kuwait data in the background. Began work on ~~big database file~~ [http://mcnair.bakerinstitute.org/wiki/Google_Scholar_Crawler Google Scholar Crawler]. Wrote a crawler for the [http://mcnair.bakerinstitute.org/wiki/Accelerator_Seed_List_(Data) Accelerator Project] to get the HTML files of hundreds of accelerators. The crawler ended up failing; it appears to have been due to HTTPS.

911/2811/~~2017 13~~2016 12:3000-152:3000: ~~Added collaborative editing feature~~ Continued to ~~PyCharm~~download Moroccan data in the background. Attempted to find bug fixes for the [http://mcnair.bakerinstitute.org/wiki/Accelerator_Seed_List_(Data) Accelerator Project] crawler.

1011/215/~~2017 14~~2016 15:00-1718:00: ~~Worked on ArcGIS data~~Finished download of Moroccan Written Question pdfs. ~~See Harrison's Work Log~~ Wrote a parser with Christy to be used for parsing bills from Congress and eventually executive orders. Found bug in the ~~details~~system Python that was worked out and rebooted.

1011/417/~~2017 14~~2016 13:00-1618:00: ~~Emergency ArcGIS creation for Agglomeration project~~Wrote a crawler to retrieve information about executive orders, and their corresponding pdfs. They can be found [http://mcnair.bakerinstitute.org/wiki/E%26I_Governance_Policy_Report here.] Next step is to run code to convert the pdfs to text files, then use the parser fixed by Christy.

1011/518/~~2017 14~~2016 12:1500-152:4500: ~~Emergency ArcGIS creation~~ Converted Executive Order PDFs to text files using adobe acrobat DC. See [http://mcnair.bakerinstitute.org/wiki/E%26I_Governance_Policy_Report Wikipage] for ~~Agglomeration project~~details.

1011/1222/~~2017 14~~2016 15:00-1518:00:30Transferred downloaded Morocco Written Bills to provided SeaGate Drive. Made a "gentle" F6S crawler to retrieve HTMLs of possible accelerator pages documented [http: ~~Continued work on Patent Thicket project, awaiting further project specs~~//mcnair.bakerinstitute.org/wiki/Accelerator_Seed_List_(Data) here].

1011/1329/~~2017 14~~2016 15:00-1518:00: ~~Updated various project~~ Began pulling data from the accelerators listed [http://mcnair.bakerinstitute.org/wiki ~~pages~~/Accelerator_Seed_List_(Data) here]. Made text files for about 18 accelerators.

1012/161/~~2017 14~~2016 13:00-1718:00: ~~Assisted Harrison on~~ Continued making text files for the ~~USITC~~ [http://mcnair.bakerinstitute.org/wiki/Accelerator_Seed_List_(Data) Accelerator Seed List project]. ~~Looked~~ Built tool for ~~natural language processing tools to extract complaintants and defendants along~~ the [http://mcnair.bakerinstitute.org/wiki/E%26I_Governance_Policy_Report E&I Governance Report Project] with ~~their location from case files~~Christy. ~~Experimented with pulling based on parts~~ Adds a column of ~~speech tags, as well as using geotext~~ data that shows whether or ~~geograpy to pull locations from a case segment~~not the bill has been passed.

1012/172/~~2017: 15~~2016 12:00-1714:00: ~~Constructed ArcGIS maps~~ Built and ran web crawler for Center for ~~the agglomeration project~~Middle East Studies on Kuwait. ~~Finished maps of points~~ Continued making text files for ~~every year in~~ the ~~state of California~~[http://mcnair. ~~Finished maps of Route 128~~bakerinstitute. ~~Began working on selenium Yelp crawler to get cafe locations within the 610-loop~~org/wiki/Accelerator_Seed_List_(Data) Accelerator Seed List project].

1012/186/~~2017 14~~2016 15:1500-1518:4500: ~~Continued work on Yelp crawler for~~ Learned how to use git. Committed software projects from the semester to the McNair git repository. Projects can be found at; [http://mcnair.bakerinstitute.org/wiki/E%26I_Governance_Policy_Report Executive Order Crawler], [http://mcnair.bakerinstitute.org/wiki/Moroccan_Parliament_Web_Crawler Foreign Government Web Crawlers], [http://mcnair.bakerinstitute.org/wiki/~~Houston_Innovation_DistrictHouston Innovation District Project~~Accelerator_Seed_List_(Data) F6S Crawler and Parser].

1012/197/~~2017 14~~2016 15:3000-1618:3000: Continued ~~work on Yelp crawler~~ making text files for the [http://mcnair.bakerinstitute.org/wiki/~~Houston_Innovation_District Houston Innovation District Project~~Accelerator_Seed_List_(Data) Accelerator Seed List project].

1012/238/~~2017~~ 2016 14:00-1718:00~~: Finished Yelp crawler for [http://mcnair.bakerinstitute.org/wiki/Houston_Innovation_District Houston Innovation District Project].~~ 10/24/2017 14:00-16:00: Throw some addresses into a database, use address normalizer and geocoder. May need to install things. Details on the installation process can be found on the [http://mcnair.bakerinstitute.org/wiki/PostGIS_Installation PostGIS Installation page]. ~~10/25/2017 14:15-15:45~~: Continued ~~working on the [http://mcnair.bakerinstitute.org/wiki/PostGIS_Installation TigerCoder Installation].~~ 10/30/2017 14:00-16:00: With Ed's help, was able to get the national data from Tiger installed onto a database server. The process required much jumping around and changing users, and all the things we learned are outlined in [http://mcnair.bakerinstitute.org/wiki/Database_Server_Documentation#Editing_Users the database server documentation] under "Editing Users". ~~Starting to use non-military timecodes because I want to.~~ ~~10/31/2017 2pm-4pm: Began downloading blocks of data for individual states~~ making text files for the [http://mcnair.bakerinstitute.org/wiki/~~Tiger_Geocoder Tiger Geocoder]~~ Accelerator_Seed_List_(Data) Accelerator Seed List project~~. Wrote out the new wiki page for installation, and beginning to write documentation on usage.~~ ~~11/1/2017 2-3:30pm: Attempted to continue downloading, however ran into HTTP Forbidden errors. Listed the errors on the [http://mcnair.bakerinstitute.org/wiki/Tiger_Geocoder Tiger Geocoder Page~~]. 11/6/2017 2-5pm: Contacted Geography Center for the US Census Bureau, [https://www.census.gov/geo/about/contact.html here], and began email exchange on PostGIS installation problems. Began working on the [http://mcnair.bakerinstitute.org/wiki/Selenium_Documentation Selenium Documentation]. Also began working on an Accelerator Google Crawler that will be used with Yang and ML to find Demo Days for cohort companies.

11/7/2017 2-4pm: Created file with 0s and 1s detailing whether crunchbase has the founder information for an accelerator. Details posted as a TODO on [http://mcnair.bakerinstitute.org/wiki/Accelerator_Seed_List_(Data) Accelerator Seed List] page. Still waiting for feedback on the PostGIS installation from [http://mcnair.bakerinstitute.org/wiki/Tiger_Geocoder Tiger Geocoder]. Continued working on Accelerator Google Crawler.=='''Notes=='''

*Ed moved the Morocco Data to E:\McNair\Projects from C:\Users\PeterJ\Documents

MichelleH

Bots, Bureaucrats, Administrators (Semantic MediaWiki), Administrators

201

edits

Changes

Peter Jalbert (Work Log) (view source)

Revision as of 11:20, 9 November 2017

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools