Work Logs are broken down into two divisions of McNair Center, the long-term deliverables of academic papers and short-term deliverables of general content. Individuals working within a division will be listed under the respective one. In case an individual works within both divisions, they will be listed in both locations.

Academic Papers

This division of the McNair Center pursues longer-term projects, such as peer-reviewed academic papers.

Jake Silberman

Jake Silberman Work Logs (log page)

Will Cleland

Will Cleland Work Logs (log page)

Todd Rachowin

Todd Rachowin Work Logs (log page)

Amir Kazempour

Amir Kazempour Work Logs (log page)

Marcos Ki Hyung Lee

Summer 2018

Notes from Ed

Please build and link to a project page for the STATA analysis!

Also, if/when you make changes to a sql file, please:

Run them through or make it clear that you haven't with comments
Let me know by posting it on a project page and linking to in your work log.

Otherwise, we are both going to be making conflicting changes to the same files.

By Date

Project Page: VC Startup Matching Stata Work

2018-07-23 until 07-27:

This week was dedicated to refining the Linear Probability Model and the reduced form evidence that I did for Jeremy.

I added extensive notes to the interpretation of the coefficients of each model that was estimated as requested by Jeremy. I adjusted for small technicalities in each model.

Additionally, after a call with Ed, I checked the distribution of market size when using either year-code100 or year-code20 as market definition. See Project Page for more on this.

2018-07-11 until 07-20:

Basically spent this entire week working out the code to build the LPM dataset. Detailed description of code and dataset on project page.

2018-07-11:

Created a new SQL code to build the LPM dataset. Sent it to Ed Egan to check.

2018-07-12:

Sick day.

2018-07-11:

Skype meeting with Ed Egan to discuss new dataset. Need to build a broader dataset for running the LPM model. Started studying the SQL code and thinking about the necessary changes to get the desired dataset.

2018-07-10:

Investigated the reasons of why the LPM model is not giving the expected results.

2018-07-06:

Received new suggestions from Fox.

Started rewriting the report following comments from Fox.

2018-07-05:

Wrote a report with all the results so far by request from Jeremy Fox and sent it to him.

Also added a LPM model to the analysis, by suggestion from Jeremy Fox, although I suspect there is something wrong with the way I built the dataset needed for it. Emailed my worries to Fox.

After meeting with Ed Egan, changed the SQL code when building the history variables from VCs. Instead of subtracting 1 from the sum of all portcos that worked with the VC, now we do not subtract and instead of using weak inequality when LEFT JOINing, we use strict inequality.

2018-07-04:

Rework data analysis following suggestions from Egan and Fox.

2018-07-03:

Worked with regressions and made log files with summary statistics and outputs.

Had Skype meetings with Ed Egan and Jeremy Fox.

2018-06-26:

Picked relevant variables and started thinking of some regression specifications.

2018-06-25:

Studied the SQL code that creates the variables synsumprevsameindu100 synsumprevsameindu20 synsumprevsameindu synsumprevsamesector synnumprevportcos syntotsameindu100 syntotsameindu20 syntotsameindu syntotsamesector (and also the non synthetic ones).

Found some problems with the nonsynthetic ones, with some double counting when VCS met with more than portco on the same day AND the portcos had the same industry code. The code subtracts 1 from the sum of dummies that indicate same industry code, but in these instances, they should be subctrating more.

Also, the synthetic counterparts have weird values. The historical ones (previous from meeting the portco) are mostly 0 or -1, while all-time have lots of missings. Initially I thought it was an error from the code, but after thinking about this, I think it is a feature of the randomization. To correct the negative numbers, I think we should not subtract 1 from the sum of dummies. We did that to account for the repeated portcos that showed up in the blowout table, but now these repetitions don't happen, since we are joining a table with synthetic matches with real matches.

2018-06-22:

Plans for today: try to fix dataset.

Found more errors in matched dataset. Synthetic firms variables seem to be wrong, as there are negative numbers and lots of missings.

Also, variables form matched firms like number of people, doctors, etc, and city name, are missing seemingly at random.

Egan walked me through the SQL code that generates de matched dataset. We made a more precise count of coinvestors. Before, we were double counting funds. Now, if a PortCo had only one VC fund investment, numcoinvestor == 0.

Looking into the syntethic variables problem, the main problem is on the variables synsumprevsameindu100 synsumprevsameindu20 synsumprevsameindu synsumprevsamesector synnumprevportcos syntotsameindu100 syntotsameindu20 syntotsameindu syntotsamesector.

They basically count the number of PortCos VCs invested that were in the same industry code as them, before meeting the current matched POrtCo (synsum*) and over all time (syntot*). So they are integers and tot >= sum. However, for the synthetic firm ones, they are mostly -1 on the sum ones, and missing on tot ones.

Looking at the code that generates these synthetic, there seems to be a problem when joining and subtracting one to the sum of dummies where A.code100 = B.code100 for example. Can't figure out how to correct it yet.

2018-06-21:

Plans for today: get a full understanding of dataset and variables, start making some summary statistics.

Inspected matched dataset and found inconsistencies on invesmentment amounts of VCs in PortCos. Talked to Egan about this, we will check it out carefully on the source SQL code tomorrow.

Made summary statistics of firm variables. There does not seem to be inconsistencies on that.

2018-06-20: Created folder at "E:\McNair\Projects\MatchingEntrepsToVC\Stata\", imported files into Stata, and made master dofile.

Research

Amir Kazempour

Amir Kazempour Work Logs (log page)

Ben Baldazo

Ben Baldazo Work Logs (Work Log)

contributing Projects

Crunchbase Data / Accelerator Seed List (Data) : Combined this data in a table discussed on Crunchbase Data

Houston Innovation District

Augusta Startup Ecosystem

Houston Entrepreneurship Ecosystem Project

Houston Entrepreneurship

Start-Up Guide (Issue Brief)

Houston Accelerators and Incubators (Report)

Cofounding in Exchange for Equity

Start-Ups of Houston (Map)

worklog

2017-11-21: Worked with Ed to set up all of the ground work to begin joining tables for the purpose stated in yesterday's work log. Should be able to finish it upon returning next week, but until then, notes are all held within "Z:\bulk\crunchbase\AccFunding.psql" with the important parts under the header of "From Ed on 21st of Nov. To finish on Nov 27" 2017-11-20: Attempting to link 3 tables from psql crunchbasebulk to find accelerators that have invested in companies. Likely found success with the table "Acc_Funded_Cos" but the investor column is dirty, thus trying to do it cleaner with the aforementioned 3 table link

This is all noted in "Z:\bulk\crunchbase\AccFunding.psql" and the code for "Acc_Funded_Cos" is emphasized

2017-09-25: Followed Talk:Ben Baldazo (Work Log) to create documentation infrastructure for Augusta Startup Ecosystem

Catherine Kirby

Catherine Kirby Work Logs (log page)

2017-12-07: 1pm-3pm: work on report

2017-12-06: 1pm-3pm; work on report

2017-12-01: 3pm-5pm: work on report

2017-11-30: 4pm-5pm: work on report

2017-11-29: 10am-11am; 3:00pm-5:00pm: work on report

2017-11-28: 4pm-5pm: work on report

2017-11-27: 10am-11am; 3:00pm-5:00pm: work on report

2017-11-22: 10am-12pm: work on report

2017-11-21: 3:30-4:30pm: work on report

2017-11-20: 10am-11:30am; 3:30-5:00pm: work on report

2017-11-16: 3:30pm-5:00pm: work on report

2017-11-15: 10am-12pm; 3:30-5:00pm: work on report

2017-11-14: 4:00pm-5:00pm: work on report

2017-11-13: 11:00am-12:00pm, 3:30pm-5:00pm: begin writing

2017-11-10: 1:00pm-3:00pm: work on graphs

2017-11-07: 11:00am-12:00pm: edit graphs, begin intro

2017-11-02: 11:00pm-12:00pm; 4:00-5:00pm: make graphs

2017-11-01: 10:00-12:00pm; 3:00pm-5:00pm: make graphs

2017-10-30: 10:00-12:00pm; 3:00-5:00pm: make graohs

2017-10-26: 11:00-12:00pm; 4:00-5:00pm: zipcodes

2017-10-25: 3:30-5:00pm: zipcodes

2017-10-24: 10pm-12pm; 4:00pm-5:00pm: zipcodes

2017-10-23: 2pm-3:30pm: find hospitals in zipcode area

2017-10-20: 1pm-3:30pm: find hospitals in zipcode area

2017-10-18: 10am-12pm: look at code running results. Clinical trial data finished

2017-10-17: 11am-12pm; 3:30-5:00pm: run newbiotech code

2017-10-16: 10am-12pm; 3:15-4:45pm: finish pulling Texas Contract data. rewrite med center code with new zipcodes

2017-10-12: 10am-12pm: pull data for Texas Contracts

2017-10-11: 10am-12pm: pull data for all Texas Federal Grants from 2005-2017

2017-10-04: 10am-12pm; 3:30-5pm: find all government grants in Augusta

2017-10-03: 11am-12pm: work on DOD grants

2017-10-02: 10am-12pm; 3:45-6:15: work with new zipcodes, pull DOD grants

2017-09-29: 3:00pm-4:30pm: examine hospitals in zipcodes

2017-09-27: 3:30-4:30pm: edit paper

2017-09-26: 10:30am-12pm; 2:00pm-5:00pm: find nsf, nih, clinical trial data for georgia and augusta university

2017-09-25: 10am-12am: edit paper

2017-09-21: 10:15-12:15pm; 3:30-5:00pm: search for zipcode areas, edit graph

2017-09-20: 10am-12pm; 3:30-5:00pm: edit graphs and writing

2017-09-18: 10am-12pm: edit graphs and figures

2017-09-13: 10am-12pm; simplify report

2017-09-12: 10am-12pm; 3:30-5:00: simplify report

2017-09-11: 10am-12pm; 3:30-5:00: edit report and figures, add the edited graphs to report

2017-09-06: 10am-12pm: edit report

Connor Rothschild

7/30/2018 -

Recoded founders' education
In the process of recoding founders' job experience
Worked with Minh to test MTurk survey
Talked through MTurk logistics and strategy with Minh
Recoded equity and investment variables given new SeedDB data
Renormalized investment amount based on midpoint of ranges, and upper bounds (upon Hira's request)

7/29/2018 -

Finalized multiple campuses work, refined addresses
Upon Hira's request, recoded dead/alive variable for updated accuracy

7/27/2018 -

Recoded founders
Fixed multiple campuses and cohorts
Fixed the Google Sheet

7/26/2018 -

Cleaned up and fixed the Google Sheet with timing info.
Recoded the employee count variable.
Normalized investment amount

7/25/2018 -

Created a comprehensive Google Sheet with new timing info, collaborated with other interns to find data. Cleaned up sheet.

7/24/2018 - Sick day :(

7/23/2018 -

Helped Minh with Demo Day information.

7/19/2018 -

Helped Minh with training data for Demo Day Crawler

7/18/2018 -

Helped Augi with MA cleaning
Talked to Minh about Demo Day progress

7/17/2018 -

Worked with Ed to add/merge data from Crunchbase to existing data. This was a replication of the process but done by Ed in SQL, not Excel. New data can be found in

/McNair/Projects/Accelerators/Summer 2018/Merged With Crunchbase Info as of July 17.xlsx

NOTE: Use this data rather than the sheet mentioned in yesterday's entry.

7/16/2018 -

Merged cohort company data with Crunchbase data, by doing a Vlookup then cleaning up data. I used a =IF(A2="",B2,A2) formula to merge cells only when blanks were present. This provided us updated data for four columns:
- colocation (removed 6324 blanks)
- codescription (removed 5151 blanks)
- costatus (removed 7342 blanks)
- courl (removed 6670 blanks)

and new columns:

- address
- founded_on date
- employee_count
- linkedin_url

These new variables can be found in:

/McNair/Projects/Accelerators/Summer 2018/Crunchbase Info Populated Empty Cells.xlsx (OUTDATED:: DON'T USE)

Upon Ed's approval, I'll move this sheet to replace Cohort Companies in The File to Rule Them All.

7/13/2018 -

Using SQL, matched our cohort companies with information from Crunchbase. This gave us a lot of new information, like employee counts, company status, the date founded, and the location of the company. This data can be found here:

/McNair/Projects/Accelerators/Summer 2018/Cohort Companies With Crunchbase Info.xlsx

7/12/2018 -

Created 'The File to Rule them All' with finalized info on accelerators, cohort companies, and founders.
Attempted to match our company data to Crunchbase data with SQL to get more info on companies.

7/11/2018 -

Worked on LinkedIn Founders data. Cleaned up data, removed duplicates, checked for fidelity.
Worked with Maxine to finish Crunchbase matching.

7/10/2018 -

Merged Clean Cohort Data (Veeral) and Cohort List (new) in the Accelerator Master Variable List file. Cross-referenced this list with Ed's data sent last week, titled accelerator_data_noflag.txt. We found that there are 4866 more entries in the new merged file, meaning Ed's merging may have dropped valid entries. (This was after filtering the list so we only looked at the accelerators on our list).

7/9/2018 -

Worked with Maxine to remove duplicates/gather clean data for Crunchbase matching

06/29/2018 -

Finished manually coding an equity variable in Master Variable List sheet (with the help of Maxine Tao).
Finished editing terms of joining accelerator:
Given the above two tasks, there are five new columns in our Master Variable List sheet:
- Terms of joining - terms of joining accelerator and important details about program
- equity (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information
- equity amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))
- investment - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a "up to $######")
- notes - anything to comment on previous 4 columns
Taught Maxine Tao how to VLookup :D

06/28/2018 -

Began manually coding an equity variable in Master Variable List sheet.
Edited terms of joining accelerator.
Helped Grace with LinkedIn crawler.

06/27/2018 -

Finished coding duplicates. Final file can be found at:

/bulk/McNair/Projects/Accelerators/Summer 2018/Duplicate Companies.xlsx

Dylan taught interns Excel skills

06/26/2018 -

Began coding duplicates in CohortMainBaseWCounts.txt file that Ed sent. Sorted by company name alphabetically, then used conditional formatting to highlight when an accelerator had the same name as the accelerator above. This narrowed down the results to instances in which a company would go through the same accelerator twice. Most of the time, this was due to an error with the normalizer, so I moved those un-normalized company names to their own sheet and deleted them from the file.

06/25/2018 -

Went through and manually fixed discrepancies between our accelerator data and the Crunchbase data, found at

/bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators Matched by Name and Homepage URL.xlsx

Finalized a sheet with a list of accelerator names as we code them, as Crunchbase codes them, and the appropriate UUID for each accelerator. I recommend updating the names in our spreadsheet of accelerators to the Crunchbase list so that we will be able to look up that name without having an in-between. The list can be found in the rightmost columns here:

/bulk/McNair/Projects/Accelerators/Summer 2018/Accelerator Master Variable List - Revised by Ed V2.xlsx

and here:

https://docs.google.com/spreadsheets/d/1n1sX5DqZrm_0vbUXG9ZaZIagF9sa0Kva9PAno-6H854/edit?usp=sharing

Worked with Minh Le to better understand and begin documenting the Demo Day Page Parser project.

06/22/2018 -

Finished going through Accelerator Master Variable List to refine industry classification and update addresses/accelerator statuses.

06/21/2018 -

Began manually editing entries in Accelerator Master Variable List.
Reached out to Grace and Maxine and sent them the necessary sheets/txt files so they could begin on their Crunchbase project.
I also made these graphics to better represent what our collaborative work would look like, and what the final project would include:

https://docs.google.com/document/d/13Mb7lOLydm9r-ENYxSlZJVGgY9wxClATR6Hy8F9YK1Y/edit?usp=sharing

06/20/2018 -

Talked with Ed about project details.
Began looking through the Accelerator Master List to better understand project description.
Sent Grace and Maxine the relevant company names listed in the Accelerator Master Spreadsheet so they could begin using their relevant parsers and tools to sort through data.

06/19/2018 -

Set up work stations on balcony, trained

06/18/2018 -

Trained, met other interns

Diana Carranza

Diana Carranza Work Logs (log page)

Dylan Dickens

2018-03-06:Troubleshot Key Terms program with Christy, continued to read articles.

2018-03-05: Tested the Key Terms program, found it not to be working. Troubleshot and alerted Christy.

2018-03-01: Started to read articles for key-terms testing.

2018-02-28: Adjusted some wiki pages, started testing the revamped tools.

2018-02-27: Drafted email with concerns to Ed, met with Ed to resolve concerns. Created action plan of testing the revamped tools and codifying a subset of known papers.

2018-02-26: Reviewed Christy's new documentation, prepared to meet with Ed.

2018-02-22: Tested RegEx-Excel Filter process, flagged some additional questions that need guidance from Ed. Met with Christy and worked to resolve coding issues.

2018-02-21: Finished RegEx-Excel Filter process, spoke with Ed about long-term goals of project.

2018-02-20: Continued working on the RegEx-Excel filter.

2018-02-19: Continued working on the RegEx-Excel filter.

2018-02-15: Started developing a RegEx and Excel filter for processing and cross-referencing sources.

2018-02-14: Identified the status of all codes. Drafted an email to Christy about retunring temporarily to help with the codes.

2018-02-13: Ran the KeyTerms and PDF Converter Python Codes.

2018-02-12: Finished troubleshooting crawler, reached out to Ed for guidance. Was redirected to testing Key Terms code.

2018-02-07: Troubleshot the crawler with Christy.

2018-02-06: Troubleshot the crawler with Christy.

2018-02-05: Reached out to Ed for guidance, was redirected to testing the scholar crawler.

2018-02-01: Continued PDF - BibTex filtering

2018-01-31: Started PDF - BibTex filtering process as per meetings with Christy and Lauren.

2018-01-30: Met with both Christy and Lauren.

2018-01-29: Reviewed the current state of PTLR project in order to prepare for meetings on Tuesday.

2018-01-26: Assisted with the McNair Center Event.

2018-01-25: Reached out to previous project owners to gather information for next steps. Was on standby to assist with the Lyceum Research Page

2018-01-24: Searched for tools to accomplish the strategies outlined in Patent Thicket Strategic Planning. Had a hard time locating anything, or getting a good grasp on where exactly the project is and what it needs. Gathered contact information for previous owners to make communications later this week. Also continued to prep the Lyceum Research Page for Ed.

2018-01-23: Finished Patent Thicket Strategic Planning and sent to Ed. Ed approved.

2018-01-22: Read Patent Thicket literature. Met with Ed to discuss broad strategy, began planning for next steps. Patent Thicket Strategic Planning

2018-01-18: Met with Ed to discuss Patent Thicket Project. Helped complete his research for the Amazon HQ2 Report.

2018-01-11: Finalized sourcing for Venture Capital Gap for Women

2018-01-10: Sourced all of Venture Capital Gap for Women, downloaded PDF's for about 3/4 of sources

2018-01-09: Found additional sources on the Venture Capital Gap for Women, as well as Fondren availability for a portion of the sources.

Hira Farooqi

Hira Farooqi Work Logs (log page)

James Chen

James Chen Work Logs (log page)

Joe Reilly

Joe Reilly Work Logs (log page)

2017-3-26: Added Score column to Accelerator Master Variable List google Doc; began filling in necessary info.

2017-3-19: Filled in part of 'duration' column on Accelerator Master List Doc.

2017-3-9: Created a google doc of the Accelerator master list

2017-3-5: Created "Potential Other "Variables full list" in E:\McNair\Projects\Accelerators\Spring 2018\Grouping project of ListOfAccs. Began comprehensive list of all possible variables that could be included using current info in Master Accelerator Variable Master List Project Excel File.

2017-3-2: Fixed errors on Master Accelerator Variable Master List Project in E:\McNair\Projects\Accelerators\Spring 2018; created "Potential Other Variables" in E:\McNair\Projects\Accelerators\Spring 2018\Grouping project of ListOfAccs.

2017-2-28: Organized and delegated tasks for completion of Accelerator Variable Master List Project among Michelle, Cindy, Yunnie, and me.

[fill in days]

2017-2-23: Accelerator Type Project: researched whether foreign-based accelerators had a significant US presence.

2017--2-16: Accelerator Type Project

2017-2-15: Accelerator type project: wrote instructions, saved as "Instructions for Accelerator type project" in E:\McNair\Projects\Accelerators\Spring 2018\Grouping project of ListOfAccs.

2017-2-14: Accelerator type project

2017-2-12: Accelerator type project

2017-2-7: Accelerator type project

2017-2-6: Accelerator Data meeting.

2017-2-1: Accelerator type project

2017-1-31: Accelerator type project

2017-1-29: Accelerator type project

2017-1-24: Continued on Accelerator Type Project. Began process of collecting files through Zotero for future "Gender and MGMT style" lit review page.

2017-1-22: Worked on Accelerator Type Project. See http://mcnair.bakerinstitute.org/wiki/Accelerator_Seed_List_(Data)#Accelerator_Type_project. Also see E:\McNair\Projects\Accelerators\Spring 2017\Grouping project of ListOfAccs.

Julia Wang

Julia Wang Work Logs (log page)

12/4-12/8 finalizing University Patents report

12/4 9-12 edits, sent to Ed, confirming catering for party
12/5 9-12 final edits, sent to Ed
12/6 1-4 making City Agglomeration graphics
12/7 1-3 wrapping up everything

11/27-12/1

11/27 10-12 edits, sent to Ed
11/29 10-12 catering order for lunch party, wiki page organization
11/30 2:30-4:30 edits
12/1 10-12 met with Ed, edits

11/20-11/22

11/20 10-12 edits
11/21 10-12 met with Ed, edits
11/22 10-12 met with Ed, edits

11/13-11/17 deadline 11/16 final draft

11/13 10-12 redoing reg table
11/14 10-12 edits
11/15 10-12:30 edits
11/16 2:30-4 met with Ed, edits
11/17 10-12 edits

11/6-11/10 deadline 11/16 final draft

11/6 10-12 4th draft
11/8 10-12 met with Ed, editing
11/9 2:30-5:30 redoing graphs, restructuring introduction
11/10 10-12, 3-4:30 redoing charts, rewriting body

10/30-11/3 deadline 11/16 final draft

10/30 10-12 revisions
10/31 10-12 revisions
11/1 10-12 revisions, new data for basic funding
11/2 10-12 revisions

10/23-10/27

10/23 10-12 editing University Patents
10/24 10-12 reran regressions, fixed problem with Cornell!
10/25 10-12 sent 2nd draft to Anne
10/26 2:30-4:30 revisions
10/27 10-12 sent 3rd draft

10/16-10/20

10/16 10-12 pulled Houston patent addresses
10/17 10-12 pulled Houston patent addresses
10/18 10-12 edited University Patents
10/19 2:30-4:30 edited University Patents, tabled patent database reorganization until it is cleaned by Oliver/Shelby/Ed
10/20 10-12 edited University Patents

10/11-10/13

10/11 10-12 pulling Houston patents
10/12 10-12 University patents, sent draft to Anne

10/2-10/6

10/2 10-12 work on University Patents draft, close to sending
10/3 10-12 Distracted by Augusta project, Reorganizing patent database
10/4 10-12 Reorganizing whole patent database by city, state, pulling Crunchbase data for Augusta
10/5 2:15-2:45 Augusta patents
10/6 10-12 Reorganizing patents, figure out misspellings

9/25-9/29 finish draft

9/25 10-12 remaking charts
9/26 10-12 data pull for Augusta University
9/27 10-12 data pull for Augusta University
9/29 10-12 reran log regressions

9/18-9/22/2017

9/18 10-12 cleaning data
9/19 10-12 cleaning data
9/20 10-12 created time-series data set
9/21 2:30-4 reran regressions
9/22 10-12, 2-3 remade charts

9/11-9/15/2017 Deadline: 9/15 - convert data to time-series, new charts

9/11 10-12 Converting to time-series
9/12 10-12 Check accuracy, converting to time-series, talked to Jeemin about next project
9/13 10-12 Fix R&D data, previous SQL code
9/14 2:30pm-4pm, 10:30pm-12am Fix R&D data
9/15 10-12, 2-3 join data

9/5-9/8/2017 Putting together University Patents report

9/5 10-12 Looked at report, created artifacts, cleaned University Patents folder
9/6 10-12 Spoke with Ed about project organization
9/7 2:30-4:30 Writing report
9/8 10-12 Making data into time series: gyear, make tables and charts

Matthew Ringheanu

Matthew Ringheanu Work Logs (log page)

9/11/2017 2:00-5:00 pm

Spoke to Ed about the project going forward. Organized the current updated data for our project.

9/12/2017 3:00-5:00 pm

Began going through the Cleaned Cohort Data Excel file and found a few problems with it. Will continue the cleaning process for the rest of the week.

9/13/2017 2:00-5:00 pm

Sorted through Cleaned Cohort Data and finalized our List of Accelerators. We can begin the process of creating our PercentVC table.

9/14/2017 3:00-5:00 pm

Completely finalized our dataset of accelerators and startups. Met with Michelle Passo to discuss objectives of the research for credit course.

9/18/2017 2:00-4:00 pm

Talked with Peter about the LinkedIn crawler data. Went through VC page that Meghana sent me.

9/19/2017 3:00-5:00 pm

Completed SDC pull of updated VC Data.

9/20/2017 2:00-5:00 pm

Attempted several times to run the Matcher. Cleaned our pulled data.

9/21/2017 3:00-5:00 pm

Came extremely close to running the Matcher the correctly. Reviewed the final LinkedIn data from Peter.

9/25/2017 2:00-5:00 pm

Finalized the matched file of accelerator companies with VC portfolio companies. Gave Ben the data on Georgia accelerators.

9/26/2017 3:00-5:00 pm

Worked on finding the duplicates in our Matched file in order to have the most accurate data.

9/27/2017 2:00-5:00 pm

Attempted to find a way to organize the duplicate matches.

9/28/2017 4:00-5:00 pm

Continued running through matched data in order to organize it effectively.

10/2/2017 2:00-5:00 pm

Talked to Ed about next steps for the project. Practiced accessing the crunchbase database on SQL. Brushed up on SQL code.

10/3/2017 3:00-5:00 pm

Searched the database for crunchbase investment information.

10/4/2017 2:00-5:00 pm

Pulled the funding rounds table from SQL and matched it with the companies that have received VC funding in order to gather round dates.

10/6/2017 3:00-5:00 pm

Went through the matched data. Brainstormed ways to get the dates for cohort companies going through accelerators.

10/11/2017 2:00-3:30 pm:

Looked into using the WhoIs Parser in order to find when the companies went through their accelerators.

10/12/2017 3:00-5:00 pm

Discovered that the Wayback Machine will not be a good option for finding when companies went through their accelerators. Created a list of VCCompanies and their earliest round date. Included a column for the date they went through their accelerators and will fill it in when we find a good method of finding this date.

10/16/2017 2:00-3:30 pm

Continued working on sorting VCCompanies by their earliest round date.

10/17/2017 3:00-5:00 pm

Worked with Ben to find a solution to our problem of data acquisition. Finalized earliest round date for VCCompanies.

10/18/2017 2:00-5:00 pm

Updated our VC data with Ed's help in order to increase the accuracy and completion of our data.

10/19/2017 3:00-5:00 pm

Organized all of our matched data and updated it in order to reflect the most recent SDC pull with Ed. Matched Crunchbase data with our cohort companies.

10/20/2017 2:00-3:30 pm

Generated the new list of VCCompanies as well as their earliest round dates.

10/23/2017 2:00-3:30 pm

Worked on sorting out the discrepancies in our matched data.

10/24/2017 3:00-5:00 pm

Went through list of VCCompanies and began adding respective accelerators in order to proceed with VCPercentage table.

10/25/2017 2:00-5:00 pm

Continued going through list of VCCompanies and adding accelerators.

10/26/2017 3:30-5:30 pm

Continued going through list of VCCompanies and adding accelerators. Will have this completed on Monday.

10/30/2017 2:00-3:30 pm

Finished adding all of the accelerators to the list of VCCompanies. Added a column indicating whether or not the company went through two or more accelerators.

10/31/2017 3:00-5:00 pm

Began compiling data in the column for Date Company went through Accelerator.

11/1/2017 2:00-4:00 pm

Finalized entering dates for Y Combinator cohort companies.

11/2/2017 4:00-5:30 pm

Continued entering cohort company dates into Excel file.

11/6/2017 2:00-4:00 pm

Continued entering cohort company dates into Excel file. Began compiling a list of keywords for demo day press releases.

11/7/2017 3:00-5:00 pm

Finished coming up with keywords for demo day crawler. Sent the final list to Peter.

11/8/2017 2:00-3:30 pm

Spoke to Ed and organized all of our current data.

11/9/2017 3:00-5:00 pm

Created a new project page called Accelerator Data and listed all relevant files as well as descriptions.

11/14/2017 3:00-5:00 pm

Looked up URLs and decided whether or not the webiste was relevant.

11/15/2017 2:00-5:00 pm

Created SQL database entitled "acceleratordata" and began creating tables from folder of All Relevant Files.

11/16/2017 3:00-5:00 pm

Continued to input tables into SQL database.

11/20/2017 2:00-5:00 pm

Cleaned text files in order to import tables into SQL database.

11/27/2017 2:00-5:00 pm

Worked with Peter to find and exclude irrelevant keywords on HTML pages. Began categorizing relevant demo day pages.

11/28/2017 3:00-5:00 pm

Finished inputting tables of relevant files into SQL database.

11/29/2017 2:00-5:00 pm

Went through accelerator HTML URLs. Spoke with Ed about going through HTMLs and classifying based on overall and specific relevance.

12/1/2017 3:00-5:00 pm

Worked through accelerator links and classified pages based on whether or not they provided relevant information about startup timing.

12/4/2017 10:00-12:00 pm

Continued running through demo day crawl URLs and scoring them based on relevance.

12/7/2017 1:00-4:30 pm

Finalized scoring of demo day URLs for the original crawl. Last day of work for this semester.

Meghana Gaur

Meghana Gaur Work Logs (log page)

2017-12-1: worked with ed to build tables with firm/portco data on distance and fund/portco data on performance

2017-11-16: finished calculating great circle distances between firms, portco's, and branch offices (look at roundlinewithgcd table)

2017-11-14: worked on getting all roundline tables down to the firm level, instead of fund; running into small problems with calculating gcd between firms and portco's (will discuss with Ed)

2017-11-14: worked on joining ipo information to roundline; aggregated ipo information to the fund level (rather than fund)

2017-11-09: reloaded firm coords and also fund coords - re-building roundlinewithgcd (code is written, but fund coords weren't correctly loaded, so this code will be re-run), wrote code for fundtofirms and portcotofirms, but this code will be re-run once the firm codes are correctly loaded; working on joining portcoexitmaster to roundlinejoinerlean

2017-11-08: loaded roundlinewithgcd table (calculating gcd between portcos and funds), created GCD example with notes in datawork folder in MatchingEntrepsToVCs, worked on building portcostofirms

2017-11-07: loaded portcocoords table, joined portcocoords to roundlinejoinerlean, calculate gcd distance between funds and portco's, work on joining funds to firms

2017-11-03: loaded table/sql script for firms office locations into vcdb2 with latitude and longitude coordinates; joined coordinates to all clean base tables for firms, funds, branch offices, joined co and fund coordinates to roundlinejoinerlean in new table: roundlinecoords

2017-11-02: met with Ed; loaded tables/sql script for branch office and fund office locations into vcdb2 with latitude and longitude coordinates

2017-10-27: come up with next steps for matching firms to funds - for geocoding branch offices

2017-10-26: update VC Database Rebuild wiki; identify key for bocore table; verify that fundbasecore table was correctly cleaned after being being rebuilt by Ed

2017-10-24: met with Ed to discuss firmbase and branch office tables; find key for firmbasecore table; remove undisclosed firms from both firmbasecore and bocore

2017-10-12: peer edit and put Shelby's blog post into Wordpress; see what needs to be done on VC project; continue literature review for matching models

2017-10-11: finished loading tables (firmbase and branchoffice)

2017-10-6: load data using SQL code into tables, which is on Retrieving US VC Data From SDC

2017-09-29: completed pulling/normalizing data, still need to load data using SQL code into tables, which is on Retrieving US VC Data From SDC

2017-09-28: met with Ed, worked on pulling firm and branch office data from SDC

2017-09-22: join portcos and funds; and begin literature review of matching games/venture capital (located in "Matching Entreps to VC's project folder" on E drive."

2017-09-21: work with Ed on research project

2017-09-19: continue to work on joining portcoexits and roundlinejoiner tables in vcdb2, in MatchingEntrepsToVC folder under project management

2017-09-15: work on joining portcoexits and roundlinejoiner; create txt file called "Notes on Matching Funds to portcos" in the "Matching Entreps to VC's project folder" on E drive.

2017-09-14: build table roundlinejoinerapprop (appropriate the funds between funds; work on joining portcoexits and roundlinejoiner)

2017-09-27: rebuild portcoexits and work on apportioning amounts in roundlinejoiner

2017-09-07: work with Ed to familiarize with SQL script for VC project/vcdb2 database

2017-09-05: receive project from Ed; reacquaint with wiki, RDP, etc.

Shrey Agarwal

Shrey Agarwal Work Logs (log page)

1/23/18 15:00 - 17:00

Became reacclimatized with the project, spoke with Ed about the direction for the rest of the semester

1/25/18 15:00 - 17:00

Began examining the data on pulled webpages relating to demo days

1/26/18 13:00 - 17:00

Began categorizing demo day pages based on: 1) relevance to accelerators, 2) relevance to the particular accelerator (got to 200)

1/30/18 15:00 - 17:00

Continued working through the demo day pages, spoke with Ed about using the data to work a better set (got to 450)

2/01/18 15:00 - 17:00

Finished the match and created pivot tables to count the number of repetitions (companies going through more than one accelerator)

2/06/18 15:00 - 17:00

Discussed with Matthew the best way to collect the VC data from the repetitions. We tried different matches through our SDC data to no avail

2/08/18 15:00 - 18:00

Continued attempting to match with SDC the different columns. Didn't work without separating the data into individual files, a very tedious process.

2/13/18 15:00 - 17:00

Spoke with Ed about incubators project, will begin as soon as we can time the accelerator startup investments. Ed is expecting us to begin sometime in the next two months, using a similar process as we did for incubators. The process should be handled by a new worker.

2/15/18 15:00 - 17:00

Talked to Ed about next steps for the project. Practiced accessing the CrunchBase database on SQL and brushed up on SQL code.

2/16/18 13:00 - 17:00

Sifted through the database for Crunchbase investment information.

2/20/18 15:00 - 17:00

Pulled the funding rounds table from SQL and matched it with the companies that have received VC funding in order to gather round dates.

2/22/18 15:00 - 18:00

Went through the matched data. Brainstormed ways to get the dates for cohort companies going through accelerators.

2/27/18 15:00 - 17:00

Looked into using the WhoIs Parser in order to find when the companies went through their accelerators.

9/19/17 15:00 - 17:00

Became reacclimatized with the project, spoke with Ed about the direction for the rest of the semester

9/20/17 15:00 - 17:00

Worked on setting up a new pull for the updated SDC data

9/21/17 15:00 - 17:00

Finished the pull and sorted the data from the updated accelerator list

9/22/17 15:00 - 17:00

Tried to set up the matcher with Matthew; ran into some difficulties on Power Shell, returning a blank file in the output

9/26/17 15:00 - 17:00

Finished the match and created pivot tables to count the number of repetitions (companies going through more than one accelerator)

9/27/17 15:00 - 17:00

Discussed with Matthew the best way to collect the VC data from the repetitions. We tried different matches through our SDC data to no avail

9/28/17 16:00 - 17:00

Continued attempting to match with SDC the different columns. Didn't work without separating the data into individual files, a very tedious process.

9/29/17 15:00 - 17:00

Spoke with Ed about incubators project, will begin as soon as we can time the accelerator startup investments. Ed is expecting us to begin sometime in the next two months, using a similar process as we did for incubators. The process should be handled by a new worker.

10/02/17 15:00 - 17:00

Talked to Ed about next steps for the project. Practiced accessing the CrunchBase database on SQL and brushed up on SQL code.

10/03/17 15:00 - 17:00

Sifted through the database for Crunchbase investment information.

10/04/17 15:00 - 17:00

Pulled the funding rounds table from SQL and matched it with the companies that have received VC funding in order to gather round dates.

10/06/17 15:00 - 17:00

Went through the matched data. Brainstormed ways to get the dates for cohort companies going through accelerators.

10/11/17 15:00 - 17:00

Looked into using the WhoIs Parser in order to find when the companies went through their accelerators.

10/12/17 15:00 - 17:00

Discovered that the Wayback Machine will not be a good option for identifying the time when a company went through the accelerator. Created a list of VC Companies and their earliest round date. Included a column for the date they went through their accelerators and will fill it in when we find a good method of finding this date.

10/16/17 15:00 - 17:00

Continued working on sorting VCCompanies by their earliest round date.

10/17/17 15:00 - 17:00

Worked with Ben to find a solution to our problem of data acquisition. Finalized earliest round date for VCCompanies.

10/18/17 15:00 - 17:00

Updated our VC data with Ed's help in order to increase the accuracy and completion of our data.

10/19/17 15:00 - 17:00

Organized all of our matched data and updated it in order to reflect the most recent SDC pull with Ed. Matched Crunchbase data with our cohort companies.

10/20/17 15:00 - 17:00

Generated the new list of VCCompanies as well as their earliest round dates.

10/23/17 15:00 - 17:00

Worked on sorting out the discrepancies in our matched data.

10/24/17 15:00 - 17:00

Went through list of VCCompanies and began adding respective accelerators in order to proceed with VCPercentage table.

10/25/17 15:00 - 17:00

Continued going through list of VCCompanies and adding accelerators.

10/26/17 15:00 - 17:00

Continued going through list of VCCompanies and adding accelerators. Will have this completed on Monday.

10/30/17 15:00 - 17:00

Finished adding all of the accelerators to the list of VCCompanies. Added a column indicating whether or not the company went through two or more accelerators.

10/31/17 15:00 - 17:00

Began compiling data in the column for the dates that a specific company went through an Accelerator.

11/01/17 15:00 - 17:00

Finalized entering dates for Y Combinator cohort companies.

11/02/17 15:00 - 17:00

Continued entering cohort company dates into Excel file.

11/06/17 15:00 - 17:00

Began looking at keywords for identifying the cohort class dates for each company

11/07/17 15:00 - 17:00

Received list from Peter with the accelerator founders matched from the Crunchbase LinkedIn URLs and proceeded to find the links for those founders without a match on Crunchbase. Data found in "Unfound Founders List" in the Fall 2017 folder

Taylor Jacobe

Taylor Jacobe Work Logs (log page)

2017-12-01: Finished up the California post, ready to publish.

2017-11-30: Finished and published the Augusta post. Worked on California post in Wordpress, adding a bit more content suggested by Ed.

2017-11-29: Cleaned up Augusta findings post and cleaned out spam comments on Wordpress; there were almost 2000 spam comments within the last 3 weeks, which is concerning. Maybe there is a reason it has increased so quickly?

2017-11-17: Worked on Augusta Findings post.

2017-11-16: Finished first draft of California post, California Growth (Blog Post). Continued looking into a "Future of Communication" post and what that would look like. Anne also suggested I write a post about Augusta Findings (Blog Post), so I began that!

2017-11-15: Peer edited Yunnie and Dianna's blog post drafts. Anne suggested another post: McNair projects>Agglomeration>PeterHarrison. Research growth of high-tech high-growth enterprises in California from 1986-2016. Use file of maps. Started working on the post.

2017-11-10: Spent the morning cleaning out the spam comments on the blog. More than 1000 of them! Kept investigating Blockchain; I think I've determined that it might not be worth doing a post about because there are already a lot of sources that have published pieces that explain blockchain in simple terms. Continued looking for future blog post ideas: new social media (https://www.techworld.com/social-media/bumble-founder-whitney-wolfe-herd-talks-harvey-weinstein-linkedin-future-of-social-network-3666350/), 3D printing, security in a time of increasing automation and digitalization, the future of communication (smartphones, etc.: what's next?)

2017-11-09: Reorganized work log. Continued researching blockchain and began a draft of a post that will explain the concept in simpler terms and discuss potential impacts of this new technology! Created graphs for the Fund of Funds post. Finished the post and put everything into wordpress.

2017-11-08: Compiled a list of cities in Greater Cincinnati to use for data for blog post. Tried to educate myself on blockchain to eventually write a post about it

2017-11-01: Tried to gather research to improve the VC FOF post. Edited and redrafted. Investigated other potential blog posts.

2017-10-26: Worked on Fund of Funds in VC Blog post

2017-10-25: Looked over summary and edited. Started working on a blog post on the role of fund of funds in venture capital by direction of Anne and Ed

2017-10-23: Worked on a summary document for the Houston Innovation District project, verbal summaries of data analysis

2017-10-19: Worked on Houston Innovation District more. All work is documented on the wiki page

2017-10-18: Working on Houston Innovation District project. Figuring out what we've done and what needs to be done. McNair center servers went down, while I was working, so I lost a decent amount of work that I had been doing to summarize what we had and had to start over when it was rebooted. Cleaned up the wiki page and summarized where we are so far: what data we have & where it is, what data we are currently collecting, and what data we still want/need. Began working on collecting information about tax codes, incentives for development offered in Houston.

2017-10-11: Worked on prep for Houston Innovation District project

2017-10-06: Added more slides and edited. Updated wiki page with info.

2017-10-05: Spent quite a while trying to figure out source data for patent data in slides for Augusta, then worked on cleaning up and adding to Augusta slides that were unfinished/not great, created cybersecurity slide

2017-10-04: Continued working on Augusta project

2017-09-27: Worked on data analysis and research for Augusta Project, looked into Augusta business news (there isn't very much of it!)

2017-09-21: Continued preparation for Augusta Startup Ecosystem and Houston Innovation District Projects

2017-09-20: Preliminary preparations for Augusta and Houston Projects

Yunnie Huang

2017-12-1: Finished blog post on wordpress with help from Tay.

2017-11-28: Accelerator Type List Excel Edits.

2017-11-20: TIF Project from Ed.

2017-11-15: Got advice from Tay about the blog post.

2017-11-14: Organized the list of articles related to Houston startups.

2017-11-13: Literature Review on Accelerators & Incubators

2017-11-10: Organized Literature Review documents into folder and included a few more articles.

2017-11-07: Started writing narrative description of the manufacturing incubator blog from notes. Met with Ed to talk about Literature Review.

2017-11-06: Talked to Ed and edited the literature review.

2017-11-03: Looked through some websites for manufacturing incubators for the blog post.

2017-10-31: Continued blog post for manufacturing incubators. Edited Startup Density lit review

2017-10-30: Edited Startup Density literature review. Started blog post for Manufacturing Incubators

2017-10-27: Finished literature review

2017-10-24: Set up RDP. Continue literature review using Zotero and Textpad.

2017-10-23: First day! Set up and edited wiki page. Started literature review.

Technical

Christy Warden

Christy Warden Work Logs (log page)

2017-12-12: Scholar Crawler Main Program Accelerator Website Images

2017-11-28: PTLR Webcrawler Internal Link Parser

2017-11-21: PTLR Webcrawler

2017-09-21: PTLR Webcrawler

2017-09-14: Ran into some problems with the scholar crawler. Cannot download pdfs easily since a lot of the links are not to PDFs they are to paid websites. Trying to adjust crawler to pick up as many pdfs as it can without having to do anything manually. Adjusted code so that it outputs tab delimited text rather than CSV and practiced on several articles.

2017-09-12: Got started on Google Scholar Crawling. Found Harsh's code from last year and figured out how to run it on scholar queries. Adjusted provided code to save the results of the query in a tab-delimited text file named after the query itself so that it can be found again in the future.

2017-09-11: Barely started Ideas for CS Mentorship before getting introduced to my new project for the semester. Began by finding old code for pdf ripping, implementing it and trying it out on a file.

2017-09-07: Reoriented myself with the Wiki and my previous projects. Met new team members. Began tracking down my former Wikis (they all seem pretty clear to me thus far about where to get my code for everything). Looking through my C drive to figure out where the pieces of code I have in my personal directory belong in the real world (luckily I am a third degree offender only).

Grace Tan

Grace Tan Work Logs (log page)

2018-08-03: Fixed/debugged minor coding with priority ranking. Helped Connor find timing info for missing companies. Cleaned up wiki pages.

2018-08-02: Redid minor codes with priority ranking.

2018-08-01: Entered the rest of the minor codes and arbitrarily picked the first one for those that had multiple codes attached to them.

2018-07-31: Minor coded cohorts based on contents of category group list in the file to rule them all. See Ed's slack message for key/legend. The rows highlighted in red are the ones that I'm not sure about. I wrote a python script to code most of them - E:\McNair\Projects\Accelerators\Summer 2018\codecategory.py . The coded sheet is a google sheet that I will add onto the wiki page once I make one.

2018-07-30: Matched employers with VC firms, funds, and startups. There were 40 matches with firms and funds, and 4 matches with startups. Coded these into 2 columns in Founder Experience table. Updated all wiki pages.

2018-07-27: Reformatted the timing info data to separate out the companies to look like The File To Rule Them All. It is located in E:/McNair/Projects/Accelerators/Summer 2018/Formatted Timing Info.txt. Looked at the WhoIs Parser but Maxine said she figured it out so she will finish it. I will start with Founders Experience on Monday and I do not understand what minorcode lookup means.

2018-07-26: Finished Demo Day Timing Info data. Talked with Ed and Hira about what to do for the last week. Cleaned up Timing Info data.

2018-07-25: Converted the 608 pdfs to txt files using PDF to Text Converter. All of them converted to txt files but some txt files are empty or do not contain the content of the paper. I do not know of a way to fix it or clean up the txt files to get only the txt files that are actually academic papers. Worked on Demo Day Timing Info data.

2018-07-24: Realized that some pdfs did not download properly because the link was not to an immediate pdf. Found all pdfs possible and came up with 608 total and 5 that I could not find pdfs for. Ran pdf_to_txt_bulk_PTLR.py on the 608 pdfs.

2018-07-23: Found my missing pdfs. There were 7 that were not able to download through the script for some reason so I manually downloaded 6 of them and I could not find the pdf for 1 of them. Now I have 612 pdfs on file and am ready to start converting to txt tomorrow.

2018-07-20: Reinstalled pdfminer and the test that GitHub provides works so it should be installed correctly. Ran all the pdf urls through pdfdownloader.py but came up short with 602/613 pdfs. I found 8 papers that did not successfully get downloaded but that still leaves 3 mystery papers and I'm not sure which ones they are. Did some sql to try to figure it out but hasn't worked yet.

2018-07-19: Started on converting pdfs to txt files. I found pdf_to_txt_bulk_PTLR.py in E:/McNair/Software/Google_Scholar_Crawler. I copied this and moved my data to E:/McNair/Software/Patent_Thicket. When I tried to run the program, it was giving me an error when trying to import pdfminer because I originally tried running it in Z, so I installed pdfminer.six and it did not complain with python3.6 It gave me a different error with a missing package which Wei said we shouldn't touch so I moved everything back to E. It now runs but cannot convert any pdfs. I have no idea how to fix this.

2018-07-18: Finished running the rest of the 100 pages. Took quite a long time because google scholar was catching me after 2-5 pages rather than 5-10. It helped to switch between the different wifi(rice visitor, rice owls, eduroam). Altogether resulted in 958 bibtex files and 613 pdfs from 1000 entries. There might be more entries but I'm not sure where to find them. I saved the data and code onto the rdp by connecting to it from the selenium box.

2018-07-17: Ran google scholar crawler. When google scholar blocks me with a 403 error code, I exit the program and rerun it at the page that it last looked at by clicking on the correct page number before crawling. I finished running through 68/100 pages of google scholar.

2018-07-16: Ran through 10 pages of google scholar first thing without a problem. Tried running through all 100 pages but kept on getting caught. Helped Augi with discrepancies in data and will try google scholar crawler again tomorrow.

2018-07-13: Fixed problem where pdf urls were not saving to txt file. Created another txt file to save urls that are not pdfs. Didn't run into a single recaptcha all morning. Towards the end, it started catching me at the 7th query and forced the program to restart. For some reason, selecting the css element triggered google scholar to find me. I changed the css element tag for the "next" button to the path and I was able to get through the 4th page. It is still not able to click on the actual link but I'm not sure if that's supposed to do anything.

2018-07-12: Figured out how to save BibTeX files to computer. Still had to do recaptcha tests. After giving it some time, I was able to run it completely once but I only got 49 BibTeX files and about half as many pdf links. When I tried to work on it further the recaptcha wasn't loading and gave me the error - "Cannot contact reCAPTCHA. Check your connection and try again." I ended up moving to the selenium computer and spent the rest of the day converting the code to python3 and messing with regex because for some reason it wasn't matching the text correctly.

2018-07-11: Started on Google Scholar Crawler for Patent Thicket Project. I'm not sure what the problem is. The code seems to work except that Google constantly blocks be to do reCaptcha tests. I am also not sure if the crawler is saving any data to txt files and if so, where those files are located.

2018-07-10: Finished LinkedIn Crawler! When no search results were found, I did not find the href and instead, added the founder and company name to a txt file. Spent way too much time doing reCaptcha tests and logging out and logging in again because firefox and wifi was being slow. Cleaned up code and put it on the rdp as well as fixed the wiki page - Crunchbase Accelerator Founders

2018-07-09: Continued working on the LinkedIn Crawler. I figured out how to get the web element of the first search result using path. You can achieve the same result by looking for the css element which probably would have been easier. I then used get_attribute('href') to find the href in the web element that of the first result to get the url of the founder. Note that previously, there was another function to click on the name on the screen which would open another window with the profile but I found it easier to just extract the url. Next, I will run the (hopefully) working crawler on all the founder data. Update - ran into error when there are no search results found.

2018-06-29: Spent a large part of today clicking on road signs and cars to prove to LinkedIn that I am not a robot. Figured out how to find search box with css element instead of xpath. Now trying to get information from search results.

2018-06-28: Tried to figure out the xpath for the search box but came up with no solutions. Halfway through, linkedin discovered that I was a bot so I moved to the selenium computer and used the Rice Visitor wifi. Linkedin still wouldn't let me in so we made another test account (see project page for details). It finally let me in at the end of the day but I set one of the delays to be 3-4 min so that I have time to do the tests that linkedin gives to ensure that you are not a bot. I still have no idea how to find search box xpath.

2018-06-27: Took the dictionary of accelerator to founder UUIDs and formed a table with the UUIDs combined with names of founders, gender, and linkedin_url. The file is in Z:\crunchbase2\FounderAccInfo.txt . Started looking at linkedin crawler documentation and got the crawler to get information from profiles with known urls. It crashed when it tried searching up founder names and accelerators so will work on that tomorrow.

2018-06-26: Ended up finding all founders manually. Then talked to Ed and figured out how to get the founder data off the crunchbase API with a link (see project page). Created a python script that goes through all the API pages for each accelerator API and returns a dictionary of accelerator UUIDs mapped to founder UUIDs. I found 209 founders on the API but 224 manually so we'll look at the discrepancy tomorrow.

2018-06-25: We took the 157 Accelerator UUIDs we found and created a new table that includes all the attributes of the accelerator that we want from organizations.csv called AccAllInfo. Maxine and I then split into our respectful projects. I tried joining people to the companies they are linked to in order to find the founders of each accelerator. I found about 90 matches but this there are still a lot of missing holes since some accelerators have no founders and others have multiple founders. Still unsure of how to fix this.

2018-06-22: Matched Connor's master list of accelerators with organizations.csv based on homepage_url and company_name. Found 90 that matched along with 76 blanks. Then tried matching with homepage_url or company_name and manually found about 30 more that had slight variations in url or name that we should keep. Using ILIKE we found ~25 more company UUIDs that match with accelerators on the list.

2018-06-21: Downloaded all 17 v3.1 csv tables and updated LoadTables.sql to match our data. We did this by manually updating the name and size of the fields. To solve the problem of "" from yesterday, we used regular expressions to change the empty string to nothing (see project page). We then worked with Connor to start extracting the accelerators from the organizations in the Crunchbase data. We found a lot of null matches based on company_name and a few that have the same name but are actually different companies. Maybe try matching with homepage_url tomorrow.

2018-06-20: Learned more SQL. Started working on Crunchbase Data project with Maxine. Old code contained 22 csv tables but new Crunchbase data only has 17 csv tables. We will be using the new Crunchbase API v3.1 ( not v3) with only 17 csv tables as data. We then started updating the old SQL tables to align with the 17 tables we have. We ran into a problem where a field of "" in the data for a date type and SQL did not like that. Ed was helping us with this but we have not found a solution yet.

2018-06-19: Set up monitors and continued learning SQL. We were also introduced to our projects. I will be continuing Christy's work on the Google Scholar Crawler as well as working with Maxine to update the Crunchbase data and then use that data to crawl Linkedin to find data on startup founders that go through accelerators.

2018-06-18: Introduced to the wiki, connected to RDP, and learned SQL.

Harrison Brown

Harrison Brown Work Logs (log page)

2017-11-29:

Got the tab-delimited text files written for USITC data. Added detail to project page.

2017-11-29:

Finishing up converting JSON to tab-delimited text, see USITC/JSON_scraping_python. Worked on creating images with ArcGIS

2017-11-13:

Worked on getting JSON to tab-delimited text

2017-11-01:

Looked at Oliver's code. Got git repository set up for the project on Bonobo. Started messing around with reading the XML documents in Java.

2017-10-30:

Worked on seeing what data can be gathered from the CSV and XML files. Started project page for project.

2017-10-26:

Met with Ed to talk about the direction of the project. Starting to work on extracting information from the XML files. Working on adding documentation to wiki and work log. Looking into work from other projects that may use XML.

2017-10-25:

Found information about a USITC database that we could use. Added this information to the wiki, and updated information on USITC wiki page.

2017-10-19:

Continued to look into NLTK. Talked with Ed about looking into alternative approaches to gathering this data.

2017-10-18:

Trying to figure out the best way to extract respondents from the documents. Right now using exclusively NLTK will not get us any more accuracy that using regular expressions. Currently neither will allow us to match every entity correctly so trying to figure out alternate approaches.

2017-10-16:

NLTK
- NLTK Information
  - Need to convert text to ascii. Had issues with my PDF texts and had to convert
  - Can use sent_tokenize() function to split document into sentences, easier that regular expressions
  - Use pos_tag() to tag the sentences. This can be used to extract proper nouns
    - Trying to figure out how to use this to grab location data from these documents
  - Worked with Peter to try to extract geographic information from the documents. We looked into tools Geograpy and GeoText. Geograpy does not have the functionality that we would like. GeoText looks to be better but we have issues with dependencies. Will try to resolve these next time.

2017-10-11:

Started to use NLTK library for gathering information to extract respondents. See code in Projects/USITC/ProcessingTexts

2017-10-05:

Made photos for the requested maps in ArcGIS with Peter and Jeemin.

       To access:
       Go to E:\McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS
        The photos can be found in there
       To generate the photos open ArcMap with the beginMapArc file
       To generate a PNG Click, File, Export to export the photos
       To adjust the data right click on the table name in the layers lab, and hit properties, then query builder

2017-10-04:

Worked with Peter on connecting ArcGIS to the database and displaying different points in ArcGIS

2017-10-02:

Started work with ArcGIS. Got the data with startups from Houston into the ArcGIS application. For notes see McNair/Porject/Agglomeration

2017-09-28:

Helped Christy with set up on Postgres server. Looked through text documents to see what information I could gather. Looked at Stanford NLTK library for extracting the respondents from the documents.

2017-09-28:

Got the PDFS parsed to text. Some of the formatting is off will need to determine if data can still be gathered.

2017-09-25:

Got 3000 PDFS downloaded. Script works. Completed a task to get emails for people who had written papers about economics and entrepreneurship. Started work on pasring the PDFS to text

2017-09-20:

Shell program did not work. Create Python program that can catch all exceptions (url does not exist, lost connection, and improperly formatted url) Hopefully it will complete with no problems. This program is found in the database server under the USITC folder.
Got connected to the database server and mounted the drive onto my computer. Got the list of all the PDFS on the website and started a shell script on the database server to download all of the PDFs. I will leave it running overnight hopefully it completes by tomorrow.

2017-09-17:

Added features to python program to pull the dates in numerical form. Worked on pulling the PDFs from the website. Currently working on pulling them in Python. The program can run and pull PDFs on my local machine but it doesn't work on the Remote Desktop. I will work on this next time.

2017-09-14:

Have a python program that can scrape the entire webpage and navigate through all of the pages that contain section 337 documents. You can see these files and more information on the USITC project page. It can pull all of the information that is in the HTML that can be gathered for each case. The PDFs now need to be scraped; will start work on that next time. Generated a csv file with more than 4000 entries from the webpage. There is a small edge case I need to fix where the entry does not contain the Investigation No.

2017-09-13:

Worked on parsing the USITC website Section 337 Notices. Nearly have all of the data I can scrape. Scraper works, but there are a few edges
cases where information in the tables are part of a Notice but do not have Investigation Numbers. Will finish this hopefully next time. Also added my USITC project to the projects page I did not have it linked

2017-09-11: Met with Dr. Egan and got assigned project. Set Up Project Page USITC, Started Coding in Python for the Web Crawler. Look in McNair/Projects/UISTC for project notes and code.

2017-09-07: Set Up Work Log Pages, Slack, Microsoft Remote Desktop

Jeemin Sim

2017-12-05:

Uploaded tables with KML files in TIF folder in bulk(E:) drive
- Allegheny County
- Atlanta
- Chicago
- Columbus, OH
- Dublin, OH
- Vermont
- Washington, D.C.

Updated documentation for:
- http://www.edegan.com/wiki/TIF_Project#Uploading_TIF_Data_onto_database_.28tigertest.29

shp2pgsql needs to be installed to upload Shapefiles to PostgreSQL database
- applies for Houston, TX and Dallas, TX
- DONE

2017-12-04:

To upload KML file into database with specified table name:

researcher@McNairDBServ:/bulk/tigertest$ ogr2ogr -f PostgreSQL PG:"dbname=tigertest" chicagotif.kml -nln chicagotif

chicagotif table now resides in tigertest database

2017-11-30:

Displayed map with TIF districts and startups
Location:

E:\McNair\Projects\Agglomeration\TIF\McNair_Color_Chicago_TIF_and_startups.png

Added information and steps to ArcMap/ArcGIS documentation page
Create a project page for 'Working with POSTGIS' and add instructions for uploading KML file onto POSTGIS
- Command used :
  - Logged in as : researcher@McNairDBServ

/bulk/tigertest$ ogr2ogr -f PostgreSQL PG:"dbname=tigertest" chicagotif.kml

chicago TIF kml file currently downloaded in tigertest with a table name of Layer0
- Figure out how to change layer name while loading kml file
- Instructions pulled from : http://wiki.wildsong.biz/index.php/Loading_data_into_PostGIS#Loading_data_from_KMZ_files

2017-11-28:

Created and edited ArcMap / ArcGIS Documentation
Plotted points for TIFS and Startups in Chicago in one map.
- Location:

E:\McNair\Projects\Agglomeration\TIF\Jeemin_Chicago_TIF_and_Startups_Attempt1

Used https://mygeodata.cloud/result to convert from KML file to CSV (which was then saved as txt file to be uploaded onto ArcMap)
Text files located in Local Disk (C:) Drive

2017-11-27:

Download Chicago TIF data

2017-11-13:

2017-11-09:

Notes on data downloaded:
- Year 2010-2012 data are based on total population, not 25 yrs or over (case for all other tables)
  - Record appears five times total, with same exact column name
  - For exmaple: 'Total; Estimate; High school graduate (includes equivalency)' appears five times, with different values.
TODO:
- Make Projects page for ACS Data
  - American Community Survey (ACS) Data

2017-11-07:

Yesterday, narrowed down columns of interest from ACS_S1501_educationattain_2016 table.

Id
Id2
Geography
Total; Estimate; Population 25 years and over
Total; Estimate; Population 25 years and over - High school graduate (includes equivalency)
Total; Margin of Error; Population 25 years and over - High school graduate (includes equivalency)
Total; Margin of Error; Population 25 years and over - High school graduate (includes equivalency)
Percent; Margin of Error; Population 25 years and over - High school graduate (includes equivalency)
Total; Estimate; Population 25 years and over - Associate's degree	
Total; Margin of Error; Population 25 years and over - Associate's degree
Percent; Estimate; Population 25 years and over - Associate's degree
Percent; Margin of Error; Population 25 years and over - Associate's degree
Total; Estimate; Population 25 years and over - Bachelor's degree	
Total; Margin of Error; Population 25 years and over - Bachelor's degree
Percent; Estimate; Population 25 years and over - Bachelor's degree
Percent; Margin of Error; Population 25 years and over - Bachelor's degree
Total; Estimate; Population 25 years and over - Graduate or professional degree
Total; Margin of Error; Population 25 years and over - Graduate or professional degree
Percent; Estimate; Population 25 years and over - Graduate or professional degree
Percent; Margin of Error; Population 25 years and over - Graduate or professional degree
Percent; Estimate; Percent high school graduate or higher	
Percent; Margin of Error; Percent high school graduate or higher
Percent; Estimate; Percent bachelor's degree or higher	
Percent; Margin of Error; Percent bachelor's degree or higher

Complications:
- For csv files corresponding to years 2015 & 2016, all of the above columns exist.
- For csv files corresponding to years 2005 - 2014, no 'Percent' columns exist
  - Instead their 'Total' columns are percentage values
- For csv file corresponding to year 2005, columns regarding Graduate or professional degree are labeled differently.
- 2012 data doesn't correspond to Population 25 years and over.

Temporary Solution:
- Since the above problems may be specific to this set of tables, will go through csv files and adjust columns.

Python script location:

E:\McNair\Projects\Agglomeration\ACS_Downloaded_Data\pullCertainColumns.py

2017-10-31:

Finished doanloading files from ACS.
Started loading tables into tigertest.
Commands run could be found in

E:\McNair\Projects\Agglomeration\ACS_Downloaded_Data\DataLoading_SQL_Commands.txt

2017-10-30:

Downloaded data from ACS, to be continued
File path:

E:\McNair\Projects\Agglomeration\ACS_Downloaded_Data

Fields of interest:

S1401 SCHOOL ENROLLMENT
S1501 EDUCATIONAL ATTAINMENT
S2301 EMPLOYMENT STATUS
B01003 TOTAL POPULATION
B02001 RACE
B07201 GEOGRAPHICAL MOBILITY
B08303 TRAVEL TIME TO WORK
B19013 MEDIAN HOUSEHOLD INCOME
B19053 SELF-EMPLOYMENT INCOME IN THE PAST 12 MONTHS FOR HOUSEHOLDS
B19083 GINI INDEX OF INCOME INEQUALITY
B25003 TENURE
B25105 MEDIAN MONTHLY HOUSING COSTS
B28011 INTERNET SUBSCRIPTIONS IN HOUSEHOLD
G001 GEOGRAPHIC IDENTIFIERS

2017-10-23:

Talked to Ed with Peter & Oliver about upcoming tasks & projects.
Loaded acs_place table 2017 (does not contain population) on tigertest.
- SQL commands used:

DROP TABLE acs_place;

CREATE TABLE acs_place (
       USPS varchar(5),
       GEOID  varchar(30),
       ANSICODE varchar(30),
       NAME varchar(100),
       LSAD varchar(30),
       FUNCSTAT varchar(10),
       ALAND varchar(30),
       AWATER varchar(30),
       ALAND_SQMI varchar(30),
       AWATER_SQMI varchar(30),
       INTPTLAT varchar(30),
       INTPTLONG varchar(30)
);

\COPY acs_place FROM '/bulk/2017_Gaz_place_national.txt';
--COPY 29578

TODO:
- Find acs place 2016 data for population
- Find larger acs files, ideally at the place level
- Provide more documentation on POSTGIS & geocoding

2017-10-16:

Exported maps of points from the Bay Area each year.
- Map used location: E:\McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS\Jeemin_Bay_Area Points_Every_Year\BayAreaEveryYearMap
- Zoom scale: 1:650.000
Location of Bay Area Points png files:

E:\McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS\Jeemin_Bay_Area Points_Every_Year

2017-10-10:

Discoveries/ Struggles regarding.gdb to .shp file conversion:
- Esri Production Mapping (costly)
  - License needs to be purchasesd: http://www.esri.com/software/arcgis/extensions/production-mapping/pricing
- Use ogr2ogr from gdal package
  - https://gis.stackexchange.com/questions/14432/migrating-geodatabase-data-into-postgis-without-esri-apps
  - Command: ogr2ogr -f "ESRI Shapefile" [Destination of shapefile] [path to gdb file]
  - Problem installing gdal

2017-10-09:

TODO'S:
- Downloading data onto tigertest
  - Road
  - Railway
  - Coastline
  - Instructions: http://www.edegan.com/wiki/PostGIS_Installation#Bulk_Download_TIGER_Shapefiles
- Configure census data from American Community Survey (ACS)
  - 1) Work out what data is of our interest (confirm ACS)
  - 2) Determine appropriate shape file unit:
    - census block vs. census block group vs. census track
  - 3) Load into tigertest

Done:
- Downloaded data from https://www.census.gov/cgi-bin/geo/shapefiles/index.php

 tl_2017_us_coastline -- 4209
 tl_2017_us_primaryroads -- 11574 
 tl_2017_us_rails -- 176237

- Link found to potentially download ACS data: https://www.census.gov/geo/maps-data/data/tiger-data.html
  - But most files on it come with .gdb extension and not .shp

2017-10-03:

Installed PostGIS & is now visible on pgAdmin III

ArcGIS (connect to postgis database):
- 1) Open ArcMap
- 2) Either open blank or open existing file/project
- 3) Click on 'Add Data' button with a cross and a yellow diamond (under Selection toolbar)
- 4) Go to the top-most directory by pressing on the arrow that points left-then-up (on the left of home button)
- 5) Click on 'Database Connections'
- 6) Click on 'Add Database Connection' (if Connection to localhost.sde) does not exist already)
- 7) Fill in the following fields:
  - Database Platform: PostgreSQL
  - Instance: localhost
  - User name: postgres
  - Password:
  - Database: tigertest
- 8) Press 'OK'
- 9) Now you'll have 'Connection to localhost.sde' in your Database Connections
- 10) Double click on 'Connection to localhost.sde'
- 11) Double click on the table of interest
- 12) Click 'Finish'
- 13) You'll see information populated on map, as one of the 'Layers'
  - Tested with: tigertest.public.copointplacescontains

On running & altering Oliver's script:
- Location: E:\McNair\Projects\OliverLovesCircles\src\python\vc_circles.py
- Ed manipulated file names so that underscores would replace dots (St.Louis --> St_Louis)
- Takes in instances and sweep times as part of the argument, but not impactful as those variables are hardcoded in the script
- Ran vc_circles.py with the following variables with changed values:
  - SWEEP_CYCLE_SECONDS = 10 (used to be 30)
  - NUMBER_INSTANCES = 16 (used to be 8)
- New output to be found in: E:\McNair\Projects\OliverLovesCircles\out

2017-10-02:

Talked to Harrison & Peter regarding ArcGIS
- Currently have points plotted on Houston
- Trouble interpreting geometry type, as currently reads in from text file
- Documents located in : E:\McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS
Attempted to install PostGIS spatial extention from PostgreSQL but getting 'spatial database creation failed' error message.
- Referenced instructions:
  - https://www.gpsfiledepot.com/tutorials/installing-and-setting-up-postgresql-with-postgis/
  - http://www.bostongis.com/PrinterFriendly.aspx?content_name=postgis_tut01

2017-09-26:

Created a table that maps a state to the database name.
- http://www.edegan.com/wiki/PostGIS_Installation#Translating_Table_names_to_corresponding_States
Added more GIS-information (functions, realm & outliers to consider)
- http://www.edegan.com/wiki/Urban_Start-up_Agglomeration#GIS_Resources
Visualization in PostGIS or connecting to ArcGIS for visualization (import/export data)
Spatial indexing:
- http://revenant.ca/www/postgis/workshop/indexing.html

2017-09-25:

Talked to Ed about GIS, Census data, and going about determining the correctness of reported 'place.' Currently script makes a cross product of each reported place and an existing place, outputting a column of boolean value to indicate whether the reported place's coordinates fell within a place's geometric boundaries. One other way of going about this which we discussed is to first check if the reported place does fall within that place's boundaries. If it isn't, we'll go about the cross product method.

To add documentation :
- http://www.edegan.com/wiki/PostGIS_Installation
- http://www.edegan.com/wiki/Urban_Start-up_Agglomeration

Discussed the need to maintain venture capital database.

Relevant File paths:
- E:\McNair\Projects\Agglomeration\TestGIS.sql
- Z:\VentureCapitalData\SDCVCData\vcdb2\ProecssingCoLevelSimple.sql
- Z:\VentureCapitalData\SDCVCData\vcdb2\CitiesWithGT10Active.txt

2017-09-21:

Functions for Linear Referencing:

ST_LineInterpolatePoint(geometry A, double measure): Returns a point interpolated along a line.
ST_LineLocatePoint(geometry A, geometry B): Returns a float between 0 and 1 representing the location of the closest point on LineString to the given Point.
ST_Line_Substring(geometry A, double from, double to): Return a linestring being a substring of the input one starting and ending at the given fractions of total 2d length.
ST_Locate_Along_Measure(geometry A, double measure): Return a derived geometry collection value with elements that match the specified measure.
ST_Locate_Between_Measures(geometry A, double from, double to): Return a derived geometry collection value with elements that match the specified range of measures inclusively.
ST_AddMeasure(geometry A, double from, double to): Return a derived geometry with measure elements linearly interpolated between the start and end points. If the geometry has no measure dimension, one is added.

3-D Functions:

ST_3DClosestPoint — Returns the 3-dimensional point on g1 that is closest to g2. This is the first point of the 3D shortest line.
ST_3DDistance — For geometry type Returns the 3-dimensional cartesian minimum distance (based on spatial ref) between two geometries in projected units.
ST_3DDWithin — For 3d (z) geometry type Returns true if two geometries 3d distance is within number of units.
ST_3DDFullyWithin — Returns true if all of the 3D geometries are within the specified distance of one another.
ST_3DIntersects — Returns TRUE if the Geometries “spatially intersect” in 3d - only for points and linestrings
ST_3DLongestLine — Returns the 3-dimensional longest line between two geometries
ST_3DMaxDistance — For geometry type Returns the 3-dimensional cartesian maximum distance (based on spatial ref) between two geometries in projected units.
ST_3DShortestLine — Returns the 3-dimensional shortest line between two geometries

Relevant PostgreSQL Commands:

\dt *.* Show all tables
\q Exit table

Specifities/ Outliers to consider:

New York (decompose)
Princeton area (keep Princeton  unique)
Reston, Virginia (keep)
San Diego (include La Jolla)
Silicon Valley (all distinct)

Continue reading from: https://postgis.net/docs/postgis_installation.html

2017-09-20:

Attended first intro to GIS course yesterday
Updated above notes on GIS

2017-09-19:

Useful functions for spatial joins:

sum(expression): aggregate to return a sum for a set of records
count(expression): aggregate to return the size of a set of records
ST_Area(geometry) returns the area of the polygons
ST_AsText(geometry) returns WKT text
ST_Buffer(geometry, distance): For geometry: Returns a geometry that represents all points whose distance from this Geometry is less than or equal to distance. Calculations are in the Spatial Reference System of this Geometry. For geography: Uses a planar transform wrapper.
ST_Contains(geometry A, geometry B) returns the true if geometry A contains geometry B
ST_Distance(geometry A, geometry B) returns the minimum distance between geometry A and geometry B
ST_DWithin(geometry A, geometry B, radius) returns the true if geometry A is radius distance or less from geometry B
ST_GeomFromText(text) returns geometry
ST_Intersection(geometry A, geometry B): Returns a geometry that represents the shared portion of geomA and geomB. The geography implementation does a transform to geometry to do the intersection and then transform back to WGS84
ST_Intersects(geometry A, geometry B) returns the true if geometry A intersects geometry B
ST_Length(linestring) returns the length of the linestring
ST_Touches(geometry A, geometry B) returns the true if the boundary of geometry A touches geometry B
ST_Within(geometry A, geometry B) returns the true if geometry A is within geometry B
geometry_a && geometry_b: Returns TRUE if A’s bounding box overlaps B’s.
geometry_a = geometry_b: Returns TRUE if A’s bounding box is the same as B’s.
ST_SetSRID(geometry, srid): Sets the SRID on a geometry to a particular integer value.
ST_SRID(geometry): Returns the spatial reference identifier for the ST_Geometry as defined in spatial_ref_sys table.
ST_Transform(geometry, srid): Returns a new geometry with its coordinates transformed to the SRID referenced by the integer parameter.
ST_Union(): Returns a geometry that represents the point set union of the Geometries.
substring(string [from int] [for int]): PostgreSQL string function to extract substring matching SQL regular expression.
ST_Relate(geometry A, geometry B): Returns a text string representing the DE9IM relationship between the geometries.
ST_GeoHash(geometry A): Returns a text string representing the GeoHash of the bounds of the object.

Native functions for geogrphy:

ST_AsText(geography) returns text
ST_GeographyFromText(text) returns geography
ST_AsBinary(geography) returns bytea
ST_GeogFromWKB(bytea) returns geography
ST_AsSVG(geography) returns text
ST_AsGML(geography) returns text
ST_AsKML(geography) returns text
ST_AsGeoJson(geography) returns text
ST_Distance(geography, geography) returns double
ST_DWithin(geography, geography, float8) returns boolean
ST_Area(geography) returns double
ST_Length(geography) returns double
ST_Covers(geography, geography) returns boolean
ST_CoveredBy(geography, geography) returns boolean
ST_Intersects(geography, geography) returns boolean
ST_Buffer(geography, float8) returns geography [1]
ST_Intersection(geography, geography) returns geography [1]

Continue reading from: http://workshops.boundlessgeo.com/postgis-intro/geography.html

2017-09-18:

Read documentation on PostGIS and tiger geocoder
Continue reading from: http://workshops.boundlessgeo.com/postgis-intro/joins.html

2017-09-12:

Clarified University Matching output file.
Helped Christy with pdf-reader, capturing keywords in readable format.

2017-09-11:

Ensured that documentation exists for the projects worked on last semester.

Kyran Adams

Kyran Adams Work Logs (log page)

2018-05-18: Cleaned up demo_day_classifier directory and fleshed out the writeup on the page.

2018-05-16: Wrote a script (classify_all_accelerator.py) to pull all of the unclassified accelerators from the master variable list (if they are not already in the Cohort List page), and then classify them. This works best if the creation years are provided in the Master Variable List. Started the run on the whole dataset. This will definitely pull up a lot of duplicate results, so it might be valuable to run a program at the end to remove duplicates.

2018-05-11/12: Ran on data, predicted html files are saved in positive directory. Also determined that the model extremely overfits, more data is probably the only fix.

2018-05-06: Changed crawl_and_classify so that the html pages are separated based on what they are predicted to be. Added support for individual year searching. Started running on actual data. Tuned hyperparameters too, should save to params.txt.

2018-05-04: Same. Also cleaned up directory, wiki. Model now achieves 0.80 (+/- 0.15) accuracy.

2018-05-03: Played around with different features and increased dataset.

2018-04-23: So auto-generated features actually reduces accuracy, probably because there isn't enough data. I've gone back to my hand picked features and I'm just focusing on making the dataset larger.

2018-04-16: Still working through using auto-generated features. It takes forever. :/ I reduced the number of words looked at to about 3000. This makes it a lot faster, and seems like it should still be accurate, because the most frequent words are words like "demo" and "accelerator". I also switched from using beautiful soup for text extraction to html2text. I might consider using Sublinear tf scaling (parameter in the tf model).

2018-04-16: I think I'm going to transition from using hand-picked feature words to automatically generated features. This webpage has a good example. I could also use n-grams, instead of unigrams. I might also consider using a SVM instead of a random forest, or a combination of the two.

2018-04-12: Continued increasing the dataset size as well as going back and correcting some wrong classifications in the dataset. I'm wondering whether the accuracy would be improved most by an increased dataset, a different approach to features, or changes to the model itself. I am considering using something like word2vec with, for example, five words before and after each instance of the words "startup" or "demo day" in the pages. The problem with this is that this would need its own dataset (which would be easier to create). However, semantic understanding of the text might be an improvement. Or, maybe, I could train this on the title of the article, because the title should have enough semantic meaning. But even this dataset might have to be curated, because a lot of the 0's are demoday pages, they just don't list the cohorts.

2018-04-11: Increased the dataset size using the classifier. Ironed out some bugs in the code.

2018-04-09: Wrote the code to put everything together. It runs the google crawler, creates the features matrix from the results, and then runs the classifier on it. This can be used to increase the size of the dataset and improve the accuracy of the classifier.

Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl.

Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to predicted.txt.

2018-04-05: The classifier doesn't work as well when there is an imbalance of positive and negative training cases, so I made it use the same number of cases from each. Also had a meeting, my next task is to run the google crawler to create a larger dataset, which we can then use to improve the classifier.

2018-04-04: Continued to classify examples, and tried using images as features. It didn't give great results, so I abandoned that. Currently the features are wordcounts in the headers and title. I might consider the number of "simple" links in the page, like "www.abc.com". Complicated links would be used a lot for many things, but simple links are often used to bring someone to a home page, as in a cohort company's page, so this might be a good indicator of a list of cohort companies.

2018-04-02: Mostly worked on increasing the size of the dataset. So far, with just 30ish more examples, there was a ton of increase in accuracy, so I'm just going to spend the rest of the time doing this. As seen in the graph, this is not entirely due to just the size of the data set, but probably the breadth of it. Also used scikit learns built in hyperparameter grid search, will run this overnight once the dataset is large enough. Another thing I'm thinking about is adding images as a feature, for example the number of images over 200 px wide, because some demo day pages use logos as the cohort company list.

This graph the number of training examples given versus the accuracy.

2018-03-28: Changed to using sykitlearn random forest instead of tensorflow, because this would allow me to see which features have a lot of value and might be affecting the model negatively. One observation I made is that certain years affect the model highly... Maybe I should generalize it for the occurrence of any year. Also, I discovered that just using hand-picked features improved accuracy by 10% rather than using all of the word counts. After that, the only other feature I can think of is the number of images in the page or in the center of the page, because often there are images with all of the cohort companies' logos. Tomorrow I am also going to work on hyperparameter tuning and increasing the amount of data we have.

2018-03-26: Rewrote the classifier to handle missing data. Removed normalization of data because it only makes sense for the logistic regression classifier, not the random tree classifier. Even with the reorganized data, I still have a lot of false negatives....

2018-03-22: Continued redoing the dataset for the HTML pages. Once i get enough data points, I can rewrite the code to use this instead.

2018-03-14: Realized that a lot of the entries in the data might be out of order, and that the HTML pages are often different than the actual pages because of errors. I'm going to redo the dataset using the HTML pages.

2018-03-08: Finished the random forest code, but am having some problems with the tensorflow binary. Might be a windows problem, so I'm trying to run it on the linux box. Ran it on Linux, and it seems that the model is mostly outputting 0s. I should rethink the data going into the model.

2018-03-07: Kept categorizing page lists. Researched some other possible models with better accuracy for the task: We might consider dimensionality reduction due to the large number of variables, and maybe gradient boosting/random forest instead of logistic regression. Started implementing random forest.

2018-03-05: Kept categorizing page lists. Most of the inaccuracies are probably just miscategorized data. Realized that some of the false positive may be due to the fact that these pages are demo day pages, but for the wrong company. I should add the company name as a feature to the model. Some other feature ideas: URL length,

2018-03-01: Didn't really help much, but running on the new data with just the pages with cohort lists should improve accuracy. I'll help with categorizing those.

2018-02-28: Finished the new features. Will run overnight to see if it improves accuracy.

2018-02-22: Subtracting the means helped very marginally. I'm going to try coming up with some new ideas for features to improve accuracy. Started working on a program web_demo_features.py that will parse URLs and HTML files and count word hits from a file words.txt. This will be a feature for the ML model. Also met with project team, somebody is going to look through the training output data and look for lists of cohort companies, as opposed to just demo day pages. I will train the model on this, and hopefully get more useful results.

2018-02-21: A lot of the inaccuracy was probably due to mismatched data (for some reason two entries, Mergelane & Dreamit, were missing from the hand-classified data, and some of the entries were alphabetized differently), cleaned up and matched full datasets. However, now, it seems like we might be overfitting the data. I think it might be necessary to input a few different features. Found the KeyTerms.py file (translated it to python3) to see if I could augment it with some other features. Yang also suggested normalizing the features by subtracting their means, so I will start implementing that.

2018-02-19: Finally got the classifier to work, but it has pretty low accuracy (70% training, 60% testing). I used code from an extremely similar tutorial. Will work on improving the accuracy.

2018-02-14: Kept working on the classifier. Fixed bug that shows 100% accuracy of classifier. Gradient descent isn't working, for some reason.

2018-02-12: Talked with Ed regarding the classifier. Collected data that will be needed for the classifier in "E:\McNair\Projects\Accelerators\Spring 2018\google_classifier". Took a simple neural network from online that I tried to adapt for this data.

2018-02-07: Got tensorflow working on my computer. It runs SO slowly from Komodo. Talked with Ed about the matlab page and the demo day project. Peter's word counter is supposedly called DemoDayHits.py, I need to find that.

2018-02-05: Added comments to the code about what I ~think~ the code is doing. Researched a bit about word2vec. Created Demo Day Page Google Classifier page.

2018-02-01: Tried to document and clean up/refactor the code a bit. Had a meeting about the accelerator data project.

2018-01-31: Found a slightly better solution to the bug (documented on matlab page). Verified that the code won't crash for all types of tasks. Also ran data for the first time, took about 81 min.

2018-01-29: Solved matrix dimension bug (sort of). The bug and solution is documented on the matlab page.

2018-01-26: Started to try to fix a bug that occurs both in the Adjusted and Readjusted code. The bug is that on the second stage of gmm_2stage_estimation, there is a matrix dimension mismatch. Also, I'm not really sure of the point of the solver at all. The genetic algorithm only runs a few iterations (< 5), and it is actually gurobi that does most of the work. Useful finding: this runs astronomically faster with gurobi display turned off (about 20x lol).

2018-01-25: Continued working on matlab page. Deleted unused smle and cmaes estimators. Not sure why we have so many estimators (that apparently don't work) in the first place. Deleted unused example.m. Profiled program and created an option to plot the GA's progress. Put up signs with Dr. Dayton for event tomorrow.

2018-01-24: Kept working through the codebase and editing the matlab page. Deleted "strip_" files because they only work with task="monte_data". Deleted JCtest files because they were unnecessary. Deleted gurobi_gmm... because it was unused.

2018-01-22: Kept working on the Matlab page. Read reference paper in Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists. Talked with Ed about what I'm supposed to do; I'm going to refactor the code and hopefully simplify it a bit and remove some redundancies, and figure out what each file does, what the data files mean, what the program outputs, and the general control flow.

Possibly useful info:

only 'ga' and 'msm' work apparently, I have to verify this
Christy and Abhijit both worked on this
This program is supposed to solve for the distribution of "unobserved complementarities" between VC and firms?

2018-01-19: Wrote page Using R in PostgreSQL. Also started wiki page Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code. Tried to understand even a little of what's going on in this codebase

2018-01-18: Started work on running R functions from PostgreSQL queries following this tutorial. First installed R 3.4.3, then realized we needed R 3.3.0 for PL/R to work with PostgreSQL 9.5. Installed R 3.3.0. Link from tutorial for PL/R doesn't work, I used this instead. To use R from pgAdmin III, follow the instructions in the tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This was run on databases tigertest and template1. You should be able to run the given examples. Another possibly useful presentation on PL/R. Keep in mind, if the version of PostgreSQL is updated, both the R version and PL/R version will have to match it.

Minh Le

Minh Le Work Logs (log page)

2018-08-03:

For some reason, when we search Cappital Innovators, there are more options in the "Tools" section. Need to figure out away around this. Did some quick fix around but nothing permanents.
Finished crawling, started classifying.
Finished classifying.
Pushed the batch to MTurk.

2018-08-02:

Cleaned up codes
Published the big MTurk batch.
Got results after 2 hours.
Processed the data and trimmed extra columns off.
Helped Grace with her minor code code
Helped Maxine with the url classifier
Improved crawler to take date arguments as per Ed request.
Ran the crawler again.

2018-08-01:

Built the SeedDB parser with Maxine and Connor
Finished getting the data from Seed DB and sent it to Connor.

2018-07-31:

Talked to Connor and Maxine to figure out SeedDB
Published the first small batch of MTurk with interjudge reliability (2 workers per HIT) and got good results
Tested SeedDB server

2018-07-30:

Finalized the design for MTurk, sent to Ed for thoughts and opinions
Tried publishing a batch on MTurk using the sandbox, and talked to Connor to test it out together.

2018-07-29:

Worked on HTML mockup for MTurk
Crawled Data for the Mturk

2018-07-28:

Worked on HTML mockup for MTurk

2018-07-27:

Worked on MTurk

2018-07-26:

Worked on collecting data with others.
Skyped Ed, Hira along with others.

2018-07-25:

Worked with MTurk with Connor
Talked with Ed about the project progress. We agreed that the RNN can wait, and focus on collecting the data because the data seems much usable now.
Hand collect data along with fellow interns.

2018-07-24:

Tried to tweak some more. Still no progress. I might change to word2vec finally?
Looked into MTurk

2018-07-23:

The tuning has not been completed yet. However, checking from the results, it seemed that the last 6 parameters did not significantly affect the result?
This tuning had been fruitless. I stopped the code.
Looked into using Yang's preprocessing code.
Maxine was borrowing my crawler for her work and she found a bug in the crawler where the crawler would never take the first result. i think because google updates their web display? Anyway, fixed it.
Worked on the wiki page

2018-07-20:

Ran parameters tuning to tweak 11 different parameters:

dropout_rate_firstlayer\tdropout_rate_secondlayer\trec_dropout_rate_firstlayer\trec_dropout_rate_secondlayer\tembedding_vector_length\tfirstlayer_units\tsecondlayer_units\t"dropout_rate_dropout_layer\tepochs\tbatch_size\tvalidation_split

Talked to Ed about potentially just do a test run with the RandomForest model because we needed data soon.

2018-07-19:

Helped Grace with her Thicket project
Helped Maxine with her classifier
Delegated the data collecting task to Connor
Continued optimizing the current Kera's LSTM. The accuracy is around 50% right now

2018-07-18:

Edited the wiki page with more content and ideas.
Tried an MLP with lbfgs solver, and got around 60% accuracy:

FINISHED classifying. Train accuracy score:
1.0
FINISHED classifying. Test accuracy score:
0.652542372881356

Building a full fledge LSTM (not prototype) to see how things go

2018-07-17:

try tuning the LSTM in keras but did not manage to increase the accuracy by much. Accuracy fluctuates around 50%

2018-07-16:

Work to adapt the data to RNN
Installed keras for BOTH python 2 and 3.
For python2, installed using the command:

pip install keras

For python3, installed by first downloading github repo:

git clone https://github.com/keras-team/keras.git

then run the following command

cd keras
python3 setup.py install

Normally, having run the command for python 2 should be sufficient, but we have anaconda2 and anaconda3 both so for some reason, pip can't detect the ananconda 3 folder, hence we have to manually install it like that. Note that you can run:

python setup.py install

to install to python2 as well (and skip the pip installation). Source: https://keras.io/

Prototyped a simple LSTM in keras, and the accuracy was 0.53. This is promising; after I complete the full model, the accuracy can be much higher.

2018-07-13:

Finished installing tensorflow for all user. Create a new folder to work on the DBServer to use tensorflow. The folder can be found here:

Z:\AcceleratorDemoDay

or if accessed from PuTtY, use the following command:

cd \bulk\AcceleratorDemoDay

The new RNN currently has words frequency as input features

2018-07-12:

Followed this instruction here: https://www.tensorflow.org/install/install_linux#InstallingVirtualenv and install tensorflow with Wei. Specific is below.
1. Installed CUDA Toolkit 9.0 Base Installer. The toolkit is in

/usr/local/cuda-9.0

for the toolkit. Did NOT install NVDIA accelerated Graphics Driver for Linux-x86_64 384.81 (We believe we have a different graphic driver. we have a much Newer version(396.26)). Installed the CUDA 9.0 samples in

HOME/MCNAIR/CUDA-SAMPLES.

2. Installed Patch 1, 2 and 3. The command to install was

sudo sh cuda 9.0.176.2 linux.run # (9.0.176.1 for patch 1 and 9.0.176.3 for patch 3)

3. This was supposed to be what to do next:

""" Set up the environment variables: The PATH variable needs to include /usr/local/cuda-9.0/bin To add this path to the PATH variable:

$ export PATH=/usr/local/cuda-9.0/bin${PATH:+:${PATH}}

In addition, when using the runfile installation method, the LD_LIBRARY_PATH variable needs to contain /usr/local/cuda-9.0/lib64 on a 64-bit system To change the environment variables for 64-bit operating systems:

$ export LD_LIBRARY_PATH=/usr/local/cuda-9.2/lib64\${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

Note that the above paths change when using a custom install path with the runfile installation method. """ But when we travel to /usr/local/ we saw cuda-9.2 which we did not install. So we are WAITING for Yang to get back to use so we can proceed.

For now, I can't build anything without tensorflow, so I am going to continue classifying data.
Helped Grace with Google Scholar Crawler's regex
All installationote can be see here Installing TensorFlow

2018-07-11:

With an extended dataset, the accuracy went down with the random forest model. Accuracy: 0.71 (+/- 0.15)
Built codes for an RNN, running into problem of not having tensorflow installed
Helped Grace with her Google Scholar Crawler.
Asked Wei to help with installing tensorflow GPU version.

2018-07-10:

Doing further research into how RNN can be used to classify
Reorganize the code under a new folder "Experiment" to prepare for testing with a new RNN
Ran the reorganized code to make sure there is no problem. I kept running into this error: "TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule safe"
Apparently this was caused by random question marks I have in the column (??) Removed it and it seems to run fine.

2018-07-09:

Continued studying machine learning models.
Helped Grace with her LinkedIn Crawler.
Cleaned up working folder.
Populate the project page with some information.

2018-07-06:

Review Augi's classified training data to make sure it meets the correct requirement.
Continued studying machine learning models and neural nets

2018-07-05:

Studied different machine learning models and different classifier algorithms to prepare to build the RNN.
Worked on classifying more training data.

2018-07-03:

Ran a 0.84 classifier on the newly crawled data from the Chrome driver. From observation, the data still was not good enough. I will started building the RNN
Still waiting Augi to release lock on the new excel data so i can work on it.

2018-07-02:

Why did the code not run while I logged out of RDP omg these codes were running for so 3 hours last time I logged off :(
The accuracy got to 0.875 today with just the new improved word list, which I thought might have overfitted the data. This was also rare because I never got it again
Ran the improved crawler again to see how it went. (The ran start at 10AM ~It has been 5 hours-ish and it only processes the 50% of the list)
After painfully seeing firefox crawling (literally) through webpages, I had installed the chromedriver in the working folder and changed the DemoDayCrawler.py back to Chrome Webdriver
It seems like Firefox has a tendancy to pause randomly when i don't log into rdp and keep an eye on it. Chrome resolves this problem

2018-06-29:

Delegated Augi to work on building the training data.
Started to work on the classifier by studying machine learning models
Edited words.txt with new words and remove words that i don't think help with the classification. Removed: march/ Added: rundown, list, mentors, overview, graduating, company, founders, autumn.
The new words.txt had increased the accuracy from 0.76 to 0.83 in the first run
The accuracy really fluctuated. Got as low as 0.74 but the highest run has been 0.866
Note: testing inside of KyranGoogleClassifier instead of the main folder because the main folder was testing out the new improved crawler.
It also seemed that rundown and autumn is the least important with 0.0 score so I removed them

2018-06-28:

Continued to find more ways to optimize the crawler: adding several constraints as well as blacklist websites like Eventbrite, LinkedIn and Twitter. Needed to figure out a way to bypass Eventbrite's time expire script. LinkedIn required login before seeing details. Twitter's post was too short and frankly distracting.
Ran improved results on the classifier.
Classified some training data.
Helped Grace debug the LinkedIn Crawler.

2018-06-27:

Worked on optimizing and fixing issues with the crawler.
It was observed that we may not need to change our criteria for the demo day pages. The page containing cohort list often includes dates (which is a data we now need to find). I might add more words to the words bag to improve it further but it seems unnecessary for now

2018-06-26:

Finished running the Analysis code (for some reasons the shell didn't run after i logged off of RDP
Talked to Ed about where to head with the code
Connected the 2 projects together: got rid of Kyran's crawler and Peter's analysis script for now (we might want the analysis code later on to see how good the crawler was)
Ran on the list of accelerators Connor gave me. Got mixed results (probably because the 80% is low) and we had to deal with website with expire timestamp like Eventbrite (the html showed the list, but displaying the html in the web browser doesn't). Found a problem that the crawler only get the number of results of the first page so if we want to gather large numbers of result, it would not work.

2018-06-25:

Fixed Peter's Parser's compatibility issue with Python3. All code can now be used with Python 3
Ran through everything in the Parser on a small test set.
Completed moving all the files.
Ran the Parser on the entire list.
The run took 3h45m to execute the crawling (not counting the other steps) with 5 results per accelerators
Update @6:00PM The Analysis has been taking an hour and 30m to run and only 80% done. I need to go home now but these steps are taking a lot of time

2018-06-22:

Moved Peter's Parser into my project folder. Details can be read under the folder "E:\McNair\Projects\Accelerator Demo Day\Notes. READ THIS FIRST\movelog".
The current Selenium version and Chrome seem to hate each other on the RDP (throwing a bunch of errors on registry key), so I had to switch to a Firefox webdriver. Adjusting code and inserting a bunch of sleep statements.
For some reason (yet to be understood) if I save HTML pages with the utf-8 encoding, it will get mad at me. So commented that out for now.
The code seemed slow compared to those existed in Kyran's project. Might attempt to optimize and parallelize it?
it seems that python 3 does not support write(stuff).encoding('utf-8')?

2018-06-21:

Continued reading through past projects (it's so disorganized...)
Moved Kyran's Google Classifier to my project folder. Details can be read under the folder "Notes. READ THIS FIRST\movelog".
Tried running the Classifier from a new folder. The Shell crashed once on the web_demo_feature.py
Ran through everything in the Classfier. Things seemed to be functioning with occasional error messages
Talked to Kyran about the project and clarified some confusions up
Made a to-do list in the general note file ("Notes. READ THIS FIRST\NotesAndTasks.txt")

2018-06-20:

Set up Work Log page.
Edited Profile page with more information.
Created project page: Accelerator Demo Day.
Made new project folder at E:\McNair\Projects\Accelerator Demo Day.
Read through old projects and started copying scripts over as well as cleaned things up.
Created movelog.txt to track these moving details.
Talked to Ed more about the project goals and purposes

2018-06-19: More SQL. Talked to Ed and received my project (Demo Day Crawler).

2018-06-18: Set up RDPs, Slacks, Profile page and learned about SQL.

Oliver Chang

Oliver Chang Work Logs (log page)

2017-12-02: communicated results of running ECA on 2 stddev table (not the error source); update db and web server software; check if scipy elision is ECA bug (it is not)

2017-12-01: re-tasked Kyran with the hierarchical clustering algorithm implementation; create 2 standard deviations tables; freed up DB space

2017-11-30: stub out implementation, add parsing code and mapping code...just need the meat of the algorithm now

2017-11-27: documentation & bug finding on the parallel enclosing circle project; research hierarchical linkage approach

2017-11-14: hand off work on xpathing to Shelby; walked through some program design decisions

2017-11-13: finish javadoc of common/ and some trickier parts about downloading; added descriptions, results to one-off java/python scripts so that they actually make sense in context

2017-11-10: add javadoc documentation to patent reproducibility project after forgetting half of the stuff myself

2017-10-25: start ingestion of application xml files and deal with all the bugs which accompany that

2017-10-21: create xml explorer script to mass-inspect xpaths (can be found at E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\xml_schema_explorer)

2017-10-20: re-run assignment import, review xpaths

2017-10-19: repopulate patents database and create spreadsheet of assignment equivalencies

2017-10-11: add issues to work on to the PECA wiki page; cleanup PECA git repo; start patent application code from granted patent code and start customizing to new domain

2017-10-03: troubleshoot vc_circles.py and make command line interface a little nicer

2017-10-02: discuss mapping strategies & investigate missing eca data

2017-09-23: make Project/OliverLovesCircles usable and add initial splitting ability

2017-09-22: goal setting & server debugging & meet with Yang

Peter Jalbert

Peter Jalbert Work Logs (log page)

2017-12-21: Last minute adjustments to the Moroccan Data. Continued working on Selenium Documentation.

2017-12-20: Working on Selenium Documentation. Wrote 2 demo files. Wiki Page is avaiable here. Created 3 spreadsheets for the Moroccan data.

2017-12-19: Finished fixing the Demo Day Crawler. Changed files and installed as appropriate to make linked in crawler compatible with the RDP. Removed some of the bells and whistles.

2017-12-18: Continued finding errors with the Demo Day Crawler analysis. Rewrote the parser to remove any search terms that were in the top 10000 most common English words according to Google. Finished uploading and submitting Moroccan data.

2017-12-15: Found errors with the Demo Day Crawler. Fixed scripts to download Moroccan Law Data.

2017-12-14: Uploading Morocco Parliament Written Questions. Creating script for next Morocco Parliament download. Begin writing Selenium documentation. Continuing to download TIGER data.

2017-12-06: Running Morocco Parliament Written Questions script. Analyzing Demo Day Crawler results. Continued downloading for TIGER geocoder.

2017-11-28: Debugging Morocco Parliament Crawler. Running Demo Day Crawler for all accelerators and 10 pages per accelerator. TIGER geocoder is back to Forbidden Error.

2017-11-27: Rerunning Morocco Parliament Crawler. Fixed KeyTerms.py and running it again. Continued downloading for TIGER geocoder.

2017-11-20: Continued running Demo Day Page Parser. Fixed KeyTerms.py and trying to run it again. Forbidden Error continues with the TIGER Geocoder. Began Image download for Image Classification on cohort pages. Clarifying specs for Morocco Parliament crawler.

2017-11-16: Continued running Demo Day Page Parser. Fixed KeyTerms.py and trying to run it again. Forbidden Error continues with the TIGER Geocoder. Began Image download for Image Classification on cohort pages. Clarifying specs for Morocco Parliament crawler.

2017-11-15: Continued running Demo Day Page Parser. Wrote a script to extract counts that were greater than 2 from Keyword Matcher. Continued downloading for TIGER Geocoder. Finished re-formatting work logs.

2017-11-14: Continued running Demo Day Page Parser. Wrote an HTML to Text parser. See Parser Demo Day Page for file location. Continued downloading for TIGER Geocoder.

2017-11-13: Built Demo Day Page Parser.

2017-11-09: Running demo version of Demo Day crawler (Accelerator Google Crawler). Fixing work log format.

2017-11-07: Created file with 0s and 1s detailing whether crunchbase has the founder information for an accelerator. Details posted as a TODO on Accelerator Seed List page. Still waiting for feedback on the PostGIS installation from Tiger Geocoder. Continued working on Accelerator Google Crawler.

2017-11-06: Contacted Geography Center for the US Census Bureau, here, and began email exchange on PostGIS installation problems. Began working on the Selenium Documentation. Also began working on an Accelerator Google Crawler that will be used with Yang and ML to find Demo Days for cohort companies.

2017-11-01: Attempted to continue downloading, however ran into HTTP Forbidden errors. Listed the errors on the Tiger Geocoder Page.

2017-10-31: Began downloading blocks of data for individual states for the Tiger Geocoder project. Wrote out the new wiki page for installation, and beginning to write documentation on usage.

2017-10-30: With Ed's help, was able to get the national data from Tiger installed onto a database server. The process required much jumping around and changing users, and all the things we learned are outlined in the database server documentation under "Editing Users".

2017-10-25: Continued working on the TigerCoder Installation.

2017-10-24: Throw some addresses into a database, use address normalizer and geocoder. May need to install things. Details on the installation process can be found on the PostGIS Installation page.

2017-10-23: Finished Yelp crawler for Houston Innovation District Project.

2017-10-19: Continued work on Yelp crawler for Houston Innovation District Project.

2017-10-18: Continued work on Yelp crawler for Innovation District Project.

2017-10-17: Constructed ArcGIS maps for the agglomeration project. Finished maps of points for every year in the state of California. Finished maps of Route 128. Began working on selenium Yelp crawler to get cafe locations within the 610-loop.

2017-10-16: Assisted Harrison on the USITC project. Looked for natural language processing tools to extract complaintants and defendants along with their location from case files. Experimented with pulling based on parts of speech tags, as well as using geotext or geograpy to pull locations from a case segment.

2017-10-13: Updated various project wiki pages.

2017-10-12: Continued work on Patent Thicket project, awaiting further project specs.

2017-10-05: Emergency ArcGIS creation for Agglomeration project.

2017-10-04: Emergency ArcGIS creation for Agglomeration project.

2017-10-02: Worked on ArcGIS data. See Harrison's Work Log for the details.

2017-09-28: Added collaborative editing feature to PyCharm.

2017-09-27: Worked on big database file.

2017-09-25: New task -- Create text file with company, description, and company type.

VC Database Rebuild
psql vcdb2
table name, sdccompanybasecore2
Combine with Crunchbasebulk

TODO: Write wiki on linkedin crawler, write wiki on creating accounts.

2017-09-21: Wrote wiki on Linkedin crawler, met with Laura about patents project.

2017-09-20: Finished running linkedin crawler. Transferred data to RDP. Will write wikis next.

2017-09-19: Began running linkedin crawler. Helped Yang create RDP account, get permissions, and get wiki setup.

2017-09-18: Finished implementation of Experience Crawler, continued working on Education Crawler for LinkedIn.

2017-09-14: Continued implementing LinkedIn Crawler for profiles.

2017-09-13: Implemented LinkedIn Crawler for main portion of profiles. Began working on crawling Experience section of profiles.

2017-09-12: Continued working on the LinkedIn Crawler for Accelerator Founders Data. Added to the wiki on this topic.

2017-09-11: Continued working on the LinkedIn Crawler for Accelerator Founders Data.

2017-09-06: Combined founders data retrieved with the Crunchbase API with the crunchbasebulk data to get linkedin urls for different accelerator founders. For more information, see here.

2017-09-05: Post Harvey. Finished retrieving names from the Crunchbase API on founders. Next step is to query crunchbase bulk database to get linkedin urls. For more information, see here.

2017-08-24: Began using the Crunchbase API to retrieve founder information for accelerators. Halfway through compiling a dictionary that translates accelerator names into proper Crunchbase API URLs.

2017-08-23: Decided with Ed to abandon LinkedIn crawling to retrieve accelerator founder data, and instead use crunchbase. Spent the day navigating the crunchbasebulk database, and seeing what useful information was contained in it.

2017-08-22: Discovered that LinkedIn Profiles cannot be viewed through LinkedIn if the target is 3rd degree or further. However, if entering LinkedIn through a Google search, the profile can still be viewed if the user has previously logged into LinkedIn. Devising a workaround crawler that utilizes Google search. Continued blog post here under Section 4.

2017-08-21: Began work on extracting founders for accelerators through LinkedIn Crawler. Discovered that Python3 is not installed on RDP, so the virtual environment for the project cannot be fired up. Continued working on Ubuntu machine.

Shelby Bice

Shelby Bice Work Logs (log page)

2017-12-08 1:00 pm - 2:45 pm - updated the ER Diagram on my project page to include the three tables for reissue, plant, and design patents respectively. Finished typing up the status of the project as I am leaving it with notes to Oliver and Ed

2017-12-07 3:15 PM - 4:45 PM (came in late due to finals) - finished debugging additions to Oliver's code for the tables that are related to design, reissue, and plant patents, added a troubleshooting section to Oliver's page with instructions on how to deal with issues importing the project.

2017-12-04 2:45 pm - 4:00 pm - continued debugging and started typing up troubleshooting tips for the next person who alters the patent code

2017-12-01 3:15 pm - 5:00 pm - ran code (and ran into errors) which I have been working on fixing. If I don't finish today, I'll continue doing so on Monday. I plan to write up some of the mistakes I made in adding the tables to the database and add them to Oliver's page, with his permission so that people in the future who are not familiar with the code (like I wasn't) hopefully won't fall into the same pitfalls. The main things are 1) how to set up a maven project (if IntelliJ doesn't automatically set it up for you when you open/import the project, and 2) how to set up the data source so you can run SQL scripts and actually load data into the database on the RDP.

2017-11-30-17 1:55 pm - 3:55 pm - continued altering code. Wrote creation tables script in SQL for creating tables for the design, reissue, and plant patents, and went through the checklist to make sure I had done everything to create these new tables based on Oliver's Reproducible Patent Data page. Will definitely run code tomorrow and will type up the exact process I went through to create new tables.

2017-11-17 2:15 pm - 5:00 pm - continued altering code to include the special fields for plant, reissue, and design patent. Their models and the changes to XmlParser relevant to those models should be done. I will go through the rest of the code and check where else I need to make alterations next time I am in.

2017-11-16 8:45 am - 10:30 am - continued altering code to include the special fields for plant, reissue, and design patents

2017-11-14 8:30 - 10:30 am - met with Oliver to go over his code, began working on scripts and finding paths for the fields qunie to design, reissue, and plant patents.

2017-11-10: 2:00 - 5:00 pm - updated format of worklog, finished researching fields of the new design, utility, and reissue tables, and documenting what each table contains

2017-11-03: 2:30 - 3:30 pm, 3:45 pm - 5:00 pm (I went to get a snack because I was starving!) - Finished up finding unique fields, added them to database design, and researched what the fields represented - again, this is taking a while since there is little to no documentation online about what fields on these XML files actually represent.

2017-11-02: 2:15 pm - 4:00 pm - made a main data page and continued going through the xpaths Oliver found. Notified Michelle so she can continue adding patent pages

2017-10-31: 9:00 am - 10:30 am - started trying to determine from Oliver's research what paths are unique to utility, reissue, and plant patents that we have not yet accounted for in our design of the table

2017-10-27: 2:00 pm - 3:30 pm - Kept updating tables. For some of the new values I've been adding to tables (I didn't realize we had XPaths for some fields) I'm struggling to find out exactly what the field means (for example, the date on a citation - is this the date the citedpatent was granted? or is the date that the citingpatent cited the citedpatent?) so the process is slow going. I want to get the documentation right though so in the future someone can look up and see exactly what each field in the table represents and not have to guess

2017-10-26: 2:00 pm - 4:00 pm - worked on updating the designs for the patent databases tables based on what was discussed on Friday (specifically adding inventors and lawyers tables, altering fields on patent tables) - will continue this on Friday

2017-10-20: - 2:00 pm - 5:00 pm - met with Ed and Oliver about patent database design

2017-10-19: 2:00 pm - 4:00 pm - finished up ER diagram and description of all tables for patent database, started reading papers concerning creating an inventor's database (looks like other research groups have merged the USPTO data with the Harvard Dataverse data in order to create an inventors table)

2017-10-17: 8:45 am - 10:30 am - continued trying to solve twitter issues, worked on ER diagram, skimmed through a paper I found (linked at the top of the page for Redesign Assignment and Patent Database) to see how they cleaned the data, since I assume that will be the next step to research after finishing the ER diagrams

2017-10-12: 2:30 pm - 5:00 pm write blog post on Grace Hopper Celebration and attempted to solve Twitter issue

2017-10-03: 8:45 am - 10:30 am - finished adding descriptions to the fields for each table in the patent database, started work on an ER diagram

2017-09-29: 2:00 pm - 5:00 pm - finished adding the design for the Patent database to the project page, added descriptions of the fields for each table to the project page including the datatype that I think the field will be when it's loaded into the database

2017-09-28: 9:00 am - 10:30 am - continued going over and documenting last semester's patent database design and adding the details to the Redesign Assignment and Patent Database project page. Additionally, I began trying to determine how to match up the information in the Document_Info table in Assignment to match up with a patent_id in the Patent table in Patent.

Tomorrow I will finish adding the design for the Patent database to the project page, add descriptions of the fields for each table to the project page, and start working on ER diagrams for the two databases.

Links for creating ER diagrams: https://erdplus.com/#/standalone or https://creately.com/app/?tempID=hqdgwjki1&login_type=demo#

2017-09-26: 8:45 am - 10:00 am - continued worked on design of Assignment database by checking my design against the work done last semester on the assignment data restructure to make sure I didn't miss anything major. Began going over my patent database design from last semester to tweak it. Will need to sync up with Joe Reilly to see if there are any new fields that we are pulling from the data. Additionally, I made a new project page called Redesign Assignment and Patent Database that encompasses the new design for the Assignment database and Patent database redesign and moved some of the notes from McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper to the project page.

The main takeaway from looking over Patent Assignment Data Restructure is that, after assembling the table according to my design (which doesn't seem to have any contradictions with the Patent Assignment Data Restructure) that there will by multiple steps for cleaning the data, specifically the fields relating to location and address in the assignment table. While the Patent Assignment Data Restructure mentions connecting to the Patent database, it is not clear from the page what field would be used to connect to the Patent database.

2017-09-23: 2:00 pm - 4:00 pm - continued working on design of Assignment database and how it will connect to Patent database by writing out what will be in each table in Assignment and questions about different possible structures of tables that we will have to address before finalizing the design - the notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper. Questions are highlighted in yellow throughout the document

2017-09-22: 8:30 am - 10:30 am - continued looking at paper on USPTO assignment data and adding to the notes on what the design of that database should look like, specifically on what I need for different tables and what I don't know yet about the design. Had to set up connection to RDP again due to technical issues.

2017-09-15: 2:00 pm - 5:00 pm - introduced to new patent database projected, reviewed and took notes on USPTO Assignment data (notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper)

Yang Zhang

Yang Zhang Work Logs (log page)

2017-09-28: the LSTM model achieves 67% testing accuracy

2017-09-26: modify the previous model and train on the "longdescriptionindu.txt"

2017-09-22: the two models both can achieve 90+% training accuracy and ~60% testing accuracy. Notice that the task is hard even for humans and thus 60% is acceptable, plus the baseline is under 10% for random guessing

2017-09-21: start to build two different kinds of deep neural networks (Convolutional and LSTM) to classify the companies' industry, specifically for the file "Descriptions.txt"

2017-09-19: accounts setup and get familiar with the rules

Administrative

Cindy Ryoo

Cindy Ryoo Work Logs (log page)

2018-05-10: Researched for Accelerator Seed List (Data)

2018-05-09: Researched for Accelerator Seed List (Data)

2018-05-08: Researched for Accelerator Seed List (Data)

2018-05-07: Researched for Accelerator Seed List (Data)

2018-05-06: Researched for Accelerator Seed List (Data)

2018-05-05: Researched for Accelerator Seed List (Data)

2018-05-04: Researched for Accelerator Seed List (Data)

2018-05-03: Researched for Accelerator Seed List (Data)

2018-05-02: Researched for Accelerator Seed List (Data)

2018-05-01: Researched for Accelerator Seed List (Data)

2018-04-30: Researched for Accelerator Seed List (Data)

2018-04-27: Researched for Accelerator Seed List (Data)

2018-04-26: Researched for Accelerator Seed List (Data)

2018-04-25: Researched for Accelerator Seed List (Data)

2018-04-24: Researched for Accelerator Seed List (Data)

2018-04-23: Researched for Accelerator Seed List (Data), practiced regex

2018-04-19: Researched for Accelerator Seed List (Data), practiced regex

2018-04-18: Researched for Accelerator Seed List (Data), practiced regex

2018-04-17: Researched for Accelerator Seed List (Data), learned regex

2018-04-13: Researched for Accelerator Seed List (Data)

2018-04-12: Researched for Accelerator Seed List (Data)

2018-04-11: Researched for Accelerator Seed List (Data)

2018-04-10: Researched for Accelerator Seed List (Data)

2018-04-05: Researched for Accelerator Seed List (Data)

2018-04-04: Researched for Accelerator Seed List (Data)

2018-04-03: Researched for Accelerator Seed List (Data)

2018-03-29: Researched for Accelerator Seed List (Data)

2018-03-28: Researched for Accelerator Seed List (Data)

2018-03-27: Researched for Accelerator Seed List (Data)

2018-03-22: Researched for Accelerator Seed List (Data)

2018-03-20: Researched for Accelerator Seed List (Data)

2018-03-09: Researched for Accelerator Seed List (Data)

2018-03-08: Organized list of media reports for Beebe and Diamond

2018-03-06: Compiled list of press contacts for issue brief, found articles for Agglomeration Lit Review

2018-02-27: Found articles for Agglomeration Lit Review

2018-02-22: Found articles for Agglomeration Lit Review, constructed bibliography for Agglomeration project

2018-02-20: Researched for TIF Project wiki page, researched for Accelerator Seed List (Data)

2018-02-15: Researched for TIF Project wiki page

2018-02-13: Edited and researched for TIF Project wiki page

2018-02-08: Edited TIF Project wiki page, finished proofreading Women 2017 paper

2018-02-07: Proofread Women 2017 paper, edited TIF Project wiki page

2018-02-06: Filled out Concur reports, proofread Women 2017 paper

2018-02-02: Worked on Women 2017 bibliography

2018-02-01: Worked on Women 2017 bibliography

2018-01-30: Drafted conference invitation letters, proofread report, filled out Concur reports, compiled list of catering companies, conference paper deadlines, and university contact information

2018-01-26: Assisted Texas Innovation Conference

2018-01-25: Proofread report, drafted conference invitation letters, researched media reports, organized references for women & entrepreneurship

2018-01-24: Organized and searched for literature reviews

2018-01-23: Organized media reports, filled out Concur reports, organized meetings

2018-01-19: Filled out Concur reports

2018-01-18: Created posts for Hootsuite

2018-01-11: Created spreadsheet, managed Hootsuite

2018-01-09: Filled out Concur reports, created Google Cloud Account wikipage, proofread report

Lin Yang

Lin Yang (Work Log)

Michelle Huang

Michelle Huang Work Logs (log page)

2018-08-16: Updated Interns list

2018-08-15: Updated tables for paper

2018-08-14: Updated tables for paper

Work Logs

Contents

Academic Papers

Jake Silberman

Will Cleland

Todd Rachowin

Amir Kazempour

Marcos Ki Hyung Lee

Summer 2018

Notes from Ed

By Date

Research

Amir Kazempour

Ben Baldazo

contributing Projects

worklog

Catherine Kirby

Connor Rothschild

Diana Carranza

Dylan Dickens

Hira Farooqi

James Chen

Joe Reilly

Julia Wang

Matthew Ringheanu

Meghana Gaur

Shrey Agarwal

Taylor Jacobe

Yunnie Huang

Technical

Christy Warden

Grace Tan

Harrison Brown

Jeemin Sim

Kyran Adams

Minh Le

Oliver Chang

Peter Jalbert

Shelby Bice

Yang Zhang

Administrative

Cindy Ryoo

Lin Yang

Michelle Huang

Archive

Abhijit Brahme

Adrian Smart

Albert Nabiullin

Alex Jiang

Ariel Sun

Avesh Krishna

Carlin Cherry

Claudio Sanchez-Nieto

Dan Lee

David Zhang

Eliza Martin

Gunny Liu

Harsh Upadhyay

Iris Huang

Jackie Li

Jake Floyd

Jason Isaacs

Juliette Richert

Kerda Veraku

Komal Agarwal

Kranthi Pandiri

Kunal Shah

Lauren Bass

Leo Du

Mallika Miglani

Marcela Interiano

Meghana Pannala

Napas Udomsak

Pedro Alvarez

Rachel Garber

Ramee Saleh

Ravali Kruthiventi

Sahil Patnayakuni

Shoeb Mohammed

Sonia Zhang