Difference between revisions of "Maxine Tao (Work Log)"

From edegan.com
Jump to navigation Jump to search
Line 16: Line 16:
 
6/28 -- Worked with Minh and Grace to debug linkedin crawler. We had an issue with the xpath of the linkedin searchbox. Also helped Connor with filling in accelerator terms on master variable list. I filtered the list of accelerators and companies they've invested in by the investment amounts. If they match what is given on the website, I put them into a separate sheet under 'Accelerators and Investments'
 
6/28 -- Worked with Minh and Grace to debug linkedin crawler. We had an issue with the xpath of the linkedin searchbox. Also helped Connor with filling in accelerator terms on master variable list. I filtered the list of accelerators and companies they've invested in by the investment amounts. If they match what is given on the website, I put them into a separate sheet under 'Accelerators and Investments'
  
6/29 -- I filtered out accelerator and investment matches that had the same data as the terms of joining given on the accelerator website. Then I took this data and matched it against the cohort list for companies without cohort years. I was only able to find 5 companies, which means this approach will not get us the data we want. After calling Ed, I matched a list of company names (from our data) to itself and a list of company names (from crunchbase) to itself. I think I'm getting my list of matches is longer than my original list?
+
6/29 -- I filtered out accelerator and investment matches that had the same data as the terms of joining given on the accelerator website. Then I took this data and matched it against the cohort list for companies without cohort years. I was only able to find 5 companies, which means this approach will not get us the data we want. After calling Ed, I matched a list of company names (from our data) to itself and a list of company names (from crunchbase) to itself. These two files have not been cleaned but they are in McNair\Software\Database Scripts\Crunchbase2 and have -MATCHED at the ends of their file names.

Revision as of 16:39, 29 June 2018

Summer 2018

6/21 -- Downloaded Crunchbase data using API version 3.1, loaded 17 files into crunchbase2 database, checked each table to make sure specs matched new data and updated line counts. Grace and I ran into an issue with blank strings on date types. Date types with "" were not being read as null. We fixed this using a one-line command that we've written on Crunchbase Data. Later we used Connor's master list of 166 accelerators and tried to create a table with accelerators and their uuids by using the 'orgnizations' table. Some names matched multiple times and some did not match at all so we ended up with 179 matches, which we will clean through tomorrow.

6/22 -- Loaded Accelerator Master List as a table and matched on accelerator name or accelerator URL. Manually edited out bad results with same name and different URLs or different URLs and same names. There were 34 entries from the master accelerator list that could not be matched to anything in the crunchbase data table 'organizations'. Grace and I manually searched for these using ILIKE and found a number of matches that we added back into our spreadsheet of matches. Now we have a clean list of accelerator names, their matches from the crunchbase data, and their UUIDs.


6/25 -- Updated list of accelerators and their UUIDs with Connor and Grace (we now have 163 matches), created a table in database crunchbase2 called 'AccUUIDsFinal'. This is a list of 3 columns: accelerator names from the master list, accelerator names from crunchbase, accelerator UUIDs from crunchbase. Then we joined this table back to the needed info fields from crunchbase. This new table is called 'AccAllInfo'. From this table, joining accelerator UUIDs to company UUIDs does not work. This gives investors that have invested in accelerators. From this, Connor and I figured that company_name/company_uuid actually refers to the company being invested in. Joining accelerator names to investor names also gives nothing back. However, when I manually searched Y Combinator as an investor name, I got results back. Not sure what is going on - I think the accelerator names to investor names join should work.

6/26 -- Fixed yesterday's issue of no matches. The problem was that the investor_names field was surrounded with curly braces. I removed these and a clean version is saved in 'funding_rounds-no brackets.txt'. I found that matching accelerator UUIDs to investor UUIDs gives more matches than accelerator names to investor names. There are 631 matches, most of which are labeled as seed type investments.

6/27 -- Filled in a spreadsheet of the unique accelerators I got from yesterday's matches with flags indicating whether or not they take equity and notes about specifics. This is incomplete, there are some that I'm not sure about or couldn't find information for. Also helped Connor with manually filtering out duplicated company names. Helped Grace with LinkedIn crawler; it seems to work for founders that we have urls for but it crashes otherwise.

6/28 -- Worked with Minh and Grace to debug linkedin crawler. We had an issue with the xpath of the linkedin searchbox. Also helped Connor with filling in accelerator terms on master variable list. I filtered the list of accelerators and companies they've invested in by the investment amounts. If they match what is given on the website, I put them into a separate sheet under 'Accelerators and Investments'

6/29 -- I filtered out accelerator and investment matches that had the same data as the terms of joining given on the accelerator website. Then I took this data and matched it against the cohort list for companies without cohort years. I was only able to find 5 companies, which means this approach will not get us the data we want. After calling Ed, I matched a list of company names (from our data) to itself and a list of company names (from crunchbase) to itself. These two files have not been cleaned but they are in McNair\Software\Database Scripts\Crunchbase2 and have -MATCHED at the ends of their file names.