Difference between revisions of "Venture Capital (Data)"

From edegan.com
Jump to navigation Jump to search
Line 214: Line 214:
 
  insurance company focused on renters insurances';
 
  insurance company focused on renters insurances';
 
  --1
 
  --1
 +
 +
==Removing duplicates from ipos table==
 +
If you create the following two tables you'll notice that there are duplicate entries in the ipos table.
 +
DROP TABLE ipospkey;
 +
CREATE TABLE ipospkey AS
 +
SELECT *,
 +
concat(issuer::text, issuedate::text, statecode::text, principalamt::text) AS pkey
 +
FROM ipos;
 +
--10440
 +
 +
DROP TABLE iposcount;
 +
CREATE TABLE iposcount AS
 +
SELECT issuer, issuedate, statecode, COUNT(pkey)
 +
FROM ipospkey
 +
GROUP BY issuer, issuedate, statecode;
 +
--9491

Revision as of 17:26, 26 June 2017


McNair Project
Venture Capital (Data)
Project logo 02.png
Project Information
Project Title Venture Capital (Data)
Owner Adrian Smart, Jake Silberman
Start Date
Deadline
Keywords Data
Primary Billing
Notes
Has project status Active
Copyright © 2016 edegan.com. All Rights Reserved.


Project Objective

To create a master venture capital data set on which all other projects requiring venture capital data will be built.

Specifically, we will be building data to support the following:

  • Lists and counts by year as well as state, MSA, and city, zip code, etc.
  • Identify seed, early and later stage deals (first investment), dollars invested and rounds
  • Identify transactional VC dollars invested and rounds
  • Acquisitions and IPOs of VC backed firms
  • Identify when VC backed firms are "alive"
  • Geocode all portfolio companies
  • Geocode VC headquarters and branch offices
  • Compute distances between investors and their portfolio companies
  • Identify lead investors for portfolio companies
  • Identify top 100 VCs

Retrieving data

Old notes are Old Venture Capital Data Work

Retrieving US VC Data From SDC requires SDC platinum, perl scripts and SQL scripts, as well as some manual processing.

Scripts and other info

SDC and Perl Scripts are in:

E:\McNair\Projects\VC Database

This includes:

  • NormalizeFixedWidth.pl
  • RoundOnOneLine.pl - Needed for round on one line

SQL Scripts and finished data are in:

Z:\VentureCapitalData\SDCVCData

Notes:

  • Portfolio companies attributes to be extracted from roundlevel information
  • Round-on-one-line processed using RoundOnOneLine.pl and NormalizeFixedWidth.pl
  • Firms includes branch office so attributes must be extracted
  • Portfolio company descriptions - just the portco name, state, date of first inv, and the long description - has to be custom processed.

The next steps are detailed in VC Database Rebuild.


Additional datafiles (in E:\McNair\Projects\VC Database):

  • GeocodedVCData.txt 43,724 records, tab-delimited with companynames but with "none" for some geocoords.

Processing the base tables

The SQL scripts need to do the following:

  • Extract company infomation from roundbase then process it
    • Add variable for growth (1), transactional (2), or non-VC (3) (see below)
    • Create clean 5-digit ZIP field
  • Extract firm information from firmbase, then:
    • Add indicator variable for US
    • Create clean 5-digit ZIP field
  • Build out BranchOffice (BO) table
  • Build lookups for:
    • State Code
    • raisestatus

SQL file to process data is:

Z:/VentureCapitalData/SDCVC/ProcessData.sql

Lookup Tables

Stage of Investment

We are generating 6 binary indicators as follows:

Growth VC (4)
--------------
Seed (1)
Early Stage (2)
Later Stage (3)
Tranactional VC (5)
--------------
Acq. for Expansion
Acquisition
Bridge Loan
Expansion
Pending Acq
Recap or Turnaround
Exclude (non-VC) (6)
--------------
LBO
MBO
Open Market Purchase
PIPE
Secondary Buyout
VC Partnership
Other

Building companybase and round

Both are built from roundbase which contains portfolio companies and round details. The scripts are below.

CREATE TABLE companybase AS
SELECT DISTINCT 
coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode,
statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip 
FROM roundbase
ORDER BY coname;
CREATE TABLE round AS
SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors
FROM roundbase
ORDER BY coname;

Building stageflags

Stageflags are built from the stage3 column of round. Using the tables above, I made 6 flags for seed, early, later, growth, transaction and exclude. The script is below. Verify that no row entries are missing flags in excel by copying the round table to a txt file. The last 3 columns should always sum to 1 for every row entry. There should be no zero sums and no sums greater than 1.

CREATE TABLE stageflags AS
SELECT coname, stage3,
CASE
 WHEN stage3 = 'Seed' THEN 1
 ELSE 0
 END AS seedflag,
CASE
 WHEN stage3 = 'Early Stage' THEN 1
 ELSE 0
 END AS earlyflag,
CASE
 WHEN stage3 = 'Later Stage' THEN 1
 ELSE 0
 END AS laterflag,
CASE
 WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1
 ELSE 0
 END AS growthflag,
CASE
 WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1
 ELSE 0
 END AS transactionflag,
CASE
 WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' 
OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1
 ELSE 0
 END AS excludeflag
FROM round;

Sample output

The psql file was copied to an output text file using:

\COPY stageflags TO 'flag.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV; 

A sum column was added in excel which is the sum of the last 3 columns. Notice how all entries are 1. This is good.

Checking companybase table

The following psql code creates a new table called companybase1 that checks for Undisclosed companies and nonUS companies in the companybase table and also inserts a primary key column which is a concatenation of coname, statecode and datefirstinv.

CREATE TABLE companybase1 AS 
SELECT *,
concat(coname::text, statecode::text, datefirstinv::text) AS pkey,
CASE
 WHEN nationcode = 'US' THEN 1::int
 ELSE 0::int
 END AS alwaysusflag,
CASE
 WHEN coname = 'Undisclosed Company' THEN 1::int
 ELSE 0::int
 END AS undisclosedflag
FROM companybase;
\COPY companybase1 TO 'companybase1.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV;

Checking if pkey is valid

The following psql code creates a new table called companybasepkeyvalid that counts the number of times the pkey created in companybase1 occurs in companybase1. In this case there are 3 pkeys that are not valid for companies: Undisclosed Company, New York Digital Health LLC and Undisclosed Company.

CREATE TABLE companybasepkeyvalid AS
SELECT coname, pkey, COUNT(pkey)
FROM companybase1
GROUP BY coname, pkey;
--44771
\COPY companybasepkeyvalid TO 'companybasepkeyvalid.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV;

The Undisclosed Companies will be updated manually using the description to extract the company name. To remove the duplicate records of the company New York Digital Health LLC from the roundbase table run the following command:

DELETE FROM roundbase WHERE url = 'www.digitalhealthaccelerator.c';
--1

After the Undisclosed company names are updated, you will need to recreate the companybase and roundbase tables. Copy all the undisclosed companies to a text file and pull out the company names from the description.

DROP TABLE undisclosedcompanies
CREATE TABLE undisclosedcompanies AS
SELECT coname, description, url  FROM roundbase
WHERE coname = 'Undisclosed Company';
--37
\COPY undisclosedcompanies TO 'undisclosedcompanies.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV;

After manually pulling out the company names from the description from all 37 entries, I created the following commands to update the original roundbase table below. After the updates there are still 18 Undisclosed companies. Recheck to make sure there are no primary key issues again.

UPDATE roundbase SET coname = 'Standard Bariatrics' WHERE coname = 'Undisclosed Company' AND description = 'Standard 
Bariatrics is taking bariatric procedure and creating instruments';
--1
UPDATE roundbase SET coname = 'Jersey Watch' WHERE coname = 'Undisclosed Company' AND description = 'Jersey Watch provides 
organizers to professional.';
--1
UPDATE roundbase SET coname = 'CareSkore' WHERE coname = 'Undisclosed Company' AND description = 'CareSkore provides an 
evidence-based predictive systems.';
--1
UPDATE roundbase SET coname = 'Navistone' WHERE coname = 'Undisclosed Company' AND description = 'Navistone provides digital 
marketers a way to target the right web browsers';
--2
UPDATE roundbase SET coname = 'PolySync' WHERE coname = 'Undisclosed Company' AND description = 'Offers PolySync platform 
for driverless vehicle applications.';
--2
UPDATE roundbase SET coname = 'Platform9' WHERE coname = 'Undisclosed Company' AND description = 'Platform9 Systems Inc is a 
United States-based company.';
--2
UPDATE roundbase SET coname = 'Salamanca' WHERE coname = 'Undisclosed Company' AND description = 'Salamanca provides 
products and services.';
--1
UPDATE roundbase SET coname = 'Sourcegraph Inc.' WHERE coname = 'Undisclosed Company' AND description = 'Sourcegraph Inc. 
provides a code search engine for open source.';
--1
UPDATE roundbase SET coname = 'Sona' WHERE coname = 'Undisclosed Company' AND description = 'Sona provides products and 
services.';
--1
UPDATE roundbase SET coname = 'AlphaSource' WHERE coname = 'Undisclosed Company' AND description = 'AlphaSource in provides 
healthcare technology solutions.';
--2
UPDATE roundbase SET coname = 'Reflecktive' WHERE coname = 'Undisclosed Company' AND description = 'Reflecktive provides 
products and services.';
--1
UPDATE roundbase SET coname = 'SuperH' WHERE coname = 'Undisclosed Company' AND description = 'SuperH provides an SaaS 
platform for open source data';
--1
UPDATE roundbase SET coname = 'Islands Media Inc' WHERE coname = 'Undisclosed Company' AND description = 'Islands Media Inc 
is an IT company.';
--1
UPDATE roundbase SET coname = 'Envoy Commerce Corp.' WHERE coname = 'Undisclosed Company' AND description = 'Envoy Commerce 
Corp. provides online grocery delivery service.';
--1
UPDATE roundbase SET coname = 'Quilt' WHERE coname = 'Undisclosed Company' AND description = 'Quilt is a consumer-centric 
insurance company focused on renters insurances';
--1

Removing duplicates from ipos table

If you create the following two tables you'll notice that there are duplicate entries in the ipos table.

DROP TABLE ipospkey;
CREATE TABLE ipospkey AS
SELECT *,
concat(issuer::text, issuedate::text, statecode::text, principalamt::text) AS pkey
FROM ipos;
--10440
DROP TABLE iposcount;
CREATE TABLE iposcount AS
SELECT issuer, issuedate, statecode, COUNT(pkey)
FROM ipospkey
GROUP BY issuer, issuedate, statecode;
--9491