Difference between revisions of "VentureXpert Data"

From edegan.com
Jump to navigation Jump to search
 
(69 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{McNair Projects
+
{{Project
 +
|Has project output=Data,How-to
 +
|Has sponsor=McNair Center
 
|Has title=VentureXpert Data
 
|Has title=VentureXpert Data
 
|Has owner=Augi Liebster,
 
|Has owner=Augi Liebster,
Line 6: Line 8:
 
}}
 
}}
  
[[Augi Liebster (Work Log)]]
+
The successors to this project include:
 +
*[[VCDB24]], which is the most recent iteration.
 +
*[[VCDB23]]
 +
*[[VCDB20Q3]]
 +
*[[VCDB20H1]]
 +
*[[VCDB4]]
  
 
==Relevant Former Projects==
 
==Relevant Former Projects==
Line 14: Line 21:
  
 
==Location==
 
==Location==
My scripts for SDC pulls are located in the Z drive in the location:
+
My scripts for SDC pulls are located in the E drive in the location:
  E:\VentureXpert Database\ScriptsForSDCExtract
+
  E:\McNair\Projects\VentureXpertDatabase\ScriptsForSDCExtract
  
 
My successfully pulled and normalized files are stored in the location:
 
My successfully pulled and normalized files are stored in the location:
  E:\VentureXpert Database\ExtractedDataQ2
+
  E:\McNair\Projects\VentureXpertDatabase\ExtractedDataQ2
  
My script for loading data is in one big text file in the location:
+
My scripts for loading tables and data are in:
  E:\VentureXpert Database\vcdb3\LoadingScripts
+
  E:\McNair\Projects\VentureXpertDatabase\vcdb3\LoadingScripts
 
 
The folder vcdb2 is there for reference to see what people before had done. ExtractedData is there because I pulled data before July 1st, and Ed asked me to repull the data.
 
  
 +
There are a variety of SQL files in there with self explanatory names. The file that has all of the loading scripts is called LoadingScriptsV1. The folder vcdb2 is there for reference to see what people before had done. ExtractedData is there because I pulled data before July 1st, and Ed asked me to repull the data.
  
 
==Goal==
 
==Goal==
Line 142: Line 148:
 
   datefirstinv date,
 
   datefirstinv date,
 
   rounddate date,
 
   rounddate date,
   disclosedamt money,
+
   disclosedamt real,
   investor varchar(255)
+
   fundname varchar(255)
 
  );
 
  );
  
Line 234: Line 240:
 
  \COPY mas FROM 'MAUSTargetComp100pc1985-July2018-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 
  \COPY mas FROM 'MAUSTargetComp100pc1985-July2018-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 
  --119432
 
  --119432
 +
 +
I encountered no problems loading in this data.
 +
 +
DROP TABLE longdescription;
 +
CREATE TABLE longdescription(
 +
  varchar(255),
 +
  statecode varchar(10),
 +
  fundingdate date, --date co received first inv
 +
  codescription varchar(10000) --long description
 +
);
 +
 +
\COPY longdescription FROM 'PortCoLongDesc-Ready-normal-fixed.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 +
--48037
 +
 +
I encountered no problems loading this data.
 +
 +
==Cleaning Companybase, Fundbase, Firmbase, and BranchOffice==
 +
===Cleaning Company===
 +
The primary key for port cos will be coname, datefirstinv, and statecode. Before checking whether this is a valid primary key, remove the undisclosed companies. I will explain the second part of the query concerning New York Digital Health later.
 +
 +
DROP TABLE companybasecore;
 +
CREATE TABLE companybasecore AS
 +
SELECT *
 +
FROM Companybase WHERE nationcode = 'US' AND coname != 'Undisclosed Company'
 +
AND NOT (coname='New York Digital Health LLC' AND statecode='NY' AND datefirstinv='2015-08-13' AND updateddate='2015-10-20');
 +
--48001
 +
 +
SELECT COUNT(*) FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore) AS T;
 +
--48001
 +
Since the count of the table and the count of the distinct primary key is equivalent, you know that the primary key is valid. In the initial cleaning of the table, I first sorted out only the undisclosed companies. This table had 48002 rows. I then ran the DISTINCT query above and found that there are 48001 distinct rows with the coname, datefirstinv, statecode primary key. Thus there must two rows that share a primary key. I found this key using the following query:
 +
 +
SELECT * FROM (SELECT coname, datefirstinv, statecode FROM companybase) as key GROUP BY coname, datefirstinv, statecode HAVING COUNT(key) > 1;
 +
 +
The company named 'New York Digital Health LLC' came up as the company that is causing the problems. I queried to find the two rows that list this company name in companybase and chose to keep the row that had the earlier updated date. It is a good practice to avoid deleting rows from tables when possible, so I added the filter as a WHERE clause to exclude one of the New York Digital listings.
 +
 +
===Cleaning Fundbase===
 +
The primary key for funds will be only the fundname. First get rid of all of the undisclosed funds.
 +
 +
DROP TABLE fundbasenound;
 +
CREATE TABLE fundbasenound AS
 +
SELECT DISTINCT * FROM fundbase WHERE fundname NOT LIKE '%Undisclosed Fund%';
 +
--28886
 +
 +
SELECT COUNT(*) FROM (SELECT DISTINCT fundname FROM fundbasenound)a;
 +
--28833
 +
 +
As you can see, fundbase still has rows that share fundnames. If you are wondering why the DISTINCT in the first query did not eliminate these, it is because this DISTINCT applies to the whole row not individual fundnames. Thus, only completely duplicate rows will be eliminated in the first query. I chose to keep the funds that have the earlier last investment date.
 +
 +
DROP TABLE fundups;
 +
CREATE TABLE fundups AS SELECT
 +
fundname, max(lastinvdate) AS lastinvdate FROM fundbasenound GROUP BY fundname HAVING COUNT(*)>1;
 +
--53
 +
 +
DROP TABLE fundbasecore;
 +
CREATE TABLE fundbasecore AS
 +
SELECT A.* FROM fundbasenound AS A LEFT JOIN fundups AS B ON A.fundname=B.fundname AND A.lastinvdate=B.lastinvdate WHERE B.fundname IS NULL AND B.lastinvdate IS NULL;
 +
--28833
 +
 +
Since the count of fundbasecore is the same as the number of distinct fund names, we know that the fundbasecore table is clean. In the first query I am finding duplicate rows and choosing the row that has the greater last investment date. I then match this table back to fundbasenound but choose all the rows from fundbasecore for which there is no corresponding fund in fundups based on fund name and date of last investment. This allows the funds with the earlier date of last investment to be chosen.
 +
 +
===Cleaning Firmbase===
 +
The primary key for firms will be firm name. First I got rid of all undisclosed firms. I also filtered out two firms that have identical firm names and founding dates. The reason for this is because I use founding dates to filter out duplicate firm names. If there are two rows that have the same firm name and founding date, they will not be filtered out by the third query below. Thus, I chose to filter those out completely.
 +
 +
DROP TABLE firmbasenound;
 +
CREATE TABLE firmbasenound AS
 +
SELECT DISTINCT * FROM firmbase WHERE firmname NOT LIKE '%Undisclosed Firm%' AND firmname NOT LIKE '%Amundi%' AND firmname NOT LIKE '%Schroder Adveq Management%';
 +
--15452
 +
 +
SELECT COUNT(*) FROM(SELECT DISTINCT firmname FROM firmbasenound)a;
 +
--15437
 +
 +
Since these counts are not equal we will have to clean the table further. We will use the same method from before.
 +
 +
DROP TABLE firmdups;
 +
CREATE TABLE firmdups AS SELECT
 +
firmname, max(foundingdate) as foundingdate FROM firmbasenound GROUP BY firmname HAVING COUNT(*)>1;
 +
--15
 +
 +
DROP TABLE firmbasecore;
 +
CREATE TABLE firmbasecore AS
 +
SELECT A.* FROM firmbasenound AS A LEFT JOIN firmdups AS B ON A.firmname=B.firmname AND A.foundingdate=B.foundingdate WHERE B.firmname IS NULL AND B.foundingdate IS NULL;
 +
--15437
 +
 +
Since the count of firmbasecore and the DISTINCT query are the same, the firm table is now clean.
 +
 +
===Cleaning Branch Offices===
 +
When cleaning the branch offices, I had to remove all duplicates in the table. This is because the table is so sparse that often the only data in a row would be the fund name the branch was associated with. Thus, I couldn't filter based on dates as I had been doing previously for firms and funds. The primary key is firm name.
 +
 +
DROP TABLE bonound;
 +
CREATE TABLE bonound AS
 +
SELECT *, CASE WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int ELSE 0::int END AS undisclosedflag
 +
FROM branchoffices;
 +
--10353
 +
 +
SELECT COUNT(*) FROM(SELECT DISTINCT firmname FROM bonound)a;
 +
--10042
 +
 +
Since these counts aren't the same, we will have to work a little more to clean the table. As stated above, I did this by excluding the firm names that were duplicated.
 +
 +
DROP TABLE branchofficecore;
 +
CREATE TABLE branchofficecore AS
 +
SELECT A.* FROM bonound AS A JOIN (
 +
SELECT bonound.firmname, COUNT(*) FROM bonound GROUP BY firmname
 +
HAVING COUNT(*) =1
 +
) AS B
 +
ON A.firmname=B.firmname WHERE undisclosedflag=0;
 +
--10032
 +
 +
SELECT COUNT(*) FROM (SELECT DISTINCT firmname FROM branchofficecore)a;
 +
--10032
 +
 +
Since these counts are the same, we are good to go. The count is 10 lower because we completely removed 10 firmnames from the listing by throwing out the duplicates.
  
 
==Instructions on Matching PortCos to Issuers and M&As From Ed==
 
==Instructions on Matching PortCos to Issuers and M&As From Ed==
 +
===Company Standardizing===
  
 
Get portco keys
 
Get portco keys
Line 285: Line 404:
  
  
Now prepare to repeat that process for M&A's and IPOs:
+
===MA Cleaning and Matching===
*For M&As your keys (for now) will be targetname, statecode, dateannounced
+
First remove all of the duplicates in the MA data. Do this by running aggregate queries on every column except for the primary key:
*For IPOs your keys (for now) will be issuername, statecode, issuedate
+
DROP TABLE MANoDups;
*FIRST CLEAN EACH DATASET. The easiest way to remove duplicates (if you have lots of them) is to use an aggregate query:
+
CREATE TABLE MANoDups AS
 +
SELECT targetname, targetstate, announceddate, min(effectivedate) AS effectivedate, MIN(acquirorname) as acquirorname, MIN(acquirorstate) as acquirorstate, MAX(transactionamt) as
 +
transactionamt, MAX(enterpriseval) as enterpriseval, MIN(acquirorstatus) as acquirorstatus
 +
FROM mas
 +
GROUP BY targetname, targetstate, announceddate ORDER BY targetname, targetstate, announceddate;
 +
--119374
 +
 
 +
SELECT COUNT(*) FROM(SELECT DISTINCT targetname, targetstate, announceddate FROM manodups)a;
 +
--119374
 +
 
 +
Since these counts are equivalent, the data set is clean. Then get all the primary keys from the table and copy the distinct target names into a text file.
 +
 
 +
DROP TABLE makey;
 +
CREATE TABLE makey AS
 +
SELECT targetname, targetstate, announceddate
 +
FROM manodups;
 +
--119374
 +
 
 +
\COPY (SELECT DISTINCT targetname FROM makey) TO 'DistinctTargetName.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV;
 +
--117212
 +
 
 +
After running this list of distinct target names through the matcher, put the standardized MA list into the data base.
 +
 
 +
DROP TABLE MaStd;
 +
CREATE TABLE MaStd (
 +
  targetnamestd varchar(255),
 +
  targetname varchar(255),
 +
  norm varchar(100),
 +
  x1 varchar(255),
 +
  x2 varchar(255)
 +
);
 +
 
 +
\COPY mastd FROM 'DistinctTargetName.txt-DistinctTargetName.txt.matched' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 +
--117212
 +
 
 +
Then match the list of standardized names back to the makey table to get a table with standardized keys and primary keys. This will be your input for matching against port cos.
  
  DROP TABLE IPOCoreNoDups;
+
  DROP TABLE makeysstd;
  CREATE TABLE IPOCoreNoDups as
+
  CREATE TABLE makeysstd AS
  SELECT issuername, statecode, issuedate, max(var1) as var1, avg(var2) as var2, ...
+
  SELECT B.targetnamestd, A.*
  FROM IPOCore GROUP BY issuername, statecode, issuedate ORDER BY issuername, statecode, issuedate;
+
  FROM makey AS A
   
+
  JOIN mastd AS B ON A.targetname=B.targetname;
  Note that you need all vars to be inside aggregates and that you should choose the aggregate function sensibly by looking at the data. Generally use MAX for amounts and MIN for dates. You can also use MAX or MIN on text strings.
+
  --119374
  
And now build the same stacks as before but to create Issuerkeystd and TargetKeystd (or whatever you call them). Make sure that issuerstd (and targetnamestd) is in the first column.
+
  \COPY makeysstd TO 'MAMatchInput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 +
  --119374
  
Now match Portcokeystd to Issuerkeystd, and match Portcokeystd to Targetkeystd
+
Use this text file to match against the PortCoMatchInput. Your job will be to determine whether the matches between the MAs and PortCos are true matches. The techniques that I used are described in the section below.
*Move the files into the input director as before
 
*Run the matcher script but WITHOUT mode 2:
 
  
perl Matcher.pl -file1="PortCoMatchInput.txt" -file2="IssuerMatchInput.txt"
+
===IPO Cleaning and Matching===
perl Matcher.pl -file1="PortCoMatchInput.txt" -file2="TargetMatchInput.txt"
+
The process is the same for IPOs.
 
Open each of these files in excel and mark good matches with 1s and bad matches with 0s by adding columns to compare dates, states, etc, and filtering.
 
  
When you are done:
+
DROP TABLE iponodups;
*Build a new sheet of just good matches.
+
CREATE TABLE iponodups
*Save the excel files
+
AS SELECT issuer, statecode, issuedate, MAX(principalamt) AS principalamt, MAX(proceedsamt) AS proceedsamt, MIN(naiccode) as naicode, MIN(zipcode) AS zipcode, MIN(status) AS status,
*Copy each of your match sheets to a text file
+
MIN(foundeddate) AS foundeddate
*CREATE TABLE to reflect the data you are going to load (include std names and keys)
+
FROM ipos GROUP BY issuer, statecode, issuedate ORDER BY issuer, statecode, issuedate;
*\COPY the data (using the exact copy command above but changing the table and file names) into the table
+
--11149
*Celebrate!
 
*Next we'll deal with any firms that have an IPO and an M&A and decide which we'll keep
 
*And then we'll join in the chosen IPO and M&A data and move on!
 
  
==Cleaning IPO and MA Data==
+
SELECT COUNT(*) FROM(SELECT DISTINCT issuer, statecode, issuedate FROM iponodups)a;
It is important to follow Ed's direction of cleaning the data using aggregate function before putting the data into excel. This will keep you from a lot of manual checking that is unnecessary. When ready, paste the data you have into an excel file. In that excel file, I made three columns: one to check whether state codes were equivalent, one checking whether the date of first investment was 3 years before the MA or IPO, and one checking whether both of these conditions were satisfied for each company. I did this using simple if statements. This process is manual checking and filtering to see whether matches are correct or not and are thus extremely subjective and tedious. First, I went through and checked the companies that did not have equivalent state codes. If the company was one that I knew or the name was unique to the point that I did not believe the same name would appear in another state, I marked the state codes as equivalent. I did the same for the date of first investment vs MA/IPO date. Then I removed all duplicates that had the marking Warning Multiple Matches, and the data sheets were clean.
+
--11149
  
 +
DROP TABLE ipokeys;
 +
CREATE TABLE ipokeys AS
 +
SELECT issuer, statecode, issuedate
 +
FROM iponodups;
 +
--11149
  
==Finding Companies that Underwent IPOs and MAs==
+
\COPY (SELECT DISTINCT issuer FROM ipokeys) TO 'IPODistinctIssuer.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
#Load IPOClean and MAClean into the database.
+
--10803
#Perform an INNER JOIN on the two tables in order to find the companies that underwent both MAs and IPOs. Do this by joining on the primary keys of portcos in both tables. Load this table into an excel sheet and manually find which companies you want to keep as MAs and which you want to keep as IPOs. Make sure to keep the portco primary key in this table.
 
#Load the decided IPO and MA data back into the database, including the primary keys of the portcos.
 
#LEFT JOIN the MA table with the IPO table. Join on primary keys of portcos. Select the companies where the IPO table are null as these are the companies that only had MAs. Do the same for IPOs. Now you have tables of companies that underwent only MAs and only IPOs.
 
#Join the companies that underwent IPOs only and the chosen IPOs back to the original key using the primary key of the company which must be in both tables. Repeat this for the MA table.
 
  
===Ed's version===
+
DROP TABLE ipokeysstd;
 +
CREATE TABLE ipokeysstd (
 +
    issuerstd varchar(255),
 +
    issuer varchar(255),
 +
    norm varchar(100),
 +
    x1 varchar(255),
 +
    x2 varchar(255)
 +
  );
 +
 +
\COPY ipokeysstd FROM 'IPODistinctIssuer.txt-IPODistinctIssuer.txt.matched' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 +
--10803
  
Note: your version isn't wrong. I'm just trying to be clearer.
+
DROP TABLE ipostd;
 +
CREATE TABLE ipostd AS
 +
SELECT B.issuerstd, A.*
 +
FROM ipokeys AS A
 +
JOIN ipokeysstd AS B ON A.issuer=B.issuer;
 +
--11149
  
==Finding Companies that Underwent IPOs and MAs==
+
\COPY ipostd TO 'IPOMatchInput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
#Load IPOClean and MAClean into the database.
+
--11149
#Make a table of keys by doing a JOIN on the two tables in order to find the companies that underwent both MAs and IPOs. (Do this by joining on the primary keys of portcos in both tables.) Make another table by left joining these results back with IPOClean and MAClean. COPY this table into an excel sheet and manually find which companies you want to keep as MAs and which you want to keep as IPOs. Make sure to keep the portco primary key in this table.
 
#Load the decided IPO and MA data back into the database into a table called IPOMASelected, including the primary keys of the portcos, and the primary keys of the MAs and IPOs selected. 
 
#Make a table of primary keys of portcos and MAs and IPOs called ExitKeysClean. Do this by taking the IPOs which didn't have MAs from IPOClean and the MAs which didn't have IPOs from MAClean and the selected MAs and IPOs from IPOMASelected. Note that one way to do this is to LEFT JOIN the MA table with the IPO table. Select the companies where the IPO table are null as these are the companies that only had MAs. Do the same for IPOs. Now you have tables of companies that underwent only MAs and only IPOs.
 
#Make a table called PortCoExits by compiling information from ExitKeysClean Left joined to your main (no duplicates) IPO and MA tables, and using CASE WHEN statements.
 
  
Example Join on a composite Key:
+
As with MA, match this file against the PortCoMatchInput file without mode 2. Then manually check the matches using the techniques described below.
DROP TABLE IPOMAForReview;
 
CREATE TABLE IPOMAForReview
 
SELECT A.*, B.targetname, B.targetstate, B.announcedate
 
FROM IPOClean AS A
 
JOIN MAClean AS B ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv;
 
  
Example CASE WHEN statement:
+
I generally use MAX for amounts and MIN for dates. I also chose to use MIN on text strings.
  
DROP TABLE PortCoExits;
+
==Cleaning IPO and MA Data==
CREATE TABLE PortCoExits
+
It is important to follow Ed's direction of cleaning the data using aggregate function before putting the data into excel. This will keep you from a lot of manual checking that is unnecessary. When ready, paste the data you have into an excel file. In that excel file, I made three columns: one to check whether state codes were equivalent, one checking whether the date of first investment was 3 years before the MA or IPO, and one checking whether both of these conditions were satisfied for each company. I did this using simple if statements. This process is manual checking and filtering to see whether matches are correct or not and are thus extremely subjective and tedious. First, I went through and checked the companies that did not have equivalent state codes. If the company was one that I knew or the name was unique to the point that I did not believe the same name would appear in another state, I marked the state codes as equivalent. I did the same for the date of first investment vs MA/IPO date. Then I removed all duplicates that had the marking Warning Multiple Matches, and the data sheets were clean.
SELECT A.coname, A.statecode, A.datefirst,  
 
CASE WHEN b.coname IS NOT NULL THEN 1::int WHEN c.coname IS NOT NULL THEN 0::int ELSE NULL::int END AS IPOvsMA,
 
CASE WHEN b.coname IS NOT NULL THEN b.proceedsall WHEN c.coname IS NOT NULL THEN c.transactionvalue ELSE NULL::int END AS exitvalue,
 
...
 
FROM ExitKeysClean AS A
 
LEFT JOIN IPONoDups AS B ON A.issuername=B.issuername AND A.statecode=B.statecode AND A.issuedate=B.issuedate
 
LEFT JOIN MANoDups AS B ON A.targetname=B.targetname AND A.statecode=B.statecode AND A.announcedate=B.announcedate;
 
  
When putting values together, make sure that they are in the same units (pref $m).
 
  
 
==Process For Creating the PortCoExits Table==
 
==Process For Creating the PortCoExits Table==
Even if you manually checked the excel sheet for Warning Multiple Matches with the Hall warning, there still may be duplicates. Check this by running:
+
===MA Process===
 +
First we must load the clean, manually checked tables back into the database.
 +
DROP TABLE MAClean;
 +
CREATE TABLE MAClean (
 +
  conamestd varchar(255),
 +
  targetnamestd varchar(255),
 +
  method varchar(100),
 +
  x1 varchar(255),
 +
  coname varchar(255),
 +
  statecode varchar(10),
 +
  datefirstinv date,
 +
  x2 varchar(255),
 +
  targetname varchar(255),
 +
  targetstate varchar(10),
 +
  announceddate date
 +
);
  
  SELECT COUNT(*) FROM(SELECT targetname, targetstate, announceddate FROM MAClean)a;
+
  \COPY MAClean FROM 'MAClean.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
  --7206
+
  --7205
  
 
  SELECT COUNT(*) FROM(SELECT DISTINCT targetname, targetstate, announceddate FROM MAClean)a;
 
  SELECT COUNT(*) FROM(SELECT DISTINCT targetname, targetstate, announceddate FROM MAClean)a;
  --7189
+
  --7188
 +
 
 +
As you can see there are still duplicate primary keys in the table. To get rid of these I wrote a query that chooses primary keys that occur only once and matches them against MANoDups. That way you will have unique primary keys by construction.
 +
 
 +
DROP TABLE MACleanNoDups;
 +
CREATE TABLE MACleanNoDups AS
 +
SELECT A.*, effectivedate, transactionamt, enterpriseval, acquirorstatus
 +
FROM MAClean AS A
 +
JOIN (
 +
SELECT targetname, targetstate, announceddate, COUNT(*) FROM MAClean
 +
GROUP BY targetname, targetstate, announceddate HAVING COUNT(*)=1
 +
) AS B
 +
ON A.targetname=B.targetname AND A.targetstate=B.targetstate AND A.announceddate=B.announceddate
 +
LEFT JOIN MANoDups AS C ON A.targetnamestd=C.targetname AND A.announceddate=C.announceddate;
 +
 
 +
SELECT COUNT(*) FROM(SELECT DISTINCT coname, statecode, datefirstinv FROM MACleanNoDups)a;
 +
--7171
 +
 
 +
Thus the portco primary key is unique in the table. We will use this later.
 +
Now do the same for the IPOs.
  
  SELECT COUNT(*) FROM(SELECT issuername, issuerstate, issuedate FROM IPOClean)a;
+
===IPO Process===
 +
  DROP TABLE IPOClean;
 +
CREATE TABLE IPOClean (
 +
  conamestd varchar(255),
 +
  issuernamestd varchar(255),
 +
  method varchar(100),
 +
  x1 varchar(255),
 +
  coname varchar(255),
 +
  statecode varchar(10),
 +
  datefirstinv date,
 +
  x2 varchar(255),
 +
  issuername varchar(255),
 +
  issuerstate varchar(10),
 +
  issuedate date
 +
);
 +
 +
\COPY IPOClean FROM 'IPOClean.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 
  --2146
 
  --2146
  
Line 376: Line 573:
 
  --2141
 
  --2141
  
 +
As with the MA process, there were duplicates in the clean IPO table. Get rid of these using the same process as with MAs. Only choose the primary keys that occur once and join these to the IPONoDups table.
  
As you can see, I still have duplicates in both the MAClean and IPOClean files. I ran an aggregate function to get rid of these duplicates:
+
  DROP TABLE IPOCleanNoDups;
  DROP TABLE MACleanNoDups;
+
  CREATE TABLE IPOCleanNoDups AS
  CREATE TABLE MACleanNoDups AS
+
  SELECT A.*, principalamt, proceedsamt, naicode as naics, zipcode, status, foundeddate
  SELECT targetname, targetstate, announceddate, MIN(x1) as x1, MIN(x2) as x2, MIN(method) as method, MIN(conamestd) as conamestd, MIN(coname) as coname, MIN(statecode) as
+
FROM IPOClean AS A
statecode, MIN(datefirstinv) as datefirstinv, MIN(targetnamestd) as targetnamestd
+
JOIN (
FROM MAClean
+
SELECT issuername, issuerstate, issuedate, COUNT(*) FROM IPOClean
GROUP BY targetname, targetstate, announceddate;
+
GROUP BY issuername, issuerstate, issuedate HAVING COUNT(*)=1
  --7189
+
) AS B
There are two companies that have the name Masspower in the MACleanNoDups file. One is written in all caps and will thus not be caught by the aggregate function. You will have to find it manually and delete it in order to ensure that your joining of MAs to companybasecore will not add in extra rows.  
+
  ON A.issuername=B.issuername AND A.issuerstate=B.issuerstate AND A.issuedate=B.issuedate
  DELETE FROM MaCleanNoDups WHERE coname='Masspower' AND announceddate='2006-03-15';
+
  LEFT JOIN IPONoDups AS C ON A.issuername=C.issuer AND A.issuerstate=C.statecode AND A.issuedate=C.issuedate;
  --7188
+
  --2136
  
Now do the same for the IPOs.
 
  
DROP TABLE IPOCleanNoDups;
+
  SELECT COUNT(*) FROM(SELECT DISTINCT coname, statecode, datefirstinv FROM IPOCleanNoDups)a;
CREATE TABLE IPOCleanNoDups AS
+
  --2136
  SELECT issuername, issuerstate, issuedate, MIN(x1) as x1, MIN(x2) as x2, MIN(method) as method, MIN(conamestd) as conamestd, MIN(coname) as coname, MIN(statecode) as statecode,  
 
MIN(datefirstinv) as datefirstinv, MIN(issuernamestd) as issuernamestd
 
FROM IPOClean
 
GROUP BY issuername, issuerstate, issuedate;
 
  --2141
 
  
 
Now the duplicates are out of the MAClean and IPOClean data and we can start to construct the ExitKeysClean table.
 
Now the duplicates are out of the MAClean and IPOClean data and we can start to construct the ExitKeysClean table.
Line 407: Line 599:
 
  CREATE TABLE IPOMAForReview
 
  CREATE TABLE IPOMAForReview
 
  SELECT A.*, B.targetname, B.targetstate, B.announcedate
 
  SELECT A.*, B.targetname, B.targetstate, B.announcedate
  FROM IPOClean AS A
+
  FROM IPOCleanNoDups AS A
  JOIN MAClean AS B ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv;
+
  JOIN MACleanNoDups AS B ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv;
 
  --92
 
  --92
  
I then pulled out the IPOs that were only IPOs and MAs that were only MAs.
+
I then pulled out the IPOs that were only IPOs and MAs that were only MAs. I also added in a column that indicated whether a company underwent an IPO or a MA.
 
 
DROP TABLE ExitKeysCleanIPONoExit;
 
CREATE TABLE ExitKeysCleanIPONoExit AS
 
SELECT A.*, B.targetname, B.targetstate, B.announceddate
 
FROM IPOCleanNoDups AS A LEFT JOIN MACleanNoDups AS B ON A.coname=B.Coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv WHERE B.statecode IS NULL AND
 
B.coname IS NULL AND B.datefirstinv IS NULL;
 
--2049
 
 
 
DROP TABLE ExitKeysCleanMANoExit;
 
CREATE TABLE ExitKeysCleanMANoExit AS
 
SELECT A.*, B.issuername, B.issuerstate, B.issuedate
 
FROM MACleanNoDups AS A LEFT JOIN IPOCleanNoDups AS B ON A.coname=B.Coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv WHERE B.statecode IS NULL AND
 
B.coname IS NULL AND B.datefirstinv IS NULL;
 
--7096
 
  
Since 2141-92=2049 and 7188-92=7096, we know that the duplicate companies were extracted successfully.
+
DROP TABLE IPONoConflict;
 +
CREATE TABLE IPONoConflict AS
 +
SELECT A.*, 1::int as IPOvsMA
 +
FROM IPOCleanNoDups AS A
 +
LEFT JOIN MACleanNoDups AS B
 +
ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv
 +
WHERE B.statecode IS NULL AND B.coname IS NULL AND B.datefirstinv IS NULL;
 +
--2044
  
I then created a column that identifies whether a company underwent an MA or an IPO. A 0 indicates an MA and a 1 indicates an IPO.
+
  DROP TABLE MANoConflict;
  DROP TABLE ExitKeysCleanMA;
+
  CREATE TABLE MANoConflict AS
  CREATE TABLE ExitKeysCleanMA AS
+
  SELECT A.*, 0::int as IPOvsMA
  SELECT *,
+
  FROM MACleanNoDups AS A
  CASE
+
LEFT JOIN IPOCleanNoDups AS B
  WHEN issuername IS NULL AND issuerstate IS NULL AND issuedate IS NULL THEN 0 ELSE NULL END AS MAvsIPO
+
ON A.coname=B.Coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv
FROM exitkeyscleanmanoexit;
+
WHERE B.statecode IS NULL AND B.coname IS NULL AND B.datefirstinv IS NULL;
  --7096
+
  --7079
  
DROP TABLE ExitKeysCleanIPO;
+
Since 2136-92=2044 and 7171-92=7079, we know that the duplicate companies were extracted successfully.
CREATE TABLE ExitKeysCleanIPO AS
 
SELECT *,
 
CASE
 
  WHEN targetname IS NULL AND targetstate IS NULL AND announceddate IS NULL THEN 1 ELSE NULL END AS MAvsIPO
 
FROM exitkeyscleaniponoexit as A;
 
--2049
 
  
I then wrote a query to check which date was lower and used that to indicate whether I chose the company to have undergone an MA or an IPO in the column MSvsIPO(I chose based on which process came first). A 0 in the column represented an MA being chosen and a 1 represented an IPO being chosen.
+
I then wrote a query to check whether the IPO issue date or announced date of the MA was earlier and used that to indicate whether I chose the company to have undergone an MA or an IPO in the column MSvsIPO. A 0 in the column represented an MA being chosen and a 1 represented an IPO being chosen.
DROP TABLE IPOMASelected;
 
CREATE TABLE IPOMASelected AS
 
SELECT *,
 
CASE
 
  WHEN issuedate < announceddate THEN 1 ELSE 0 END AS MAvsIPO
 
FROM IPOMAForReview;
 
--92
 
  
  
Line 460: Line 632:
 
  DROP TABLE MASelected;
 
  DROP TABLE MASelected;
 
  CREATE TABLE MASelected AS
 
  CREATE TABLE MASelected AS
  SELECT A.coname, A.statecode, A.datefirstinv, B.targetname, B.targetstate, B.announceddate, B.issuername, B.issuerstate, B.issuedate, B.mavsipo
+
  SELECT coname, statecode, datefirstinv,  
  FROM IPOMASelected AS B LEFT JOIN companybasecore AS A ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv WHERE mavsipo=0;
+
targetname, targetstate, announceddate,
 +
0::int as IPOvsMA
 +
  FROM IPOMAForReview
 +
WHERE issuedate >= announceddate;
 
  --25
 
  --25
  
 
  DROP TABLE IPOSelected;
 
  DROP TABLE IPOSelected;
 
  CREATE TABLE IPOSelected AS
 
  CREATE TABLE IPOSelected AS
  SELECT A.coname, A.statecode, A.datefirstinv, B.targetname, B.targetstate, B.announceddate, B.issuername, B.issuerstate, B.issuedate, B.mavsipo
+
  SELECT coname, statecode, datefirstinv,  
  FROM IPOMASelected AS B LEFT JOIN companybasecore AS A ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv WHERE mavsipo=1;
+
issuername, issuerstate, issuedate,
 +
1::int as IPOvsMA
 +
  FROM IPOMAForReview
 +
WHERE issuedate < announceddate;
 
  --67
 
  --67
  
I then put together all of the IPOs that I selected into one table and all of the MAs I selected into another table. I did this using UNION statements. I did this because I didn't want to duplicate any IPOs and UNION acts as a SELECT DISTINCT statement. Thus if the number of the two tables added together equaled the final count of the table, I did not have any duplicate rows.
+
I then made the ExitKeysClean table using the portco primary key and the indicator MAvsIPO indicator column.
  DROP TABLE SelectedIPOSAll;
+
 
  CREATE TABLE SelectedIPOSALL AS
+
DROP TABLE ExitKeys;
SELECT A.coname, A.statecode, A.datefirstinv, A.targetname, A.targetstate, A.announceddate, A.issuername, A.issuerstate, A.issuedate, A.mavsipo FROM
+
CREATE TABLE ExitKeys AS
  ExitKeysCleanIPO AS A
+
SELECT coname, statecode, datefirstinv, ipovsma FROM IPONoConflict
 +
UNION SELECT coname, statecode, datefirstinv, ipovsma FROM IPOSelected
 +
UNION SELECT coname, statecode, datefirstinv, ipovsma FROM MANoConflict
 +
UNION SELECT coname, statecode, datefirstinv, ipovsma FROM MASelected;
 +
--9215
 +
 
 +
==Create the PortCoExit And PortCoAliveDead Tables==
 +
From consulting with Ed and the VC Database Rebuild wiki, I decided to make the PortCoExit table with an mavsipo, an exitdate, an exited, and an exitvalue column. I use the MAvsIPO column to add in data. It is very important that you have constructed this column.
 +
DROP TABLE PortCoExit;
 +
CREATE TABLE PortCoExit AS
 +
SELECT A.coname, A.statecode, A.datefirstinv, A.datelastinv, A.city, B.ipovsma,
 +
CASE WHEN B.ipovsma IS NOT NULL THEN 1::int ELSE 0::int END AS Exit,
 +
CASE WHEN B.ipovsma=1 THEN C.proceedsamt::numeric WHEN ipovsma=0 THEN D.transactionamt::numeric ELSE NULL::numeric END AS exitvaluem,
 +
CASE WHEN B.ipovsma=1 THEN C.issuedate WHEN ipovsma=0 THEN D.announceddate ELSE NULL::date END AS exitdate,
 +
CASE WHEN B.ipovsma=1 THEN extract(year from C.issuedate) WHEN ipovsma=0 THEN extract(year from D.announceddate) ELSE NULL::int END AS exityear
 +
FROM companybasecore AS A
 +
LEFT JOIN ExitKeys AS B ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv
 +
LEFT JOIN IPOCleanNoDups AS C ON A.coname=C.coname AND A.statecode=C.statecode AND A.datefirstinv=C.datefirstinv
 +
LEFT JOIN MACleanNoDups AS D ON A.coname=D.coname AND A.statecode=D.statecode AND A.datefirstinv=D.datefirstinv;
 +
--48001
 +
 
 +
I then used this table to build one that has information as to whether a company was dead or alive. I found this information by checking whether a company had undergone an IPO or MA, indicating the company was dead. Alternatively, if the company's date of last investment was more than 5 years ago, I marked the company as dead.
 +
 
 +
DROP TABLE PortCoAliveDead;
 +
CREATE TABLE PortCoAliveDead AS
 +
SELECT *,
 +
datefirstinv as alivedate, extract(year from datefirstinv) as aliveyear,
 +
CASE WHEN exitdate IS NOT NULL then exitdate
 +
WHEN exitdate IS NULL AND (datelastinv + INTERVAL '5 year') < '7/1/2018' THEN (datelastinv + INTERVAL '5 year')
 +
ELSE NULL::date END AS deaddate,
 +
CASE WHEN exitdate IS NOT NULL then exityear
 +
WHEN exitdate IS NULL AND (datelastinv + INTERVAL '5 year') < '7/1/2018' THEN extract(year from (datelastinv + INTERVAL '5 year'))
 +
ELSE NULL::int END AS deadyear
 +
FROM PortCoExit;
 +
--48001
 +
 
 +
==GeoCoding Companies, Firms, and Branch Offices==
 +
A helpful page here is the [[Geocode.py]] page which explains how to use the Geocoding script. You will have to tweak the Geocode script when geocoding as each of these tables has a different primary key. It is vital that you include the primary keys in the file you input and output from the Geocoding script. Without these, you will not be able to join the latitudes and longitudes back to the firm, branch office, or company base tables.
 +
 
 +
Geocoding costs money since we are using the Google Maps API. The process doesn't cost much, but in order to save money I tried to salvage as much of the preexisting geocode information I could find.
 +
===Companies===
 +
I found the table of old companies with latitudes and longitudes in vcdb2 and loaded these into vcdb3.
 +
 
 +
DROP TABLE oldgeocords;
 +
CREATE TABLE oldgeocords (
 +
  coname varchar(255),
 +
  statecode varchar(10),
 +
  datefirstinv date,
 +
  ivestedk real,
 +
  city varchar(255),
 +
  addr1 varchar(255),
 +
  addr2 varchar(100),
 +
  latitude numeric,
 +
  longitude numeric
 +
  );
 +
 
 +
\COPY oldgeocords FROM 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 +
--44740
 +
 
 +
The API occasionally will give erroneous latitude and longitude readings. In order to catch only the good ones, I found the latitude and longitude lines that encompass the mainland US and created an exclude flag to make sure companies were in this box. I then created flags to include companies in Puerto Rico, Hawaii, and Alaska. Companies that were in these places often had wrong latitude and longitude readings of 44.93, 7.54, so I ran a query making sure that these weren't listed.
 +
 
 +
DROP TABLE geoallcoords;
 +
CREATE TABLE geoallcoords AS
 +
SELECT *, CASE
 +
WHEN longitude < -125 OR longitude > -66 OR latitude < 24 OR latitude > 50 OR latitude IS NULL OR longitude IS NULL THEN 1::int ELSE
 +
0::int END AS excludeflag FROM oldgeocords;
 +
--44740
 +
 
 +
DROP TABLE geoallcoords1;
 +
CREATE TABLE geoallcoords1 AS SELECT
 +
*, CASE WHEN statecode='PR' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as prflag,
 +
CASE WHEN statecode='HI' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as hiflag,
 +
CASE WHEN statecode='AK' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int  ELSE 0::int END as akflag
 +
FROM geoallcoords;
 +
--44740
 +
 
 +
I then included only companies that were either in the mainland US, Hawaii, Alaska, or Puerto Rico.
 +
 
 +
DROP TABLE goodgeoold;
 +
CREATE TABLE goodgeoold AS SELECT
 +
A.*, B.latitude, B.longitude, B.prflag, B.excludeflag, B.hiflag, B.akflag FROM companybasecore AS A LEFT JOIN geoallcoords1 AS B ON
 +
A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv WHERE excludeflag=0 or prflag=1 or hiflag=1 or akflag=1;
 +
--38498
 +
 
 +
I then found the remaining companies that needed to be geocoded. Only companies that have addresses listed are able to be accurately geocoded. If we attempt to geocode based on city, the location returned will simply be the center of the city. Thus, I chose the companies that we did not already have listings for and had a valid address.
 +
 
 +
  DROP TABLE remaininggeo;
 +
  CREATE TABLE remaininggeo AS SELECT A.coname, A.statecode, A.datefirstinv, A.addr1, A.addr2, A.city, A.zip FROM companybasecore AS A LEFT JOIN goodgeoold AS B ON A.coname=B.coname
 +
AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv
 +
WHERE B.coname IS NULL AND A.addr1 IS NOT NULL;
 +
--5955
 +
 
 +
\COPY remaininggeo TO 'RemainingGeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 +
--5955
 +
 
 +
I copied this table into excel to concatenate the address, city, state, and zipcode columns into one column. This can and should be done in SQL, but I was not aware this could be done. I then ran remaininggeo through the Geocode script with columns coname, statecode, datefirstinv, and address in the inputted file.
 +
 
 +
DROP TABLE remaining;
 +
CREATE TABLE remaining (
 +
  coname varchar(255),
 +
  statecode varchar(10),
 +
  datefirstinv date,
 +
  latitude numeric,
 +
  longitude numeric
 +
  );
 +
 
 +
\COPY remaining FROM 'RemainingLatLong.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 +
--5955
 +
 
 +
I then ran the same geographical checks on the newly geocoded companies and found all of the good geocodes.
 +
 
 +
DROP TABLE geoallcoords2;
 +
CREATE TABLE geoallcoords2 AS
 +
SELECT *, CASE
 +
WHEN longitude < -125 OR longitude > -66 OR latitude < 24 OR latitude > 50 OR latitude IS NULL OR longitude IS NULL THEN 1::int ELSE
 +
0::int END AS excludeflag FROM remaining;
 +
--5955
 +
 
 +
DROP TABLE geoallcoords3;
 +
CREATE TABLE geoallcoords3 AS
 +
SELECT *, CASE WHEN statecode='PR' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int    0::int END as prflag,
 +
CASE WHEN statecode='HI' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as hiflag,
 +
CASE WHEN statecode='AK' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int  ELSE 0::int END as akflag
 +
FROM geoallcoords2;
 +
--5955
 +
 
 +
DROP TABLE goodgeonew;
 +
CREATE TABLE goodgeonew AS
 +
SELECT A.*, B.latitude, B.longitude, B.prflag, B.excludeflag, B.hiflag, B.akflag FROM companybasecore AS A LEFT JOIN geoallcoords3 AS B ON
 +
A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv WHERE excludeflag=0 or prflag=1 or hiflag=1 or akflag=1;
 +
--5913
 +
 
 +
I then combined the old and new geocodes and matched them back to the company base table to get a geo table for companies.
 +
 
 +
DROP TABLE geocodesportco;
 +
  CREATE TABLE geocodesportco AS SELECT
 +
A.* from goodgeonew
 
  UNION
 
  UNION
  SELECT IPOSelected.* FROM
+
  SELECT B.* from goodgeoold;
  IPOSelected;
+
--44411
  --2116
+
 
  --Makes sense because 2049+67=2116
+
DROP TABLE portcogeo;
 +
CREATE TABLE portcogeo AS SELECT
 +
A.*, B.latitude, B.longitude FROM companybasecore AS A LEFT JOIN Geocodesportco AS B ON A.coname=B.coname AND A.datefirstinv=B.datefirstinv AND A.statecode=B.statecode;
 +
--48001
 +
 
 +
===Firms===
 +
This process is largely the same as for companies. I found old firms that had already been geocoded and checked for accuracy.
 +
 
 +
DROP TABLE oldfirmcoords;
 +
CREATE TABLE oldfirmcoords (
 +
  firmname varchar(255),
 +
  latitude numeric,
 +
  longitude numeric
 +
  );
 +
 +
\COPY oldfirmcoords FROM 'FirmCoords.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV
 +
  --5556
 +
 
 +
DROP TABLE firmoldfilter;
 +
  CREATE TABLE firmoldfilter AS
 +
SELECT *, CASE
 +
WHEN longitude < -125 OR longitude > -66 OR latitude < 24 OR latitude > 50 OR latitude IS NULL OR longitude IS NULL THEN 1::int ELSE
 +
0::int END AS excludeflag FROM oldfirmcoords;
 +
  --5556
 +
 
 +
Since oldfirmcoords does not have state codes, we have to find a way to include state codes to add in companies based in Puerto Rico, Hawaii, and Alaska. I did this by matching the firmoldfilter table back to the firm base table. I used the coalesce function because we wanted to exclude companies that we had not geocoded due to faulty addresses.
 +
 
 +
DROP TABLE firmcoordsmatch1;
 +
CREATE TABLE firmcoordsmatch1 AS SELECT
 +
A.firmname, A.state, B.latitude, B.longitude, COALESCE(B.excludeflag, 1) AS excludeflag FROM firmbasecore AS A LEFT JOIN firmoldfilter AS B ON A.firmname=B.firmname;
 +
--15437
 +
 
 +
Then the process of tagging the PR, HI, and AK companies and including only correctly tagged companies is the same as for companies.
 +
 
 +
DROP TABLE firmcoordsexternal;
 +
CREATE TABLE firmcoordsexternal AS
 +
SELECT *, CASE WHEN state='PR' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as prflag,
 +
CASE WHEN state='HI' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as hiflag,
 +
CASE WHEN state='AK' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int  ELSE 0::int END as akflag
 +
FROM firmcoordsmatch1;
 +
--15437
 +
 
 +
DROP TABLE goodfirmgeoold;
 +
CREATE TABLE goodfirmgeoold AS SELECT
 +
A.*, B.latitude, B.longitude, B.prflag, B.excludeflag, B.hiflag, B.akflag FROM firmcoreonedupremoved AS A LEFT JOIN firmcoordsexternal AS B ON A.firmname=B.firmname
 +
WHERE excludeflag=0 or prflag=1 or hiflag=1 or akflag=1;
 +
--5346
 +
 
 +
Find the remaining firms and run the geocode script on these firms
 +
 
 +
DROP TABLE remainingfirm;
 +
CREATE TABLE remainingfirm AS SELECT A.firmname, A.addr1, A.addr2, A.city, A.state, A.zip FROM firmcoreonedupremoved AS A LEFT JOIN goodfirmgeoold AS B ON A.firmname=B.firmname
 +
WHERE B.firmname IS NULL AND A.addr1 IS NOT NULL AND A.msacode!='9999';
 +
--706
 +
 
 +
\COPY remainingfirm TO 'FirmGeoRemaining.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV
 +
--706
 +
 
 +
DROP TABLE firmremainingcoords;
 +
CREATE TABLE firmremainingcoords(
 +
  firmname varchar(255),
 +
  latitude numeric,
 +
  longitude numeric
 +
  );
 +
 
 +
\COPY firmremainingcoords FROM 'FirmRemainingCoords.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV
 +
--706
 +
 
 +
Follow the same filtering process as above to get the good geocodes.
 +
 
 +
DROP TABLE firmnewfilter;
 +
CREATE TABLE firmnewfilter AS
 +
SELECT *, CASE
 +
WHEN longitude < -125 OR longitude > -66 OR latitude < 24 OR latitude > 50 OR latitude IS NULL OR longitude IS NULL THEN 1::int ELSE
 +
0::int END AS excludeflag FROM firmremainingcoords;
 +
--706
 +
 
 +
DROP TABLE firmcoordsmatch2;
 +
CREATE TABLE firmcoordsmatch2 AS SELECT
 +
A.firmname, A.state, B.latitude, B.longitude, COALESCE(B.excludeflag, 1) AS excludeflag FROM firmcoreonedupremoved AS A LEFT JOIN firmnewfilter AS B ON A.firmname=B.firmname;
 +
--15437
 +
 
 +
DROP TABLE firmcoordsexternalremaining;
 +
CREATE TABLE firmcoordsexternalremaining AS
 +
SELECT *, CASE WHEN state='PR' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as prflag,
 +
CASE WHEN state='HI' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as hiflag,
 +
CASE WHEN state='AK' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int  ELSE 0::int END as akflag
 +
FROM firmcoordsmatch2;
 +
--15437
 +
 
 +
DROP TABLE goodfirmgeonew;
 +
CREATE TABLE goodfirmgeonew AS SELECT A.*, B.latitude, B.longitude, B.prflag, B.excludeflag, B.hiflag, B.akflag FROM firmcoreonedupremoved AS A LEFT JOIN firmcoordsexternalremaining AS B
 +
ON A.firmname=B.firmname
 +
WHERE excludeflag=0 or prflag=1 or hiflag=1 or akflag=1;
 +
--703
  
  DROP TABLE SelectedMASAll;
+
Combine the old and new geocoded firms and match them to firm base to get a firm geo table.
  CREATE TABLE SelectedMASALL AS
+
 
  SELECT A.coname, A.statecode, A.datefirstinv, A.targetname, A.targetstate, A.announceddate, A.issuername, A.issuerstate, A.issuedate, A.mavsipo FROM
+
  DROP TABLE firmgeocoords;
ExitKeysCleanMA AS A
+
  CREATE TABLE firmgeocoords AS
 +
  SELECT * FROM goodfirmgeonew
 
  UNION
 
  UNION
  SELECT MASelected.* FROM
+
  SELECT * FROM goodfirmgeoold;
  MASelected;
+
--6049
  --7121
+
 
  --Makes sense because 25+7096=7121
+
DROP TABLE firmgeocore;
 +
CREATE TABLE firmgeocore AS
 +
SELECT A.*, B.latitude, B.longitude FROM firmbasecore AS A LEFT JOIN firmgeocoords AS B ON A.firmname=B.firmname;
 +
--15437
 +
 
 +
===Branch Offices===
 +
I did not use old branch office data because I could not find it anywhere in the old data set. I have since found old data in the table firmbasecoords in vcdb2.
 +
 
 +
First copy all of the needed data out of the database to do geocoding.
 +
 
 +
\COPY (SELECT A.firmname, A.boaddr1, A.boaddr2, A.bocity, A.bostate, A.bozip FROM bonound AS A WHERE A.boaddr1 IS NOT NULL) TO 'BranchOffices.txt' WITH DELIMITER AS E'\t' HEADER
 +
NULL AS '' CSV
 +
--2046
 +
 
 +
Then load the data into the database and follow the same filtering process as above.
 +
 
 +
DROP TABLE bogeo;
 +
CREATE TABLE bogeo (
 +
  firmname varchar(255),
 +
  latitude numeric,
 +
  longitude numeric
 +
  );
 +
 
 +
\COPY bogeo FROM 'BranchOfficesGeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 +
--2046
 +
 
 +
DROP TABLE bogeo1;
 +
CREATE TABLE bogeo1 AS SELECT *, CASE
 +
WHEN longitude < -125 OR longitude > -66 OR latitude < 24 OR latitude > 50 OR latitude IS NULL OR longitude IS NULL THEN 1::int ELSE
 +
0::int END AS excludeflag FROM bogeo;
 +
--2046
 +
 
 +
DROP TABLE bomatchgeo;
 +
CREATE TABLE bomatchgeo AS
 +
SELECT A.*, B.latitude, B.longitude, COALESCE(B.excludeflag, 1) AS excludeflag FROM branchofficecore AS A LEFT JOIN bogeo1 AS B ON A.firmname=B.firmname;
 +
--10032
 +
 
 +
DROP TABLE bogeo2;
 +
CREATE TABLE bogeo2 AS
 +
SELECT *, CASE WHEN bostate='PR' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as prflag,
 +
CASE WHEN bostate='HI' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as hiflag,
 +
CASE WHEN bostate='AK' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int  ELSE 0::int END as akflag
 +
FROM bomatchgeo;
 +
--10032
 +
 
 +
Match the correctly geocoded branch offices back to firm base to get the final table.
 +
 
 +
DROP TABLE bogeocore1;
 +
  CREATE TABLE bogeocore1 AS
 +
SELECT * FROM bogeo2 WHERE excludeflag=0 or prflag=1 or hiflag=1 or akflag=1;
 +
  --1161
 +
 
 +
DROP TABLE firmbogeo;
 +
CREATE TABLE firmbogeo AS
 +
SELECT A.*, B.latitude AS BOLatitude, B.longitude AS BOLongitude FROM firmgeocore AS A LEFT JOIN bogeocore1 AS B ON A.firmname=B.firmname;
 +
  --15437
 +
 
 +
==Creating People Tables==
 +
We pulled data on executives in both portcos and funds. I describe the process below. If any of the explanations don't make sense, I also describe most tables in the section called Marcos's Code.
 +
===Company People===
 +
DROP TABLE titlelookup;
 +
CREATE TABLE titlelookup(
 +
fulltitle varchar(150),
 +
charman int,
 +
ceo int,
 +
cfo int,
 +
coo int,
 +
cio int,
 +
cto int,
 +
otherclvl int,
 +
boardmember int,
 +
president int,
 +
vp int,
 +
founder int,
 +
director int
 +
);
 +
 
 +
\COPY titlelookup FROM 'Important Titles in Women2017 dataset.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 +
--628
 +
 
 +
This table lists various titles one can have and identifies where they fall under traditional executive titles.
 +
 
 +
DROP TABLE copeople;
 +
CREATE TABLE copeople(
 +
datefirstinv  date,
 +
cname varchar(150),
 +
statecode  varchar(2),
 +
prefix varchar(5),
 +
firstname varchar(50),
 +
lastname varchar(50),
 +
jobtitle varchar(150),
 +
nonmanaging  varchar(1),
 +
prevpos  varchar(255)
 +
);
  
I then checked both of these files to make sure that their primary keys were still distinct and thus valid.
+
  \COPY copeople FROM 'Executives-NoFoot-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
  SELECT COUNT(*) FROM(SELECT DISTINCT targetname, targetstate, announceddate FROM SelectedMasAll)a;
+
  --194359
  --7121
 
  
SELECT COUNT(*) FROM(SELECT targetname, targetstate, announceddate FROM SelectedMasAll)a;
+
This table gets various executives from portcos. This is loaded from SDC. Next we have to identify which traditional executive level job the listed job title corresponds to. It also identifies whether a prefix identifies an executive as male or female. I made a stupid mistake of writing cname instead of coname when loading in the data. If you want to save yourself work, write coname.
--7121
 
  
  SELECT COUNT(*) FROM(SELECT issuername, issuerstate, issuedate FROM SelectedIPOMA)a;
+
DROP TABLE copeoplebase;
  --2116
+
CREATE TABLE copeoplebase AS
 +
  SELECT copeople.*,
 +
CASE WHEN prefix='Ms' THEN 1::int
 +
WHEN prefix='Mr' THEN 0::int
 +
ELSE Null::int END AS titlefemale,
 +
CASE WHEN prefix='Ms' THEN 0::int
 +
WHEN prefix='Mr' THEN 1::int
 +
ELSE Null::int END AS titlemale,
 +
CASE WHEN prefix='Dr' THEN 1::int
 +
ELSE 0::int END AS doctor,
 +
CASE WHEN prefix IS NULL THEN 0::int
 +
ELSE 1::int END AS hastitle,
 +
CASE WHEN prefix IS NULL AND firstname IS NULL AND lastname IS NULL THEN 0::int
 +
ELSE 1::int END AS hasperson,
 +
CASE WHEN fulltitle IS NOT NULL THEN 1::int ELSE 0::int END AS hastitlelookup,
 +
CASE WHEN charman IS NOT NULL THEN charman ELSE 0::int END AS chairman,
 +
CASE WHEN ceo IS NOT NULL THEN ceo ELSE 0::int END AS ceo,
 +
CASE WHEN cfo IS NOT NULL THEN cfo ELSE 0::int END AS cfo,
 +
CASE WHEN coo IS NOT NULL THEN coo ELSE 0::int END AS coo,
 +
CASE WHEN cio IS NOT NULL THEN cio ELSE 0::int END AS cio,
 +
CASE WHEN cto IS NOT NULL THEN cto ELSE 0::int END AS cto,
 +
CASE WHEN otherclvl IS NOT NULL THEN otherclvl ELSE 0::int END AS otherclvl,
 +
CASE WHEN boardmember IS NOT NULL THEN boardmember ELSE 0::int END AS boardmember,
 +
CASE WHEN president IS NOT NULL THEN president ELSE 0::int END AS president,
 +
CASE WHEN vp IS NOT NULL THEN vp ELSE 0::int END AS vp,
 +
CASE WHEN founder IS NOT NULL THEN founder ELSE 0::int END AS founder,
 +
CASE WHEN director IS NOT NULL THEN director ELSE 0::int END AS director
 +
FROM copeople
 +
LEFT JOIN titlelookup ON copeople.jobtitle=titlelookup.fulltitle;
 +
  --194359
  
SELECT COUNT(*) FROM(SELECT DISTINCT issuername, issuerstate, issuedate FROM SelectedIPOMA)a;
+
Next we will try to identify whether an executive is male or female based on their names.
--2116
 
  
I combined the two tables and checked the result to make sure it Unioned correctly.
+
DROP TABLE namegender;
 +
CREATE TABLE namegender AS
 +
SELECT firstname,
 +
CASE WHEN countfemale > 0 AND countmale=0 THEN 1::int ELSE 0::int END AS exclusivelyfemale,
 +
CASE WHEN countmale > 0 AND countfemale=0 THEN 1::int ELSE 0::int END AS exclusivelymale
 +
FROM
 +
(SELECT firstname, COALESCE(sum(titlefemale),0) as countfemale,  COALESCE(sum(titlemale),0) as countmale
 +
FROM copeoplebase WHERE doctor=0
 +
GROUP BY firstname) As T
 +
WHERE NOT (countfemale > 0 AND countmale>0);
 +
--12736
  
  DROP TABLE SelectedIPOMA;
+
The next table expands CoPeopleBase to include information on executive gender and executive position.
  CREATE TABLE SelectedIPOMA AS
+
 
 +
DROP TABLE CoPeopleFull;
 +
CREATE TABLE CoPeopleFull AS
 +
SELECT copeoplebase.*,
 +
CASE WHEN titlefemale=1 THEN 1::int
 +
WHEN exclusivelyfemale=1 THEN 1::int ELSE 0::int END AS female,
 +
CASE WHEN titlemale=1 THEN 1::int
 +
WHEN exclusivelymale=1 THEN 1::int ELSE 0::int END AS male,
 +
CASE WHEN (titlefemale=1 OR titlemale=1 OR exclusivelymale=1 OR exclusivelyfemale=1) THEN 0::int ELSE 1::int END AS unknowngender,
 +
CASE WHEN (ceo=1 OR president=1) THEN 1::int ELSE 0::int END AS ceopres,
 +
CASE WHEN (chairman=1 OR ceo=1 OR cfo=1 OR coo=1 OR cio=1 OR cto=1 OR otherclvl=1 OR president=1) THEN 1::int ELSE 0::int END AS CLevel,
 +
CASE WHEN (chairman=1 OR ceo=1 OR cfo=1 OR coo=1 OR cio=1 OR cto=1 OR otherclvl=1 OR president=1 OR director=1 OR boardmember=1) THEN 1::int ELSE 0::int END AS board,
 +
CASE WHEN (chairman=1 OR ceo=1 OR cfo=1 OR coo=1 OR cio=1 OR cto=1 OR otherclvl=1 OR president=1 OR director=1 OR boardmember=1 OR vp=1 OR founder=1) THEN 1::int ELSE
 +
0::int END AS vpandabove
 +
FROM copeoplebase
 +
LEFT JOIN namegender ON namegender.firstname=copeoplebase.firstname
 +
WHERE hasperson=1;
 +
--177547
 +
 
 +
The next table only keeps executive listings that have a valid portco primary key associated with them.
 +
 
 +
  DROP TABLE CoPeopleKey;
 +
  CREATE TABLE CoPeopleKey AS
 
  SELECT A.*
 
  SELECT A.*
  FROM SelectedMASALL AS A
+
  FROM CoPeopleFull AS A
  UNION
+
  JOIN (SELECT firstname, lastname, cname, datefirstinv, statecode, count(*) FROM CoPeopleFull
  SELECT B.*
+
WHERE firstname IS NOT NULL AND lastname IS NOT NULL AND cname IS NOT NULL AND datefirstinv IS NOT NULL AND statecode IS NOT NULL
  FROM SelectedIPOSALL AS B;
+
GROUP BY firstname, lastname, cname, datefirstinv, statecode HAVING COUNT(*)=1) AS B
  --9237
+
ON A.firstname=B.firstname AND A.lastname=B.lastname AND A.datefirstinv=B.datefirstinv AND A.cname=B.cname AND A.statecode=B.statecode;
 +
--176251
 +
 
 +
The next table identifies whether a person previously held executive positions.
 +
 
 +
CREATE TABLE CoPeopleSerial AS
 +
  SELECT firstname, lastname, cname, datefirstinv, statecode,
 +
COALESCE(sum(hasperson),0) as prev,
 +
COALESCE(sum(ceo),0) as prevceo,
 +
COALESCE(sum(ceopres),0) as prevceopres,
 +
COALESCE(sum(founder),0) as prevfounder,
 +
COALESCE(sum(clevel),0) as prevclevel,
 +
COALESCE(sum(board),0) as prevboard,
 +
COALESCE(sum(vpandabove),0) as prevvpandabove,
 +
CASE WHEN sum(hasperson) >=1 THEN 1::int ELSE 0::int END AS serial,
 +
CASE WHEN sum(ceo) >=1 THEN 1::int ELSE 0::int END AS serialceo,
 +
CASE WHEN sum(ceopres) >=1 THEN 1::int ELSE 0::int END AS serialceopres,
 +
CASE WHEN sum(founder) >=1 THEN 1::int ELSE 0::int END AS serialfounder,
 +
CASE WHEN sum(clevel) >=1 THEN 1::int ELSE 0::int END AS serialclevel,
 +
CASE WHEN sum(board) >=1 THEN 1::int ELSE 0::int END AS serialboard,
 +
CASE WHEN sum(vpandabove) >=1 THEN 1::int ELSE 0::int END AS serialvpandabove
 +
FROM (
 +
SELECT A.prefix, A.firstname, A.lastname, A.cname, A.datefirstinv, A.statecode,
 +
B.cname as prevcname, B.datefirstinv as prevdatefirstinv, B.statecode as prevstatecode, B.ceo, B.ceopres, B.founder, B.clevel, B.board, B.vpandabove, B.hasperson
 +
FROM CoPeopleKey AS A
 +
LEFT JOIN CoPeopleKey AS B ON A.firstname=B.firstname AND A.lastname=B.lastname AND A.datefirstinv > B.datefirstinv
 +
) AS T
 +
GROUP BY firstname, lastname, cname, datefirstinv, statecode;
 +
--176251
 +
 
 +
The last table aggregates a ton of information on executives for each company. There is too much information to explain it all.
 +
 
 +
DROP TABLE copeopleagg;
 +
CREATE TABLE copeopleagg AS
 +
SELECT A.cname, A.datefirstinv, A.statecode,
 +
sum(hasperson) as numperson,
 +
sum(hastitle) as numtitled,
 +
CASE WHEN sum(ceopres) >=1 THEN 1::int ELSE 0::int END AS hasceopres,
 +
CASE WHEN sum(founder) >=1 THEN 1::int ELSE 0::int END AS hasfounder,
 +
CASE WHEN sum(clevel) >=1 THEN 1::int ELSE 0::int END AS hasclevel,
 +
CASE WHEN sum(board) >=1 THEN 1::int ELSE 0::int END AS hasboard,
 +
CASE WHEN sum(vpandabove) >=1 THEN 1::int ELSE 0::int END AS hasvpandabove,
 +
sum(female) as females,
 +
sum(male) as males,
 +
sum(unknowngender) as ugs,
 +
sum(doctor*female) as femaledoctors,
 +
sum(doctor*male) as maledoctors,
 +
sum(doctor*unknowngender) as ugdoctors,
 +
sum(ceopres*female) as femaleceos,
 +
sum(ceopres*male) as maleceos,
 +
sum(ceopres*unknowngender) as ugceos,
 +
sum(ceopres*female*doctor) as femaledoctorceos,
 +
sum(ceopres*male*doctor) as maledoctorceos,
 +
sum(ceopres*unknowngender*doctor) as ugdoctorceos,
 +
sum(founder*female) as femalefounders,
 +
sum(founder*male) as malefounders,
 +
sum(founder*unknowngender) as ugfounders,
 +
sum(founder*female*doctor) as femaledoctorfounders,
 +
sum(founder*male*doctor) as maledoctorfounders,
 +
sum(founder*unknowngender*doctor) as ugdoctorfounders,
 +
sum(clevel*female) as femaleclevels,
 +
sum(clevel*male) as maleclevels,
 +
sum(clevel*unknowngender) as ugclevels,
 +
sum(clevel*female*doctor) as femaledoctorclevels,
 +
sum(clevel*male*doctor) as maledoctorclevels,
 +
sum(clevel*unknowngender*doctor) as ugdoctorclevels,
 +
sum(board*female) as femaleboards,
 +
sum(board*male) as maleboards,
 +
sum(board*unknowngender) as ugboards,
 +
sum(board*female*doctor) as femaledoctorboards,
 +
sum(board*male*doctor) as maledoctorboards,
 +
sum(board*unknowngender*doctor) as ugdoctorboards,
 +
sum(vpandabove*female) as femaleabovevps,
 +
sum(vpandabove*male) as maleabovevps,
 +
sum(vpandabove*unknowngender) as ugabovevps,
 +
sum(vpandabove*female*doctor) as femaledoctorabovevps,
 +
sum(vpandabove*male*doctor) as maledoctorabovevps,
 +
sum(vpandabove*unknowngender*doctor) as ugdoctorabovevps,
 +
sum(prev*female) as femaleprevs,
 +
sum(prev*male) as maleprevs,
 +
sum(prev*unknowngender) as ugprevs,
 +
sum(prevceopres*female) as femaleprevceopres,
 +
sum(prevceopres*male) as maleprevceopres,
 +
sum(prevceopres*unknowngender) as ugprevceopres,
 +
sum(prevfounder*female) as femaleprevfounder,
 +
sum(prevfounder*male) as maleprevfounder,
 +
sum(prevfounder*unknowngender) as ugprevfounder,
 +
sum(prevclevel*female) as femaleprevclevel,
 +
sum(prevclevel*male) as maleprevclevel,
 +
sum(prevclevel*unknowngender) as ugprevclevel,
 +
sum(prevboard*female) as femaleprevboard,
 +
sum(prevboard*male) as maleprevboard,
 +
sum(prevboard*unknowngender) as ugprevboard,
 +
sum(prevvpandabove*female) as femaleprevvpandabove,
 +
sum(prevvpandabove*male) as maleprevvpandabove,
 +
sum(prevvpandabove*unknowngender) as ugprevvpandabove,
 +
sum(serial*female) as femaleserials,
 +
sum(serial*male) as maleserials,
 +
sum(serial*unknowngender) as ugserials,
 +
sum(serialceopres*female) as femaleserialceopres,
 +
sum(serialceopres*male) as maleserialceopres,
 +
sum(serialceopres*unknowngender) as ugserialceopres,
 +
sum(serialfounder*female) as femaleserialfounder,
 +
sum(serialfounder*male) as maleserialfounder,
 +
sum(serialfounder*unknowngender) as ugserialfounder,
 +
sum(serialclevel*female) as femaleserialclevel,
 +
sum(serialclevel*male) as maleserialclevel,
 +
sum(serialclevel*unknowngender) as ugserialclevel,
 +
sum(serialboard*female) as femaleserialboard,
 +
sum(serialboard*male) as maleserialboard,
 +
sum(serialboard*unknowngender) as ugserialboard,
 +
sum(serialvpandabove*female) as femaleserialvpandabove,
 +
sum(serialvpandabove*male) as maleserialvpandabove,
 +
sum(serialvpandabove*unknowngender) as ugserialvpandabove,
 +
sum(ceopres*serialceopres*female) as femaleceopresserialceopres,
 +
sum(ceopres*serialceopres*male) as maleceopresserialceopres,
 +
sum(ceopres*serialceopres*unknowngender) as ugceopresserialceopres,
 +
sum(founder*serialfounder*female) as femalefounderserialfounder,
 +
sum(founder*serialfounder*male) as malefounderserialfounder,
 +
sum(founder*serialfounder*unknowngender) as ugfounderserialfounder
 +
  FROM CoPeoplekey AS A
 +
JOIN CoPeopleSerial AS B  
 +
ON A.firstname=B.firstname AND A.lastname=B.lastname AND A.datefirstinv=B.datefirstinv AND A.cname=B.cname AND A.statecode=B.statecode
 +
GROUP BY A.cname, A.datefirstinv, A.statecode;
 +
--30413
 +
 
 +
Since this table is so big, it is a good idea to have a smaller, more manageable table to work with.
 +
 
 +
DROP TABLE copeopleaggsimple;
 +
CREATE TABLE copeopleaggsimple AS
 +
SELECT cname, datefirstinv, statecode, numperson, females, males, ugs, ugdoctors, maleserials+femaleserials+ugserials AS serials
 +
FROM copeopleagg;
 +
--30413
 +
 
 +
===Fund People===
 +
Luckily, this process is much easier than the company people process. First we must simply load the data into the db.
 +
 
 +
DROP TABLE fundpeople;
 +
  CREATE TABLE fundpeople(
 +
fundname  varchar(255),
 +
fundyear  int,
 +
prefix varchar(5),
 +
firstname varchar(50),
 +
lastname varchar(50),
 +
jobtitle varchar(150),
 +
prevpos  varchar(255)
 +
);
 +
 
 +
\COPY fundpeople FROM 'Executives-Funds-NoFoot-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 +
--328994
 +
 
 +
The next table identifies degree and sex information about the executives of the fund.
 +
 
 +
DROP TABLE fundpeoplebase;
 +
CREATE TABLE fundpeoplebase AS
 +
SELECT fundpeople.*,
 +
CASE WHEN prefix='Ms' THEN 1::int
 +
WHEN prefix='Mr' THEN 0::int
 +
ELSE Null::int END AS titlefemale,
 +
CASE WHEN prefix='Ms' THEN 0::int
 +
WHEN prefix='Mr' THEN 1::int
 +
ELSE Null::int END AS titlemale,
 +
CASE WHEN prefix='Dr' THEN 1::int
 +
ELSE 0::int END AS doctor,
 +
CASE WHEN prefix IS NULL THEN 0::int
 +
ELSE 1::int END AS hastitle,
 +
CASE WHEN prefix IS NULL AND firstname IS NULL AND lastname IS NULL THEN 0::int
 +
ELSE 1::int END AS hasperson
 +
FROM fundpeople;
 +
--328994
 +
 
 +
The next table tries to identify the sex of the executive using the above defined namegender table. It only selects rows where a person is actually listed.
 +
 
 +
DROP TABLE FundPeopleFull;
 +
CREATE TABLE FundPeopleFull AS
 +
SELECT fundpeoplebase.*, exclusivelyfemale, exclusivelymale,
 +
CASE WHEN titlefemale=1 THEN 1::int
 +
WHEN exclusivelyfemale=1 AND exclusivelymale=0 AND (titlemale=0 OR titlemale IS NULL) THEN 1::int ELSE 0::int END AS female,
 +
CASE WHEN titlemale=1 THEN 1::int
 +
WHEN exclusivelymale=1  AND exclusivelyfemale=0 AND (titlefemale =0 OR titlefemale IS NULL) THEN 1::int ELSE 0::int END AS male,
 +
CASE WHEN (titlefemale=1 OR titlemale=1 OR exclusivelymale=1 OR exclusivelyfemale=1) THEN 0::int ELSE 1::int END AS unknowngender
 +
FROM fundpeoplebase
 +
LEFT JOIN namegender ON namegender.firstname=fundpeoplebase.firstname
 +
WHERE hasperson=1;
 +
--320915
 +
 
 +
The next table gives you information on executives aggregated by fund.
 +
 
 +
DROP TABLE FundPeopleAgg;
 +
CREATE TABLE FundPeopleAgg AS
 +
SELECT fundname,
 +
sum(female) as numfemale,
 +
sum(male) as nummale,
 +
sum(unknowngender) as numunknowngender,
 +
sum(doctor) as numdoctor,
 +
sum(female*doctor) as numfemaledoctor,
 +
sum(male*doctor) as nummaledoctor,
 +
sum(unknowngender*doctor) as numunknowngenderdoctor,
 +
sum(hastitle) as numtitled,
 +
sum(hasperson) as numpeople,
 +
CASE WHEN sum(hasperson) > 0 THEN sum(female)/sum(hasperson) ELSE NULL END as fracfemale,
 +
CASE WHEN sum(male) > 0 THEN sum(female)/sum(male) ELSE NULL END as ratiofemale
 +
FROM FundPeopleFull
 +
GROUP BY fundname;
 +
--21536
 +
 
 +
It is also good to have this information on firms. We do not pull firm people information from SDC. However, we have enough information to create it from preexisting tables.
 +
 
 +
DROP TABLE firmpeopleagg;
 +
CREATE TABLE firmpeopleagg AS
 +
SELECT _firmname as firmname, sum(numfemale) as firmwomen, sum(nummale) as firmmen, sum(numunknowngender) as firmugs,
 +
sum(numdoctor) as firmdoctors, sum(numpeople) as firmpeople,
 +
CASE WHEN sum(numpeople) > 0 THEN (sum(numfemale)/sum(numpeople))::real ELSE NULL END as firmfracwomen,
 +
CASE WHEN sum(nummale) > 0 THEN (sum(numfemale)/sum(nummale))::real ELSE NULL END as firmratiowomen
 +
FROM roundlineaggfunds AS A
 +
JOIN fundpeopleagg AS B ON A._fundname=B.fundname
 +
GROUP BY _firmname;
 +
--5273
 +
 
 +
==Marcos's Code==
 +
This is code that a Rice student, Marcos Lee, wrote. I cleaned it and ran it. I have described the tables that I built and where they come from below. My code is located in:
 +
E:McNair\Projects\VentureXpert Database\vcdb3\LoadingScripts\MatchingEntrepsV3
 +
 
 +
If you have issues understanding my explanation, go to this location and read the query. Most of them are straight forward.
 +
===Describing Stacks Created in Code===
 +
CoPeopleBase:
 +
-Builds from copeople and titlelookup
 +
-Identifies what roles people played in their companies
 +
 
 +
namegender:
 +
-built from copeoplebase
 +
-identifies male/female/unknown
 +
 
 +
CoPeopleFull:
 +
-built from copeoplebase and namegender
 +
-builds more extensive information on executive including speficially what level of executive they are
 +
 
 +
CoPeopleKey:
 +
-built from CoPeopleFull
 +
-creates table where only executives with full primary keys are kept
 +
 
 +
CoPeopleSerial:
 +
-built from copeoplekey
 +
-keeps track of executives previous jobs at executive level
 +
 
 +
CoPoepleAgg:
 +
-built from copeoplekey and copeopleserial
 +
-gets extensive information on executives for each company
 +
 
 +
FundPeopleBae:
 +
-built from fundpeople
 +
-identifies male/female/doctor
 +
-hasperson column slightly weird because we can only have the lastname without prefix or first name and still have a 1 in column. Seems to be of little use/too broad
 +
 
 +
FundPeopleFull:
 +
-built from fundpeoplebase, namegender
 +
-adds in male/female
 +
 
 +
Fundpeopleagg:
 +
-built from fundpeoplefull
 +
-has aggregations of gender info for each fund
 +
 
 +
RoundLineJoinerLeanffWlistno:
 +
-built from rounlinejoinerleanff
 +
-adds listno to funds
 +
 
 +
RoundLineAggFunds:
 +
-built from roundlinejoinerleanffwlistno and rounlineaggfirms
 +
-if there are two funds from one firm that invest in same portco, we choose only one and leave the others behind
 +
 
 +
RoundLineAggWExit:
 +
-built from roundlineaggfirms, portcoexitupdated, roundlineaggfunds
 +
-adds in exit information for each company in roundlineaggfirms
 +
 
 +
FirmPerf:
 +
-built from roundlineaggwexit
 +
-adds in various performance measures for a given firm
 +
 
 +
PortCoFundDemo:
 +
-built from roundlinejoinerleanffclean and fundpeopleagg
 +
-gives information on executives of funds who invested in the portcos
 +
 
 +
PortCoPeopleMaster:
 +
-built from PortCoMaster, PortCoIndustry, PortCoPatent, PortCoSBIR, copeoplagg, PortCoFundDemo, CPI, statelookupint
 +
-huge amount of data about companies and their executives
 +
 
 +
RoundAggDistBase:
 +
-built from portcogeo, firmbogeo, roundlineaggwexit
 +
-creates geographic points using long, lat from geocoding
 +
 
 +
RoundAggDist:
 +
-Built from roundaggdistbase
 +
-gets actual distances between portcos and firms. if branch office exists and distance is less than distance to firm chooses that also generates random number
 +
 
 +
FirmPeopleAgg:
 +
-built from roundlineaggfunds, fundpeopleagg
 +
-finds information on executives from different firms
 +
 
 +
PortCoMatchmaster:
 +
-built from portcopatent, porcoindustry, portcosbir, copeopleaggsimple, portcoid
 +
-gets all information together about portcos
 +
 
 +
FirmMatchMaster:
 +
-built from firmperf, firmvars, firmpeopleagg, firmid
 +
-gets all information together about firms
 +
 
 +
RoundLineMasterBase:
 +
-built from portcomatchmaster, firmmatchmaster, roundaggdist, roundlineaggwexit
 +
-builds large amount of information about portcos and firms spceifically info about exits and distances
 +
 
 +
MatchMostNumerous:
 +
-built from roundlinemasterbase
 +
-finds max number of portcos invested in by a firm that also invested in the company grouping by
 +
 
 +
MatchHighestRandom:
 +
-built from matchmostnumerous
 +
-if two firms that invested in one company had the same number of max port cos this randomly chooses one company
 +
 
 +
FirmActiveYearsCode20:
 +
-built from roundlinejoinerleanffclean, porcoindustry
 +
-adds firmname to industry code not exactly sure why distinct is used in query
 +
 
 +
RealMatchesCode20:
 +
-built from MatchHighestRandom, PortCoIndustry
 +
-real matches between portcos and firms that invested in them including the code20
 +
 
 +
SyntheticFirmSetBaseCode20:
 +
-built from realmatchescode20, firmactiveyarscode20
 +
-crossproduct of firms and portcos. finds firms that invested in same year as portco received first inv, firms invested in same type of company, and makes sure matches are unique
 +
 
 +
AllMatchKeys:
 +
-built from SyntheticFirmSetBaseCode20, RealMatchesCode20
 +
-combines synthetic and real matches
 +
 
 +
SynthRoundAggDistBaseCode20:
 +
-built from allmatchkeys, portcogeo, firmbogeo
 +
-builds points for all portco, firm listings in allmatch keys
 +
 
 +
SynthRoundAddDistCode20:
 +
-built from synthroundaggdistvasecode20
 +
-finds actual distance between portcos and firms using installed extensions chooses branch offices if distance between portco and bo less than firm
 +
 
 +
SynthFirmnameInduBlowoutCode20:
 +
-built from allmatchkeys, roundlinemasterbase
 +
-gets every firm combination and checks whehter the companies that those firms invested in are in the same general industry
 +
 
 +
SynthFirmNameroundInduHistCode20:
 +
-built from SynthFirmnameInduBlowoutcode20
 +
-gets information by portco, firmname match about what the firms past investment patterns are
 +
 
 +
MasterWithSynthBaseCode20Portco:
 +
-built from Allmatchkeys, matchhighestrandom, synthroundaggdistcode20, sythnfirmnameroundinduhistcode20, synthfirmnameroundindutotalcode20, firmvars, copeopleaggsimple, portcomaster
 +
-builds a bunch of information about synthetic and real matches
 +
 
 +
SynthFirmnameRoundInduTotalCode20:
 +
-built from allmatchkeys, roundlinemasterbase
 +
-finds number of portcos in certain industries by firmnames
 +
 
 +
MasterWithSynthCode20Firms:
 +
-built with firmmatchmaster, allmatchkeys
 +
-matching a bunch of information to all firms
 +
 
 +
MasterWithSynthcode20:
 +
-built from masterwithsynthbasecode20portco, masterwithsynthcode20firms
 +
-gets a huge amount of info together on real and synthetic matches about firms and companies
 +
 
 +
MasterReals:
 +
-built from masterwithsynthcode20
 +
-gets just real matches from code
 +
 
 +
MasterOneSynth:
 +
-built from masterwithsynthcode20
 +
-gets just one randomly chosen synthetic match between companies and firms
 +
 
 +
MasterRealOneSynth:
 +
-built from masteronesynth, masterreals
 +
-combines the real and one synth table
  
SELECT COUNT(*) FROM SelectedIPOMA WHERE mavsipo=0;
+
==Ranking Tables and Graphs==
--7121
+
This is a slight detour from the creation of VCDB3. However, this is a cool process because you actually get to use the data you've been working with. This process is extensive, but the queries are easy to understand. If you wish to have deeper understanding of the process, read the code. It is located in:
  
  SELECT COUNT(*) FROM SelectedIPOMA WHERE mavsipo=1;
+
  E:McNair\Projects\VentureXpert Database\vcdb3\LoadingScripts\RoundRanking.SQL
--2116
 
  
  SELECT COUNT(*) FROM SelectedIPOMA WHERE targetname IS NOT NULL;
+
First you must create a table that has aggregate round information grouped by cities and round year. Since this is a little difficult to picture, I will attach the code.
  --7188
+
DROP TABLE roundleveloutput;
   
+
  CREATE TABLE roundleveloutput AS SELECT
  SELECT COUNT(*) FROM SelectedIPOMA WHERE issuername IS NOT NULL;
+
city, statecode, roundyear AS year,
  --2141
+
SUM(rndamtestm*seedflag) AS seedamnt,
 +
SUM(rndamtestm*earlyflag) AS earlyamnt,
 +
  SUM(rndamtestm*laterflag) AS lateramnt,
 +
  SUM(rndamtestm*growthflag) AS selamnt,
 +
  SUM(growthflag*dealflag) AS numseldeals
 +
FROM round GROUP BY city, statecode, roundyear;
 +
  --30028
  
Everything seems to check out, so we can move on to joining the SelectedIPOMA table to the companybasecore table to create the ExitKeysClean table.
+
Next create a table that lists the all time SEL amount by city. Keep including the state code since this will ensure that you have the right city. City names are often repeated in different states. Next, create a table which lists unique city, state for every year since 1980. Then, build a table which matches portcos to the city, state, year blowout table for each year they were alive. This table should be relatively large since it lists companies once for every year they were alive up until the present. Then create a table that displays the number of companies alive in a city every year since 1980. Then add in a table that lists all of the information you have built in tables previously based on city, state, year. Also add in population. Then you can run the ranking queries.
DROP TABLE ExitKeysClean;
 
CREATE TABLE ExitKeysClean AS
 
SELECT A.coname, A.statecode, A.datefirstinv, B.targetname, B.targetstate, B.announceddate, B.issuername, B.issuerstate, B.issuedate, B.mavsipo
 
FROM companybasecore AS A
 
LEFT JOIN SelectedIPOMA AS B ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv;
 
--48000
 
  
Since 48000 is the same number of rows in the companybasecore, we know that the join was successful and no rows were added that we don't want.
+
For states follow the same general process but group by states not cities and states.  
  
==Create the PortCoExit Table==
+
If this explanation was not enough for you (it was not meant to be in depth) go to the location defined above and read the actual code. With the description I have given, you should be able to piece together what each query does.
From consulting with Ed and the VC Database Rebuild wiki, I decided to make the PortCoExit table with an mavsipo, an exitdate, an exited, and an exitvalue column. DROP TABLE PortCoExit;
 
CREATE TABLE PortCoExit AS
 
SELECT A.coname, A.statecode, A.datefirstinv,
 
CASE WHEN a.mavsipo=0 THEN 0::int WHEN a.mavsipo=1 THEN 1::int ELSE NULL::int END AS mavsipo,
 
CASE WHEN a.mavsipo=0 THEN B.announceddate WHEN a.mavsipo=1 THEN C.issuedate ELSE NULL::date END AS exitdate,
 
CASE WHEN a.mavsipo=0 OR a.mavsipo=1 THEN 1::int ELSE 0::int END AS exited,
 
CASE WHEN a.mavsipo=0 THEN B.transactionamt WHEN a.mavsipo=1 THEN C.proceedsamt ELSE NULL::money END AS exitvalue
 
  FROM ExitKeysClean AS A
 
LEFT JOIN MANoDups AS B ON A.targetname=B.targetname AND A.targetstate=B.targetstate AND A.announceddate=B.announceddate
 
LEFT JOIN IPONoDups AS C ON A.issuername=C.issuer AND A.issuerstate=C.statecode AND A.issuedate=C.issuedate;
 
--48000
 
  
The issue that I am currently having is that the exitvalue column for MAs is always Null, and there are some weird numbers in the table. I will continue to look for these. I have been continuously checking for duplicates and validity of the primary keys, so I do not believe these problems to be due to duplicates
+
==Master Tables==
 +
Throughout the creation of the database, there are inevitably some tables that are vital to create a solid foundation. The following tables are the master tables with a quick explanation:
 +
* '''Companybasecore'''- The base table for portcos. This is data that was drawn directly from SDC and was not changed other than for cleaning purposes. Count: 48001
 +
* '''BranchOfficeCore'''- The base table for branch offices. This is data drawn directly from SDC. Here only branch offices with distinct firm names are included. Count: 10032
 +
* '''FirmBaseCore'''- The base table for firms. This is also data taken directly from SDC and was not changed other than for cleaning purposes. Count: 15437
 +
* '''FundBaseCore'''- The base table for funds. This is also data taken directly from SDC and was not changed other than for cleaning purposes. Count: 28833
 +
* '''IPOCleanNoDups''' - This is the clean table of IPOs after being run through the matcher against portcos. It was cleaned manually and had duplicates removed. Count: 2136
 +
* '''IPONoDups'''- This is the table before the cleaning process of matching to portcos. There could be problems with this table as we used an aggregate function here. Be careful using this table. Count: 11149
 +
* '''MACleanNoDups'''- This is the clean table of MAs after being run through the matcher against portcos. It was cleaned manually and had duplicates removed. Count: 7171
 +
* '''MANoDups'''- This is the table before the cleaning process of matching to portcos. There could be problems with this table as we used an aggregate function here as well. Be careful using this table. Count: 119374
 +
* '''Round'''- This is the master round table. It has SEL flags attached to it and has the most round info. RoundBaseClean is also a decent table but has less information. This table is your best bet for round information. Count: 151323
 +
* '''RoundLineJoinerLeanFFClean'''- This is the master round table for joining purposes. It was cleaned and used for widespread joining purposes. Count: 163157
 +
* '''CoPeople'''- This is the base table for PortCo people information. It was pulled directly from SDC. Count: 194359
 +
* '''FirmBoGeo'''- This is the base table for firm/branch office geocoding. This table was cleaned and contains lat/long readings for firms and branch offices where the information was available. Count: 15437
 +
* '''PortCoGeo'''- This is the base table for portco geocoding. Table was cleaned and contains lat/long reading for portcos where the Google API returned a valid reading. Count: 48001
 +
* '''FirmPerf'''- This is a wide reaching table about the performance of firms. It was mainly used later in the project but is extremely useful. Count: 8336
 +
* '''FundPeople'''- This is the base table for fund people information. It was pulled directly from SDC. Count: 328994.
 +
* '''PortCoExitUpdated'''- This is the master exit table for portcos. The difference between this and PortCoExit is that Updated has two columns marking MAs and IPOs while the other has one column MAvsIPO. Use which ever one is more convenient. Count: 48001
 +
* '''PortCoMaster'''- This table is great. There's a ton of information on PortCos including SEL flags, round amounts, and industry classifications. Count: 48001

Latest revision as of 11:36, 24 March 2024


Project
VentureXpert Data
Project logo 02.png
Project Information
Has title VentureXpert Data
Has owner Augi Liebster
Has start date June 20, 2018
Has deadline date
Has project status Active
Has sponsor McNair Center
Has project output Data, How-to
Copyright © 2019 edegan.com. All Rights Reserved.


The successors to this project include:

Relevant Former Projects

  1. Venture Capital (Data)
  2. Retrieving US VC Data From SDC
  3. VC Database Rebuild

Location

My scripts for SDC pulls are located in the E drive in the location:

E:\McNair\Projects\VentureXpertDatabase\ScriptsForSDCExtract

My successfully pulled and normalized files are stored in the location:

E:\McNair\Projects\VentureXpertDatabase\ExtractedDataQ2

My scripts for loading tables and data are in:

E:\McNair\Projects\VentureXpertDatabase\vcdb3\LoadingScripts

There are a variety of SQL files in there with self explanatory names. The file that has all of the loading scripts is called LoadingScriptsV1. The folder vcdb2 is there for reference to see what people before had done. ExtractedData is there because I pulled data before July 1st, and Ed asked me to repull the data.

Goal

I will be looking to redesign the VentureXpert Database in a way that is more intuitively built than the previous one. I will also update the database with current data.

Initial Stages

The first step of the project was to figure out what primary keys to use for each major table that I create. I looked at the primary keys used in the creation of the VC Database Rebuild and found primary keys that are decent. I have updated them and list them below:

  1. CompanyBaseCore- coname, statecode, datefirstinv
  2. IPOCore- issuer, issuedate, statecode
  3. MACore- target name, target state code, announceddate
  4. Geo - city, statecode, coname, datefirst, year
  5. DeadDate - conname, statecode, datefirst, rounddate (tentative could still change)
  6. RoundCore- conname, statecode, datefirst, rounddate
  7. FirmBaseCore - firmname
  8. FundBaseCore - fund name (firstinvedate doesn't work because not every row has an entry)

These are my initial listings and I will come back to update them if needed.

The second part of the initial stage has been to pull data from the SDC Platinum platform. I did it in July to ensure that I had two full quarters of data.


SDC Pull

When pulling data from SDC, it is a good idea to look for previously made rpt files that have the names of the pulls you will need to do. They have already been created and will save you a lot of work. The rpt files that I used are in the folder VentureXpertDB/ScriptsForSDCExtract. The files will come in pairs with one being saved as an ssh file and one as a rpt file. To update the dates to make them recent, go into the ssh file of the pair and change the date of last investment. When you open SDC, you will be given a variety of choices for which database to pull from. For each type of file chose the following:

  1. VentureXpert - PortCo, PortCoLong, USVC, Firms, BranchOffices, Funds, Rounds, VCFirmLong
  2. Mergres & Acquisition - MAs
  3. Global New Issues Databases - IPOs

Help on pulling data from SDC is on the SDC Platinum (Wiki) page.

VCFund Pull Problem

When pulling the VCFund1980-Present, I encountered two problems. One, is that SDC is not able to sort through the funds that are US only with the built in filters. Two, there are multiple rpt files that specify different variables for the fund pull. I pulled from both to be safe, but in the VC Database Rebuild page there is a section on the fund pull where Ed specifies which rpt file he used to pull data from SDC. Regardless I have both saved in the ExtractedData folder. After speaking with Ed, he told me to use the VCFund1980-present.rpt file to extract the data. Had various problems extracting data including freezing of SDC program or getting error Out of Memory. Check the SDC Platinum (Wiki) page to fix these issues.


Loading Tables

When I describe errors I encountered, I will not describe them using line numbers. This is because as soon as any data is added, the line numbers will become useless. Instead I recommend that you copy the normalized file you are working with into an excel file and using the filter feature. This way you can find the line number in your specific file that is causing errors and fix it in the file itself. The line numbers that PuTTY errors display are often wrong, so I relied on excel to discover the error fastest. If my instructions are not enough for you to find the error, my advice would be to find key words in the line that PuTTY is telling you is causing errors and filter through excel.


DROP TABLE roundbase;
CREATE TABLE roundbase (
  coname varchar(255),
  rounddate date,
  updateddate date,
  foundingdate date,
  datelastinv date,
  datefirstinv date,
  investedk real,
  city varchar(100),
  description varchar(5000),
  msa varchar(100),
  msacode varchar(10),
  nationcode varchar(10),
  statecode varchar(10),
  addr1 varchar(100),
  addr2 varchar(100),
  indclass varchar(100),
  indsubgroup3 varchar(100),
  indminor varchar(100),
  url varchar(5000),
  zip varchar(10),
  stage1 varchar(100),
  stage3 varchar(100),
  rndamtdisck real,
  rndamtestk real,
  roundnum integer,
  numinvestors integer
);
\COPY roundbase FROM 'USVC1980-2018q2-Good.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--151549

The only error I encountered here was with Cardtronic Technology Inc. Here there was a problem with a mixture of quotation marks which cause errors in loading. Find this using the excel trick and remove it manually.

DROP TABLE ipos;
CREATE TABLE ipos (
  issuedate date,
  issuer varchar(255),
  statecode varchar(10), 
  principalamt money, --million
  proceedsamt money, --sum of all markets in million
  naiccode varchar(255), --primary NAIC code
  zipcode varchar(10),
  status varchar (20),
  foundeddate date
);
\COPY ipos FROM 'IPO1980-2018q2-NoFoot-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--12107

I encountered no errors while loading this data.


DROP TABLE branchoffices;
CREATE TABLE branchoffices (
  firmname varchar(255),
  bocity varchar(100),
  bostate varchar(2),
  bocountrycode varchar(2),
  bonation varchar(100),
  bozip varchar(10),
  boaddr1 varchar(100),
  boaddr2 varchar(100)

);
\COPY branchoffices FROM 'USVCFirmBranchOffices1980-2018q2-NoFoot-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--10353

I encountered no errors while loading this data.

DROP TABLE roundline;
CREATE TABLE roundline (
  coname varchar(255),
  statecode varchar(2),
  datelastinv date,
  datefirstinv date,
  rounddate date,
  disclosedamt real,
  fundname varchar(255)
);
\COPY roundline FROM 'USVCRound1980-2018q2-NoFoot-normal-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--403189

I encountered no errors while loading this data.

DROP TABLE fundbase;
CREATE TABLE fundbase (
  fundname varchar(255),
  closedate date, --mm-dd-yyyy
  lastinvdate date, --mm-dd-yyyy
  firstinvdate date, --mm-dd-yyyy
  numportcos integer,
  investedk real,
  city varchar(100),
  fundyear varchar(4), --yyyy
  zip varchar(10),
  statecode varchar(2),
  fundsizem real,
  fundstage varchar(100),
  firmname varchar(255),
  dateinfoupdate date,
  invtype varchar(100),
  msacode varchar(10),
  nationcode varchar(10),
  raisestatus varchar(100),
  seqnum integer,
  targetsizefund real,
  fundtype varchar(100)
);
\COPY fundbase FROM 'VCFund1980-2018q2-NoFoot-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--29397

There is a Ukranian fund that has stray quotation marks in its name. It is called something along the lines of "VAT "ZNVKIF "Skhidno-Evropeis'lyi investytsiynyi Fond". If this does not help, you can filter in excel using Kiev as the keyword in the city column and find the line where you are getting errors. Then manually remove the commas in the actual text file. After that, the table should load correctly.

DROP TABLE firmbase;
CREATE TABLE firmbase(
  firmname varchar(255),
  foundingdate date, --mm-dd-yyyy
  datefirstinv date, --mm-dd-yyyy  
  datelastinv date, --mm-dd-yyyy
  addr1 varchar(100),
  addr2 varchar(100),
  location varchar(100),
  city varchar(100),
  zip varchar(10),
  areacode integer,
  county varchar(100),
  state varchar(2),
  nationcode varchar(10),
  nation varchar(100),
  worldregion varchar(100),
  numportcos integer,
  numrounds integer,
  investedk money,
  capitalundermgmt money,  
  invstatus varchar(100),
  msacode varchar(10),
  rolepref varchar(100),
  geogpref varchar(100),
  indpref varchar(100),
  stagepref varchar(100),
  type varchar(100)
);


\COPY firmbase FROM 'USVCFirms1980-2018q2-NoFoot-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--15899

The normalization for this file was wrong when I tried to load the data. To fix this go to the file where you have removed the footer and find the column header titled Firm Capital under Mgmt{0Mil}. Delete the {0mil} and renormalize the file. Then everything should be ok. A good way to check this is to copy and paste the normalized file into an excel sheet and see whether the entries line up with their column header correctly. The second error I found was with the Kerala Ventures firm. Here the address has the word l"opera in it. This quotation will cause errors so find the line number using excel and remove it manually. The third error is in an area code where 1-8 is written. This hyphen causes errors. Interestingly, the line number given by PuTTY was correct, and I found it in my text file and deleted it manually. These were the only errors I encountered while loading this table.


DROP TABLE mas;
CREATE TABLE mas (
  announceddate date,
  effectivedate date,
  targetname varchar(255),
  targetstate varchar(100),
  acquirorname varchar(255),
  acquirorstate varchar(100),
  transactionamt money,
  enterpriseval varchar(255),
  acquirorstatus varchar(150)
);
\COPY mas FROM 'MAUSTargetComp100pc1985-July2018-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--119432

I encountered no problems loading in this data.

DROP TABLE longdescription;
CREATE TABLE longdescription(
  varchar(255),
  statecode varchar(10),
  fundingdate date, --date co received first inv
  codescription varchar(10000) --long description
);
\COPY longdescription FROM 'PortCoLongDesc-Ready-normal-fixed.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--48037

I encountered no problems loading this data.

Cleaning Companybase, Fundbase, Firmbase, and BranchOffice

Cleaning Company

The primary key for port cos will be coname, datefirstinv, and statecode. Before checking whether this is a valid primary key, remove the undisclosed companies. I will explain the second part of the query concerning New York Digital Health later.

DROP TABLE companybasecore;
CREATE TABLE companybasecore AS
SELECT * 
FROM Companybase WHERE nationcode = 'US' AND coname != 'Undisclosed Company' 
AND NOT (coname='New York Digital Health LLC' AND statecode='NY' AND datefirstinv='2015-08-13' AND updateddate='2015-10-20');
--48001
SELECT COUNT(*) FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore) AS T;
--48001

Since the count of the table and the count of the distinct primary key is equivalent, you know that the primary key is valid. In the initial cleaning of the table, I first sorted out only the undisclosed companies. This table had 48002 rows. I then ran the DISTINCT query above and found that there are 48001 distinct rows with the coname, datefirstinv, statecode primary key. Thus there must two rows that share a primary key. I found this key using the following query:

SELECT * FROM (SELECT coname, datefirstinv, statecode FROM companybase) as key GROUP BY coname, datefirstinv, statecode HAVING COUNT(key) > 1;

The company named 'New York Digital Health LLC' came up as the company that is causing the problems. I queried to find the two rows that list this company name in companybase and chose to keep the row that had the earlier updated date. It is a good practice to avoid deleting rows from tables when possible, so I added the filter as a WHERE clause to exclude one of the New York Digital listings.

Cleaning Fundbase

The primary key for funds will be only the fundname. First get rid of all of the undisclosed funds.

DROP TABLE fundbasenound;
CREATE TABLE fundbasenound AS 
SELECT DISTINCT * FROM fundbase WHERE fundname NOT LIKE '%Undisclosed Fund%';
--28886
SELECT COUNT(*) FROM (SELECT DISTINCT fundname FROM fundbasenound)a;
--28833

As you can see, fundbase still has rows that share fundnames. If you are wondering why the DISTINCT in the first query did not eliminate these, it is because this DISTINCT applies to the whole row not individual fundnames. Thus, only completely duplicate rows will be eliminated in the first query. I chose to keep the funds that have the earlier last investment date.

DROP TABLE fundups;
CREATE TABLE fundups AS SELECT
fundname, max(lastinvdate) AS lastinvdate FROM fundbasenound GROUP BY fundname HAVING COUNT(*)>1;
--53
DROP TABLE fundbasecore;
CREATE TABLE fundbasecore AS
SELECT A.* FROM fundbasenound AS A LEFT JOIN fundups AS B ON A.fundname=B.fundname AND A.lastinvdate=B.lastinvdate WHERE B.fundname IS NULL AND B.lastinvdate IS NULL;
--28833

Since the count of fundbasecore is the same as the number of distinct fund names, we know that the fundbasecore table is clean. In the first query I am finding duplicate rows and choosing the row that has the greater last investment date. I then match this table back to fundbasenound but choose all the rows from fundbasecore for which there is no corresponding fund in fundups based on fund name and date of last investment. This allows the funds with the earlier date of last investment to be chosen.

Cleaning Firmbase

The primary key for firms will be firm name. First I got rid of all undisclosed firms. I also filtered out two firms that have identical firm names and founding dates. The reason for this is because I use founding dates to filter out duplicate firm names. If there are two rows that have the same firm name and founding date, they will not be filtered out by the third query below. Thus, I chose to filter those out completely.

DROP TABLE firmbasenound;
CREATE TABLE firmbasenound AS 
SELECT DISTINCT * FROM firmbase WHERE firmname NOT LIKE '%Undisclosed Firm%' AND firmname NOT LIKE '%Amundi%' AND firmname NOT LIKE '%Schroder Adveq Management%';
--15452
SELECT COUNT(*) FROM(SELECT DISTINCT firmname FROM firmbasenound)a;
--15437

Since these counts are not equal we will have to clean the table further. We will use the same method from before.

DROP TABLE firmdups;
CREATE TABLE firmdups AS SELECT
firmname, max(foundingdate) as foundingdate FROM firmbasenound GROUP BY firmname HAVING COUNT(*)>1;
--15
DROP TABLE firmbasecore;
CREATE TABLE firmbasecore AS
SELECT A.* FROM firmbasenound AS A LEFT JOIN firmdups AS B ON A.firmname=B.firmname AND A.foundingdate=B.foundingdate WHERE B.firmname IS NULL AND B.foundingdate IS NULL;
--15437

Since the count of firmbasecore and the DISTINCT query are the same, the firm table is now clean.

Cleaning Branch Offices

When cleaning the branch offices, I had to remove all duplicates in the table. This is because the table is so sparse that often the only data in a row would be the fund name the branch was associated with. Thus, I couldn't filter based on dates as I had been doing previously for firms and funds. The primary key is firm name.

DROP TABLE bonound;
CREATE TABLE bonound AS
SELECT *, CASE WHEN firmname LIKE '%Undisclosed Firm%' THEN 1::int ELSE 0::int END AS undisclosedflag
FROM branchoffices;
--10353
SELECT COUNT(*) FROM(SELECT DISTINCT firmname FROM bonound)a;
--10042

Since these counts aren't the same, we will have to work a little more to clean the table. As stated above, I did this by excluding the firm names that were duplicated.

DROP TABLE branchofficecore;
CREATE TABLE branchofficecore AS
SELECT A.* FROM bonound AS A JOIN (
		SELECT bonound.firmname, COUNT(*) FROM bonound GROUP BY firmname
		HAVING COUNT(*) =1
		) AS B
ON A.firmname=B.firmname WHERE undisclosedflag=0;
--10032
SELECT COUNT(*) FROM (SELECT DISTINCT firmname FROM branchofficecore)a;
--10032

Since these counts are the same, we are good to go. The count is 10 lower because we completely removed 10 firmnames from the listing by throwing out the duplicates.

Instructions on Matching PortCos to Issuers and M&As From Ed

Company Standardizing

Get portco keys

DROP TABLE portcokeys;
CREATE TABLE portcokey AS
SELECT coname, statecode, datefirst
FROM portcocore;
--CHECK COUNT IS SAME AS portcocore OR THESE KEYS ARE VALID AND FIX THAT FIRST

Get distinct coname and put it in a file

\COPY (SELECT DISTINCT coname FROM portcokeys) TO 'DistinctConame.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV

Match that to itself

Move DistinctConame.txt to E:\McNair\Software\Scripts\Matcher\Input
Open powershell and change directory to E:\McNair\Software\Scripts\Matcher
Run the matcher in mode2:
 perl Matcher.pl -file1="DistinctConame.txt" -file2="DistinctConame.txt" -mode=2
Pick up the output file from E:\McNair\Software\Scripts\Matcher\Output (it is probably called DistinctConame.txt-DistinctConame.txt.matched) and move it to your Z drive directory

Load the matches into the dbase

DROP TABLE PortcoStd;
CREATE TABLE PortcoStd (
   conamestd  varchar(255),
   coname   varchar(255),
   norm  varchar(100),
   x1  varchar(255),
   x2  varchar(255)
);

\COPY CohortCoStd FROM 'DistinctConame.txt-DistinctConame.txt.matched' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--YOUR COUNT

Join the Conamestd back to the portcokeys table to create your matching table

DROP TABLE portcokeysstd;
CREATE TABLE portcokeysstd AS
SELECT B.conamestd, A.*
FROM portcokey AS A
JOIN PortcoStd AS B ON A.coname=B.coname
--CHECK COUNT IS SAME AS portcokey OR YOU LOST SOME NAMES OR INFLATED THE DATA

Put that in a file for matching (conamestd is in first column by construction)

 \COPY portcokeysstd TO 'PortCoMatchInput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
 --YOUR COUNT


MA Cleaning and Matching

First remove all of the duplicates in the MA data. Do this by running aggregate queries on every column except for the primary key:

DROP TABLE MANoDups;
CREATE TABLE MANoDups AS
SELECT targetname, targetstate, announceddate, min(effectivedate) AS effectivedate, MIN(acquirorname) as acquirorname, MIN(acquirorstate) as acquirorstate, MAX(transactionamt) as 
transactionamt, MAX(enterpriseval) as enterpriseval, MIN(acquirorstatus) as acquirorstatus
FROM mas 
GROUP BY targetname, targetstate, announceddate ORDER BY targetname, targetstate, announceddate;
--119374
SELECT COUNT(*) FROM(SELECT DISTINCT targetname, targetstate, announceddate FROM manodups)a;
--119374

Since these counts are equivalent, the data set is clean. Then get all the primary keys from the table and copy the distinct target names into a text file.

DROP TABLE makey;
CREATE TABLE makey AS
SELECT targetname, targetstate, announceddate
FROM manodups;
--119374
\COPY (SELECT DISTINCT targetname FROM makey) TO 'DistinctTargetName.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV;
--117212

After running this list of distinct target names through the matcher, put the standardized MA list into the data base.

DROP TABLE MaStd;
CREATE TABLE MaStd (
  targetnamestd varchar(255),
  targetname varchar(255),
  norm varchar(100),
  x1 varchar(255),
  x2 varchar(255)
);
\COPY mastd FROM 'DistinctTargetName.txt-DistinctTargetName.txt.matched' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--117212

Then match the list of standardized names back to the makey table to get a table with standardized keys and primary keys. This will be your input for matching against port cos.

DROP TABLE makeysstd;
CREATE TABLE makeysstd AS
SELECT B.targetnamestd, A.*
FROM makey AS A
JOIN mastd AS B ON A.targetname=B.targetname;
--119374
 \COPY makeysstd TO 'MAMatchInput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
 --119374

Use this text file to match against the PortCoMatchInput. Your job will be to determine whether the matches between the MAs and PortCos are true matches. The techniques that I used are described in the section below.

IPO Cleaning and Matching

The process is the same for IPOs.

DROP TABLE iponodups;
CREATE TABLE iponodups
AS SELECT issuer, statecode, issuedate, MAX(principalamt) AS principalamt, MAX(proceedsamt) AS proceedsamt, MIN(naiccode) as naicode, MIN(zipcode) AS zipcode, MIN(status) AS status, 
MIN(foundeddate) AS foundeddate
FROM ipos GROUP BY issuer, statecode, issuedate ORDER BY issuer, statecode, issuedate; 
--11149
SELECT COUNT(*) FROM(SELECT DISTINCT issuer, statecode, issuedate FROM iponodups)a;
--11149
DROP TABLE ipokeys;
CREATE TABLE ipokeys AS
SELECT issuer, statecode, issuedate
FROM iponodups;
--11149
\COPY (SELECT DISTINCT issuer FROM ipokeys) TO 'IPODistinctIssuer.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--10803
DROP TABLE ipokeysstd;
CREATE TABLE ipokeysstd (
   issuerstd varchar(255),
   issuer varchar(255),
   norm varchar(100),
   x1 varchar(255),
   x2 varchar(255)
  );

\COPY ipokeysstd FROM 'IPODistinctIssuer.txt-IPODistinctIssuer.txt.matched' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--10803
DROP TABLE ipostd;
CREATE TABLE ipostd AS
SELECT B.issuerstd, A.*
FROM ipokeys AS A
JOIN ipokeysstd AS B ON A.issuer=B.issuer;
--11149
\COPY ipostd TO 'IPOMatchInput.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--11149

As with MA, match this file against the PortCoMatchInput file without mode 2. Then manually check the matches using the techniques described below.

I generally use MAX for amounts and MIN for dates. I also chose to use MIN on text strings.

Cleaning IPO and MA Data

It is important to follow Ed's direction of cleaning the data using aggregate function before putting the data into excel. This will keep you from a lot of manual checking that is unnecessary. When ready, paste the data you have into an excel file. In that excel file, I made three columns: one to check whether state codes were equivalent, one checking whether the date of first investment was 3 years before the MA or IPO, and one checking whether both of these conditions were satisfied for each company. I did this using simple if statements. This process is manual checking and filtering to see whether matches are correct or not and are thus extremely subjective and tedious. First, I went through and checked the companies that did not have equivalent state codes. If the company was one that I knew or the name was unique to the point that I did not believe the same name would appear in another state, I marked the state codes as equivalent. I did the same for the date of first investment vs MA/IPO date. Then I removed all duplicates that had the marking Warning Multiple Matches, and the data sheets were clean.


Process For Creating the PortCoExits Table

MA Process

First we must load the clean, manually checked tables back into the database.

DROP TABLE MAClean;
CREATE TABLE MAClean (
 conamestd varchar(255),
 targetnamestd varchar(255),
 method varchar(100),
 x1 varchar(255),
 coname varchar(255),
 statecode varchar(10),
 datefirstinv date,
 x2 varchar(255),
 targetname varchar(255),
 targetstate varchar(10),
 announceddate date
);
\COPY MAClean FROM 'MAClean.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--7205
SELECT COUNT(*) FROM(SELECT DISTINCT targetname, targetstate, announceddate FROM MAClean)a;
--7188

As you can see there are still duplicate primary keys in the table. To get rid of these I wrote a query that chooses primary keys that occur only once and matches them against MANoDups. That way you will have unique primary keys by construction.

DROP TABLE MACleanNoDups;
CREATE TABLE MACleanNoDups AS
SELECT A.*, effectivedate, transactionamt, enterpriseval, acquirorstatus
FROM MAClean AS A
JOIN (
	SELECT targetname, targetstate, announceddate, COUNT(*) FROM MAClean
	GROUP BY targetname, targetstate, announceddate HAVING COUNT(*)=1
	) AS B
ON A.targetname=B.targetname AND A.targetstate=B.targetstate AND A.announceddate=B.announceddate
LEFT JOIN MANoDups AS C ON A.targetnamestd=C.targetname AND A.announceddate=C.announceddate;
SELECT COUNT(*) FROM(SELECT DISTINCT coname, statecode, datefirstinv FROM MACleanNoDups)a;
--7171

Thus the portco primary key is unique in the table. We will use this later. Now do the same for the IPOs.

IPO Process

DROP TABLE IPOClean;
CREATE TABLE IPOClean (
 conamestd varchar(255),
 issuernamestd varchar(255),
 method varchar(100),
 x1 varchar(255),
 coname varchar(255),
 statecode varchar(10),
 datefirstinv date,
 x2 varchar(255),
 issuername varchar(255),
 issuerstate varchar(10),
 issuedate date
);

\COPY IPOClean FROM 'IPOClean.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--2146
SELECT COUNT(*) FROM(SELECT DISTINCT issuername, issuerstate, issuedate FROM IPOClean)a;
--2141

As with the MA process, there were duplicates in the clean IPO table. Get rid of these using the same process as with MAs. Only choose the primary keys that occur once and join these to the IPONoDups table.

DROP TABLE IPOCleanNoDups;
CREATE TABLE IPOCleanNoDups AS
SELECT A.*, principalamt, proceedsamt, naicode as naics, zipcode, status, foundeddate
FROM IPOClean AS A
JOIN (
	SELECT issuername, issuerstate, issuedate, COUNT(*) FROM IPOClean
	GROUP BY issuername, issuerstate, issuedate HAVING COUNT(*)=1
	) AS B
ON A.issuername=B.issuername AND A.issuerstate=B.issuerstate AND A.issuedate=B.issuedate
LEFT JOIN IPONoDups AS C ON A.issuername=C.issuer AND A.issuerstate=C.statecode AND A.issuedate=C.issuedate;
--2136


SELECT COUNT(*) FROM(SELECT DISTINCT coname, statecode, datefirstinv FROM IPOCleanNoDups)a;
--2136

Now the duplicates are out of the MAClean and IPOClean data and we can start to construct the ExitKeysClean table.

Creating ExitKeysClean

First I looked for the PortCos that were in both the MAs and the IPOs. I did this using:

DROP TABLE IPOMAForReview;
CREATE TABLE IPOMAForReview
SELECT A.*, B.targetname, B.targetstate, B.announcedate
FROM IPOCleanNoDups AS A
JOIN MACleanNoDups AS B ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv;
--92

I then pulled out the IPOs that were only IPOs and MAs that were only MAs. I also added in a column that indicated whether a company underwent an IPO or a MA.

DROP TABLE IPONoConflict;
CREATE TABLE IPONoConflict AS
SELECT A.*, 1::int as IPOvsMA
FROM IPOCleanNoDups AS A 
LEFT JOIN MACleanNoDups AS B 
ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv 
WHERE B.statecode IS NULL AND B.coname IS NULL AND B.datefirstinv IS NULL;
--2044
DROP TABLE MANoConflict;
CREATE TABLE MANoConflict AS
SELECT A.*, 0::int as IPOvsMA
FROM MACleanNoDups AS A
LEFT JOIN IPOCleanNoDups AS B 
ON A.coname=B.Coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv
WHERE B.statecode IS NULL AND B.coname IS NULL AND B.datefirstinv IS NULL;
--7079

Since 2136-92=2044 and 7171-92=7079, we know that the duplicate companies were extracted successfully.

I then wrote a query to check whether the IPO issue date or announced date of the MA was earlier and used that to indicate whether I chose the company to have undergone an MA or an IPO in the column MSvsIPO. A 0 in the column represented an MA being chosen and a 1 represented an IPO being chosen.


Then out of this table I extracted the MAs and IPOs using the the created MAvsIPO flag:

DROP TABLE MASelected;
CREATE TABLE MASelected AS
SELECT coname, statecode, datefirstinv, 
targetname, targetstate, announceddate,
0::int as IPOvsMA
FROM IPOMAForReview 
WHERE issuedate >= announceddate;
--25
DROP TABLE IPOSelected;
CREATE TABLE IPOSelected AS
SELECT coname, statecode, datefirstinv, 
issuername, issuerstate, issuedate,
1::int as IPOvsMA
FROM IPOMAForReview 
WHERE issuedate < announceddate;
--67

I then made the ExitKeysClean table using the portco primary key and the indicator MAvsIPO indicator column.

DROP TABLE ExitKeys;
CREATE TABLE ExitKeys AS
SELECT coname, statecode, datefirstinv, ipovsma FROM IPONoConflict
UNION SELECT coname, statecode, datefirstinv, ipovsma FROM IPOSelected
UNION SELECT coname, statecode, datefirstinv, ipovsma FROM MANoConflict
UNION SELECT coname, statecode, datefirstinv, ipovsma FROM MASelected;
--9215

Create the PortCoExit And PortCoAliveDead Tables

From consulting with Ed and the VC Database Rebuild wiki, I decided to make the PortCoExit table with an mavsipo, an exitdate, an exited, and an exitvalue column. I use the MAvsIPO column to add in data. It is very important that you have constructed this column.

DROP TABLE PortCoExit;
CREATE TABLE PortCoExit AS
SELECT A.coname, A.statecode, A.datefirstinv, A.datelastinv, A.city, B.ipovsma,
CASE WHEN B.ipovsma IS NOT NULL THEN 1::int ELSE 0::int END AS Exit,
CASE WHEN B.ipovsma=1 THEN C.proceedsamt::numeric WHEN ipovsma=0 THEN D.transactionamt::numeric ELSE NULL::numeric END AS exitvaluem,
CASE WHEN B.ipovsma=1 THEN C.issuedate WHEN ipovsma=0 THEN D.announceddate ELSE NULL::date END AS exitdate,
CASE WHEN B.ipovsma=1 THEN extract(year from C.issuedate) WHEN ipovsma=0 THEN extract(year from D.announceddate) ELSE NULL::int END AS exityear
FROM companybasecore AS A
LEFT JOIN ExitKeys AS B ON A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv
LEFT JOIN IPOCleanNoDups AS C ON A.coname=C.coname AND A.statecode=C.statecode AND A.datefirstinv=C.datefirstinv
LEFT JOIN MACleanNoDups AS D ON A.coname=D.coname AND A.statecode=D.statecode AND A.datefirstinv=D.datefirstinv;
--48001

I then used this table to build one that has information as to whether a company was dead or alive. I found this information by checking whether a company had undergone an IPO or MA, indicating the company was dead. Alternatively, if the company's date of last investment was more than 5 years ago, I marked the company as dead.

DROP TABLE PortCoAliveDead;
CREATE TABLE PortCoAliveDead AS
SELECT *, 
datefirstinv as alivedate, extract(year from datefirstinv) as aliveyear,
CASE WHEN exitdate IS NOT NULL then exitdate 
	WHEN exitdate IS NULL AND (datelastinv + INTERVAL '5 year') < '7/1/2018' THEN (datelastinv + INTERVAL '5 year') 
	ELSE NULL::date END AS deaddate,
CASE WHEN exitdate IS NOT NULL then exityear 
	WHEN exitdate IS NULL AND (datelastinv + INTERVAL '5 year') < '7/1/2018' THEN extract(year from (datelastinv + INTERVAL '5 year')) 
	ELSE NULL::int END AS deadyear
FROM PortCoExit;
--48001

GeoCoding Companies, Firms, and Branch Offices

A helpful page here is the Geocode.py page which explains how to use the Geocoding script. You will have to tweak the Geocode script when geocoding as each of these tables has a different primary key. It is vital that you include the primary keys in the file you input and output from the Geocoding script. Without these, you will not be able to join the latitudes and longitudes back to the firm, branch office, or company base tables.

Geocoding costs money since we are using the Google Maps API. The process doesn't cost much, but in order to save money I tried to salvage as much of the preexisting geocode information I could find.

Companies

I found the table of old companies with latitudes and longitudes in vcdb2 and loaded these into vcdb3.

DROP TABLE oldgeocords;
CREATE TABLE oldgeocords (
  coname varchar(255),
  statecode varchar(10),
  datefirstinv date,
  ivestedk real,
  city varchar(255),
  addr1 varchar(255),
  addr2 varchar(100),
  latitude numeric,
  longitude numeric
  );
\COPY oldgeocords FROM 'companybasegeomaster.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--44740

The API occasionally will give erroneous latitude and longitude readings. In order to catch only the good ones, I found the latitude and longitude lines that encompass the mainland US and created an exclude flag to make sure companies were in this box. I then created flags to include companies in Puerto Rico, Hawaii, and Alaska. Companies that were in these places often had wrong latitude and longitude readings of 44.93, 7.54, so I ran a query making sure that these weren't listed.

DROP TABLE geoallcoords;
CREATE TABLE geoallcoords AS
SELECT *, CASE
WHEN longitude < -125 OR longitude > -66 OR latitude < 24 OR latitude > 50 OR latitude IS NULL OR longitude IS NULL THEN 1::int ELSE 
0::int END AS excludeflag FROM oldgeocords;
--44740
DROP TABLE geoallcoords1;
CREATE TABLE geoallcoords1 AS SELECT
*, CASE WHEN statecode='PR' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as prflag,
CASE WHEN statecode='HI' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as hiflag,
CASE WHEN statecode='AK' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int  ELSE 0::int END as akflag
FROM geoallcoords;
--44740

I then included only companies that were either in the mainland US, Hawaii, Alaska, or Puerto Rico.

DROP TABLE goodgeoold;
CREATE TABLE goodgeoold AS SELECT
A.*, B.latitude, B.longitude, B.prflag, B.excludeflag, B.hiflag, B.akflag FROM companybasecore AS A LEFT JOIN geoallcoords1 AS B ON
A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv WHERE excludeflag=0 or prflag=1 or hiflag=1 or akflag=1;
--38498

I then found the remaining companies that needed to be geocoded. Only companies that have addresses listed are able to be accurately geocoded. If we attempt to geocode based on city, the location returned will simply be the center of the city. Thus, I chose the companies that we did not already have listings for and had a valid address.

DROP TABLE remaininggeo;
CREATE TABLE remaininggeo AS SELECT A.coname, A.statecode, A.datefirstinv, A.addr1, A.addr2, A.city, A.zip FROM companybasecore AS A LEFT JOIN goodgeoold AS B ON A.coname=B.coname 
AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv
WHERE B.coname IS NULL AND A.addr1 IS NOT NULL;
--5955
\COPY remaininggeo TO 'RemainingGeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--5955

I copied this table into excel to concatenate the address, city, state, and zipcode columns into one column. This can and should be done in SQL, but I was not aware this could be done. I then ran remaininggeo through the Geocode script with columns coname, statecode, datefirstinv, and address in the inputted file.

DROP TABLE remaining;
CREATE TABLE remaining (
  coname varchar(255),
  statecode varchar(10),
  datefirstinv date,
  latitude numeric,
  longitude numeric
  );
\COPY remaining FROM 'RemainingLatLong.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--5955

I then ran the same geographical checks on the newly geocoded companies and found all of the good geocodes.

DROP TABLE geoallcoords2;
CREATE TABLE geoallcoords2 AS
SELECT *, CASE
WHEN longitude < -125 OR longitude > -66 OR latitude < 24 OR latitude > 50 OR latitude IS NULL OR longitude IS NULL THEN 1::int ELSE 
0::int END AS excludeflag FROM remaining;
--5955
DROP TABLE geoallcoords3;
CREATE TABLE geoallcoords3 AS
SELECT *, CASE WHEN statecode='PR' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int     0::int END as prflag,
CASE WHEN statecode='HI' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as hiflag,
CASE WHEN statecode='AK' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int  ELSE 0::int END as akflag
FROM geoallcoords2;
--5955
DROP TABLE goodgeonew;
CREATE TABLE goodgeonew AS
SELECT A.*, B.latitude, B.longitude, B.prflag, B.excludeflag, B.hiflag, B.akflag FROM companybasecore AS A LEFT JOIN geoallcoords3 AS B ON
A.coname=B.coname AND A.statecode=B.statecode AND A.datefirstinv=B.datefirstinv WHERE excludeflag=0 or prflag=1 or hiflag=1 or akflag=1;
--5913

I then combined the old and new geocodes and matched them back to the company base table to get a geo table for companies.

DROP TABLE geocodesportco;
CREATE TABLE geocodesportco AS SELECT
A.* from goodgeonew 
UNION
SELECT B.* from goodgeoold;
--44411
DROP TABLE portcogeo;
CREATE TABLE portcogeo AS SELECT
A.*, B.latitude, B.longitude FROM companybasecore AS A LEFT JOIN Geocodesportco AS B ON A.coname=B.coname AND A.datefirstinv=B.datefirstinv AND A.statecode=B.statecode;
--48001

Firms

This process is largely the same as for companies. I found old firms that had already been geocoded and checked for accuracy.

DROP TABLE oldfirmcoords;
CREATE TABLE oldfirmcoords (
  firmname varchar(255),
  latitude numeric,
  longitude numeric
  );

\COPY oldfirmcoords FROM 'FirmCoords.txt' DELIMITER AS E'\t' HEADER NULL AS  CSV
--5556
DROP TABLE firmoldfilter;
CREATE TABLE firmoldfilter AS
SELECT *, CASE
WHEN longitude < -125 OR longitude > -66 OR latitude < 24 OR latitude > 50 OR latitude IS NULL OR longitude IS NULL THEN 1::int ELSE 
0::int END AS excludeflag FROM oldfirmcoords;
--5556

Since oldfirmcoords does not have state codes, we have to find a way to include state codes to add in companies based in Puerto Rico, Hawaii, and Alaska. I did this by matching the firmoldfilter table back to the firm base table. I used the coalesce function because we wanted to exclude companies that we had not geocoded due to faulty addresses.

DROP TABLE firmcoordsmatch1;
CREATE TABLE firmcoordsmatch1 AS SELECT 
A.firmname, A.state, B.latitude, B.longitude, COALESCE(B.excludeflag, 1) AS excludeflag FROM firmbasecore AS A LEFT JOIN firmoldfilter AS B ON A.firmname=B.firmname;
--15437

Then the process of tagging the PR, HI, and AK companies and including only correctly tagged companies is the same as for companies.

DROP TABLE firmcoordsexternal;
CREATE TABLE firmcoordsexternal AS
SELECT *, CASE WHEN state='PR' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as prflag,
CASE WHEN state='HI' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as hiflag,
CASE WHEN state='AK' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int  ELSE 0::int END as akflag
FROM firmcoordsmatch1;
--15437
DROP TABLE goodfirmgeoold;
CREATE TABLE goodfirmgeoold AS SELECT
A.*, B.latitude, B.longitude, B.prflag, B.excludeflag, B.hiflag, B.akflag FROM firmcoreonedupremoved AS A LEFT JOIN firmcoordsexternal AS B ON A.firmname=B.firmname
WHERE excludeflag=0 or prflag=1 or hiflag=1 or akflag=1;
--5346

Find the remaining firms and run the geocode script on these firms

DROP TABLE remainingfirm;
CREATE TABLE remainingfirm AS SELECT A.firmname, A.addr1, A.addr2, A.city, A.state, A.zip FROM firmcoreonedupremoved AS A LEFT JOIN goodfirmgeoold AS B ON A.firmname=B.firmname
WHERE B.firmname IS NULL AND A.addr1 IS NOT NULL AND A.msacode!='9999';
--706
\COPY remainingfirm TO 'FirmGeoRemaining.txt' DELIMITER AS E'\t' HEADER NULL AS  CSV
--706
DROP TABLE firmremainingcoords;
CREATE TABLE firmremainingcoords(
  firmname varchar(255),
  latitude numeric,
  longitude numeric
  );
\COPY firmremainingcoords FROM 'FirmRemainingCoords.txt' DELIMITER AS E'\t' HEADER NULL AS  CSV
--706

Follow the same filtering process as above to get the good geocodes.

DROP TABLE firmnewfilter;
CREATE TABLE firmnewfilter AS
SELECT *, CASE
WHEN longitude < -125 OR longitude > -66 OR latitude < 24 OR latitude > 50 OR latitude IS NULL OR longitude IS NULL THEN 1::int ELSE 
0::int END AS excludeflag FROM firmremainingcoords;
--706
DROP TABLE firmcoordsmatch2;
CREATE TABLE firmcoordsmatch2 AS SELECT 
A.firmname, A.state, B.latitude, B.longitude, COALESCE(B.excludeflag, 1) AS excludeflag FROM firmcoreonedupremoved AS A LEFT JOIN firmnewfilter AS B ON A.firmname=B.firmname;
--15437
DROP TABLE firmcoordsexternalremaining;
CREATE TABLE firmcoordsexternalremaining AS
SELECT *, CASE WHEN state='PR' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as prflag,
CASE WHEN state='HI' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as hiflag,
CASE WHEN state='AK' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int  ELSE 0::int END as akflag
FROM firmcoordsmatch2;
--15437
DROP TABLE goodfirmgeonew;
CREATE TABLE goodfirmgeonew AS SELECT A.*, B.latitude, B.longitude, B.prflag, B.excludeflag, B.hiflag, B.akflag FROM firmcoreonedupremoved AS A LEFT JOIN firmcoordsexternalremaining AS B 
ON A.firmname=B.firmname
WHERE excludeflag=0 or prflag=1 or hiflag=1 or akflag=1;
--703

Combine the old and new geocoded firms and match them to firm base to get a firm geo table.

DROP TABLE firmgeocoords;
CREATE TABLE firmgeocoords AS
SELECT * FROM goodfirmgeonew
UNION
SELECT * FROM goodfirmgeoold;
--6049
DROP TABLE firmgeocore;
CREATE TABLE firmgeocore AS
SELECT A.*, B.latitude, B.longitude FROM firmbasecore AS A LEFT JOIN firmgeocoords AS B ON A.firmname=B.firmname;
--15437

Branch Offices

I did not use old branch office data because I could not find it anywhere in the old data set. I have since found old data in the table firmbasecoords in vcdb2.

First copy all of the needed data out of the database to do geocoding.

\COPY (SELECT A.firmname, A.boaddr1, A.boaddr2, A.bocity, A.bostate, A.bozip FROM bonound AS A WHERE A.boaddr1 IS NOT NULL) TO 'BranchOffices.txt' WITH DELIMITER AS E'\t' HEADER 
NULL AS  CSV
--2046

Then load the data into the database and follow the same filtering process as above.

DROP TABLE bogeo;
CREATE TABLE bogeo (
  firmname varchar(255),
  latitude numeric,
  longitude numeric
  );
\COPY bogeo FROM 'BranchOfficesGeo.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--2046
DROP TABLE bogeo1;
CREATE TABLE bogeo1 AS SELECT *, CASE
WHEN longitude < -125 OR longitude > -66 OR latitude < 24 OR latitude > 50 OR latitude IS NULL OR longitude IS NULL THEN 1::int ELSE 
0::int END AS excludeflag FROM bogeo;
--2046
DROP TABLE bomatchgeo;
CREATE TABLE bomatchgeo AS
SELECT A.*, B.latitude, B.longitude, COALESCE(B.excludeflag, 1) AS excludeflag FROM branchofficecore AS A LEFT JOIN bogeo1 AS B ON A.firmname=B.firmname;
--10032
DROP TABLE bogeo2;
CREATE TABLE bogeo2 AS
SELECT *, CASE WHEN bostate='PR' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as prflag,
CASE WHEN bostate='HI' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int ELSE 0::int END as hiflag,
CASE WHEN bostate='AK' AND latitude!=44.9331 AND longitude!=7.54012 THEN 1::int  ELSE 0::int END as akflag
FROM bomatchgeo;
--10032

Match the correctly geocoded branch offices back to firm base to get the final table.

DROP TABLE bogeocore1;
CREATE TABLE bogeocore1 AS
SELECT * FROM bogeo2 WHERE excludeflag=0 or prflag=1 or hiflag=1 or akflag=1;
--1161
DROP TABLE firmbogeo;
CREATE TABLE firmbogeo AS
SELECT A.*, B.latitude AS BOLatitude, B.longitude AS BOLongitude FROM firmgeocore AS A LEFT JOIN bogeocore1 AS B ON A.firmname=B.firmname;
--15437

Creating People Tables

We pulled data on executives in both portcos and funds. I describe the process below. If any of the explanations don't make sense, I also describe most tables in the section called Marcos's Code.

Company People

DROP TABLE titlelookup;
CREATE TABLE titlelookup(
	fulltitle varchar(150),
	charman int, 
	ceo int,
	cfo int,
	coo int,
	cio int,
	cto int,
	otherclvl int,
	boardmember int,
	president int,
	vp int,
	founder int,
	director int
);
\COPY titlelookup FROM 'Important Titles in Women2017 dataset.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--628

This table lists various titles one can have and identifies where they fall under traditional executive titles.

DROP TABLE copeople;
CREATE TABLE copeople(
	datefirstinv   date,
	cname varchar(150),
	statecode  varchar(2),
	prefix varchar(5),
	firstname varchar(50),
	lastname varchar(50),
	jobtitle varchar(150),
	nonmanaging  varchar(1),
	prevpos  varchar(255)
);
\COPY copeople FROM 'Executives-NoFoot-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--194359

This table gets various executives from portcos. This is loaded from SDC. Next we have to identify which traditional executive level job the listed job title corresponds to. It also identifies whether a prefix identifies an executive as male or female. I made a stupid mistake of writing cname instead of coname when loading in the data. If you want to save yourself work, write coname.

DROP TABLE copeoplebase;
CREATE TABLE copeoplebase AS
SELECT copeople.*,
CASE WHEN prefix='Ms' THEN 1::int
	WHEN prefix='Mr' THEN 0::int
	ELSE Null::int END AS titlefemale,
CASE WHEN prefix='Ms' THEN 0::int
	WHEN prefix='Mr' THEN 1::int
	ELSE Null::int END AS titlemale,
CASE WHEN prefix='Dr' THEN 1::int
	ELSE 0::int END AS doctor,
CASE WHEN prefix IS NULL THEN 0::int
	ELSE 1::int END AS hastitle,
CASE WHEN prefix IS NULL AND firstname IS NULL AND lastname IS NULL THEN 0::int
	ELSE 1::int END AS hasperson,
CASE WHEN fulltitle IS NOT NULL THEN 1::int ELSE 0::int END AS hastitlelookup,
CASE WHEN charman IS NOT NULL THEN charman ELSE 0::int END AS chairman,
CASE WHEN ceo IS NOT NULL THEN ceo ELSE 0::int END AS ceo,
CASE WHEN cfo IS NOT NULL THEN cfo ELSE 0::int END AS cfo,
CASE WHEN coo IS NOT NULL THEN coo ELSE 0::int END AS coo,
CASE WHEN cio IS NOT NULL THEN cio ELSE 0::int END AS cio,
CASE WHEN cto IS NOT NULL THEN cto ELSE 0::int END AS cto,
CASE WHEN otherclvl IS NOT NULL THEN otherclvl ELSE 0::int END AS otherclvl,
CASE WHEN boardmember IS NOT NULL THEN boardmember ELSE 0::int END AS boardmember,
CASE WHEN president IS NOT NULL THEN president ELSE 0::int END AS president,
CASE WHEN vp IS NOT NULL THEN vp ELSE 0::int END AS vp,
CASE WHEN founder IS NOT NULL THEN founder ELSE 0::int END AS founder,
CASE WHEN director IS NOT NULL THEN director ELSE 0::int END AS director
FROM copeople
LEFT JOIN titlelookup ON copeople.jobtitle=titlelookup.fulltitle;
--194359

Next we will try to identify whether an executive is male or female based on their names.

DROP TABLE namegender;
CREATE TABLE namegender AS
SELECT firstname, 
CASE WHEN countfemale > 0 AND countmale=0 THEN 1::int ELSE 0::int END AS exclusivelyfemale,
CASE WHEN countmale > 0 AND countfemale=0 THEN 1::int ELSE 0::int END AS exclusivelymale
FROM
	(SELECT firstname, COALESCE(sum(titlefemale),0) as countfemale,  COALESCE(sum(titlemale),0) as countmale 
	FROM copeoplebase WHERE doctor=0
	GROUP BY firstname) As T
WHERE NOT (countfemale > 0 AND countmale>0);
--12736

The next table expands CoPeopleBase to include information on executive gender and executive position.

DROP TABLE CoPeopleFull;
CREATE TABLE CoPeopleFull AS
SELECT copeoplebase.*,
CASE WHEN titlefemale=1 THEN 1::int 
	WHEN exclusivelyfemale=1 THEN 1::int ELSE 0::int END AS female,
CASE WHEN titlemale=1 THEN 1::int 
	WHEN exclusivelymale=1 THEN 1::int ELSE 0::int END AS male,	
CASE WHEN (titlefemale=1 OR titlemale=1 OR exclusivelymale=1 OR exclusivelyfemale=1) THEN 0::int ELSE 1::int END AS unknowngender,
CASE WHEN (ceo=1 OR president=1) THEN 1::int ELSE 0::int END AS ceopres,
CASE WHEN (chairman=1 OR ceo=1 OR cfo=1 OR coo=1 OR cio=1 OR cto=1 OR otherclvl=1 OR president=1) THEN 1::int ELSE 0::int END AS CLevel,
CASE WHEN (chairman=1 OR ceo=1 OR cfo=1 OR coo=1 OR cio=1 OR cto=1 OR otherclvl=1 OR president=1 OR director=1 OR boardmember=1) THEN 1::int ELSE 0::int END AS board,
CASE WHEN (chairman=1 OR ceo=1 OR cfo=1 OR coo=1 OR cio=1 OR cto=1 OR otherclvl=1 OR president=1 OR director=1 OR boardmember=1 OR vp=1 OR founder=1) THEN 1::int ELSE 
0::int END AS vpandabove
FROM copeoplebase
LEFT JOIN namegender ON namegender.firstname=copeoplebase.firstname
WHERE hasperson=1;
--177547

The next table only keeps executive listings that have a valid portco primary key associated with them.

DROP TABLE CoPeopleKey;
CREATE TABLE CoPeopleKey AS
SELECT A.*
FROM CoPeopleFull AS A
JOIN (SELECT firstname, lastname, cname, datefirstinv, statecode, count(*) FROM CoPeopleFull 
WHERE firstname IS NOT NULL AND lastname IS NOT NULL AND cname IS NOT NULL AND datefirstinv IS NOT NULL AND statecode IS NOT NULL
GROUP BY firstname, lastname, cname, datefirstinv, statecode HAVING COUNT(*)=1) AS B
ON A.firstname=B.firstname AND A.lastname=B.lastname AND A.datefirstinv=B.datefirstinv AND A.cname=B.cname AND A.statecode=B.statecode;
--176251

The next table identifies whether a person previously held executive positions.

CREATE TABLE CoPeopleSerial AS
SELECT firstname, lastname, cname, datefirstinv, statecode, 
COALESCE(sum(hasperson),0) as prev,
COALESCE(sum(ceo),0) as prevceo,
COALESCE(sum(ceopres),0) as prevceopres,
COALESCE(sum(founder),0) as prevfounder,
COALESCE(sum(clevel),0) as prevclevel,
COALESCE(sum(board),0) as prevboard,
COALESCE(sum(vpandabove),0) as prevvpandabove,
CASE WHEN sum(hasperson) >=1 THEN 1::int ELSE 0::int END AS serial,
CASE WHEN sum(ceo) >=1 THEN 1::int ELSE 0::int END AS serialceo,
CASE WHEN sum(ceopres) >=1 THEN 1::int ELSE 0::int END AS serialceopres,
CASE WHEN sum(founder) >=1 THEN 1::int ELSE 0::int END AS serialfounder,
CASE WHEN sum(clevel) >=1 THEN 1::int ELSE 0::int END AS serialclevel,
CASE WHEN sum(board) >=1 THEN 1::int ELSE 0::int END AS serialboard,
CASE WHEN sum(vpandabove) >=1 THEN 1::int ELSE 0::int END AS serialvpandabove
FROM (
	SELECT A.prefix, A.firstname, A.lastname, A.cname, A.datefirstinv, A.statecode, 
	B.cname as prevcname, B.datefirstinv as prevdatefirstinv, B.statecode as prevstatecode, B.ceo, B.ceopres, B.founder, B.clevel, B.board, B.vpandabove, B.hasperson
	FROM CoPeopleKey AS A
	LEFT JOIN CoPeopleKey AS B ON A.firstname=B.firstname AND A.lastname=B.lastname AND A.datefirstinv > B.datefirstinv
) AS T
GROUP BY firstname, lastname, cname, datefirstinv, statecode;
--176251

The last table aggregates a ton of information on executives for each company. There is too much information to explain it all.

DROP TABLE copeopleagg;
CREATE TABLE copeopleagg AS
SELECT A.cname, A.datefirstinv, A.statecode, 
sum(hasperson) as numperson,
sum(hastitle) as numtitled,
CASE WHEN sum(ceopres) >=1 THEN 1::int ELSE 0::int END AS hasceopres,
CASE WHEN sum(founder) >=1 THEN 1::int ELSE 0::int END AS hasfounder,
CASE WHEN sum(clevel) >=1 THEN 1::int ELSE 0::int END AS hasclevel,
CASE WHEN sum(board) >=1 THEN 1::int ELSE 0::int END AS hasboard,
CASE WHEN sum(vpandabove) >=1 THEN 1::int ELSE 0::int END AS hasvpandabove,
sum(female) as females,
sum(male) as males,
sum(unknowngender) as ugs,
sum(doctor*female) as femaledoctors,
sum(doctor*male) as maledoctors,
sum(doctor*unknowngender) as ugdoctors,
sum(ceopres*female) as femaleceos,
sum(ceopres*male) as maleceos,
sum(ceopres*unknowngender) as ugceos,
sum(ceopres*female*doctor) as femaledoctorceos,
sum(ceopres*male*doctor) as maledoctorceos,
sum(ceopres*unknowngender*doctor) as ugdoctorceos,
sum(founder*female) as femalefounders,
sum(founder*male) as malefounders,
sum(founder*unknowngender) as ugfounders,
sum(founder*female*doctor) as femaledoctorfounders,
sum(founder*male*doctor) as maledoctorfounders,
sum(founder*unknowngender*doctor) as ugdoctorfounders,
sum(clevel*female) as femaleclevels,
sum(clevel*male) as maleclevels,
sum(clevel*unknowngender) as ugclevels,
sum(clevel*female*doctor) as femaledoctorclevels,
sum(clevel*male*doctor) as maledoctorclevels,
sum(clevel*unknowngender*doctor) as ugdoctorclevels,
sum(board*female) as femaleboards,
sum(board*male) as maleboards,
sum(board*unknowngender) as ugboards,
sum(board*female*doctor) as femaledoctorboards,
sum(board*male*doctor) as maledoctorboards,
sum(board*unknowngender*doctor) as ugdoctorboards,
sum(vpandabove*female) as femaleabovevps,
sum(vpandabove*male) as maleabovevps,
sum(vpandabove*unknowngender) as ugabovevps,
sum(vpandabove*female*doctor) as femaledoctorabovevps,
sum(vpandabove*male*doctor) as maledoctorabovevps,
sum(vpandabove*unknowngender*doctor) as ugdoctorabovevps,
sum(prev*female) as femaleprevs,
sum(prev*male) as maleprevs,
sum(prev*unknowngender) as ugprevs,
sum(prevceopres*female) as femaleprevceopres,
sum(prevceopres*male) as maleprevceopres,
sum(prevceopres*unknowngender) as ugprevceopres,
sum(prevfounder*female) as femaleprevfounder,
sum(prevfounder*male) as maleprevfounder,
sum(prevfounder*unknowngender) as ugprevfounder,
sum(prevclevel*female) as femaleprevclevel,
sum(prevclevel*male) as maleprevclevel,
sum(prevclevel*unknowngender) as ugprevclevel,
sum(prevboard*female) as femaleprevboard,
sum(prevboard*male) as maleprevboard,
sum(prevboard*unknowngender) as ugprevboard,
sum(prevvpandabove*female) as femaleprevvpandabove,
sum(prevvpandabove*male) as maleprevvpandabove,
sum(prevvpandabove*unknowngender) as ugprevvpandabove,
sum(serial*female) as femaleserials,
sum(serial*male) as maleserials,
sum(serial*unknowngender) as ugserials,
sum(serialceopres*female) as femaleserialceopres,
sum(serialceopres*male) as maleserialceopres,
sum(serialceopres*unknowngender) as ugserialceopres,
sum(serialfounder*female) as femaleserialfounder,
sum(serialfounder*male) as maleserialfounder,
sum(serialfounder*unknowngender) as ugserialfounder,
sum(serialclevel*female) as femaleserialclevel,
sum(serialclevel*male) as maleserialclevel,
sum(serialclevel*unknowngender) as ugserialclevel,
sum(serialboard*female) as femaleserialboard,
sum(serialboard*male) as maleserialboard,
sum(serialboard*unknowngender) as ugserialboard,
sum(serialvpandabove*female) as femaleserialvpandabove,
sum(serialvpandabove*male) as maleserialvpandabove,
sum(serialvpandabove*unknowngender) as ugserialvpandabove,
sum(ceopres*serialceopres*female) as femaleceopresserialceopres,
sum(ceopres*serialceopres*male) as maleceopresserialceopres,
sum(ceopres*serialceopres*unknowngender) as ugceopresserialceopres,
sum(founder*serialfounder*female) as femalefounderserialfounder,
sum(founder*serialfounder*male) as malefounderserialfounder,
sum(founder*serialfounder*unknowngender) as ugfounderserialfounder 
FROM CoPeoplekey AS A
JOIN CoPeopleSerial AS B 
ON A.firstname=B.firstname AND A.lastname=B.lastname AND A.datefirstinv=B.datefirstinv AND A.cname=B.cname AND A.statecode=B.statecode
GROUP BY A.cname, A.datefirstinv, A.statecode;
--30413

Since this table is so big, it is a good idea to have a smaller, more manageable table to work with.

DROP TABLE copeopleaggsimple;

CREATE TABLE copeopleaggsimple AS
SELECT cname, datefirstinv, statecode, numperson, females, males, ugs, ugdoctors, maleserials+femaleserials+ugserials AS serials
FROM copeopleagg;
--30413

Fund People

Luckily, this process is much easier than the company people process. First we must simply load the data into the db.

DROP TABLE fundpeople;
CREATE TABLE fundpeople(
	fundname  varchar(255),
	fundyear  int,
	prefix varchar(5),
	firstname varchar(50),
	lastname varchar(50),
	jobtitle varchar(150),
	 prevpos  varchar(255)
);
\COPY fundpeople FROM 'Executives-Funds-NoFoot-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--328994

The next table identifies degree and sex information about the executives of the fund.

DROP TABLE fundpeoplebase;
CREATE TABLE fundpeoplebase AS
SELECT fundpeople.*,
CASE WHEN prefix='Ms' THEN 1::int
	WHEN prefix='Mr' THEN 0::int
	ELSE Null::int END AS titlefemale,
CASE WHEN prefix='Ms' THEN 0::int
	WHEN prefix='Mr' THEN 1::int
	ELSE Null::int END AS titlemale,
CASE WHEN prefix='Dr' THEN 1::int
	ELSE 0::int END AS doctor,
CASE WHEN prefix IS NULL THEN 0::int
	ELSE 1::int END AS hastitle,
CASE WHEN prefix IS NULL AND firstname IS NULL AND lastname IS NULL THEN 0::int
	ELSE 1::int END AS hasperson
FROM fundpeople;
--328994

The next table tries to identify the sex of the executive using the above defined namegender table. It only selects rows where a person is actually listed.

DROP TABLE FundPeopleFull;
CREATE TABLE FundPeopleFull AS
SELECT fundpeoplebase.*, exclusivelyfemale, exclusivelymale,
CASE WHEN titlefemale=1 THEN 1::int 
	WHEN exclusivelyfemale=1 AND exclusivelymale=0 AND (titlemale=0 OR titlemale IS NULL) THEN 1::int ELSE 0::int END AS female,
CASE WHEN titlemale=1 THEN 1::int 
	WHEN exclusivelymale=1  AND exclusivelyfemale=0 AND (titlefemale =0 OR titlefemale IS NULL) THEN 1::int ELSE 0::int END AS male,	
CASE WHEN (titlefemale=1 OR titlemale=1 OR exclusivelymale=1 OR exclusivelyfemale=1) THEN 0::int ELSE 1::int END AS unknowngender
FROM fundpeoplebase
LEFT JOIN namegender ON namegender.firstname=fundpeoplebase.firstname
WHERE hasperson=1;
--320915

The next table gives you information on executives aggregated by fund.

DROP TABLE FundPeopleAgg;
CREATE TABLE FundPeopleAgg AS
SELECT fundname, 
sum(female) as numfemale,
sum(male) as nummale,
sum(unknowngender) as numunknowngender,
sum(doctor) as numdoctor,
sum(female*doctor) as numfemaledoctor,
sum(male*doctor) as nummaledoctor,
sum(unknowngender*doctor) as numunknowngenderdoctor,
sum(hastitle) as numtitled,
sum(hasperson) as numpeople, 
CASE WHEN sum(hasperson) > 0 THEN sum(female)/sum(hasperson) ELSE NULL END as fracfemale,
CASE WHEN sum(male) > 0 THEN sum(female)/sum(male) ELSE NULL END as ratiofemale
FROM FundPeopleFull
GROUP BY fundname;
--21536

It is also good to have this information on firms. We do not pull firm people information from SDC. However, we have enough information to create it from preexisting tables.

DROP TABLE firmpeopleagg;
CREATE TABLE firmpeopleagg AS 
SELECT _firmname as firmname, sum(numfemale) as firmwomen, sum(nummale) as firmmen, sum(numunknowngender) as firmugs, 
sum(numdoctor) as firmdoctors, sum(numpeople) as firmpeople,
CASE WHEN sum(numpeople) > 0 THEN (sum(numfemale)/sum(numpeople))::real ELSE NULL END as firmfracwomen,
CASE WHEN sum(nummale) > 0 THEN (sum(numfemale)/sum(nummale))::real ELSE NULL END as firmratiowomen
FROM roundlineaggfunds AS A
JOIN fundpeopleagg AS B ON A._fundname=B.fundname
GROUP BY _firmname;
--5273

Marcos's Code

This is code that a Rice student, Marcos Lee, wrote. I cleaned it and ran it. I have described the tables that I built and where they come from below. My code is located in:

E:McNair\Projects\VentureXpert Database\vcdb3\LoadingScripts\MatchingEntrepsV3

If you have issues understanding my explanation, go to this location and read the query. Most of them are straight forward.

Describing Stacks Created in Code

CoPeopleBase:
-Builds from copeople and titlelookup
-Identifies what roles people played in their companies
namegender:
-built from copeoplebase
-identifies male/female/unknown
CoPeopleFull:
-built from copeoplebase and namegender
-builds more extensive information on executive including speficially what level of executive they are
CoPeopleKey:
-built from CoPeopleFull
-creates table where only executives with full primary keys are kept
CoPeopleSerial:
-built from copeoplekey
-keeps track of executives previous jobs at executive level
CoPoepleAgg:
-built from copeoplekey and copeopleserial
-gets extensive information on executives for each company
FundPeopleBae:
-built from fundpeople
-identifies male/female/doctor
-hasperson column slightly weird because we can only have the lastname without prefix or first name and still have a 1 in column. Seems to be of little use/too broad
FundPeopleFull:
-built from fundpeoplebase, namegender
-adds in male/female 
Fundpeopleagg:
-built from fundpeoplefull
-has aggregations of gender info for each fund
RoundLineJoinerLeanffWlistno:
-built from rounlinejoinerleanff
-adds listno to funds
RoundLineAggFunds:
-built from roundlinejoinerleanffwlistno and rounlineaggfirms
-if there are two funds from one firm that invest in same portco, we choose only one and leave the others behind
RoundLineAggWExit:
-built from roundlineaggfirms, portcoexitupdated, roundlineaggfunds
-adds in exit information for each company in roundlineaggfirms
FirmPerf:
-built from roundlineaggwexit
-adds in various performance measures for a given firm 
PortCoFundDemo:
-built from roundlinejoinerleanffclean and fundpeopleagg
-gives information on executives of funds who invested in the portcos
PortCoPeopleMaster:
-built from PortCoMaster, PortCoIndustry, PortCoPatent, PortCoSBIR, copeoplagg, PortCoFundDemo, CPI, statelookupint
-huge amount of data about companies and their executives
RoundAggDistBase:
-built from portcogeo, firmbogeo, roundlineaggwexit
-creates geographic points using long, lat from geocoding
RoundAggDist:
-Built from roundaggdistbase
-gets actual distances between portcos and firms. if branch office exists and distance is less than distance to firm chooses that also generates random number
FirmPeopleAgg:
-built from roundlineaggfunds, fundpeopleagg
-finds information on executives from different firms
PortCoMatchmaster:
-built from portcopatent, porcoindustry, portcosbir, copeopleaggsimple, portcoid
-gets all information together about portcos
FirmMatchMaster:
-built from firmperf, firmvars, firmpeopleagg, firmid
-gets all information together about firms
RoundLineMasterBase:
-built from portcomatchmaster, firmmatchmaster, roundaggdist, roundlineaggwexit
-builds large amount of information about portcos and firms spceifically info about exits and distances
MatchMostNumerous:
-built from roundlinemasterbase
-finds max number of portcos invested in by a firm that also invested in the company grouping by
MatchHighestRandom:
-built from matchmostnumerous
-if two firms that invested in one company had the same number of max port cos this randomly chooses one company
FirmActiveYearsCode20:
-built from roundlinejoinerleanffclean, porcoindustry
-adds firmname to industry code not exactly sure why distinct is used in query
RealMatchesCode20:
-built from MatchHighestRandom, PortCoIndustry
-real matches between portcos and firms that invested in them including the code20
SyntheticFirmSetBaseCode20:
-built from realmatchescode20, firmactiveyarscode20
-crossproduct of firms and portcos. finds firms that invested in same year as portco received first inv, firms invested in same type of company, and makes sure matches are unique
AllMatchKeys:
-built from SyntheticFirmSetBaseCode20, RealMatchesCode20
-combines synthetic and real matches
SynthRoundAggDistBaseCode20:
-built from allmatchkeys, portcogeo, firmbogeo
-builds points for all portco, firm listings in allmatch keys
SynthRoundAddDistCode20:
-built from synthroundaggdistvasecode20
-finds actual distance between portcos and firms using installed extensions chooses branch offices if distance between portco and bo less than firm
SynthFirmnameInduBlowoutCode20:
-built from allmatchkeys, roundlinemasterbase
-gets every firm combination and checks whehter the companies that those firms invested in are in the same general industry
SynthFirmNameroundInduHistCode20:
-built from SynthFirmnameInduBlowoutcode20
-gets information by portco, firmname match about what the firms past investment patterns are
MasterWithSynthBaseCode20Portco:
-built from Allmatchkeys, matchhighestrandom, synthroundaggdistcode20, sythnfirmnameroundinduhistcode20, synthfirmnameroundindutotalcode20, firmvars, copeopleaggsimple, portcomaster
-builds a bunch of information about synthetic and real matches
SynthFirmnameRoundInduTotalCode20:
-built from allmatchkeys, roundlinemasterbase
-finds number of portcos in certain industries by firmnames
MasterWithSynthCode20Firms:
-built with firmmatchmaster, allmatchkeys
-matching a bunch of information to all firms
MasterWithSynthcode20:
-built from masterwithsynthbasecode20portco, masterwithsynthcode20firms
-gets a huge amount of info together on real and synthetic matches about firms and companies
MasterReals:
-built from masterwithsynthcode20
-gets just real matches from code
MasterOneSynth:
-built from masterwithsynthcode20
-gets just one randomly chosen synthetic match between companies and firms
MasterRealOneSynth:
-built from masteronesynth, masterreals
-combines the real and one synth table

Ranking Tables and Graphs

This is a slight detour from the creation of VCDB3. However, this is a cool process because you actually get to use the data you've been working with. This process is extensive, but the queries are easy to understand. If you wish to have deeper understanding of the process, read the code. It is located in:

E:McNair\Projects\VentureXpert Database\vcdb3\LoadingScripts\RoundRanking.SQL

First you must create a table that has aggregate round information grouped by cities and round year. Since this is a little difficult to picture, I will attach the code.

DROP TABLE roundleveloutput;
CREATE TABLE roundleveloutput AS SELECT
city, statecode, roundyear AS year,
SUM(rndamtestm*seedflag) AS seedamnt,
SUM(rndamtestm*earlyflag) AS earlyamnt,
SUM(rndamtestm*laterflag) AS lateramnt,
SUM(rndamtestm*growthflag) AS selamnt,
SUM(growthflag*dealflag) AS numseldeals
FROM round GROUP BY city, statecode, roundyear;
--30028

Next create a table that lists the all time SEL amount by city. Keep including the state code since this will ensure that you have the right city. City names are often repeated in different states. Next, create a table which lists unique city, state for every year since 1980. Then, build a table which matches portcos to the city, state, year blowout table for each year they were alive. This table should be relatively large since it lists companies once for every year they were alive up until the present. Then create a table that displays the number of companies alive in a city every year since 1980. Then add in a table that lists all of the information you have built in tables previously based on city, state, year. Also add in population. Then you can run the ranking queries.

For states follow the same general process but group by states not cities and states.

If this explanation was not enough for you (it was not meant to be in depth) go to the location defined above and read the actual code. With the description I have given, you should be able to piece together what each query does.

Master Tables

Throughout the creation of the database, there are inevitably some tables that are vital to create a solid foundation. The following tables are the master tables with a quick explanation:

  • Companybasecore- The base table for portcos. This is data that was drawn directly from SDC and was not changed other than for cleaning purposes. Count: 48001
  • BranchOfficeCore- The base table for branch offices. This is data drawn directly from SDC. Here only branch offices with distinct firm names are included. Count: 10032
  • FirmBaseCore- The base table for firms. This is also data taken directly from SDC and was not changed other than for cleaning purposes. Count: 15437
  • FundBaseCore- The base table for funds. This is also data taken directly from SDC and was not changed other than for cleaning purposes. Count: 28833
  • IPOCleanNoDups - This is the clean table of IPOs after being run through the matcher against portcos. It was cleaned manually and had duplicates removed. Count: 2136
  • IPONoDups- This is the table before the cleaning process of matching to portcos. There could be problems with this table as we used an aggregate function here. Be careful using this table. Count: 11149
  • MACleanNoDups- This is the clean table of MAs after being run through the matcher against portcos. It was cleaned manually and had duplicates removed. Count: 7171
  • MANoDups- This is the table before the cleaning process of matching to portcos. There could be problems with this table as we used an aggregate function here as well. Be careful using this table. Count: 119374
  • Round- This is the master round table. It has SEL flags attached to it and has the most round info. RoundBaseClean is also a decent table but has less information. This table is your best bet for round information. Count: 151323
  • RoundLineJoinerLeanFFClean- This is the master round table for joining purposes. It was cleaned and used for widespread joining purposes. Count: 163157
  • CoPeople- This is the base table for PortCo people information. It was pulled directly from SDC. Count: 194359
  • FirmBoGeo- This is the base table for firm/branch office geocoding. This table was cleaned and contains lat/long readings for firms and branch offices where the information was available. Count: 15437
  • PortCoGeo- This is the base table for portco geocoding. Table was cleaned and contains lat/long reading for portcos where the Google API returned a valid reading. Count: 48001
  • FirmPerf- This is a wide reaching table about the performance of firms. It was mainly used later in the project but is extremely useful. Count: 8336
  • FundPeople- This is the base table for fund people information. It was pulled directly from SDC. Count: 328994.
  • PortCoExitUpdated- This is the master exit table for portcos. The difference between this and PortCoExit is that Updated has two columns marking MAs and IPOs while the other has one column MAvsIPO. Use which ever one is more convenient. Count: 48001
  • PortCoMaster- This table is great. There's a ton of information on PortCos including SEL flags, round amounts, and industry classifications. Count: 48001