Difference between revisions of "VentureXpert Data"

From edegan.com
Jump to navigation Jump to search
Line 60: Line 60:
  
 
==Loading Tables==
 
==Loading Tables==
When I describe errors I encountered, I will not describe them using line numbers. This is because as soon as any data is added, the line numbers will become useless. Instead I recommend that you copy the normalized file you are working with into an excel file and using the filter feature. This way you can find the line number in your specific file that is causing errors and fix it in the file itself. The line numbers that PuTTY errors display are often wrong, so I relied on excel to discover the error fastest.
+
When I describe errors I encountered, I will not describe them using line numbers. This is because as soon as any data is added, the line numbers will become useless. Instead I recommend that you copy the normalized file you are working with into an excel file and using the filter feature. This way you can find the line number in your specific file that is causing errors and fix it in the file itself. The line numbers that PuTTY errors display are often wrong, so I relied on excel to discover the error fastest. If my instructions are not enough for you to find the error, my advice would be to find key words in the line that PuTTY is telling you is causing errors and filter through excel.
  
  
Line 178: Line 178:
 
  \COPY fundbase FROM 'VCFund1980-2018q2-NoFoot-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 
  \COPY fundbase FROM 'VCFund1980-2018q2-NoFoot-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 
  --29397
 
  --29397
 +
 +
There is a Ukranian fund that has stray quotation marks in its name. It is called something along the lines of "VAT "ZNVKIF "Skhidno-Evropeis'lyi investytsiynyi Fond". If this does not help, you can filter in excel using Kiev as the keyword in the city column and find the line where you are getting errors. Then manually remove the commas in the actual text file. After that, the table should load correctly.
 +
DROP TABLE firmbase;
 +
CREATE TABLE firmbase(
 +
  firmname varchar(255),
 +
  foundingdate date, --mm-dd-yyyy
 +
  datefirstinv date, --mm-dd-yyyy 
 +
  datelastinv date, --mm-dd-yyyy
 +
  addr1 varchar(100),
 +
  addr2 varchar(100),
 +
  location varchar(100),
 +
  city varchar(100),
 +
  zip varchar(10),
 +
  areacode integer,
 +
  county varchar(100),
 +
  state varchar(2),
 +
  nationcode varchar(10),
 +
  nation varchar(100),
 +
  worldregion varchar(100),
 +
  numportcos integer,
 +
  numrounds integer,
 +
  investedk money,
 +
  capitalundermgmt money, 
 +
  invstatus varchar(100),
 +
  msacode varchar(10),
 +
  rolepref varchar(100),
 +
  geogpref varchar(100),
 +
  indpref varchar(100),
 +
  stagepref varchar(100),
 +
  type varchar(100)
 +
);
 +
 +
 +
\COPY firmbase FROM 'USVCFirms1980-2018q2-NoFoot-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 +
--15899
 +
 +
The normalization for this file was wrong when I tried to load the data. To fix this go to the file where you have removed the footer and find the column header titled Firm Capital under Mgmt{0Mil}. Delete the {0mil} and renormalize the file. Then everything should be ok. A good way to check this is to copy and paste the normalized file into an excel sheet and see whether the entries line up with their column header correctly.
 +
The second error I found was with the Kerala Ventures firm. Here the address has the word l"opera in it. This quotation will cause errors so find the line number using excel and remove it manually.
 +
The third error is in an area code where 1-8 is written. This hyphen causes errors. Interestingly, the line number given by PuTTY was correct, and I found it in my text file and deleted it manually.
 +
These were the only errors I encountered while loading this table.

Revision as of 16:32, 11 July 2018


McNair Project
VentureXpert Data
Project logo 02.png
Project Information
Project Title VentureXpert Data
Owner Augi Liebster
Start Date June 20, 2018
Deadline
Primary Billing
Notes
Has project status Active
Copyright © 2016 edegan.com. All Rights Reserved.


Augi Liebster (Work Log)

Relevant Former Projects

  1. Venture Capital (Data)
  2. Retrieving US VC Data From SDC
  3. VC Database Rebuild

Location

My scripts for SDC pulls are located in the Z drive in the location:

Z:\VentureXpertDB\ScriptsForSDCExtract

My successfully pulled and normalized files are stored in the location:

Z:\VentureXpertDB\ExtractedDataQ2

My script for loading data is in one big text file in the location:

Z:\VentureXpertDB\vcdb3

The folder vcdb2 is there for reference to see what people before had done. ExtractedData is there because I pulled data before July 1st, and Ed asked me to repull the data.


Goal

I will be looking to redesign the VC Database in a way that is more intuitively built than the previous one. I will also update the database with current data.

Initial Stages

The first step of the project was to figure out what primary keys to use for each major table that I create. I looked at the primary keys used in the creation of the VC Database Rebuild and found primary keys that are decent. I have updated them and list them below:

  1. CompanyBaseCore- coname, statecode, datefirstinv
  2. IPOCore- issuer, issuedate, statecode
  3. MACore- target name, target state code, announceddate
  4. Geo - city, statecode, coname, datefirst, year
  5. DeadDate - conname, statecode, datefirst, rounddate (tentative could still change)
  6. RoundCore- conname, statecode, datefirst, rounddate
  7. FirmBaseCore - firmname
  8. FundBaseCore - fund name (firstinvedate doesn't work because not every row has an entry)

These are my initial listings and I will come back to update them if needed.

The second part of the initial stage has been to pull data from the SDC Platinum platform. I did it in July to ensure that I had two full quarters of data.


SDC Pull

When pulling data from SDC, it is a good idea to look for previously made rpt files that have the names of the pulls you will need to do. They have already been created and will save you a lot of work. The rpt files that I used are in the folder VentureXpertDB/ScriptsForSDCExtract. The files will come in pairs with one being saved as an ssh file and one as a rpt file. To update the dates to make them recent, go into the ssh file of the pair and change the date of last investment. When you open SDC, you will be given a variety of choices for which database to pull from. For each type of file chose the following:

  1. VentureXpert - PortCo, PortCoLong, USVC, Firms, BranchOffices, Funds, Rounds, VCFirmLong
  2. Mergres & Acquisition - MAs
  3. Global New Issues Databases - IPOs

Help on pulling data from SDC is on the SDC Platinum (Wiki) page.

VCFund Pull Problem

When pulling the VCFund1980-Present, I encountered two problems. One, is that SDC is not able to sort through the funds that are US only with the built in filters. Two, there are multiple rpt files that specify different variables for the fund pull. I pulled from both to be safe, but in the VC Database Rebuild page there is a section on the fund pull where Ed specifies which rpt file he used to pull data from SDC. Regardless I have both saved in the ExtractedData folder. After speaking with Ed, he told me to use the VCFund1980-present.rpt file to extract the data. Had various problems extracting data including freezing of SDC program or getting error Out of Memory. Check the SDC Platinum (Wiki) page to fix these issues.


Loading Tables

When I describe errors I encountered, I will not describe them using line numbers. This is because as soon as any data is added, the line numbers will become useless. Instead I recommend that you copy the normalized file you are working with into an excel file and using the filter feature. This way you can find the line number in your specific file that is causing errors and fix it in the file itself. The line numbers that PuTTY errors display are often wrong, so I relied on excel to discover the error fastest. If my instructions are not enough for you to find the error, my advice would be to find key words in the line that PuTTY is telling you is causing errors and filter through excel.


DROP TABLE roundbase;
CREATE TABLE roundbase (
  coname varchar(255),
  rounddate date,
  updateddate date,
  foundingdate date,
  datelastinv date,
  datefirstinv date,
  investedk real,
  city varchar(100),
  description varchar(5000),
  msa varchar(100),
  msacode varchar(10),
  nationcode varchar(10),
  statecode varchar(10),
  addr1 varchar(100),
  addr2 varchar(100),
  indclass varchar(100),
  indsubgroup3 varchar(100),
  indminor varchar(100),
  url varchar(5000),
  zip varchar(10),
  stage1 varchar(100),
  stage3 varchar(100),
  rndamtdisck real,
  rndamtestk real,
  roundnum integer,
  numinvestors integer
);
\COPY roundbase FROM 'USVC1980-2018q2-Good.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--151549

The only error I encountered here was with Cardtronic Technology Inc. Here there was a problem with a mixture of quotation marks which cause errors in loading. Find this using the excel trick and remove it manually.

DROP TABLE ipos;
CREATE TABLE ipos (
  issuedate date,
  issuer varchar(255),
  statecode varchar(10), 
  principalamt money, --million
  proceedsamt money, --sum of all markets in million
  naiccode varchar(255), --primary NAIC code
  zipcode varchar(10),
  status varchar (20),
  foundeddate date
);
\COPY ipos FROM 'IPO1980-2018q2-NoFoot-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--12107

I encountered no errors while loading this data.


DROP TABLE branchoffices;
CREATE TABLE branchoffices (
  firmname varchar(255),
  bocity varchar(100),
  bostate varchar(2),
  bocountrycode varchar(2),
  bonation varchar(100),
  bozip varchar(10),
  boaddr1 varchar(100),
  boaddr2 varchar(100)

);
\COPY branchoffices FROM 'USVCFirmBranchOffices1980-2018q2-NoFoot-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--10353

I encountered no errors while loading this data.

DROP TABLE roundline;
CREATE TABLE roundline (
  coname varchar(255),
  statecode varchar(2),
  datelastinv date,
  datefirstinv date,
  rounddate date,
  disclosedamt money,
  investor varchar(255)
);
\COPY roundline FROM 'USVCRound1980-2018q2-NoFoot-normal-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--403189

I encountered no errors while loading this data.

DROP TABLE fundbase;
CREATE TABLE fundbase (
  fundname varchar(255),
  closedate date, --mm-dd-yyyy
  lastinvdate date, --mm-dd-yyyy
  firstinvdate date, --mm-dd-yyyy
  numportcos integer,
  investedk real,
  city varchar(100),
  fundyear varchar(4), --yyyy
  zip varchar(10),
  statecode varchar(2),
  fundsizem real,
  fundstage varchar(100),
  firmname varchar(255),
  dateinfoupdate date,
  invtype varchar(100),
  msacode varchar(10),
  nationcode varchar(10),
  raisestatus varchar(100),
  seqnum integer,
  targetsizefund real,
  fundtype varchar(100)
);
\COPY fundbase FROM 'VCFund1980-2018q2-NoFoot-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--29397

There is a Ukranian fund that has stray quotation marks in its name. It is called something along the lines of "VAT "ZNVKIF "Skhidno-Evropeis'lyi investytsiynyi Fond". If this does not help, you can filter in excel using Kiev as the keyword in the city column and find the line where you are getting errors. Then manually remove the commas in the actual text file. After that, the table should load correctly.

DROP TABLE firmbase;
CREATE TABLE firmbase(
  firmname varchar(255),
  foundingdate date, --mm-dd-yyyy
  datefirstinv date, --mm-dd-yyyy  
  datelastinv date, --mm-dd-yyyy
  addr1 varchar(100),
  addr2 varchar(100),
  location varchar(100),
  city varchar(100),
  zip varchar(10),
  areacode integer,
  county varchar(100),
  state varchar(2),
  nationcode varchar(10),
  nation varchar(100),
  worldregion varchar(100),
  numportcos integer,
  numrounds integer,
  investedk money,
  capitalundermgmt money,  
  invstatus varchar(100),
  msacode varchar(10),
  rolepref varchar(100),
  geogpref varchar(100),
  indpref varchar(100),
  stagepref varchar(100),
  type varchar(100)
);


\COPY firmbase FROM 'USVCFirms1980-2018q2-NoFoot-normal.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--15899

The normalization for this file was wrong when I tried to load the data. To fix this go to the file where you have removed the footer and find the column header titled Firm Capital under Mgmt{0Mil}. Delete the {0mil} and renormalize the file. Then everything should be ok. A good way to check this is to copy and paste the normalized file into an excel sheet and see whether the entries line up with their column header correctly. The second error I found was with the Kerala Ventures firm. Here the address has the word l"opera in it. This quotation will cause errors so find the line number using excel and remove it manually. The third error is in an area code where 1-8 is written. This hyphen causes errors. Interestingly, the line number given by PuTTY was correct, and I found it in my text file and deleted it manually. These were the only errors I encountered while loading this table.