Difference between revisions of "VentureXpert Database"
(19 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
{{Project | {{Project | ||
+ | |Has project output=Data | ||
+ | |Has sponsor=Kauffman Incubator Project | ||
|Has title=VentureXpert Database | |Has title=VentureXpert Database | ||
|Has owner=Vineet Anne, Khai Nguyen, | |Has owner=Vineet Anne, Khai Nguyen, | ||
|Has project status=Active | |Has project status=Active | ||
}} | }} | ||
− | The purpose of this project is to create a new, updated database of venture capital deals, portfolio companies, funds and firms, including information on exits, executives, and other events and entities, called '''vcdb4'''. The old [[VentureXpert Data]] project created the database '''vcdb3'''. Data is retrieved using [[SDC Platinum]], for which we have a license. | + | '''This project was conducted by students at Georgetown in the Spring of 2019. It was replaced by the [[vcdb4]] project, conduced by [[Ed Egan]], in the Fall of 2019.''' |
+ | |||
+ | The purpose of this project is to create a new, updated database of venture capital deals, portfolio companies, funds and firms, including information on exits, executives, and other events and entities, called '''[[vcdb4]]'''. The old [[VentureXpert Data]] project created the database '''vcdb3'''. Data is retrieved using [[SDC Platinum]], for which we have a license. | ||
==Updating data from vcdb4== | ==Updating data from vcdb4== | ||
Line 21: | Line 25: | ||
#Move everything related to the updated session (ssh, rpt, txt, txt with no footer) to the updated file (all current updated files are in there). | #Move everything related to the updated session (ssh, rpt, txt, txt with no footer) to the updated file (all current updated files are in there). | ||
− | Notes from Ed: I'm not sure about | + | Notes from Ed: I'm not sure about steps 5 and 6! There is no need to remake the report manually. You can use the old report by editing the .ssh file directly to correct the path before loading it, or you can load the existing report inside of SDC once the session is loaded. |
− | List of files with | + | ==After updating all files through SDC, clean up the Data== |
+ | # Download the data from SDC. | ||
+ | # Cut off the footers and run it through Normalizer.pl. | ||
+ | # Load it into a PostgreSQL database, cleaning it as needed. | ||
+ | # Process it in SQL and using Matcher.pl to find IPOs and M&As for VC backed firms, etc. | ||
+ | |||
+ | |||
+ | April 30th: | ||
+ | ===Current Work=== | ||
+ | |||
+ | #Files Pulled from Old McNair Files: | ||
+ | *USVCFirms1980-present2019 | ||
+ | *USVCFunds1980-present2019 | ||
+ | *USVCPortCos1980-present2019 | ||
+ | *IPO1980-present | ||
+ | *MA1980-2018q2-updated | ||
+ | |||
+ | #Files we have not pulled/need help processing: | ||
+ | *USCompanyLongDescription1980-present | ||
+ | *VCFirmsLongDescription | ||
+ | *USVCRound1980-present - This is the Round-On-One-Line pull. See [[Retrieving_US_VC_Data_From_SDC#Round_On_One_Line]] | ||
+ | *USVC1980-present - This is the Round pull - Note from Vineet: Tried updating this but it seems the session was saved as USVCPortCos1980-present2019...please advise | ||
+ | *Branch offices -- [[Retrieving_US_VC_Data_From_SDC#Branch_Offices]] | ||
+ | *PortCo Executives | ||
+ | *Fund Executives | ||
+ | |||
+ | ==Normalized Files Status Update== | ||
+ | |||
+ | #Files that are normalized | ||
+ | *USVCFirms1980-present2019 | ||
+ | *USVCFunds1980-present2019 | ||
+ | *USVCPortCos1980-present2019 | ||
+ | *IPO1980-present | ||
+ | |||
+ | #Files that still need to be normalized: | ||
+ | *MA1980-2018q2-updated - there is a missing .txt file for the MA activity and I am unable to create another one from the .ssh file. Will be working more on it tomorrow. - Vineet | ||
+ | |||
+ | |||
+ | April 16th: | ||
+ | |||
+ | List of files with statuses: | ||
*USVCFirms1980-present2019 -- Done by Khai | *USVCFirms1980-present2019 -- Done by Khai | ||
*USVCFunds1980-present2019 -- Done by Khai | *USVCFunds1980-present2019 -- Done by Khai | ||
*USVCPortCos1980-present2019 -- Done by Khai | *USVCPortCos1980-present2019 -- Done by Khai | ||
− | *USCompanyLongDescription1980-present -- | + | *USCompanyLongDescription1980-present -- May require special processing |
− | *VCFirmsLongDescription -- | + | *VCFirmsLongDescription -- May require special processing |
− | *IPO1980-present | + | *IPO1980-present - Done by Vineet...renamed to IPO1980-April2019 |
− | *MA1980-2018q2-updated | + | *MA1980-2018q2-updated - Done by Vineet |
− | *USVCRound1980-present | + | *USVCRound1980-present -- This is the Round-On-One-Line pull. See [[Retrieving_US_VC_Data_From_SDC#Round_On_One_Line]] |
+ | *USVC1980-present -- This is the Round pull - Note from Vineet: Tried updating this but it seems the session was saved as USVCPortCos1980-present2019...please advise | ||
+ | |||
+ | We also need: | ||
+ | *Branch offices -- [[Retrieving_US_VC_Data_From_SDC#Branch_Offices]] | ||
+ | *PortCo Executives | ||
+ | *Fund Executives | ||
+ | |||
+ | Notes from Vineet: | ||
+ | |||
+ | - For IPO1980-present, SDC says that there is no matching .rpt file and as a result was not able to load. There is a .rpt file called IPO1980-present-Done.rpt that I believe is the same data as the .ssh file. Should i rename one of them to match the other? | ||
+ | |||
+ | See [[Retrieving US VC Data From SDC]] | ||
+ | |||
+ | ==Update (4/30/19)== | ||
+ | Files that have been normalized: IPO, USVC Firms, USVC Firms Branch Office, USVC Funds, USVC PortCos. | ||
+ | All these files are in E:\projects\vcdb4\Updated. Each contains: | ||
+ | *An updated session file | ||
+ | *A rpt file | ||
+ | *A txt file | ||
+ | *A txt file with no foot. | ||
+ | *A normalized txt file derived from (4). | ||
+ | |||
+ | Currently having issues with MA session. Unable to execute the ssh file in the Updated folder for MA. Displays error message but does not specify what the error is. | ||
+ | Next step is to do Fund and PortCo executives. | ||
+ | Update: Fund Executives's script has been created, but cannot be executed for some reason (Error message: Out of memory). |
Latest revision as of 12:41, 21 September 2020
VentureXpert Database | |
---|---|
Project Information | |
Has title | VentureXpert Database |
Has owner | Vineet Anne, Khai Nguyen |
Has start date | |
Has deadline date | |
Has project status | Active |
Dependent(s): | Ecosystem Organization Classifier |
Has sponsor | Kauffman Incubator Project |
Has project output | Data |
Copyright © 2019 edegan.com. All Rights Reserved. |
This project was conducted by students at Georgetown in the Spring of 2019. It was replaced by the vcdb4 project, conduced by Ed Egan, in the Fall of 2019.
The purpose of this project is to create a new, updated database of venture capital deals, portfolio companies, funds and firms, including information on exits, executives, and other events and entities, called vcdb4. The old VentureXpert Data project created the database vcdb3. Data is retrieved using SDC Platinum, for which we have a license.
Contents
Updating data from vcdb4
Using the Researcher RDP account:
- Go to: E:\mcnair\Projects\VentureXpert Database\ScriptsForSDCExtract\ and copy all necessary files (for each of them, download the txt, the ssh and the rpt file) to E:\projects\vcdb4.
- Open SDC Platinum, input the initial, and create a project description.
- Open the ssh file in SDC Platinum and save it as a new file for updating (do not overwrite the old file).
- Right-click on the line with the date, and modify to the most current date.
- Delete the line that generates the report.
- Click the Report tab on top to generate a Custom Report. Refer to the txt document of the pre-upload session for the metrics that need to be collected.
- Click OK, OK and then No in response to the window that pops up.
- A window should now pop up with the option "Save As". Use that option to save the report as a txt file in the same vcdb4 folder.
- Click Execute to generate a rpt file and a txt file for the now-updated session in the vcdb4 folder. This will take some time. Exit the session when it was done (make sure it is saved).
- Go to the vcdb4 folder, open the txt file, scroll down to the bottom to delete the footer. Save the footer-less txt document as a new file.
- Move everything related to the updated session (ssh, rpt, txt, txt with no footer) to the updated file (all current updated files are in there).
Notes from Ed: I'm not sure about steps 5 and 6! There is no need to remake the report manually. You can use the old report by editing the .ssh file directly to correct the path before loading it, or you can load the existing report inside of SDC once the session is loaded.
After updating all files through SDC, clean up the Data
- Download the data from SDC.
- Cut off the footers and run it through Normalizer.pl.
- Load it into a PostgreSQL database, cleaning it as needed.
- Process it in SQL and using Matcher.pl to find IPOs and M&As for VC backed firms, etc.
April 30th:
Current Work
- Files Pulled from Old McNair Files:
- USVCFirms1980-present2019
- USVCFunds1980-present2019
- USVCPortCos1980-present2019
- IPO1980-present
- MA1980-2018q2-updated
- Files we have not pulled/need help processing:
- USCompanyLongDescription1980-present
- VCFirmsLongDescription
- USVCRound1980-present - This is the Round-On-One-Line pull. See Retrieving_US_VC_Data_From_SDC#Round_On_One_Line
- USVC1980-present - This is the Round pull - Note from Vineet: Tried updating this but it seems the session was saved as USVCPortCos1980-present2019...please advise
- Branch offices -- Retrieving_US_VC_Data_From_SDC#Branch_Offices
- PortCo Executives
- Fund Executives
Normalized Files Status Update
- Files that are normalized
- USVCFirms1980-present2019
- USVCFunds1980-present2019
- USVCPortCos1980-present2019
- IPO1980-present
- Files that still need to be normalized:
- MA1980-2018q2-updated - there is a missing .txt file for the MA activity and I am unable to create another one from the .ssh file. Will be working more on it tomorrow. - Vineet
April 16th:
List of files with statuses:
- USVCFirms1980-present2019 -- Done by Khai
- USVCFunds1980-present2019 -- Done by Khai
- USVCPortCos1980-present2019 -- Done by Khai
- USCompanyLongDescription1980-present -- May require special processing
- VCFirmsLongDescription -- May require special processing
- IPO1980-present - Done by Vineet...renamed to IPO1980-April2019
- MA1980-2018q2-updated - Done by Vineet
- USVCRound1980-present -- This is the Round-On-One-Line pull. See Retrieving_US_VC_Data_From_SDC#Round_On_One_Line
- USVC1980-present -- This is the Round pull - Note from Vineet: Tried updating this but it seems the session was saved as USVCPortCos1980-present2019...please advise
We also need:
- Branch offices -- Retrieving_US_VC_Data_From_SDC#Branch_Offices
- PortCo Executives
- Fund Executives
Notes from Vineet:
- For IPO1980-present, SDC says that there is no matching .rpt file and as a result was not able to load. There is a .rpt file called IPO1980-present-Done.rpt that I believe is the same data as the .ssh file. Should i rename one of them to match the other?
See Retrieving US VC Data From SDC
Update (4/30/19)
Files that have been normalized: IPO, USVC Firms, USVC Firms Branch Office, USVC Funds, USVC PortCos. All these files are in E:\projects\vcdb4\Updated. Each contains:
- An updated session file
- A rpt file
- A txt file
- A txt file with no foot.
- A normalized txt file derived from (4).
Currently having issues with MA session. Unable to execute the ssh file in the Updated folder for MA. Displays error message but does not specify what the error is. Next step is to do Fund and PortCo executives. Update: Fund Executives's script has been created, but cannot be executed for some reason (Error message: Out of memory).