I encountered no problems loading this data.
==Cleaning Companybase, Fundbase, Firmbase, and BranchOffice==
===Cleaning Company===
First clean companybase. Here the primary key will be coname, datefirstinv, and statecode. Before checking whether this is a valid primary key, remove the undisclosed companies. I will explain the second part of the query concerning New York Health later.
DROP TABLE companybasecore;
CREATE TABLE companybasecore AS
SELECT *
FROM Companybase WHERE nationcode = 'US' AND coname != 'Undisclosed Company'
AND NOT (coname='New York Digital Health LLC' AND statecode='NY' AND datefirstinv='2015-08-13' AND updateddate='2015-10-20');
--48001
SELECT COUNT(*) FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybasecore) AS T;
--48001
Since the count of the table and the count of the distinct primary key is equivalent, you know that the primary key is valid. In the initial cleaning of the table, I first sorted out only the undisclosed companies. This table had 48002 rows. I then ran the DISTINCT query above and found that there are 48001 distinct rows with the coname, datefirstinv, statecode primary key. Thus there must two rows that share a primary key. I found this key using the following query:
SELECT * FROM (SELECT coname, datefirstinv, statecode FROM companybase) as key GROUP BY coname, datefirstinv, statecode HAVING COUNT(key) > 1;
The company named 'New York Digital Health LLC' came up as the company that is causing the problems. I queried to find the two rows that list this company name in companybase and chose to keep the row that had a lower updateddate. It is a good practice to avoid deleting rows from tables when possible, so I added the filter as a WHERE clause to exclude one of the New York Digital listings.
==Instructions on Matching PortCos to Issuers and M&As From Ed==