==Cleaning Companybase, Fundbase, Firmbase, and BranchOffice==
===Cleaning Company===
First clean companybase. Here the The primary key for port cos will be coname, datefirstinv, and statecode. Before checking whether this is a valid primary key, remove the undisclosed companies. I will explain the second part of the query concerning New York Digital Health later.
DROP TABLE companybasecore;
The company named 'New York Digital Health LLC' came up as the company that is causing the problems. I queried to find the two rows that list this company name in companybase and chose to keep the row that had a lower updateddate. It is a good practice to avoid deleting rows from tables when possible, so I added the filter as a WHERE clause to exclude one of the New York Digital listings.
==CleaningFundbase==The primary key for funds will be only the fundname. First get rid of all of the undisclosed funds as we do not care about these. DROP TABLE fundbasenound; CREATE TABLE fundbasenound AS SELECT DISTINCT * FROM fundbase WHERE fundname NOT LIKE '%Undisclosed Fund%'; --28899 SELECT COUNT(*) FROM (SELECT DISTINCT fundname FROM fundbasenound)a; --28833 As you can see, fundbase still has rows that share fundnames. If you are wondering why the DISTINCT in the first query did not eliminate these, it is because this DISTINCT applies to the whole row not individual fundnames. Thus, only completely duplicate rows will be eliminated in the first query. I chose to keep the funds that have the earlier last investment date. DROP TABLE fundups; CREATE TABLE fundups AS SELECT fundname, max(lastinvdate) AS lastinvdate FROM fundbasenound GROUP BY fundname HAVING COUNT(*)>1; --53 DROP TABLE fundbasecore; CREATE TABLE fundbasecore AS SELECT A.* FROM fundbasenound AS A LEFT JOIN fundups AS B ON A.fundname=B.fundname AND A.lastinvdate=B.lastinvdate WHERE B.fundname IS NULL AND B.lastinvdate IS NULL; --28833 Since the count of fundbasecore is the same as the number of distinct fundnames, we know that the fundbasecore table is clean. In the first query I am finding duplicate rows and choosing the row that has the greater last investment date. I then match this table back to fundbasenound but choose all the rows from fundbasecore for which there is no corresponding fund in fundups. This allows the funds with the earlier date of last investment to be chosen.
==Instructions on Matching PortCos to Issuers and M&As From Ed==