Difference between revisions of "Normalizer Documentation"

From edegan.com
Jump to navigation Jump to search
imported>Kunal
(New page: The perl script Normalizer.pl is provided with a pod-generated HTML documentation and a (CPAN clone) CSS style sheet page. The documentations below was created using the [http://search.cpa...)
(No difference)

Revision as of 15:02, 18 March 2016

The perl script Normalizer.pl is provided with a pod-generated HTML documentation and a (CPAN clone) CSS style sheet page. The documentations below was created using the Pod::Simple::Wiki module available from CPAN and is not perfect.

FILES

NAME

Normalize Fixed Width - Converts unnormalized data, such as in files downloaded through Thomson SDC Platinum, to 3rd normal form. Processed files may be opened as tabbed delimited text files or columned spreadsheets. Ideal for database inputs, as commas in numbers, percent symbols, ampersands, and other database-averse symbols have been removed.

SYNOPSIS

perl NormalizerFixedWidth.pl -file=<file> [-headerdiscards=<int>] [-footerdiscards=<int>] [-columnnames=<1|0>] [-safenames=<1|0>] [-outfile=<file>] [h]

OPTIONS

   -file=<file>:           Name of file to Normalize
   -headerdiscards=<int>:  The number of header lines to discard (Default: 0)
   -footerdiscards=<int>:  The number of footer lines to discard (Default: 0)
   -columnnames=<1|0>:     Whether or not the file contains column names on the first line, after discards (Default: 1)
   -safenames=<1|0>:       Determines whether column names will be stripped of non-alphanumeric characters (Default: 1)
   -outfile=<file>:        The name of the outfile sequence (Default: file.txt - expands to file1.txt, file2.txt,etc)
   -h:                     Display help

USAGE & FEATURES

Normalizer.pl takes a tab-delimited, string-quoted textfile, such as might be made by copying and pasting an excel sheet into a text file, and creates a series of files containing the data in third normal form and suitable for import into a database.

Several major data providers, notably Thomson Financial's SDC service, use 'in cell' carriage returns to seperate multiple items of data. This makes processing of the data very difficult (without some software to help!). An example spreadsheet might look like this:

   --------------------------------------------------------
   |   col1   |   col2   |   col3   |   col4   |   col5   |
   --------------------------------------------------------
   |    A     |    A     |    A     |    A     |     A    |
   --------------------------------------------------------
   |    B     |    B     |    B     |    B     |     B    |
   |          |          |    B     |    B     |     B    |
   |          |          |          |    B     |     B    |
   --------------------------------------------------------        
   |          |    C     |    C     |    C     |     C    |
   |          |          |          |    C     |     C    |
   --------------------------------------------------------

Normalize.pl will prepare tables for import into the database, in 3rd normal form, by doing the following:

1 Adding a unique key field (col0) to the 1st output file. ; ; 2 Inserting all data (including null values) that has a 1
1 relationship with the key into the 1st output file. ; ; 3 If, as in the case of col4 and col5 above, data has a many to one relationship with the key, it will be grouped into sets. ; ; 4 Foreach of these sets (the example has two) a new output file, with the first key and it's own key, will be created. ;

The set identification is fully automatic and dictated by the data. In the example given, col1 and col2 would be assigned to the 1:1 file, while "col4 & col5" and "col3" would be two seperate sets.

Note that if the software were given only the first line (A) it would only identify a 1:1 relationship. Likewise if the software were given only the first (A) and the third (C) lines, it would identify two sets (col1, col2 & col3; and col4 & col5). Thus, the larger the file, the more accurate the set identification. However, a conceptual underidentification could be corrected manually after import into a database.

BUGS & FEEDBACK

Worked as expected to process three large Hubs pulls:

  • Hubs2_4RoundsPart2.txt
  • Hubs3_4_PortfolioCompanies.txt
  • Hubs4_4_funds.txt

A fourth test file, which was pulled in a particularly difficult format, required use of a second Perl Program to sculpt it into the proper form.

  • Hubs1_4CombinedRoundInfo.txt

After processing by NormalizeFixedWidth:

  • Hubs1_4CombinedRoundInfo-normal.txt

After separating Regex groups by tabs with HubsCapture.pl:

  • Hubs1_4CombinedRoundInfo-tabbedv3

All of the normalized files are stored under Software/Hubs\ Pulls on the server and are marked with "-normal" suffix on their filename. The original pulls are stored under Original Pulls. The concoction of files used to fix the CombinedRoundInfo are under the HubsCombinedRound directory.