Python file saved in
E:\projects\listing page identifier\cnn.py
===Workflow===
This section summarizes a general process of utilizing above tools to get appropriate input for our CNN model, also serves as a guidance for anyone who wants to implement upon those tools.
# Feed raw data (as for now, our raw data is the <code>The File to Rule Them All.csv</code>) into <code> generate_dataset.py</code> to get text files (<code>train.txt</code> and<code>text.txt</code>) that contain a list of all internal urls with their corresponding indicator (class label)
# Create 2 folders: train and test, located in the same directory as <code>train.txt</code> and <code>text.txt</code>, also create 2 sub-folders: cohort and not_cohort within these 2 folders
# Feed the directory/path of <code>train.txt</code> and <code>text.txt</code> into <code>screen_shot_tool.py</code>. This process will automatically group images into their corresponding folders that we just created in step 2