Amazon Mechanical Turk for Analyzing Demo Day Classifier's Results

From edegan.com
Revision as of 16:21, 3 August 2018 by Leminh.ams (talk | contribs) (Created page with "Login info: username: mcnair@rice.edu password: amount There's a file in the folder CrawledHTMLFull called FinalResultWithURL that was manually created by combining the...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Login info:

username: mcnair@rice.edu
password: amount

There's a file in the folder

CrawledHTMLFull

called

FinalResultWithURL

that was manually created by combining the file

crawled_demoday_page_list.txt

in the mother folder and the file

predicted.txt

This file combined the predictions to the actual url of the websites.

Since MTurk makes it hard for us to display the downloaded HTML, it is much faster to just copy the url into the question box rather than trying to display the downloaded HTML.

The advantage to this is that some websites, such as techcrunch.com behaves abnormally when downloaded as HTML so opening these kinds of websites in the browser would actually be more beneficial because the UI would not be messed up. Moreover, if certain websites has paywall or pop-up ads, the user can also click out of it. Since most of the times, paywall or pop-ups are scripts within HTMLs, the classifier can't rule them out because the body of the HTMLs may still contain useful information we are looking for. Major paywalls or websites that required log-ins such as linkedin have been black-listed in the crawler. More detail in the crawler section below.

However. there is a disadvantage to this: websites are ever changing, so there is a possibility that in the future, the URL may not be usable, or has changed to something else; on the other hand, downloaded HTMLs remain the same because it does not require any internet connection to render and thus, the content is static.

To create the MTurk for this project, follow this tutorial in Mechanical Turk (Tool). For testing and development purpose, use https://requestersandbox.mturk.com/

Test account: email: mcboatfaceboaty670@gmail.com password: sameastheoneforemail2018

For this project, all the fields that was asked of the user is:

  • Whether the page had a list of companies going through an accelerator
  • The month and year of the demo day (or article)
  • Accelerator name
  • Companies going through accelerator

Layout:

Demodayfinal.png

Pricing

We priced out task at $1.25 per HIT. Assuming workers take less than 10 minutes, this translates into >$7.50 per hour.

We sent out the task in two batches. The first was 20 HITs to be completed by two workers each, as to test for interjudge reliability.

The second batch was the remaining 264 HITs, to be completed by one worker each.

MTurk charged fees of $.25 per HIT and an additional $.0625, meaning each HIT cost us $1.50.

OUR FINAL PRICE: ((20*2)+264)*1.5625 = $475.00

Hand Collecting Data

To crawl, we only looked for data on accelerators which did not receive venture capital data (which Ed found via VentureXpert) and lacked timing info. The purpose of this crawl is to find timing info where we cannot find it otherwise, and if a company received VC we can find timing info via that investment. The file we used to find instances in which we lack timing info and lacked VC is:

/bulk/McNair/Projects/Accelerators/Summer 2018/Merged W Crunchbase Data as of July 17.xlsx

We filtered this sheet in Excel (and checked our work by filtering in SQL) and found 809 companies that lacked timing info and didn't receive VC. From this, we found 74 accelerators which we needed to crawl for.

We used the crawler to search for cohort companies listed for these accelerators.

During the initial test run, the number of good pages was 359. The data is then handled by hand by fellow interns.

The file for hand-coding is in:

/bulk/McNair/Projects/Accelerator Demo Day/Test Run/CrawledDemoDayHTMLFull/FinalResultWithURL

For the sake of collaboration, the team copied this information to a Google Sheet, accessible here:

https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing

We split the process into four parts. Each interns will do the following:

1. Go to the given URL.

2. Record whether the page is good data (column F); this can later be used by Minh Le to refine/fine-tune training data.

3. Record whether the page is announcing a cohort or recapping/explaining a demo day (column G). This variable will be used to decide if we should subtract weeks from the given date (e.g. if it is recapping a demo day, the cohort went through the accelerator for the past ~12 weeks, and we should subtract weeks as such).

4. Record date, month, year, and the companies listed for that given accelerator.

5. Note any any information, such as a cohort's special name.

Once this process is finished, we will filter only the 1s in Column F, and Connor Rothschild and Maxine Tao will work to populate empty cells in The File to Rule Them All with that data.