Difference between revisions of "Eventbrite Webcrawler (Tool)"
Jump to navigation
Jump to search
Line 5: | Line 5: | ||
|Has project status=Active | |Has project status=Active | ||
||Due Date=NA | ||Due Date=NA | ||
− | |||
|Audience=McNair Staff | |Audience=McNair Staff | ||
|Keywords=Webcrawler, Database, Eventbrite, API, Python | |Keywords=Webcrawler, Database, Eventbrite, API, Python |
Revision as of 18:59, 28 February 2017
Eventbrite Webcrawler (Tool) | |
---|---|
Project Information | |
Project Title | Eventbrite Webcrawler (Tool) |
Owner | Gunny Liu |
Start Date | Summer 2016 |
Deadline | |
Primary Billing | |
Notes | |
Has project status | Active |
Copyright © 2016 edegan.com. All Rights Reserved. |
Contents
Description
Notes: The Eventbrite Webcrawler aims to create an automated system to systematically locate, retrieve and store data regarding entrepreneurship-related events documented by the Eventbrite database, such as demo days, hackathons, open houses, startup weekends, and more. To be developed around Eventbrite APIv3 and Python 2.7.
Input: Eventbrite developer database
Output: Local database documenting entrepreneurship-related events defined by the keys "organiser," "date," and "street level address," and possibly more.
Development Notes
6/30: Project start
Eventbrite APIv3
7/5: Eventbrite API First-Take
- Eventbrite developer account for McNair Center:
- first name: Anne, last name: Dayton
- Login Email: admin@mcnaircenter.org
- Login Password: amount
- Eventbrite API is well-documented and its database readily accessible. In the python dev environment, I am using the http
requests
library to make queries to the database, to obtain json data containing event objects that in turn contain organizer objects, venue objects, start/end time values, longitude/latitude values specific to each event. Therequests
library has inbuilt.json()
access methods, simplifying the json reading/writing process. Bang.- In querying for events organized by techstar, one of the biggest startup programs organization in the U.S., I use the following. Note that the organizer ID of techstar is 2300226659.
import requests response = requests.get( "https://www.eventbriteapi.com/v3/organizers/2300226659/events/", headers = { "Authorization": "Bearer CRAQ5MAXEGHKEXSUSWXN", }, verify = True, )
- In querying for, instead, keywords such as "startup weekend," I use the following.
import requests response = requests.get( "https://www.eventbriteapi.com/v3/events/search/q="startup weekend"", headers = { "Authorization": "Bearer CRAQ5MAXEGHKEXSUSWXN", }, verify = True, )
- In querying for events parked under the category "science and technology", I use the following. However, this query also returns scientific seminars unrelated to entrepreneurship and is yet to be refined.
- Note that the category ID of science and technology is 102.
import requests response = requests.get( "https://www.eventbriteapi.com/v3/categories/102", headers = { "Authorization": "Bearer CRAQ5MAXEGHKEXSUSWXN", }, verify = True, )
- In each case, var
response
is a json object, that can be read/written in python using the requests methodresponse.json()
. Each endpoint used above are instances of e.g.GET events/search/
orGET categories/:id
EventBrite API methods. There are different parameters each GET function can harness to get more specific results. To populate a comprehensive local database, the dream is to systematic queries from different endpoints and collecting all results, without repetition, in a centralized database. In order to do this, I'll have to familarize further with these GET functions and develop a systematic approach to automate queries to the eventbrite server. One way to do this is to import entrepreneurship buzzword libraries that are available on the web, and make queries by iterating through these search strings systematically.
- In each case, var
- Eventbrite event objects in json are well-organized and consistent. There are many interesting fields such as the longitude/latitude decimals, apart from name/location/organizer/start-time/end-time data which are data we want to amass initially.
- For instance, the upcoming startup weekend event in Seville looks like the following.
- In the events object, organizer and venue are represented as ID's and have to be queried separately since they contain a multitude of string-value pairs such as "description", "logo", and "url" in the case of organizer data. Huge opportunity here for more data extraction. Kudos to eventbrite for documenting their stuff meticulously. Can you tell I'm impressed?
- To produce a local database, I'm using the
import pandas as pd
library, thepandas.DataFrame
object and thepandas.DataFrame.to_csv()
method. Currently, I initialize a dataframe with columns of variables that I seek to extract, and iterate through event objects and venue/organizer objects within to populate the dataframe with rows of event data. - Still debugging/writing at the moment.
- RDP went down, major sadness.
7/6: Alpha Development
- Eventbrite stipulates a system of ID-numbering for all organizers and venues objects, for instance.
- For the endpoint
GET /venues/:id/
, replace:id
with the venue_id associated with desired organizer - For the endpoint
GET /organizers/:id
, replace:id
with the organizer_id associated with desired organizer - Where are these ID numbers located, you ask? Any query for an event will return them as values the the strings "venue_id" and "organizer_id"
- For the endpoint
- Script development slowed considerably by lack of modularity and debugging functionality
- Modules to generate query url strings from input GET
- Module to create empty
pandas.DataFrame
table based on input rows and columns - Modules to retrieve information from venues and organizer data from their respective ID numbers
- To learn and operate komodo debugger and write appropriate tests for each modules detached from main driver function
- To learn pandas.DataFrame and appropriate methods to update it
- Notes and Ideas
- Develop smart iteration to query for all events sought
- To create intelligent searches:
- Note that eventbrite is esp good for free events
- Note that past events may extend only to a certain point
- Note that eventbrite was launched in 2006, but is the first major player in online event ticketing
- Category is always science and tech
- Organiser is impt; some entrepreneurship events are organised by known collectives
- Organiser description also has many impt keywords
- keywords from SEO material on marketing artfully is very good
- Event series, dates and venues endpoints are secondarily important
7/7 Alpha Development #2
- Full swing: pseudo-code, modularity, docstrings, tests, naming style
- Komodo debugger works
- Alpha development complete. All tests passed. Complete code as below.
https://github.com/scroungemyvibe/mcnair_center_builds/blob/master/EventBrite_Webcrawler_Build.py
- Notes
- Current query (without input parameters) by organizer ID returns only active events listed under organizer. For instance, techstars has 45 upcoming events and I am pulling 45 json event objects from the database.
- Current build should be applied systematically to lists of organizer_id's
- Further build ideas/notes documented in code proper on the git