Difference between revisions of "Govtrack Webcrawler"
imported>Sahil (New page: The goal of the Govtrack Webcrawler is to create and automated system in perl by which bills relevant to a certian topic can be pulled from the Govtrack API which can be found [https://www...) |
imported>Sahil (Finished up documentation on how webcrawler was made) |
||
(One intermediate revision by the same user not shown) | |||
Line 1: | Line 1: | ||
− | The goal of the Govtrack Webcrawler is to create and automated system in | + | The goal of the Govtrack Webcrawler is to create and automated system in ActivePerl by which bills relevant to a certian topic can be pulled from the Govtrack API which can be found [https://www.govtrack.us/api/v2/bill?congress=114&order_by=-current_status_date here]. |
==Process== | ==Process== | ||
Line 11: | Line 11: | ||
Next the useragent object is created. | Next the useragent object is created. | ||
my $ua = new LWP::UserAgent; | my $ua = new LWP::UserAgent; | ||
− | Now the parameters used to search the api are decided | + | Now the parameters used to search the api are decided. |
− | my $queryName = " | + | my $queryName = "''insert query here''"; |
− | my $congressNo = " | + | my $congressNo = "''insert congress num here''"; |
− | my $limit = " | + | my $limit = "''insert maximum number of bills to search here''"; |
Using these parameters the url can be constructed. | Using these parameters the url can be constructed. | ||
my $genUrl = "https://www.govtrack.us/api/v2/bill?order_by=-current_status_date&congress=". $congressNo."&q=".$queryName."&limit=".$limit; | my $genUrl = "https://www.govtrack.us/api/v2/bill?order_by=-current_status_date&congress=". $congressNo."&q=".$queryName."&limit=".$limit; | ||
− | The useragent object can now retrieve and decode the text from the url into JSON | + | The useragent object can now retrieve and decode the text from the url into a JSON string. |
my $genResponse = $ua->get($genUrl); | my $genResponse = $ua->get($genUrl); | ||
my $genContent=$genResponse->decoded_content; | my $genContent=$genResponse->decoded_content; | ||
− | After getting the resulting JSON Data bill | + | Using the JSON module the JSON string can be converted into a data structure that can be searched through for relevant data. |
+ | my $JSONcontent = JSON::XS::decode_json($genResponse->decoded_content); | ||
+ | After getting the resulting JSON Data an array of all the bills matching the parameters bill can be located at | ||
+ | @{$JSONcontent->{objects}} | ||
+ | Iterating over this array more specific bill information can be found by constructing a url using the bill's ID number as such, | ||
+ | my $billurl = "https://www.govtrack.us/api/v2/bill/" . $bill->{id}; | ||
+ | After finding this page the data can once again be parsed from a string into a data structure. | ||
+ | my $billresponse = $ua->get($billurl); | ||
+ | my $billcontent = JSON::XS::decode_json($billresponse->decoded_content); | ||
+ | From this bill specific page the tags of each bill can be used to determine whether or not the bill is relevant and should be reviewed by Mcnair Center staff. An array of these tags can be found at | ||
+ | @{$billcontent->{terms}} | ||
+ | The data retrieved from this search is then placed into several tab delimited text files containing sets of useful information about bills deemed relevant. | ||
+ | Currently tags that are considered relevant: | ||
*Commerce: ID 5914 | *Commerce: ID 5914 | ||
*Business Investment and Capital: ID 5918 | *Business Investment and Capital: ID 5918 | ||
+ | *Intellectual Property: ID 5927 | ||
*Small Business: ID 5935 | *Small Business: ID 5935 | ||
+ | *Advanced Technology and Technological Innovations: ID 6294 | ||
+ | *Computers and Information Tech: ID 6300 | ||
*Small Business Administration: ID 6769 | *Small Business Administration: ID 6769 |
Latest revision as of 17:26, 1 February 2016
The goal of the Govtrack Webcrawler is to create and automated system in ActivePerl by which bills relevant to a certian topic can be pulled from the Govtrack API which can be found here.
Process
In order to perform this task several libraries are used most of these libraries come with ActivePerl but we are also using JSON::XS in order to make parsing the JSON data simpler. The LWP::UserAgent and HTTP::Request libraries are used to pull data from the API.
use strict; use LWP::UserAgent; use HTTP::Request; use JSON;
Next the useragent object is created.
my $ua = new LWP::UserAgent;
Now the parameters used to search the api are decided.
my $queryName = "insert query here"; my $congressNo = "insert congress num here"; my $limit = "insert maximum number of bills to search here";
Using these parameters the url can be constructed.
my $genUrl = "https://www.govtrack.us/api/v2/bill?order_by=-current_status_date&congress=". $congressNo."&q=".$queryName."&limit=".$limit;
The useragent object can now retrieve and decode the text from the url into a JSON string.
my $genResponse = $ua->get($genUrl); my $genContent=$genResponse->decoded_content;
Using the JSON module the JSON string can be converted into a data structure that can be searched through for relevant data.
my $JSONcontent = JSON::XS::decode_json($genResponse->decoded_content);
After getting the resulting JSON Data an array of all the bills matching the parameters bill can be located at
@{$JSONcontent->{objects}}
Iterating over this array more specific bill information can be found by constructing a url using the bill's ID number as such,
my $billurl = "https://www.govtrack.us/api/v2/bill/" . $bill->{id};
After finding this page the data can once again be parsed from a string into a data structure.
my $billresponse = $ua->get($billurl); my $billcontent = JSON::XS::decode_json($billresponse->decoded_content);
From this bill specific page the tags of each bill can be used to determine whether or not the bill is relevant and should be reviewed by Mcnair Center staff. An array of these tags can be found at
@{$billcontent->{terms}}
The data retrieved from this search is then placed into several tab delimited text files containing sets of useful information about bills deemed relevant. Currently tags that are considered relevant:
- Commerce: ID 5914
- Business Investment and Capital: ID 5918
- Intellectual Property: ID 5927
- Small Business: ID 5935
- Advanced Technology and Technological Innovations: ID 6294
- Computers and Information Tech: ID 6300
- Small Business Administration: ID 6769