Changes
Jump to navigation
Jump to search
PhD Masterclass - How to Build a Web Crawler (view source)
Revision as of 18:51, 31 January 2011
, 18:51, 31 January 2011no edit summary
We wrote a couple of simple scripts together to get to grips with Perl.
===Running a Perl Script===
close OUTPUT;
#Close the output filehandle - this will flush the write buffer
==Modules==
One of the joys of Perl is [http://www.cpan.org/ CPAN - The Comprehensive Perl Archive Network] which acts as repository for perl modules (as well as scripts, distros and much else). There are modules written by people from all over the world for almost every conceivable purpose. There is usually no need to reinvent the wheel in Perl - just grab a module (e.g. Wheel::Base)!
We tested some code using LWP::UserAgent and HTML::TreeBuilder. Useful documentation is here:
*[http://search.cpan.org/~gaas/libwww-perl-5.837/lib/LWP/UserAgent.pm LWP::UserAgent]
*[http://search.cpan.org/~petdance/WWW-Mechanize-1.66/lib/WWW/Mechanize.pm WWW::Mechanize]
*[http://search.cpan.org/~gaas/libwww-perl-5.837/lib/HTTP/Response.pm HTTP::Response]
*[http://search.cpan.org/~jfearn/HTML-Tree-4.1/lib/HTML/TreeBuilder.pm HTML::TreeBuilder]
*[http://search.cpan.org/~jfearn/HTML-Tree-4.1/lib/HTML/Element.pm HTML::Element]
*[http://annocpan.org/~GAAS/libwww-perl-5.837/lib/LWP/RobotUA.pm LWP::RobotUA]
*[http://annocpan.org/~GRANTM/XML-Simple-2.18/lib/XML/Simple.pm XML::Simple]
Below is a simple UserAgent example:
use LWP::UserAgent;
#Use the LWP::UserAgent modules
my $ua = LWP::UserAgent->new;
#Create a new UserAgent
my $url="http://www.contractormisconduct.org/index.cfm/1,73,222,html?CaseID=2";
#Set up a string containing a URL
my $response = $ua->get($url);
#Use the UA 'get' method to retrieve the webpage. This returns an HTTP Response object
my $content=$response->decoded_content;
#Get the response as one long text string, so we can work with it...
And now for a TreeBuilder example:
use HTML::TreeBuilder;
#Use the HTML::TreeBuilder modules
my $tree = HTML::TreeBuilder->new; # empty tree
#Create a new tree object
$tree->parse($content);
#Load up the tree from the content string (that we got using UA)
my $dump=$tree->as_text;
#Dump the tree as text maybe
my $incidentelement=$tree->look_down("id","primecontent");
#Or use HTML::Element methods to look_down the tree for a tag with some properties