Comments:"Web Scraping with Modern Perl (Part 2 - Speed Edition) | stas [blogs.perl.org]"
URL:http://blogs.perl.org/users/stas/2013/02/web-scraping-with-modern-perl-part-2---speed-edition.html
tl;dr
Grab the gist with the complete, working source code. Benchmark it against the one featured on the previous article:
$ \time perl mojo-crawler.pl
23.08user 0.07system 2:11.99elapsed 17%CPU (0avgtext+0avgdata 84064maxresident)k
16inputs+0outputs (0major+6356minor)pagefaults 0swaps
$ \time perl yada-crawler.pl
8.83user 0.30system 0:12.60elapsed 72%CPU (0avgtext+0avgdata 131008maxresident)k
0inputs+0outputs (0major+8607minor)pagefaults 0swaps
How can it be 10x faster while consuming less than a half of CPU resources?!
Perl as a glue
Sorry, I had cheated a bit on the mojo-crawler.pl
benchmark results.
It implicitly uses the EV, a high performance full-featured event loop library when it is present.
And it is not required for Mojolicious to work properly.
Let's disable it:
$ MOJO_REACTOR=Mojo::Reactor::Poll time perl mojo-crawler.pl
113.99user 13.37system 2:08.46elapsed 99%CPU (0avgtext+0avgdata 83808maxresident)k
2912inputs+0outputs (18major+5789minor)pagefaults 0swaps
The elapsed time is the same with ou without EV, but now the pure-Perl crawler hogs the CPU!
Why? EV provides an interface to libev, which clearly does a better connection polling job than the 100% interpreted code. The bridge between Perl and the compiled library is called XS:
XS is an interface description file format used to create an extension interface between Perl and C code (or a C library) which one wishes to use with Perl.Actually, CPAN is full of high performance XS-based modules for many tasks:
Thus, an efficient and fast web crawler/scraper could be constructed with those "bare-metal" building blocks ;)
Show me the code!
#!/usr/bin/env perl
use 5.016;
use common::sense;
use utf8::all;
# Use fast binary libraries
use EV;
use Web::Scraper::LibXML;
use YADA 0.039;
YADA->new(
common_opts => {
# Available opts @ http://curl.haxx.se/libcurl/c/curl_easy_setopt.html
encoding => '',
followlocation => 1,
maxredirs => 5,
}, http_response => 1, max => 4,
)->append([qw[
http://sysd.org/page/1/
http://sysd.org/page/2/
http://sysd.org/page/3/
]] => sub {
my ($self) = @_;
return if $self->has_error
or not $self->response->is_success
or not $self->response->content_is_html;
# Declare the scraper once and then reuse it
state $scraper = scraper {
process q(html title), title => q(text);
process q(a), q(links[]) => q(@href);
};
# Employ amazing Perl (en|de)coding powers to handle HTML charsets
my $doc = $scraper->scrape(
$self->response->decoded_content,
$self->final_url,
);
printf qq(%-64s %s\n), $self->final_url, $doc->{title};
# Enqueue links from the parsed page
$self->queue->prepend([
grep {
$_->can(q(host)) and $_->scheme =~ m{^https?$}x
and $_->host eq $self->initial_url->host
and (grep { length } $_->path_segments) <= 3
} @{$doc->{links} // []}
] => __SUB__);
})->wait;
Now what?!
The example above has half the lines of code of the previous one. This comes at a cost of installing a bunch of external dependencies from the CPAN:
$ cpanm AnyEvent::Net::Curl::Queued EV HTML::TreeBuilder::LibXML Web::Scraper utf8::all
Despite the use 5.016
pragma, this code works fine on Perl 5.10 if you get rid of the __SUB__
reference.
So, what approach is the better one? Obviously, it depends. There is no silver bullet: web crawling is ultimately I/O-bound! However, specialized and well-tested libraries guarantee the I/O-boundness.
For instance, trimming the ::LibXML
part from the use Web::Scraper::LibXML
statement considerably slows down our tiny crawler, because the HTML parsing will allocate more CPU cycles than the connection polling.
As the edge case, let's see how the venerable GNU Wget tool (see also yada, which comes bundled together with the AE::N::C::Queued distribution) behaves:
$ "time" wget -r --follow-tags a http://sysd.org/
0.23user 0.41system 1:10.20elapsed 0%CPU (0avgtext+0avgdata 23920maxresident)k
0inputs+40704outputs (0major+4323minor)pagefaults 0swaps
Despite it's clear disadvantage of using single connection, it is almost completely I/O-bound since it's URL extraction code doesn't require complete parsing of the HTML.