Web Scraping with Modern Perl (Part 2 - Speed Edition)

Comments:"Web Scraping with Modern Perl (Part 2 - Speed Edition) | stas [blogs.perl.org]"

URL:http://blogs.perl.org/users/stas/2013/02/web-scraping-with-modern-perl-part-2---speed-edition.html

tl;dr

Grab the gist with the complete, working source code. Benchmark it against the one featured on the previous article:

$ \time perl mojo-crawler.pl
23.08user 0.07system 2:11.99elapsed 17%CPU (0avgtext+0avgdata 84064maxresident)k
16inputs+0outputs (0major+6356minor)pagefaults 0swaps
$ \time perl yada-crawler.pl
8.83user 0.30system 0:12.60elapsed 72%CPU (0avgtext+0avgdata 131008maxresident)k
0inputs+0outputs (0major+8607minor)pagefaults 0swaps

How can it be 10x faster while consuming less than a half of CPU resources?!

Perl as a glue

Sorry, I had cheated a bit on the mojo-crawler.pl benchmark results. It implicitly uses the EV, a high performance full-featured event loop library when it is present. And it is not required for Mojolicious to work properly. Let's disable it:

$ MOJO_REACTOR=Mojo::Reactor::Poll time perl mojo-crawler.pl
113.99user 13.37system 2:08.46elapsed 99%CPU (0avgtext+0avgdata 83808maxresident)k
2912inputs+0outputs (18major+5789minor)pagefaults 0swaps

The elapsed time is the same with ou without EV, but now the pure-Perl crawler hogs the CPU!

Why? EV provides an interface to libev, which clearly does a better connection polling job than the 100% interpreted code. The bridge between Perl and the compiled library is called XS:

XS is an interface description file format used to create an extension interface between Perl and C code (or a C library) which one wishes to use with Perl.

Actually, CPAN is full of high performance XS-based modules for many tasks:

Thus, an efficient and fast web crawler/scraper could be constructed with those "bare-metal" building blocks ;)

Show me the code!

#!/usr/bin/env perl
use 5.016;
use common::sense;
use utf8::all;
# Use fast binary libraries
use EV;
use Web::Scraper::LibXML;
use YADA 0.039;
YADA->new(
 common_opts => {
 # Available opts @ http://curl.haxx.se/libcurl/c/curl_easy_setopt.html
 encoding => '',
 followlocation => 1,
 maxredirs => 5,
 }, http_response => 1, max => 4,
)->append([qw[
 http://sysd.org/page/1/
 http://sysd.org/page/2/
 http://sysd.org/page/3/
]] => sub {
 my ($self) = @_;
 return if $self->has_error
 or not $self->response->is_success
 or not $self->response->content_is_html;
 # Declare the scraper once and then reuse it
 state $scraper = scraper {
 process q(html title), title => q(text);
 process q(a), q(links[]) => q(@href);
 };
 # Employ amazing Perl (en|de)coding powers to handle HTML charsets
 my $doc = $scraper->scrape(
 $self->response->decoded_content,
 $self->final_url,
 );
 printf qq(%-64s %s\n), $self->final_url, $doc->{title};
 # Enqueue links from the parsed page
 $self->queue->prepend([
 grep {
 $_->can(q(host)) and $_->scheme =~ m{^https?$}x
 and $_->host eq $self->initial_url->host
 and (grep { length } $_->path_segments) <= 3
 } @{$doc->{links} // []}
 ] => __SUB__);
})->wait;

Now what?!

The example above has half the lines of code of the previous one. This comes at a cost of installing a bunch of external dependencies from the CPAN:

$ cpanm AnyEvent::Net::Curl::Queued EV HTML::TreeBuilder::LibXML Web::Scraper utf8::all

Despite the use 5.016 pragma, this code works fine on Perl 5.10 if you get rid of the __SUB__ reference.

So, what approach is the better one? Obviously, it depends. There is no silver bullet: web crawling is ultimately I/O-bound! However, specialized and well-tested libraries guarantee the I/O-boundness. For instance, trimming the ::LibXML part from the use Web::Scraper::LibXML statement considerably slows down our tiny crawler, because the HTML parsing will allocate more CPU cycles than the connection polling.

As the edge case, let's see how the venerable GNU Wget tool (see also yada, which comes bundled together with the AE::N::C::Queued distribution) behaves:

$ "time" wget -r --follow-tags a http://sysd.org/
0.23user 0.41system 1:10.20elapsed 0%CPU (0avgtext+0avgdata 23920maxresident)k
0inputs+40704outputs (0major+4323minor)pagefaults 0swaps

Despite it's clear disadvantage of using single connection, it is almost completely I/O-bound since it's URL extraction code doesn't require complete parsing of the HTML.

Web Scraping with Modern Perl (Part 2 - Speed Edition) | stas [blogs.perl.org]

tl;dr

Perl as a glue

Show me the code!

Now what?!

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112