Friday, October 06, 2006

Scrape Google AdWords

Scrape the AdWords from a saved Google results page into a form suitable for importing into a spreadsheet or database.

Google's AdWordsthe text ads that appear to the right of the regular search resultsare delivered on a cost-per-click basis, and purchasers of the AdWords are allowed to set a ceiling on the amount of money they spend on their ad. This means that, even if you run a search for the same query word multiple times, you won't necessarily get the same set of ads each time.

If you're considering using Google AdWords to run ads, you might want to gather up and save the ads that are running for the query words that interest you. Google AdWords is not included in the functionality provided by the Google API, so you have to do a little scraping to get at that data.

The Code

Save this code to a text file named adwords.pl:

#!/usr/bin/perl

# usage: perl adwords.pl results.html
#
use strict;
use HTML::TokeParser;

die "I need at least one file: $!\\n"
unless @ARGV;

my @Ads;
for my $file (@ARGV){
# skip if the file doesn't exist
# you could add more file testing here.
# errors go to STDERR so they won't
# pollute our csv file

unless (-e $file) {
warn "What??: $file -- $! \\n-- skipping --\\n";
next;
}

# now parse the file
my $p = HTML::TokeParser->new($file);
while(my $token = $p->get_token) {
next unless $token->[0] eq 'S'
and $token->[1] eq 'a'
and $token->[2]{id} =~ /^aw\\d$/;
my $link = $token->[2]{href};
my $ad;
if($link =~ /pagead/) {
my($url) = $link =~ /adurl=([^\\&]+)/;
$ad->{href} = $url;
} elsif($link =~ m{^/url\\?}) {
my($url) = $link =~ /\\&q=([^&]+)/;
$url =~ s/%3F/\\?/;
$url =~ s/%3D/=/g;
$url =~ s/%25/%/g;
$ad->{href} = $url;
}
$ad->{adwords} = $p->get_trimmed_text('/a');
$ad->{desc} = $p->get_trimmed_text('/font');
($ad->{url}) = $ad->{desc} =~ /([\\S]+)$/;
push(@Ads,$ad);

}
}

print quoted( qw( AdWords HREF Description URL ) );
for my $ad (@Ads) {
print quoted( @$ad{qw( adwords href desc url )} );
}


sub quoted {
return join( ",", map { "\\"$_\\"" } @_ )."\\n";
}

Running the Code

Call this script on the command line , providing the name of the saved Google results page and a file in which to put the CSV results:

% perl adwords.pl 

input.html


>

output.csv


input.html is the name of the Google results page that you've saved. output.csv is the name of the comma-delimited file to which you want to save your results. You can also provide multiple input files on the command line if you'd like:

% perl adwords.pl   

input.html



input2.html


>

output.csv

 

The results appear in a comma-delimited format, as in these results for the query new car

The c returns the AdWords headline, the link URL, the description in the ad, and the URL on the ad (this is the URL that appears in the ad text, while the hrEF is what the URL links to). With the file in hand, you can open output.csv in Excel and see which companies are using which headlines and descriptions. Scraping AdWords is a quick way to get a feel for how others are using the service.

1 comment:

Anonymous said...

hey google this post was rushed urantiamichael