Screen scraping can give you access to Amazon.com community features not yet implemented through Amazon.com's public Web Services API. In this hack, we'll implement a script to scrape customer buying advice . Customer buying advice isn't available through Amazon.com's Web Services API, so if you'd like to include this information on a remote site, you'll have to get it from Amazon.com's site through scraping. The first step to this hack is knowing where to find all the customer advice on one page. The following URL links directly to the advice page for a given ASIN (the unique ID Amazon.com displays for each product [Hack #52]): http://amazon.com/o/tg/detail/-/ insert ASIN /?vi=advice For example, here is the advice page for Mac OS X Hacks : http://amazon.com/o/tg/detail/-/0596004605/?vi=advice The CodeThis Perl script splits the advice page into two variables , based on the headings "in addition to" and "instead of." It then loops through those sections, using regular expressions to match the products' information. The script then formats and prints the information. Save the following script to a file called get_advice.pl : #!/usr/bin/perl -w # get_advice.pl # # A script to scrape Amazon to retrieve customer buying advice # Usage: perl get_advice.pl <asin> use strict; use LWP::Simple; # Take the ASIN from the command line. my $asin = shift @ARGV or die "Usage: perl get_advice.pl <asin>\n"; # Assemble the URL from the passed ASIN. my $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=advice"; # Set up unescape-HTML rules. Quicker than URI::Escape. my %unescape = ('"'=>'"', '&'=>'&', ' '=>' '); my $unescape_re = join '' => keys %unescape; # Request the URL. my $content = get($url); die "Could not retrieve $url" unless $content; # Get our matching data. my ($inAddition) = (join '', $content) [RETURN] =~ m!in addition to(.*?)(instead of)?</td></tr>!mis; my ($instead) = (join '', $content) [RETURN] =~ m!recommendations instead of(.*?)</table>!mis; # Look for "in addition to" advice. if ($inAddition) { print "-- In Addition To --\n\n"; while ($inAddition =~ m!<td width=10>(.*?)</td>\n<td width=90%>.*?ASIN/ [RETURN] (.*?)/.*?">(.*?)</a>.*?</td>.*?<td width=10% align=center>(.*?)</td>!mgis) { my ($place,$thisAsin,$title,$number) = ('','','',''); $title =~ s/($unescape_re)/$unescape{}/migs; #unescape HTML print "$place $title ($thisAsin)\n(Recommendations: $number)\n\n"; } } # Look for "instead of" advice. if ($instead) { print "-- Instead Of --\n\n"; while ($instead =~ m!<td width=10>(.*?)</td>\n<td width=90%>.*?ASIN/(. [RETURN] *?)/.*?">(.*?)</a>.*?</td>.*?<td width=10% align=center>(.*?)</td>!mgis) { my ($place,$thisAsin,$title,$number) [RETURN] = ('','','',''); $title =~ s/($unescape_re)/$unescape{}/migs; #unescape HTML print "$place $title ($thisAsin)\n(Recommendations: $number)\n\n"; } } Running the HackYou can run this script from the command line, passing in any ASIN. Here is the one for Mac OS X Hacks : % perl get_advice.pl 0596004605 -- In Addition To -- 1. Mac OS X: The Missing Manual, Second Edition (0596004508) (Recommendations: 1) 2. Mac Upgrade and Repair Bible, Third Edition (0764525948) (Recommendations: 1) If the book has long lists of alternate products, send the output to a text file. This example sends all alternate product recommendations for Google Hacks to a file called advice.txt : % perl get_advice.pl 0596004478 > advice.txt See Also
Paul Bausch |