Related products:
Perl & LWP
|
Html General
Spidering Hacks
Format: Paperback
Author: Kevin Hemenway
ReleaseDate: 01 November, 2003
Publisher: O'Reilly Media
Rating:
Perl-intensive book on web crawler design
This book is about how to create programs that perform the functions of a web crawler, with most of the Hacks being written in Perl. A spider (also known as a web crawler or web robot) is a program which browses the World Wide Web in a methodical, automated manner. Like the rest of the Hacks series, this book presents 100 bite-sized chunks of code or technique to tackle specific activities. In this book these range from the simple - how to download a set of image files - to the complex - cross-referring the output from one site with another to generate a third set of data. No matter what the complexity, each hack is clearly explained, with the code samples balanced with instructions, examples and notes on how to hack the hack.
As already mentioned, the hacks in this book mostly use Perl, though scattered here and there you'll find some Java, Python and PHP. If you really hate Perl, then you will not like this book. On the other hand the authors assume only a rudimentary knowledge of Perl, and there is no requirement for any knowledge of network programming of any description. After the opening chapter which gives guidance of being a good spidering citizen (how to respect the sites you are taking data from), there is a second chapter which details how to create a spidering toolkit (how to find and install the site of modules that many of the hacks depend on).
With a toolkit in place and a knowledge of good behavior, the book dives into the various hacks that are organized by topic: collecting media files, gleaning data from databases (with many examples for Yahoo!, Amazon, Google, Alexa and other popular information sources), maintaining your collections (more automation with "cron" or other scheduling tools) and a final chapter on giving something back (creating a web service, generating RSS feeds and so on).
The bulk of the hacks are in chapter four, which looks at extracting data from databases. Aside from the obvious sources such as Amazon and Google, these including online banks, tracking FedEx packages and more. There are a range of techniques used to grab and filter the data, so even if a data source you want to use isn't listed, the chances are that one of these hacks can be refactored to do what you want.
If Perl is not your thing then the very light sprinkling of non-Perl hacks probably isn't enough to make this a worthwhile purchase. If you're a Perl hacker interested in spidering there is a ton of stuff for you here without doubt. Also, if you are a student looking for a good supplement on building a web spider from scratch, this is probably not the book for you either, but the various hacks will give you some ideas on what you might want to do in your own spider if you wish to write one in a higher level language such as Java. Amazon does not show the table of contents so I do that here for completeness:
Chapter 1. Walking Softly
1. A Crash Course in Spidering and Scraping
2. Best Practices for You and Your Spider
3. Anatomy of an HTML Page
4. Registering Your Spider
5. Preempting Discovery
6. Keeping Your Spider Out of Sticky Situations
7. Finding the Patterns of Identifiers
Chapter 2. Assembling a Toolbox
Perl Modules
Resources You May Find Helpful
8. Installing Perl Modules
9. Simply Fetching with LWP::Simple
10. More Involved Requests with LWP::UserAgent
11. Adding HTTP Headers to Your Request
12. Posting Form Data with LWP
13. Authentication, Cookies, and Proxies
14. Handling Relative and Absolute URLs
15. Secured Access and Browser Attributes
16. Respecting Your Scrapee's Bandwidth
17. Respecting robots. txt
18. Adding Progress Bars to Your Scripts
19. Scraping with HTML::TreeBuilder
20. Parsing with HTML::TokeParser
21. WWW::Mechanize 101
22. Scraping with WWW::Mechanize
23. In Praise of Regular Expressions
24. Painless RSS with Template::Extract
25. A Quick Introduction to XPath
26. Downloading with curl and wget
27. More Advanced wget Techniques
28. Using Pipes to Chain Commands
29. Running Multiple Utilities at Once
30. Utilizing the Web Scraping Proxy
31. Being Warned When Things Go Wrong
32. Being Adaptive to Site Redesigns
Chapter 3. Collecting Media Files
33. Detective Case Study: Newgrounds
34. Detective Case Study: iFilm
35. Downloading Movies from the Library of Congress
36. Downloading Images from Webshots
37. Downloading Comics with dailystrips
38. Archiving Your Favorite Webcams
39. News Wallpaper for Your Site
40. Saving Only POP3 Email Attachments
41. Downloading MP3s from a Playlist
42. Downloading from Usenet with nget
Chapter 4. Gleaning Data from Databases
43. Archiving Yahoo! Groups Messages with yahoo2mbox
44. Archiving Yahoo! Groups Messages with WWW::Yahoo::Groups
45. Gleaning Buzz from Yahoo!
46. Spidering the Yahoo! Catalog
47. Tracking Additions to Yahoo!
48. Scattersearch with Yahoo! and Google
49. Yahoo! Directory Mindshare in Google
50. Weblog-Free Google Results
51. Spidering, Google, and Multiple Domains
52. Scraping Amazon. com Product Reviews
53. Receive an Email Alert for Newly Added Amazon. com Reviews
54. Scraping Amazon. com Customer Advice
55. Publishing Amazon. com Associates Statistics
56. Sorting Amazon. com Recommendations by Rating
57. Related Amazon. com Products with Alexa
58. Scraping Alexa's Competitive Data with Java
59. Finding Album Information with FreeDB and Amazon. com
60. Expanding Your Musical Tastes
61. Saving Daily Horoscopes to Your iPod
62. Graphing Data with RRDTOOL
63. Stocking Up on Financial Quotes
64. Super Author Searching
65. Mapping O'Reilly Best Sellers to Library Popularity
66. Using All Consuming to Get Book Lists
67. Tracking Packages with FedEx
68. Checking Blogs for New Comments
69. Aggregating RSS and Posting Changes
70. Using the Link Cosmos of Technorati
71. Finding Related RSS Feeds
72. Automatically Finding Blogs of Interest
73. Scraping TV Listings
74. What's Your Visitor's Weather Like?
75. Trendspotting with Geotargeting
76. Getting the Best Travel Route by Train
77. Geographic Distance and Back Again
78. Super Word Lookup
79. Word Associations with Lexical Freenet
80. Reformatting Bugtraq Reports
81. Keeping Tabs on the Web via Email
82. Publish IE's Favorites to Your Web Site
83. Spidering GameStop. com Game Prices
84. Bargain Hunting with PHP
85. Aggregating Multiple Search Engine Results
86. Robot Karaoke
87. Searching the Better Business Bureau
88. Searching for Health Inspections
89. Filtering for Content
Chapter 5. Maintaining Your Collections
90. Using cron to Automate Tasks
91. Scheduling Tasks Without cron
92. Mirroring Web Sites with wget and rsync
93. Accumulating Search Results Over Time
Chapter 6. Giving Back to the World
94. Using XML::RSS to Repurpose Data
95. Placing RSS Headlines on Your Site
96. Making Your Resources Scrapable with Regular Expressions
97. Making Your Resources Scrapable with a REST Interface
98. Making Your Resources Scrapable with XML-RPC
99. Creating an IM Interface
100. Going Beyond the Book
.
what is in a name?
. well, sometimes a generalizing lie.
IMHO, this book should have been named "(some) Spidering Hacks using Perl"
.
the "100" and "industrial strength" sale pitches they could have spared from the title as well
.
the very little python and java code that was either mentioned and/or included as code examples I think was as a way to pepper the content and apparently make it more appealing to a broader audience
.
. _ the book is mostly about Perl scripts (you could compile Perl to C and then use c2java, for example, but why bothering if, as I noticed right away, it was mostly toy code?) I wonder what the "industrial strength" thing was all about.
There is also some gnu utils examples (wget and curl), from which you could get better examples online
. _ the book has "examples" that don't make any sense (to me) and not only that but you could see as a total waste of time, why bothering scraping amazon's pages if they offer SOAP/RSS feeds? And not only that but then he goes on telling you how to scrape a site offering financial stocks info, too!?!?! I would have started by splitting the book in two, cases for which you don't really need scraping at all and those for which you do
. _ the author in an attempt to reach the "100" mark, included cases on how to download, say MP3 with Beatles songs and PDF files from IRS sites as separate cases :-? I wonder what the difference is once you have a connection to the data feed?!?
.
there is, "Web Content Mining with Java" ISBN: 047084311X and as you see the publishers/authors named this book after what it is all about and if you want to read about "industrial strength" approaches I would recommend "Mining the Web" ISBN: 1558607544
.
usually "hacks" books are about hacks, meaning you already know your stuff and are learning some hacks. If you know the basics of spiders and how to retrieve data off the Net programmatically this book is not for you. If you, on the other had, are new to this subject and are a Perl programmer you may learn a few things from it
.
otf.
Good, but needs more variety of languages
Now, if only the authors had included examples written for ASP, Cold Fusion, etc. Nearly all of the examples were written using Perl, but the few pages written with PHP contained some very useful nuggets!
I especially liked the use of the explode() function to split a table-formatted html report into multiple PHP array elements for individual processing. they could have appealed to a much wider audience!.
|
|