售 价:¥
温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印
为你推荐
Web Scraping with Python
Table of Contents
Web Scraping with Python
Credits
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Errata
Piracy
Questions
1. Introduction to Web Scraping
When is web scraping useful?
Is web scraping legal?
Background research
Checking robots.txt
Examining the Sitemap
Estimating the size of a website
Identifying the technology used by a website
Finding the owner of a website
Crawling your first website
Downloading a web page
Retrying downloads
Setting a user agent
Sitemap crawler
ID iteration crawler
Link crawler
Advanced features
Parsing robots.txt
Supporting proxies
Throttling downloads
Avoiding spider traps
Final version
Summary
2. Scraping the Data
Analyzing a web page
Three approaches to scrape a web page
Regular expressions
Beautiful Soup
Lxml
CSS selectors
Comparing performance
Scraping results
Overview
Adding a scrape callback to the link crawler
Summary
3. Caching Downloads
Adding cache support to the link crawler
Disk cache
Implementation
Testing the cache
Saving disk space
Expiring stale data
Drawbacks
Database cache
What is NoSQL?
Installing MongoDB
Overview of MongoDB
MongoDB cache implementation
Compression
Testing the cache
Summary
4. Concurrent Downloading
One million web pages
Parsing the Alexa list
Sequential crawler
Threaded crawler
How threads and processes work
Implementation
Cross-process crawler
Performance
Summary
5. Dynamic Content
An example dynamic web page
Reverse engineering a dynamic web page
Edge cases
Rendering a dynamic web page
PyQt or PySide
Executing JavaScript
Website interaction with WebKit
Waiting for results
The Render class
Selenium
Summary
6. Interacting with Forms
The Login form
Loading cookies from the web browser
Extending the login script to update content
Automating forms with the Mechanize module
Summary
7. Solving CAPTCHA
Registering an account
Loading the CAPTCHA image
Optical Character Recognition
Further improvements
Solving complex CAPTCHAs
Using a CAPTCHA solving service
Getting started with 9kw
9kw CAPTCHA API
Integrating with registration
Summary
8. Scrapy
Installation
Starting a project
Defining a model
Creating a spider
Tuning settings
Testing the spider
Scraping with the shell command
Checking results
Interrupting and resuming a crawl
Visual scraping with Portia
Installation
Annotation
Tuning a spider
Checking results
Automated scraping with Scrapely
Summary
9. Overview
Google search engine
The website
The API
Gap
BMW
Summary
Index
买过这本书的人还买过
读了这本书的人还在读
同类图书排行榜