售 价:¥
温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印
为你推荐
Title Page
Copyright
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
Introduction to Web Scraping
When is web scraping useful?
Is web scraping legal?
Python 3
Background research
Checking robots.txt
Examining the Sitemap
Estimating the size of a website
Identifying the technology used by a website
Finding the owner of a website
Crawling your first website
Scraping versus crawling
Downloading a web page
Retrying downloads
Setting a user agent
Sitemap crawler
ID iteration crawler
Link crawlers
Advanced features
Parsing robots.txt
Supporting proxies
Throttling downloads
Avoiding spider traps
Final version
Using the requests library
Summary
Scraping the Data
Analyzing a web page
Three approaches to scrape a web page
Regular expressions
Beautiful Soup
Lxml
CSS selectors and your Browser Console
XPath Selectors
LXML and Family Trees
Comparing performance
Scraping results
Overview of Scraping
Adding a scrape callback to the link crawler
Summary
Caching Downloads
When to use caching?
Adding cache support to the link crawler
Disk Cache
Implementing DiskCache
Testing the cache
Saving disk space
Expiring stale data
Drawbacks of DiskCache
Key-value storage cache
What is key-value storage?
Installing Redis
Overview of Redis
Redis cache implementation
Compression
Testing the cache
Exploring requests-cache
Summary
Concurrent Downloading
One million web pages
Parsing the Alexa list
Sequential crawler
Threaded crawler
How threads and processes work
Implementing a multithreaded crawler
Multiprocessing crawler
Performance
Summary
Dynamic Content
An example dynamic web page
Reverse engineering a dynamic web page
Edge cases
Rendering a dynamic web page
PyQt or PySide
Debugging with Qt
Executing JavaScript
Website interaction with WebKit
Waiting for results
The Render class
Selenium
Selenium and Headless Browsers
Summary
Interacting with Forms
The Login form
Loading cookies from the web browser
Extending the login script to update content
Automating forms with Selenium
"Humanizing" methods for Web Scraping
Summary
Solving CAPTCHA
Registering an account
Loading the CAPTCHA image
Optical character recognition
Further improvements
Solving complex CAPTCHAs
Using a CAPTCHA solving service
Getting started with 9kw
The 9kw CAPTCHA API
Reporting errors
Integrating with registration
CAPTCHAs and machine learning
Summary
Scrapy
Installing Scrapy
Starting a project
Defining a model
Creating a spider
Tuning settings
Testing the spider
Different Spider Types
Scraping with the shell command
Checking results
Interrupting and resuming a crawl
Scrapy Performance Tuning
Visual scraping with Portia
Installation
Annotation
Running the Spider
Checking results
Automated scraping with Scrapely
Summary
Putting It All Together
Google search engine
The website
Facebook API
Gap
BMW
Summary
买过这本书的人还买过
读了这本书的人还在读
同类图书排行榜