万本电子书0元读

万本电子书0元读

顶部广告

Python Web Scraping - Second Edition电子书

售       价:¥

4人正在读 | 0人评论 9.8

作       者:Katharine Jarmul,Richard Lawson

出  版  社:Packt Publishing

出版时间:2017-05-30

字       数:25.9万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
Successfully scrape data from any website with the power of Python 3.x About This Book ? A hands-on guide to web scraping using Python with solutions to real-world problems ? Create a number of different web scrapers in Python to extract information ? This book includes practical examples on using the popular and well-maintained libraries in Python for your web scraping needs Who This Book Is For This book is aimed at developers who want to use web scraping for legitimate purposes. Prior programming experience with Python would be useful but not essential. Anyone with general knowledge of programming languages should be able to pick up the book and understand the principals involved. What You Will Learn ? Extract data from web pages with simple Python programming ? Build a concurrent crawler to process web pages in parallel ? Follow links to crawl a website ? Extract features from the HTML ? Cache downloaded HTML for reuse ? Compare concurrent models to determine the fastest crawler ? Find out how to parse JavaScript-dependent websites ? Interact with forms and sessions In Detail The Internet contains the most useful set of data ever assembled, most of which is publicly accessible for free. However, this data is not easily usable. It is embedded within the structure and style of websites and needs to be carefully extracted. Web scraping is becoming increasingly useful as a means to gather and make sense of the wealth of information available online. This book is the ultimate guide to using the latest features of Python 3.x to scrape data from websites. In the early chapters, you’ll see how to extract data from static web pages. You’ll learn to use caching with databases and files to save time and manage the load on servers. After covering the basics, you’ll get hands-on practice building a more sophisticated crawler using browsers, crawlers, and concurrent scrapers. You’ll determine when and how to scrape data from a JavaScript-dependent website using PyQt and Selenium. You’ll get a better understanding of how to submit forms on complex websites protected by CAPTCHA. You’ll find out how to automate these actions with Python packages such as mechanize. You’ll also learn how to create class-based scrapers with Scrapy libraries and implement your learning on real websites. By the end of the book, you will have explored testing websites with scrapers, remote scraping, best practices, working with images, and many other relevant topics. Style and approach This hands-on guide is full of real-life examples and solutions starting simple and then progressively becoming more complex. Each chapter in this book introduces a problem and then provides one or more possible solutions.
目录展开

Title Page

Copyright

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

Introduction to Web Scraping

When is web scraping useful?

Is web scraping legal?

Python 3

Background research

Checking robots.txt

Examining the Sitemap

Estimating the size of a website

Identifying the technology used by a website

Finding the owner of a website

Crawling your first website

Scraping versus crawling

Downloading a web page

Retrying downloads

Setting a user agent

Sitemap crawler

ID iteration crawler

Link crawlers

Advanced features

Parsing robots.txt

Supporting proxies

Throttling downloads

Avoiding spider traps

Final version

Using the requests library

Summary

Scraping the Data

Analyzing a web page

Three approaches to scrape a web page

Regular expressions

Beautiful Soup

Lxml

CSS selectors and your Browser Console

XPath Selectors

LXML and Family Trees

Comparing performance

Scraping results

Overview of Scraping

Adding a scrape callback to the link crawler

Summary

Caching Downloads

When to use caching?

Adding cache support to the link crawler

Disk Cache

Implementing DiskCache

Testing the cache

Saving disk space

Expiring stale data

Drawbacks of DiskCache

Key-value storage cache

What is key-value storage?

Installing Redis

Overview of Redis

Redis cache implementation

Compression

Testing the cache

Exploring requests-cache

Summary

Concurrent Downloading

One million web pages

Parsing the Alexa list

Sequential crawler

Threaded crawler

How threads and processes work

Implementing a multithreaded crawler

Multiprocessing crawler

Performance

Summary

Dynamic Content

An example dynamic web page

Reverse engineering a dynamic web page

Edge cases

Rendering a dynamic web page

PyQt or PySide

Debugging with Qt

Executing JavaScript

Website interaction with WebKit

Waiting for results

The Render class

Selenium

Selenium and Headless Browsers

Summary

Interacting with Forms

The Login form

Loading cookies from the web browser

Extending the login script to update content

Automating forms with Selenium

"Humanizing" methods for Web Scraping

Summary

Solving CAPTCHA

Registering an account

Loading the CAPTCHA image

Optical character recognition

Further improvements

Solving complex CAPTCHAs

Using a CAPTCHA solving service

Getting started with 9kw

The 9kw CAPTCHA API

Reporting errors

Integrating with registration

CAPTCHAs and machine learning

Summary

Scrapy

Installing Scrapy

Starting a project

Defining a model

Creating a spider

Tuning settings

Testing the spider

Different Spider Types

Scraping with the shell command

Checking results

Interrupting and resuming a crawl

Scrapy Performance Tuning

Visual scraping with Portia

Installation

Annotation

Running the Spider

Checking results

Automated scraping with Scrapely

Summary

Putting It All Together

Google search engine

Facebook

The website

Facebook API

Gap

BMW

Summary

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部