万本电子书0元读

万本电子书0元读

顶部广告

Web Scraping with Python电子书

售       价:¥

29人正在读 | 0人评论 9.8

作       者:Richard Lawson

出  版  社:Packt Publishing

出版时间:2015-10-28

字       数:98.2万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:此类商品不支持退换货,不支持下载打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
Successfully scrape data from any website with the power of PythonAbout This BookA hands-on guide to web scraping with real-life problems and solutionsTechniques to download and extract data from complex websitesCreate a number of different web scrapers to extract information Who This Book Is For This book is aimed at developers who want to use web scraping for legitimate purposes. Prior programming experience with Python would be useful but not essential. Anyone with general knowledge of programming languages should be able to pick up the book and understand the principals involved.What You Will LearnExtract data from web pages with simple Python programmingBuild a threaded crawler to process web pages in parallelFollow links to crawl a websiteDownload cache to reduce bandwidthUse multiple threads and processes to scrape fasterLearn how to parse JavaScript-dependent websitesInteract with forms and sessionsSolve CAPTCHAs on protected web pagesDiscover how to track the state of a crawl In Detail The Internet contains the most useful set of data ever assembled, largely publicly accessible for free. However, this data is not easily reusable. It is embedded within the structure and style of websites and needs to be carefully extracted to be useful. Web scraping is becoming increasingly useful as a means to easily gather and make sense of the plethora of information available online. Using a simple language like Python, you can crawl the information out of complex websites using simple programming. This book is the ultimate guide to using Python to scrape data from websites. In the early chapters it covers how to extract data from static web pages and how to use caching to manage the load on servers. After the basics we'll get our hands dirty with building a more sophisticated crawler with threads and more advanced topics. Learn step-by-step how to use Ajax URLs, employ the Firebug extension for monitoring, and indirectly scrape data. Discover more scraping nitty-gritties such as using the browser renderer, managing cookies, how to submit forms to extract data from complex websites protected by CAPTCHA, and so on. The book wraps up with how to create high-level scrapers with Scrapy libraries and implement what has been learned to real websites.Style and approach This book is a hands-on guide with real-life examples and solutions starting simple and then progressively becoming more complex. Each chapter in this book introduces a problem and then provides one or more possible solutions.
目录展开

Web Scraping with Python

Table of Contents

Web Scraping with Python

Credits

About the Author

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers, and more

Why subscribe?

Free access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Errata

Piracy

Questions

1. Introduction to Web Scraping

When is web scraping useful?

Is web scraping legal?

Background research

Checking robots.txt

Examining the Sitemap

Estimating the size of a website

Identifying the technology used by a website

Finding the owner of a website

Crawling your first website

Downloading a web page

Retrying downloads

Setting a user agent

Sitemap crawler

ID iteration crawler

Link crawler

Advanced features

Parsing robots.txt

Supporting proxies

Throttling downloads

Avoiding spider traps

Final version

Summary

2. Scraping the Data

Analyzing a web page

Three approaches to scrape a web page

Regular expressions

Beautiful Soup

Lxml

CSS selectors

Comparing performance

Scraping results

Overview

Adding a scrape callback to the link crawler

Summary

3. Caching Downloads

Adding cache support to the link crawler

Disk cache

Implementation

Testing the cache

Saving disk space

Expiring stale data

Drawbacks

Database cache

What is NoSQL?

Installing MongoDB

Overview of MongoDB

MongoDB cache implementation

Compression

Testing the cache

Summary

4. Concurrent Downloading

One million web pages

Parsing the Alexa list

Sequential crawler

Threaded crawler

How threads and processes work

Implementation

Cross-process crawler

Performance

Summary

5. Dynamic Content

An example dynamic web page

Reverse engineering a dynamic web page

Edge cases

Rendering a dynamic web page

PyQt or PySide

Executing JavaScript

Website interaction with WebKit

Waiting for results

The Render class

Selenium

Summary

6. Interacting with Forms

The Login form

Loading cookies from the web browser

Extending the login script to update content

Automating forms with the Mechanize module

Summary

7. Solving CAPTCHA

Registering an account

Loading the CAPTCHA image

Optical Character Recognition

Further improvements

Solving complex CAPTCHAs

Using a CAPTCHA solving service

Getting started with 9kw

9kw CAPTCHA API

Integrating with registration

Summary

8. Scrapy

Installation

Starting a project

Defining a model

Creating a spider

Tuning settings

Testing the spider

Scraping with the shell command

Checking results

Interrupting and resuming a crawl

Visual scraping with Portia

Installation

Annotation

Tuning a spider

Checking results

Automated scraping with Scrapely

Summary

9. Overview

Google search engine

Facebook

The website

The API

Gap

BMW

Summary

Index

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部