万本电子书0元读

万本电子书0元读

顶部广告

Python Web Scraping Cookbook电子书

售       价:¥

8人正在读 | 0人评论 6.2

作       者:Michael Heydt

出  版  社:Packt Publishing

出版时间:2018-02-09

字       数:42.2万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
Untangle your web scraping complexities and access web data with ease using Python *s About This Book ? Hands-on recipes for advancing your web scraping skills to expert level. ? One-Stop Solution Guide to address complex and challenging web scraping tasks using Python. ? Understand the web page structure and collect meaningful data from the website with ease Who This Book Is For This book is ideal for Python programmers, web administrators, security professionals or someone who wants to perform web analytics would find this book relevant and useful. Familiarity with Python and basic understanding of web scraping would be useful to take full advantage of this book. What You Will Learn ? Use a wide variety of tools to scrape any website and data—including BeautifulSoup, Scrapy, Selenium, and many more ? Master expression languages such as XPath, CSS, and regular expressions to extract web data ? Deal with scraping traps such as hidden form fields, throttling, pagination, and different status codes ? Build robust scraping pipelines with SQS and RabbitMQ ? Scrape assets such as images media and know what to do when Scraper fails to run ? Explore ETL techniques of build a customized crawler, parser, and convert structured and unstructured data from websites ? Deploy and run your scraper-as-aservice in AWS Elastic Container Service In Detail Python Web Scraping Cookbook is a solution-focused book that will teach you techniques to develop high-performance scrapers and deal with crawlers, sitemaps, forms automation, Ajax-based sites, caches, and more.You'll explore a number of real-world scenarios where every part of the development/product life cycle will be fully covered. You will not only develop the skills to design and develop reliable, performance data flows, but also deploy your codebase to an AWS. If you are involved in software engineering, product development, or data mining (or are interested in building data-driven products), you will find this book useful as each recipe has a clear purpose and objective. Right from extracting data from the websites to writing a sophisticated web crawler, the book's independent recipes will be a godsend on the job. This book covers Python libraries, requests, and BeautifulSoup. You will learn about crawling, web spidering, working with AJAX websites, paginated items, and more. You will also learn to tackle problems such as 403 errors, working with proxy, scraping images, LXML, and more. By the end of this book, you will be able to scrape websites more efficiently and to be able to deploy and operate your scraper in the cloud. Style and approach This book is a rich collection of recipes that will come in handy when you are scraping a website using Python. Addressing your common and not-so-common pain points while scraping website, this is a book that you must have on the shelf.
目录展开

Title Page

Copyright and Credits

Python Web Scraping Cookbook

Contributors

About the author

About the reviewers

Packt is searching for authors like you

Packt Upsell

Why subscribe?

PacktPub.com

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Conventions used

Get in touch

Reviews

Getting Started with Scraping

Introduction

Setting up a Python development environment

Getting ready

How to do it...

Scraping Python.org with Requests and Beautiful Soup

Getting ready...

How to do it...

How it works...

Scraping Python.org in urllib3 and Beautiful Soup

Getting ready...

How to do it...

How it works

There's more...

Scraping Python.org with Scrapy

Getting ready...

How to do it...

How it works

Scraping Python.org with Selenium and PhantomJS

Getting ready

How to do it...

How it works

There's more...

Data Acquisition and Extraction

Introduction

How to parse websites and navigate the DOM using BeautifulSoup

Getting ready

How to do it...

How it works

There's more...

Searching the DOM with Beautiful Soup's find methods

Getting ready

How to do it...

Querying the DOM with XPath and lxml

Getting ready

How to do it...

How it works

There's more...

Querying data with XPath and CSS selectors

Getting ready

How to do it...

How it works

There's more...

Using Scrapy selectors

Getting ready

How to do it...

How it works

There's more...

Loading data in unicode / UTF-8

Getting ready

How to do it...

How it works

There's more...

Processing Data

Introduction

Working with CSV and JSON data

Getting ready

How to do it

How it works

There's more...

Storing data using AWS S3

Getting ready

How to do it

How it works

There's more...

Storing data using MySQL

Getting ready

How to do it

How it works

There's more...

Storing data using PostgreSQL

Getting ready

How to do it

How it works

There's more...

Storing data in Elasticsearch

Getting ready

How to do it

How it works

There's more...

How to build robust ETL pipelines with AWS SQS

Getting ready

How to do it - posting messages to an AWS queue

How it works

How to do it - reading and processing messages

How it works

There's more...

Working with Images, Audio, and other Assets

Introduction

Downloading media content from the web

Getting ready

How to do it

How it works

There's more...

Parsing a URL with urllib to get the filename

Getting ready

How to do it

How it works

There's more...

Determining the type of content for a URL

Getting ready

How to do it

How it works

There's more...

Determining the file extension from a content type

Getting ready

How to do it

How it works

There's more...

Downloading and saving images to the local file system

How to do it

How it works

There's more...

Downloading and saving images to S3

Getting ready

How to do it

How it works

There's more...

Generating thumbnails for images

Getting ready

How to do it

How it works

Taking a screenshot of a website

Getting ready

How to do it

How it works

Taking a screenshot of a website with an external service

Getting ready

How to do it

How it works

There's more...

Performing OCR on an image with pytesseract

Getting ready

How to do it

How it works

There's more...

Creating a Video Thumbnail

Getting ready

How to do it

How it works

There's more..

Ripping an MP4 video to an MP3

Getting ready

How to do it

There's more...

Scraping - Code of Conduct

Introduction

Scraping legality and scraping politely

Getting ready

How to do it

Respecting robots.txt

Getting ready

How to do it

How it works

There's more...

Crawling using the sitemap

Getting ready

How to do it

How it works

There's more...

Crawling with delays

Getting ready

How to do it

How it works

There's more...

Using identifiable user agents

How to do it

How it works

There's more...

Setting the number of concurrent requests per domain

How it works

Using auto throttling

How to do it

How it works

There's more...

Using an HTTP cache for development

How to do it

How it works

There's more...

Scraping Challenges and Solutions

Introduction

Retrying failed page downloads

How to do it

How it works

Supporting page redirects

How to do it

How it works

Waiting for content to be available in Selenium

How to do it

How it works

Limiting crawling to a single domain

How to do it

How it works

Processing infinitely scrolling pages

Getting ready

How to do it

How it works

There's more...

Controlling the depth of a crawl

How to do it

How it works

Controlling the length of a crawl

How to do it

How it works

Handling paginated websites

Getting ready

How to do it

How it works

There's more...

Handling forms and forms-based authorization

Getting ready

How to do it

How it works

There's more...

Handling basic authorization

How to do it

How it works

There's more...

Preventing bans by scraping via proxies

Getting ready

How to do it

How it works

Randomizing user agents

How to do it

Caching responses

How to do it

There's more...

Text Wrangling and Analysis

Introduction

Installing NLTK

How to do it

Performing sentence splitting

How to do it

There's more...

Performing tokenization

How to do it

Performing stemming

How to do it

Performing lemmatization

How to do it

Determining and removing stop words

How to do it

There's more...

Calculating the frequency distributions of words

How to do it

There's more...

Identifying and removing rare words

How to do it

Identifying and removing rare words

How to do it

Removing punctuation marks

How to do it

There's more...

Piecing together n-grams

How to do it

There's more...

Scraping a job listing from StackOverflow

Getting ready

How to do it

There's more...

Reading and cleaning the description in the job listing

Getting ready

How to do it...

Searching, Mining and Visualizing Data

Introduction

Geocoding an IP address

Getting ready

How to do it

How to collect IP addresses of Wikipedia edits

Getting ready

How to do it

How it works

There's more...

Visualizing contributor location frequency on Wikipedia

How to do it

Creating a word cloud from a StackOverflow job listing

Getting ready

How to do it

Crawling links on Wikipedia

Getting ready

How to do it

How it works

Theres more...

Visualizing page relationships on Wikipedia

Getting ready

How to do it

How it works

There's more...

Calculating degrees of separation

How to do it

How it works

There's more...

Creating a Simple Data API

Introduction

Creating a REST API with Flask-RESTful

Getting ready

How to do it

How it works

There's more...

Integrating the REST API with scraping code

Getting ready

How to do it

Adding an API to find the skills for a job listing

Getting ready

How to do it

Storing data in Elasticsearch as the result of a scraping request

Getting ready

How to do it

How it works

There's more...

Checking Elasticsearch for a listing before scraping

How to do it

There's more...

Creating Scraper Microservices with Docker

Introduction

Installing Docker

Getting ready

How to do it

Installing a RabbitMQ container from Docker Hub

Getting ready

How to do it

Running a Docker container (RabbitMQ)

Getting ready

How to do it

There's more...

Creating and running an Elasticsearch container

How to do it

Stopping/restarting a container and removing the image

How to do it

There's more...

Creating a generic microservice with Nameko

Getting ready

How to do it

How it works

There's more...

Creating a scraping microservice

How to do it

There's more...

Creating a scraper container

Getting ready

How to do it

How it works

Creating an API container

Getting ready

How to do it

There's more...

Composing and running the scraper locally with docker-compose

Getting ready

How to do it

There's more...

Making the Scraper as a Service Real

Introduction

Creating and configuring an Elastic Cloud trial account

How to do it

Accessing the Elastic Cloud cluster with curl

How to do it

Connecting to the Elastic Cloud cluster with Python

Getting ready

How to do it

There's more...

Performing an Elasticsearch query with the Python API

Getting ready

How to do it

There's more...

Using Elasticsearch to query for jobs with specific skills

Getting ready

How to do it

Modifying the API to search for jobs by skill

How to do it

How it works

There's more...

Storing configuration in the environment

How to do it

Creating an AWS IAM user and a key pair for ECS

Getting ready

How to do it

Configuring Docker to authenticate with ECR

Getting ready

How to do it

Pushing containers into ECR

Getting ready

How to do it

Creating an ECS cluster

How to do it

Creating a task to run our containers

Getting ready

How to do it

How it works

Starting and accessing the containers in AWS

Getting ready

How to do it

There's more...

Other Books You May Enjoy

Leave a review - let other readers know what you think

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部