万本电子书0元读

万本电子书0元读

顶部广告

PySpark Cookbook电子书

售       价:¥

27人正在读 | 0人评论 6.2

作       者:Denny Lee,Tomasz Drabas

出  版  社:Packt Publishing

出版时间:2018-06-29

字       数:32.5万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
Combine the power of Apache Spark and Python to build effective big data applications About This Book ? Perform effective data processing, machine learning, and analytics using PySpark ? Overcome challenges in developing and deploying Spark solutions using Python ? Explore recipes for efficiently combining Python and Apache Spark to process data Who This Book Is For The PySpark Cookbook is for you if you are a Python developer looking for hands-on recipes for using the Apache Spark 2.x ecosystem in the best possible way. A thorough understanding of Python (and some familiarity with Spark) will help you get the best out of the book. What You Will Learn ? Configure a local instance of PySpark in a virtual environment ? Install and configure Jupyter in local and multi-node environments ? Create DataFrames from JSON and a dictionary using pyspark.sql ? Explore regression and clustering models available in the ML module ? Use DataFrames to transform data used for modeling ? Connect to PubNub and perform aggregations on streams In Detail Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. The PySpark Cookbook presents effective and time-saving recipes for leveraging the power of Python and putting it to use in the Spark ecosystem. You’ll start by learning the Apache Spark architecture and how to set up a Python environment for Spark. You’ll then get familiar with the modules available in PySpark and start using them effortlessly. In addition to this, you’ll discover how to abstract data with RDDs and DataFrames, and understand the streaming capabilities of PySpark. You’ll then move on to using ML and MLlib in order to solve any problems related to the machine learning capabilities of PySpark and use GraphFrames to solve graph-processing problems. Finally, you will explore how to deploy your applications to the cloud using the spark-submit command. By the end of this book, you will be able to use the Python API for Apache Spark to solve any problems associated with building data-intensive applications. Style and approach This book is a rich collection of recipes that will come in handy when you are working with PySpark Addressing your common and not-so-common pain points, this is a book that you must have on the shelf.
目录展开

Title Page

Copyright and Credits

PySpark Cookbook

Packt Upsell

Why subscribe?

PacktPub.com

Contributors

About the authors

About the reviewer

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Sections

Getting ready

How to do it...

How it works...

There's more...

See also

Get in touch

Reviews

Installing and Configuring Spark

Introduction

Installing Spark requirements

Getting ready

How to do it...

How it works...

There's more...

Installing Java

Installing Python

Installing R

Installing Scala

Installing Maven

Updating PATH

Installing Spark from sources

Getting ready

How to do it...

How it works...

There's more...

See also

Installing Spark from binaries

Getting ready

How to do it...

How it works...

There's more...

Configuring a local instance of Spark

Getting ready

How to do it...

How it works...

See also

Configuring a multi-node instance of Spark

Getting ready

How to do it...

How it works...

See also

Installing Jupyter

Getting ready

How to do it...

How it works...

There's more...

See also

Configuring a session in Jupyter

Getting ready

How to do it...

How it works...

There's more...

See also

Working with Cloudera Spark images

Getting ready

How to do it...

How it works...

Abstracting Data with RDDs

Introduction

Creating RDDs

Getting ready

How to do it...

How it works...

Spark context parallelize method

.take(...) method

Reading data from files

Getting ready

How to do it...

How it works...

.textFile(...) method

.map(...) method

Partitions and performance

Overview of RDD transformations

Getting ready

How to do it...

.map(...) transformation

.filter(...) transformation

.flatMap(...) transformation

.distinct() transformation

.sample(...) transformation

.join(...) transformation

.repartition(...) transformation

.zipWithIndex() transformation

.reduceByKey(...) transformation

.sortByKey(...) transformation

.union(...) transformation

.mapPartitionsWithIndex(...) transformation

How it works...

Overview of RDD actions

Getting ready

How to do it...

.take(...) action

.collect() action

.reduce(...) action

.count() action

.saveAsTextFile(...) action

How it works...

Pitfalls of using RDDs

Getting ready

How to do it...

How it works...

Abstracting Data with DataFrames

Introduction

Creating DataFrames

Getting ready

How to do it...

How it works...

There's more...

From JSON

From CSV

See also

Accessing underlying RDDs

Getting ready

How to do it...

How it works...

Performance optimizations

Getting ready

How to do it...

How it works...

There's more...

See also

Inferring the schema using reflection

Getting ready

How to do it...

How it works...

See also

Specifying the schema programmatically

Getting ready

How to do it...

How it works...

See also

Creating a temporary table

Getting ready

How to do it...

How it works...

There's more...

Using SQL to interact with DataFrames

Getting ready

How to do it...

How it works...

There's more...

Overview of DataFrame transformations

Getting ready

How to do it...

The .select(...) transformation

The .filter(...) transformation

The .groupBy(...) transformation

The .orderBy(...) transformation

The .withColumn(...) transformation

The .join(...) transformation

The .unionAll(...) transformation

The .distinct(...) transformation

The .repartition(...) transformation

The .fillna(...) transformation

The .dropna(...) transformation

The .dropDuplicates(...) transformation

The .summary() and .describe() transformations

The .freqItems(...) transformation

See also

Overview of DataFrame actions

Getting ready

How to do it...

The .show(...) action

The .collect() action

The .take(...) action

The .toPandas() action

See also

Preparing Data for Modeling

Introduction

Handling duplicates

Getting ready

How to do it...

How it works...

There's more...

Only IDs differ

ID collisions

Handling missing observations

Getting ready

How to do it...

How it works...

Missing observations per row

Missing observations per column

There's more...

See also

Handling outliers

Getting ready

How to do it...

How it works...

See also

Exploring descriptive statistics

Getting ready

How to do it...

How it works...

There's more...

Descriptive statistics for aggregated columns

See also

Computing correlations

Getting ready

How to do it...

How it works...

There's more...

Drawing histograms

Getting ready

How to do it...

How it works...

There's more...

See also

Visualizing interactions between features

Getting ready

How to do it...

How it works...

There's more...

Machine Learning with MLlib

Loading the data

Getting ready

How to do it...

How it works...

There's more...

Exploring the data

Getting ready

How to do it...

How it works...

Numerical features

Categorical features

There's more...

See also

Testing the data

Getting ready

How to do it...

How it works...

See also...

Transforming the data

Getting ready

How to do it...

How it works...

There's more...

See also...

Standardizing the data

Getting ready

How to do it...

How it works...

Creating an RDD for training

Getting ready

How to do it...

Classification

Regression

How it works...

There's more...

See also

Predicting hours of work for census respondents

Getting ready

How to do it...

How it works...

Forecasting the income levels of census respondents

Getting ready

How to do it...

How it works...

There's more...

Building a clustering models

Getting ready

How to do it...

How it works...

There's more...

See also

Computing performance statistics

Getting ready

How to do it...

How it works...

Regression metrics

Classification metrics

See also

Machine Learning with the ML Module

Introducing Transformers

Getting ready

How to do it...

How it works...

There's more...

See also

Introducing Estimators

Getting ready

How to do it...

How it works...

There's more...

Introducing Pipelines

Getting ready

How to do it...

How it works...

See also

Selecting the most predictable features

Getting ready

How to do it...

How it works...

There's more...

See also

Predicting forest coverage types

Getting ready

How to do it...

How it works...

There's more...

Estimating forest elevation

Getting ready

How to do it...

How it works...

There's more...

Clustering forest cover types

Getting ready

How to do it...

How it works...

See also

Tuning hyperparameters

Getting ready

How to do it...

How it works...

There's more...

Extracting features from text

Getting ready

How to do it...

How it works...

There's more...

See also

Discretizing continuous variables

Getting ready

How to do it...

How it works...

Standardizing continuous variables

Getting ready

How to do it...

How it works...

Topic mining

Getting ready

How to do it...

How it works...

Structured Streaming with PySpark

Introduction

Understanding Spark Streaming

Understanding DStreams

Getting ready

How to do it...

Terminal 1 – Netcat window

Terminal 2 – Spark Streaming window

How it works...

There's more...

Understanding global aggregations

Getting ready

How to do it...

Terminal 1 – Netcat window

Terminal 2 – Spark Streaming window

How it works...

Continuous aggregation with structured streaming

Getting ready

How to do it...

Terminal 1 – Netcat window

Terminal 2 – Spark Streaming window

How it works...

GraphFrames – Graph Theory with PySpark

Introduction

Installing GraphFrames

Getting ready

How to do it...

How it works...

Preparing the data

Getting ready

How to do it...

How it works...

There's more...

Building the graph

How to do it...

How it works...

Running queries against the graph

Getting ready

How to do it...

How it works...

Understanding the graph

Getting ready

How to do it...

How it works...

Using PageRank to determine airport ranking

Getting ready

How to do it...

How it works...

Finding the fewest number of connections

Getting ready

How to do it...

How it works...

There's more...

See also

Visualizing the graph

Getting ready

How to do it...

How it works...

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部