万本电子书0元读

万本电子书0元读

顶部广告

Learning PySpark电子书

售       价:¥

17人正在读 | 0人评论 9.8

作       者:Tomasz Drabas

出  版  社:Packt Publishing

出版时间:2017-02-01

字       数:163.3万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
Build data-intensive applications locally and deploy at scale using the combined powers of Python and Spark 2.0 About This Book Learn why and how you can efficiently use Python to process data and build machine learning models in Apache Spark 2.0 Develop and deploy efficient, scalable real-time Spark solutions Take your understanding of using Spark with Python to the next level with this jump start guide Who This Book Is For If you are a Python developer who wants to learn about the Apache Spark 2.0 ecosystem, this book is for you. A firm understanding of Python is expected to get the best out of the book. Familiarity with Spark would be useful, but is not mandatory. What You Will Learn Learn about Apache Spark and the Spark 2.0 architecture Build and interact with Spark DataFrames using Spark SQL Learn how to solve graph and deep learning problems using GraphFrames and TensorFrames respectively Read, transform, and understand data and use it to train machine learning models Build machine learning models with MLlib and ML Learn how to submit your applications programmatically using spark-submit Deploy locally built applications to a cluster In Detail Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. This book will show you how to leverage the power of Python and put it to use in the Spark ecosystem. You will start by getting a firm understanding of the Spark 2.0 architecture and how to set up a Python environment for Spark. You will get familiar with the modules available in PySpark. You will learn how to abstract data with RDDs and DataFrames and understand the streaming capabilities of PySpark. Also, you will get a thorough overview of machine learning capabilities of PySpark using ML and MLlib, graph processing using GraphFrames, and polyglot persistence using Blaze. Finally, you will learn how to deploy your applications to the cloud using the spark-submit command. By the end of this book, you will have established a firm understanding of the Spark Python API and how it can be used to build data-intensive applications. Style and approach This book takes a very comprehensive, step-by-step approach so you understand how the Spark ecosystem can be used with Python to develop efficient, scalable solutions. Every chapter is standalone and written in a very easy-to-understand manner, with a focus on both the hows and the whys of each concept.
目录展开

Learning PySpark

Learning PySpark

Learning PySpark

Learning PySpark

Table of Contents

Table of Contents

Table of Contents

Table of Contents

Learning PySpark

Learning PySpark

Learning PySpark

Learning PySpark

Credits

Credits

Credits

Credits

Foreword

Foreword

Foreword

Foreword

About the Authors

About the Authors

About the Authors

About the Authors

About the Reviewer

About the Reviewer

About the Reviewer

About the Reviewer

www.PacktPub.com

www.PacktPub.com

www.PacktPub.com

www.PacktPub.com

Customer Feedback

Customer Feedback

Customer Feedback

Customer Feedback

Preface

Preface

Preface

Preface

What this book covers

What this book covers

What this book covers

What this book covers

What you need for this book

What you need for this book

What you need for this book

What you need for this book

Who this book is for

Who this book is for

Who this book is for

Who this book is for

Conventions

Conventions

Conventions

Conventions

Reader feedback

Reader feedback

Reader feedback

Reader feedback

Customer support

Customer support

Customer support

Customer support

Downloading the example code

Downloading the example code

Downloading the example code

Downloading the example code

Downloading the color images of this book

Downloading the color images of this book

Downloading the color images of this book

Downloading the color images of this book

Errata

Errata

Errata

Errata

Piracy

Piracy

Piracy

Piracy

Questions

Questions

Questions

Questions

1. Understanding Spark

1. Understanding Spark

1. Understanding Spark

1. Understanding Spark

What is Apache Spark?

What is Apache Spark?

What is Apache Spark?

What is Apache Spark?

Spark Jobs and APIs

Spark Jobs and APIs

Spark Jobs and APIs

Spark Jobs and APIs

Execution process

Execution process

Execution process

Execution process

Resilient Distributed Dataset

Resilient Distributed Dataset

Resilient Distributed Dataset

Resilient Distributed Dataset

DataFrames

DataFrames

DataFrames

DataFrames

Datasets

Datasets

Datasets

Datasets

Catalyst Optimizer

Catalyst Optimizer

Catalyst Optimizer

Catalyst Optimizer

Project Tungsten

Project Tungsten

Project Tungsten

Project Tungsten

Spark 2.0 architecture

Spark 2.0 architecture

Spark 2.0 architecture

Spark 2.0 architecture

Unifying Datasets and DataFrames

Unifying Datasets and DataFrames

Unifying Datasets and DataFrames

Unifying Datasets and DataFrames

Introducing SparkSession

Introducing SparkSession

Introducing SparkSession

Introducing SparkSession

Tungsten phase 2

Tungsten phase 2

Tungsten phase 2

Tungsten phase 2

Structured Streaming

Structured Streaming

Structured Streaming

Structured Streaming

Continuous applications

Continuous applications

Continuous applications

Continuous applications

Summary

Summary

Summary

Summary

2. Resilient Distributed Datasets

2. Resilient Distributed Datasets

2. Resilient Distributed Datasets

2. Resilient Distributed Datasets

Internal workings of an RDD

Internal workings of an RDD

Internal workings of an RDD

Internal workings of an RDD

Creating RDDs

Creating RDDs

Creating RDDs

Creating RDDs

Schema

Schema

Schema

Schema

Reading from files

Reading from files

Reading from files

Reading from files

Lambda expressions

Lambda expressions

Lambda expressions

Lambda expressions

Global versus local scope

Global versus local scope

Global versus local scope

Global versus local scope

Transformations

Transformations

Transformations

Transformations

The .map(...) transformation

The .map(...) transformation

The .map(...) transformation

The .map(...) transformation

The .filter(...) transformation

The .filter(...) transformation

The .filter(...) transformation

The .filter(...) transformation

The .flatMap(...) transformation

The .flatMap(...) transformation

The .flatMap(...) transformation

The .flatMap(...) transformation

The .distinct(...) transformation

The .distinct(...) transformation

The .distinct(...) transformation

The .distinct(...) transformation

The .sample(...) transformation

The .sample(...) transformation

The .sample(...) transformation

The .sample(...) transformation

The .leftOuterJoin(...) transformation

The .leftOuterJoin(...) transformation

The .leftOuterJoin(...) transformation

The .leftOuterJoin(...) transformation

The .repartition(...) transformation

The .repartition(...) transformation

The .repartition(...) transformation

The .repartition(...) transformation

Actions

Actions

Actions

Actions

The .take(...) method

The .take(...) method

The .take(...) method

The .take(...) method

The .collect(...) method

The .collect(...) method

The .collect(...) method

The .collect(...) method

The .reduce(...) method

The .reduce(...) method

The .reduce(...) method

The .reduce(...) method

The .count(...) method

The .count(...) method

The .count(...) method

The .count(...) method

The .saveAsTextFile(...) method

The .saveAsTextFile(...) method

The .saveAsTextFile(...) method

The .saveAsTextFile(...) method

The .foreach(...) method

The .foreach(...) method

The .foreach(...) method

The .foreach(...) method

Summary

Summary

Summary

Summary

3. DataFrames

3. DataFrames

3. DataFrames

3. DataFrames

Python to RDD communications

Python to RDD communications

Python to RDD communications

Python to RDD communications

Catalyst Optimizer refresh

Catalyst Optimizer refresh

Catalyst Optimizer refresh

Catalyst Optimizer refresh

Speeding up PySpark with DataFrames

Speeding up PySpark with DataFrames

Speeding up PySpark with DataFrames

Speeding up PySpark with DataFrames

Creating DataFrames

Creating DataFrames

Creating DataFrames

Creating DataFrames

Generating our own JSON data

Generating our own JSON data

Generating our own JSON data

Generating our own JSON data

Creating a DataFrame

Creating a DataFrame

Creating a DataFrame

Creating a DataFrame

Creating a temporary table

Creating a temporary table

Creating a temporary table

Creating a temporary table

Simple DataFrame queries

Simple DataFrame queries

Simple DataFrame queries

Simple DataFrame queries

DataFrame API query

DataFrame API query

DataFrame API query

DataFrame API query

SQL query

SQL query

SQL query

SQL query

Interoperating with RDDs

Interoperating with RDDs

Interoperating with RDDs

Interoperating with RDDs

Inferring the schema using reflection

Inferring the schema using reflection

Inferring the schema using reflection

Inferring the schema using reflection

Programmatically specifying the schema

Programmatically specifying the schema

Programmatically specifying the schema

Programmatically specifying the schema

Querying with the DataFrame API

Querying with the DataFrame API

Querying with the DataFrame API

Querying with the DataFrame API

Number of rows

Number of rows

Number of rows

Number of rows

Running filter statements

Running filter statements

Running filter statements

Running filter statements

Querying with SQL

Querying with SQL

Querying with SQL

Querying with SQL

Number of rows

Number of rows

Number of rows

Number of rows

Running filter statements using the where Clauses

Running filter statements using the where Clauses

Running filter statements using the where Clauses

Running filter statements using the where Clauses

DataFrame scenario – on-time flight performance

DataFrame scenario – on-time flight performance

DataFrame scenario – on-time flight performance

DataFrame scenario – on-time flight performance

Preparing the source datasets

Preparing the source datasets

Preparing the source datasets

Preparing the source datasets

Joining flight performance and airports

Joining flight performance and airports

Joining flight performance and airports

Joining flight performance and airports

Visualizing our flight-performance data

Visualizing our flight-performance data

Visualizing our flight-performance data

Visualizing our flight-performance data

Spark Dataset API

Spark Dataset API

Spark Dataset API

Spark Dataset API

Summary

Summary

Summary

Summary

4. Prepare Data for Modeling

4. Prepare Data for Modeling

4. Prepare Data for Modeling

4. Prepare Data for Modeling

Checking for duplicates, missing observations, and outliers

Checking for duplicates, missing observations, and outliers

Checking for duplicates, missing observations, and outliers

Checking for duplicates, missing observations, and outliers

Duplicates

Duplicates

Duplicates

Duplicates

Missing observations

Missing observations

Missing observations

Missing observations

Outliers

Outliers

Outliers

Outliers

Getting familiar with your data

Getting familiar with your data

Getting familiar with your data

Getting familiar with your data

Descriptive statistics

Descriptive statistics

Descriptive statistics

Descriptive statistics

Correlations

Correlations

Correlations

Correlations

Visualization

Visualization

Visualization

Visualization

Histograms

Histograms

Histograms

Histograms

Interactions between features

Interactions between features

Interactions between features

Interactions between features

Summary

Summary

Summary

Summary

5. Introducing MLlib

5. Introducing MLlib

5. Introducing MLlib

5. Introducing MLlib

Overview of the package

Overview of the package

Overview of the package

Overview of the package

Loading and transforming the data

Loading and transforming the data

Loading and transforming the data

Loading and transforming the data

Getting to know your data

Getting to know your data

Getting to know your data

Getting to know your data

Descriptive statistics

Descriptive statistics

Descriptive statistics

Descriptive statistics

Correlations

Correlations

Correlations

Correlations

Statistical testing

Statistical testing

Statistical testing

Statistical testing

Creating the final dataset

Creating the final dataset

Creating the final dataset

Creating the final dataset

Creating an RDD of LabeledPoints

Creating an RDD of LabeledPoints

Creating an RDD of LabeledPoints

Creating an RDD of LabeledPoints

Splitting into training and testing

Splitting into training and testing

Splitting into training and testing

Splitting into training and testing

Predicting infant survival

Predicting infant survival

Predicting infant survival

Predicting infant survival

Logistic regression in MLlib

Logistic regression in MLlib

Logistic regression in MLlib

Logistic regression in MLlib

Selecting only the most predictable features

Selecting only the most predictable features

Selecting only the most predictable features

Selecting only the most predictable features

Random forest in MLlib

Random forest in MLlib

Random forest in MLlib

Random forest in MLlib

Summary

Summary

Summary

Summary

6. Introducing the ML Package

6. Introducing the ML Package

6. Introducing the ML Package

6. Introducing the ML Package

Overview of the package

Overview of the package

Overview of the package

Overview of the package

Transformer

Transformer

Transformer

Transformer

Estimators

Estimators

Estimators

Estimators

Classification

Classification

Classification

Classification

Regression

Regression

Regression

Regression

Clustering

Clustering

Clustering

Clustering

Pipeline

Pipeline

Pipeline

Pipeline

Predicting the chances of infant survival with ML

Predicting the chances of infant survival with ML

Predicting the chances of infant survival with ML

Predicting the chances of infant survival with ML

Loading the data

Loading the data

Loading the data

Loading the data

Creating transformers

Creating transformers

Creating transformers

Creating transformers

Creating an estimator

Creating an estimator

Creating an estimator

Creating an estimator

Creating a pipeline

Creating a pipeline

Creating a pipeline

Creating a pipeline

Fitting the model

Fitting the model

Fitting the model

Fitting the model

Evaluating the performance of the model

Evaluating the performance of the model

Evaluating the performance of the model

Evaluating the performance of the model

Saving the model

Saving the model

Saving the model

Saving the model

Parameter hyper-tuning

Parameter hyper-tuning

Parameter hyper-tuning

Parameter hyper-tuning

Grid search

Grid search

Grid search

Grid search

Train-validation splitting

Train-validation splitting

Train-validation splitting

Train-validation splitting

Other features of PySpark ML in action

Other features of PySpark ML in action

Other features of PySpark ML in action

Other features of PySpark ML in action

Feature extraction

Feature extraction

Feature extraction

Feature extraction

NLP - related feature extractors

NLP - related feature extractors

NLP - related feature extractors

NLP - related feature extractors

Discretizing continuous variables

Discretizing continuous variables

Discretizing continuous variables

Discretizing continuous variables

Standardizing continuous variables

Standardizing continuous variables

Standardizing continuous variables

Standardizing continuous variables

Classification

Classification

Classification

Classification

Clustering

Clustering

Clustering

Clustering

Finding clusters in the births dataset

Finding clusters in the births dataset

Finding clusters in the births dataset

Finding clusters in the births dataset

Topic mining

Topic mining

Topic mining

Topic mining

Regression

Regression

Regression

Regression

Summary

Summary

Summary

Summary

7. GraphFrames

7. GraphFrames

7. GraphFrames

7. GraphFrames

Introducing GraphFrames

Introducing GraphFrames

Introducing GraphFrames

Introducing GraphFrames

Installing GraphFrames

Installing GraphFrames

Installing GraphFrames

Installing GraphFrames

Creating a library

Creating a library

Creating a library

Creating a library

Preparing your flights dataset

Preparing your flights dataset

Preparing your flights dataset

Preparing your flights dataset

Building the graph

Building the graph

Building the graph

Building the graph

Executing simple queries

Executing simple queries

Executing simple queries

Executing simple queries

Determining the number of airports and trips

Determining the number of airports and trips

Determining the number of airports and trips

Determining the number of airports and trips

Determining the longest delay in this dataset

Determining the longest delay in this dataset

Determining the longest delay in this dataset

Determining the longest delay in this dataset

Determining the number of delayed versus on-time/early flights

Determining the number of delayed versus on-time/early flights

Determining the number of delayed versus on-time/early flights

Determining the number of delayed versus on-time/early flights

What flights departing Seattle are most likely to have significant delays?

What flights departing Seattle are most likely to have significant delays?

What flights departing Seattle are most likely to have significant delays?

What flights departing Seattle are most likely to have significant delays?

What states tend to have significant delays departing from Seattle?

What states tend to have significant delays departing from Seattle?

What states tend to have significant delays departing from Seattle?

What states tend to have significant delays departing from Seattle?

Understanding vertex degrees

Understanding vertex degrees

Understanding vertex degrees

Understanding vertex degrees

Determining the top transfer airports

Determining the top transfer airports

Determining the top transfer airports

Determining the top transfer airports

Understanding motifs

Understanding motifs

Understanding motifs

Understanding motifs

Determining airport ranking using PageRank

Determining airport ranking using PageRank

Determining airport ranking using PageRank

Determining airport ranking using PageRank

Determining the most popular non-stop flights

Determining the most popular non-stop flights

Determining the most popular non-stop flights

Determining the most popular non-stop flights

Using Breadth-First Search

Using Breadth-First Search

Using Breadth-First Search

Using Breadth-First Search

Visualizing flights using D3

Visualizing flights using D3

Visualizing flights using D3

Visualizing flights using D3

Summary

Summary

Summary

Summary

8. TensorFrames

8. TensorFrames

8. TensorFrames

8. TensorFrames

What is Deep Learning?

What is Deep Learning?

What is Deep Learning?

What is Deep Learning?

The need for neural networks and Deep Learning

The need for neural networks and Deep Learning

The need for neural networks and Deep Learning

The need for neural networks and Deep Learning

What is feature engineering?

What is feature engineering?

What is feature engineering?

What is feature engineering?

Bridging the data and algorithm

Bridging the data and algorithm

Bridging the data and algorithm

Bridging the data and algorithm

What is TensorFlow?

What is TensorFlow?

What is TensorFlow?

What is TensorFlow?

Installing Pip

Installing Pip

Installing Pip

Installing Pip

Installing TensorFlow

Installing TensorFlow

Installing TensorFlow

Installing TensorFlow

Matrix multiplication using constants

Matrix multiplication using constants

Matrix multiplication using constants

Matrix multiplication using constants

Matrix multiplication using placeholders

Matrix multiplication using placeholders

Matrix multiplication using placeholders

Matrix multiplication using placeholders

Running the model

Running the model

Running the model

Running the model

Running another model

Running another model

Running another model

Running another model

Discussion

Discussion

Discussion

Discussion

Introducing TensorFrames

Introducing TensorFrames

Introducing TensorFrames

Introducing TensorFrames

TensorFrames – quick start

TensorFrames – quick start

TensorFrames – quick start

TensorFrames – quick start

Configuration and setup

Configuration and setup

Configuration and setup

Configuration and setup

Launching a Spark cluster

Launching a Spark cluster

Launching a Spark cluster

Launching a Spark cluster

Creating a TensorFrames library

Creating a TensorFrames library

Creating a TensorFrames library

Creating a TensorFrames library

Installing TensorFlow on your cluster

Installing TensorFlow on your cluster

Installing TensorFlow on your cluster

Installing TensorFlow on your cluster

Using TensorFlow to add a constant to an existing column

Using TensorFlow to add a constant to an existing column

Using TensorFlow to add a constant to an existing column

Using TensorFlow to add a constant to an existing column

Executing the Tensor graph

Executing the Tensor graph

Executing the Tensor graph

Executing the Tensor graph

Blockwise reducing operations example

Blockwise reducing operations example

Blockwise reducing operations example

Blockwise reducing operations example

Building a DataFrame of vectors

Building a DataFrame of vectors

Building a DataFrame of vectors

Building a DataFrame of vectors

Analysing the DataFrame

Analysing the DataFrame

Analysing the DataFrame

Analysing the DataFrame

Computing elementwise sum and min of all vectors

Computing elementwise sum and min of all vectors

Computing elementwise sum and min of all vectors

Computing elementwise sum and min of all vectors

Summary

Summary

Summary

Summary

9. Polyglot Persistence with Blaze

9. Polyglot Persistence with Blaze

9. Polyglot Persistence with Blaze

9. Polyglot Persistence with Blaze

Installing Blaze

Installing Blaze

Installing Blaze

Installing Blaze

Polyglot persistence

Polyglot persistence

Polyglot persistence

Polyglot persistence

Abstracting data

Abstracting data

Abstracting data

Abstracting data

Working with NumPy arrays

Working with NumPy arrays

Working with NumPy arrays

Working with NumPy arrays

Working with pandas' DataFrame

Working with pandas' DataFrame

Working with pandas' DataFrame

Working with pandas' DataFrame

Working with files

Working with files

Working with files

Working with files

Working with databases

Working with databases

Working with databases

Working with databases

Interacting with relational databases

Interacting with relational databases

Interacting with relational databases

Interacting with relational databases

Interacting with the MongoDB database

Interacting with the MongoDB database

Interacting with the MongoDB database

Interacting with the MongoDB database

Data operations

Data operations

Data operations

Data operations

Accessing columns

Accessing columns

Accessing columns

Accessing columns

Symbolic transformations

Symbolic transformations

Symbolic transformations

Symbolic transformations

Operations on columns

Operations on columns

Operations on columns

Operations on columns

Reducing data

Reducing data

Reducing data

Reducing data

Joins

Joins

Joins

Joins

Summary

Summary

Summary

Summary

10. Structured Streaming

10. Structured Streaming

10. Structured Streaming

10. Structured Streaming

What is Spark Streaming?

What is Spark Streaming?

What is Spark Streaming?

What is Spark Streaming?

Why do we need Spark Streaming?

Why do we need Spark Streaming?

Why do we need Spark Streaming?

Why do we need Spark Streaming?

What is the Spark Streaming application data flow?

What is the Spark Streaming application data flow?

What is the Spark Streaming application data flow?

What is the Spark Streaming application data flow?

Simple streaming application using DStreams

Simple streaming application using DStreams

Simple streaming application using DStreams

Simple streaming application using DStreams

A quick primer on global aggregations

A quick primer on global aggregations

A quick primer on global aggregations

A quick primer on global aggregations

Introducing Structured Streaming

Introducing Structured Streaming

Introducing Structured Streaming

Introducing Structured Streaming

Summary

Summary

Summary

Summary

11. Packaging Spark Applications

11. Packaging Spark Applications

11. Packaging Spark Applications

11. Packaging Spark Applications

The spark-submit command

The spark-submit command

The spark-submit command

The spark-submit command

Command line parameters

Command line parameters

Command line parameters

Command line parameters

Deploying the app programmatically

Deploying the app programmatically

Deploying the app programmatically

Deploying the app programmatically

Configuring your SparkSession

Configuring your SparkSession

Configuring your SparkSession

Configuring your SparkSession

Creating SparkSession

Creating SparkSession

Creating SparkSession

Creating SparkSession

Modularizing code

Modularizing code

Modularizing code

Modularizing code

Structure of the module

Structure of the module

Structure of the module

Structure of the module

Calculating the distance between two points

Calculating the distance between two points

Calculating the distance between two points

Calculating the distance between two points

Converting distance units

Converting distance units

Converting distance units

Converting distance units

Building an egg

Building an egg

Building an egg

Building an egg

User defined functions in Spark

User defined functions in Spark

User defined functions in Spark

User defined functions in Spark

Submitting a job

Submitting a job

Submitting a job

Submitting a job

Monitoring execution

Monitoring execution

Monitoring execution

Monitoring execution

Databricks Jobs

Databricks Jobs

Databricks Jobs

Databricks Jobs

Summary

Summary

Summary

Summary

Index

Index

Index

Index

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部