万本电子书0元读

万本电子书0元读

顶部广告

Apache Spark 2: Data Processing and Real-Time Analytics电子书

售       价:¥

1人正在读 | 0人评论 9.8

作       者:Romeo Kienzler

出  版  社:Packt Publishing

出版时间:2018-12-21

字       数:65.5万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
Build efficient data flow and machine learning programs with this flexible, multi-functional open-source cluster-computing framework Key Features *Master the art of real-time big data processing and machine learning *Explore a wide range of use-cases to analyze large data *Discover ways to optimize your work by using many features of Spark 2.x and Scala Book Description Apache Spark is an in-memory, cluster-based data processing system that provides a wide range of functionalities such as big data processing, analytics, machine learning, and more. With this Learning Path, you can take your knowledge of Apache Spark to the next level by learning how to expand Spark's functionality and building your own data flow and machine learning programs on this platform. You will work with the different modules in Apache Spark, such as interactive querying with Spark SQL, using DataFrames and datasets, implementing streaming analytics with Spark Streaming, and applying machine learning and deep learning techniques on Spark using MLlib and various external tools. By the end of this elaborately designed Learning Path, you will have all the knowledge you need to master Apache Spark, and build your own big data processing and analytics pipeline quickly and without any hassle. This Learning Path includes content from the following Packt products: *Mastering Apache Spark 2.x by Romeo Kienzler *Scala and Spark for Big Data Analytics by Md. Rezaul Karim, Sridhar Alla *Apache Spark 2.x Machine Learning Cookbook by Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen MeiCookbook What you will learn *Get to grips with all the features of Apache Spark 2.x *Perform highly optimized real-time big data processing *Use ML and DL techniques with Spark MLlib and third-party tools *Analyze structured and unstructured data using SparkSQL and GraphX *Understand tuning, debugging, and monitoring of big data applications *Build scalable and fault-tolerant streaming applications *Develop scalable recommendation engines Who this book is for If you are an intermediate-level Spark developer looking to master the advanced capabilities and use-cases of Apache Spark 2.x, this Learning Path is ideal for you. Big data professionals who want to learn how to integrate and use the features of Apache Spark and build a strong big data pipeline will also find this Learning Path useful. To grasp the concepts explained in this Learning Path, you must know the fundamentals of Apache Spark and Scala.
目录展开

Title Page

Copyright

Apache Spark 2: Data Processing and Real-Time Analytics

About Packt

Why Subscribe?

Packt.com

Contributors

About the Authors

Packt Is Searching for Authors Like You

Preface

Who This Book Is For

What This Book Covers

To Get the Most out of This Book

Download the Example Code Files

Conventions Used

Get in Touch

Reviews

A First Taste and What's New in Apache Spark V2

Spark machine learning

Spark Streaming

Spark SQL

Spark graph processing

Extended ecosystem

What's new in Apache Spark V2?

Cluster design

Cluster management

Local

Standalone

Apache YARN

Apache Mesos

Cloud-based deployments

Performance

The cluster structure

Hadoop Distributed File System

Data locality

Memory

Coding

Cloud

Summary

Apache Spark Streaming

Overview

Errors and recovery

Checkpointing

Streaming sources

TCP stream

File streams

Flume

Kafka

Summary

Structured Streaming

The concept of continuous applications

True unification - same code, same engine

Windowing

How streaming engines use windowing

How Apache Spark improves windowing

Increased performance with good old friends

How transparent fault tolerance and exactly-once delivery guarantee is achieved

Replayable sources can replay streams from a given offset

Idempotent sinks prevent data duplication

State versioning guarantees consistent results after reruns

Example - connection to a MQTT message broker

Controlling continuous applications

More on stream life cycle management

Summary

Apache Spark MLlib

Architecture

The development environment

Classification with Naive Bayes

Theory on Classification

Naive Bayes in practice

Clustering with K-Means

Theory on Clustering

K-Means in practice

Artificial neural networks

ANN in practice

Summary

Apache SparkML

What does the new API look like?

The concept of pipelines

Transformers

String indexer

OneHotEncoder

VectorAssembler

Pipelines

Estimators

RandomForestClassifier

Model evaluation

CrossValidation and hyperparameter tuning

CrossValidation

Hyperparameter tuning

Winning a Kaggle competition with Apache SparkML

Data preparation

Feature engineering

Testing the feature engineering pipeline

Training the machine learning model

Model evaluation

CrossValidation and hyperparameter tuning

Using the evaluator to assess the quality of the cross-validated and tuned model

Summary

Apache SystemML

Why do we need just another library?

Why on Apache Spark?

The history of Apache SystemML

A cost-based optimizer for machine learning algorithms

An example - alternating least squares

ApacheSystemML architecture

Language parsing

High-level operators are generated

How low-level operators are optimized on

Performance measurements

Apache SystemML in action

Summary

Apache Spark GraphX

Overview

Graph analytics/processing with GraphX

The raw data

Creating a graph

Example 1 – counting

Example 2 – filtering

Example 3 – PageRank

Example 4 – triangle counting

Example 5 – connected components

Summary

Spark Tuning

Monitoring Spark jobs

Spark web interface

Jobs

Stages

Storage

Environment

Executors

SQL

Visualizing Spark application using web UI

Observing the running and completed Spark jobs

Debugging Spark applications using logs

Logging with log4j with Spark

Spark configuration

Spark properties

Environmental variables

Logging

Common mistakes in Spark app development

Application failure

Slow jobs or unresponsiveness

Optimization techniques

Data serialization

Memory tuning

Memory usage and management

Tuning the data structures

Serialized RDD storage

Garbage collection tuning

Level of parallelism

Broadcasting

Data locality

Summary

Testing and Debugging Spark

Testing in a distributed environment

Distributed environment

Issues in a distributed system

Challenges of software testing in a distributed environment

Testing Spark applications

Testing Scala methods

Unit testing

Testing Spark applications

Method 1: Using Scala JUnit test

Method 2: Testing Scala code using FunSuite

Method 3: Making life easier with Spark testing base

Configuring Hadoop runtime on Windows

Debugging Spark applications

Logging with log4j with Spark recap

Debugging the Spark application

Debugging Spark application on Eclipse as Scala debug

Debugging Spark jobs running as local and standalone mode

Debugging Spark applications on YARN or Mesos cluster

Debugging Spark application using SBT

Summary

Practical Machine Learning with Spark Using Scala

Introduction

Apache Spark

Machine learning

Scala

Software versions and libraries used in this book

Configuring IntelliJ to work with Spark and run Spark ML sample codes

Getting ready

How to do it...

There's more...

See also

Running a sample ML code from Spark

Getting ready

How to do it...

Identifying data sources for practical machine learning

Getting ready

How to do it...

See also

Running your first program using Apache Spark 2.0 with the IntelliJ IDE

How to do it...

How it works...

There's more...

See also

How to add graphics to your Spark program

How to do it...

How it works...

There's more...

See also

Spark's Three Data Musketeers for Machine Learning - Perfect Together

Introduction

RDDs - what started it all...

DataFrame - a natural evolution to unite API and SQL via a high-level API

Dataset - a high-level unifying Data API

Creating RDDs with Spark 2.0 using internal data sources

How to do it...

How it works...

Creating RDDs with Spark 2.0 using external data sources

How to do it...

How it works...

There's more...

See also

Transforming RDDs with Spark 2.0 using the filter() API

How to do it...

How it works...

There's more...

See also

Transforming RDDs with the super useful flatMap() API

How to do it...

How it works...

There's more...

See also

Transforming RDDs with set operation APIs

How to do it...

How it works...

See also

RDD transformation/aggregation with groupBy() and reduceByKey()

How to do it...

How it works...

There's more...

See also

Transforming RDDs with the zip() API

How to do it...

How it works...

See also

Join transformation with paired key-value RDDs

How to do it...

How it works...

There's more...

Reduce and grouping transformation with paired key-value RDDs

How to do it...

How it works...

See also

Creating DataFrames from Scala data structures

How to do it...

How it works...

There's more...

See also

Operating on DataFrames programmatically without SQL

How to do it...

How it works...

There's more...

See also

Loading DataFrames and setup from an external source

How to do it...

How it works...

There's more...

See also

Using DataFrames with standard SQL language - SparkSQL

How to do it...

How it works...

There's more...

See also

Working with the Dataset API using a Scala Sequence

How to do it...

How it works...

There's more...

See also

Creating and using Datasets from RDDs and back again

How to do it...

How it works...

There's more...

See also

Working with JSON using the Dataset API and SQL together

How to do it...

How it works...

There's more...

See also

Functional programming with the Dataset API using domain objects

How to do it...

How it works...

There's more...

See also

Common Recipes for Implementing a Robust Machine Learning System

Introduction

Spark's basic statistical API to help you build your own algorithms

How to do it...

How it works...

There's more...

See also

ML pipelines for real-life machine learning applications

How to do it...

How it works...

There's more...

See also

Normalizing data with Spark

How to do it...

How it works...

There's more...

See also

Splitting data for training and testing

How to do it...

How it works...

There's more...

See also

Common operations with the new Dataset API

How to do it...

How it works...

There's more...

See also

Creating and using RDD versus DataFrame versus Dataset from a text file in Spark 2.0

How to do it...

How it works...

There's more...

See also

LabeledPoint data structure for Spark ML

How to do it...

How it works...

There's more...

See also

Getting access to Spark cluster in Spark 2.0

How to do it...

How it works...

There's more...

See also

Getting access to Spark cluster pre-Spark 2.0

How to do it...

How it works...

There's more...

See also

Getting access to SparkContext vis-a-vis SparkSession object in Spark 2.0

How to do it...

How it works...

There's more...

See also

New model export and PMML markup in Spark 2.0

How to do it...

How it works...

There's more...

See also

Regression model evaluation using Spark 2.0

How to do it...

How it works...

There's more...

See also

Binary classification model evaluation using Spark 2.0

How to do it...

How it works...

There's more...

See also

Multiclass classification model evaluation using Spark 2.0

How to do it...

How it works...

There's more...

See also

Multilabel classification model evaluation using Spark 2.0

How to do it...

How it works...

There's more...

See also

Using the Scala Breeze library to do graphics in Spark 2.0

How to do it...

How it works...

There's more...

See also

Recommendation Engine that Scales with Spark

Introduction

Content filtering

Collaborative filtering

Neighborhood method

Latent factor models techniques

Setting up the required data for a scalable recommendation engine in Spark 2.0

How to do it...

How it works...

There's more...

See also

Exploring the movies data details for the recommendation system in Spark 2.0

How to do it...

How it works...

There's more...

See also

Exploring the ratings data details for the recommendation system in Spark 2.0

How to do it...

How it works...

There's more...

See also

Building a scalable recommendation engine using collaborative filtering in Spark 2.0

How to do it...

How it works...

There's more...

See also

Dealing with implicit input for training

Unsupervised Clustering with Apache Spark 2.0

Introduction

Building a KMeans classifying system in Spark 2.0

How to do it...

How it works...

KMeans (Lloyd Algorithm)

KMeans++ (Arthur's algorithm)

KMeans|| (pronounced as KMeans Parallel)

There's more...

See also

Bisecting KMeans, the new kid on the block in Spark 2.0

How to do it...

How it works...

There's more...

See also

Using Gaussian Mixture and Expectation Maximization (EM) in Spark to classify data

How to do it...

How it works...

New GaussianMixture()

There's more...

See also

Classifying the vertices of a graph using Power Iteration Clustering (PIC) in Spark 2.0

How to do it...

How it works...

There's more...

See also

Latent Dirichlet Allocation (LDA) to classify documents and text into topics

How to do it...

How it works...

There's more...

See also

Streaming KMeans to classify data in near real-time

How to do it...

How it works...

There's more...

See also

Implementing Text Analytics with Spark 2.0 ML Library

Introduction

Doing term frequency with Spark - everything that counts

How to do it...

How it works...

There's more...

See also

Displaying similar words with Spark using Word2Vec

How to do it...

How it works...

There's more...

See also

Downloading a complete dump of Wikipedia for a real-life Spark ML project

How to do it...

There's more...

See also

Using Latent Semantic Analysis for text analytics with Spark 2.0

How to do it...

How it works...

There's more...

See also

Topic modeling with Latent Dirichlet allocation in Spark 2.0

How to do it...

How it works...

There's more...

See also

Spark Streaming and Machine Learning Library

Introduction

Structured streaming for near real-time machine learning

How to do it...

How it works...

There's more...

See also

Streaming DataFrames for real-time machine learning

How to do it...

How it works...

There's more...

See also

Streaming Datasets for real-time machine learning

How to do it...

How it works...

There's more...

See also

Streaming data and debugging with queueStream

How to do it...

How it works...

See also

Downloading and understanding the famous Iris data for unsupervised classification

How to do it...

How it works...

There's more...

See also

Streaming KMeans for a real-time on-line classifier

How to do it...

How it works...

There's more...

See also

Downloading wine quality data for streaming regression

How to do it...

How it works...

There's more...

Streaming linear regression for a real-time regression

How to do it...

How it works...

There's more...

See also

Downloading Pima Diabetes data for supervised classification

How to do it...

How it works...

There's more...

See also

Streaming logistic regression for an on-line classifier

How to do it...

How it works...

There's more...

See also

Other Books You May Enjoy

Leave a review - let other readers know what you think

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部