万本电子书0元读

万本电子书0元读

顶部广告

Mastering Apache Spark 2.x - Second Edition电子书

售       价:¥

3人正在读 | 0人评论 9.8

作       者:Romeo Kienzler

出  版  社:Packt Publishing

出版时间:2017-07-26

字       数:37.5万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
Advanced analytics on your Big Data with latest Apache Spark 2.x About This Book ? An advanced guide with a combination of instructions and practical examples to extend the most up-to date Spark functionalities. ? Extend your data processing capabilities to process huge chunk of data in minimum time using advanced concepts in Spark. ? Master the art of real-time processing with the help of Apache Spark 2.x Who This Book Is For If you are a developer with some experience with Spark and want to strengthen your knowledge of how to get around in the world of Spark, then this book is ideal for you. Basic knowledge of Linux, Hadoop and Spark is assumed. Reasonable knowledge of Scala is expected. What You Will Learn ? Examine Advanced Machine Learning and DeepLearning with MLlib, SparkML, SystemML, H2O and DeepLearning4J ? Study highly optimised unified batch and real-time data processing using SparkSQL and Structured Streaming ? Evaluate large-scale Graph Processing and Analysis using GraphX and GraphFrames ? Apply Apache Spark in Elastic deployments using Jupyter and Zeppelin Notebooks, Docker, Kubernetes and the IBM Cloud ? Understand internal details of cost based optimizers used in Catalyst, SystemML and GraphFrames ? Learn how specific parameter settings affect overall performance of an Apache Spark cluster ? Leverage Scala, R and python for your data science projects In Detail Apache Spark is an in-memory cluster-based parallel processing system that provides a wide range of functionalities such as graph processing, machine learning, stream processing, and SQL. This book aims to take your knowledge of Spark to the next level by teaching you how to expand Spark’s functionality and implement your data flows and machine/deep learning programs on top of the platform. The book commences with an overview of the Spark ecosystem. It will introduce you to Project Tungsten and Catalyst, two of the major advancements of Apache Spark 2.x. You will understand how memory management and binary processing, cache-aware computation, and code generation are used to speed things up dramatically. The book extends to show how to incorporate H20, SystemML, and Deeplearning4j for machine learning, and Jupyter Notebooks and Kubernetes/Docker for cloud-based Spark. During the course of the book, you will learn about the latest enhancements to Apache Spark 2.x, such as interactive querying of live data and unifying DataFrames and Datasets. You will also learn about the updates on the APIs and how DataFrames and Datasets affect SQL, machine learning, graph processing, and streaming. You will learn to use Spark as a big data operating system, understand how to implement advanced analytics on the new APIs, and explore how easy it is to use Spark in day-to-day tasks. Style and approach This book is an extensive guide to Apache Spark modules and tools and shows how Spark's functionality can be extended for real-time processing and storage with worked examples.
目录展开

Title Page

Second Edition

Copyright

Mastering Apache Spark 2.x

Second Edition

Credits

About the Author

About the Reviewer

www.PacktPub.com

Why subscribe?

Customer Feedback

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

A First Taste and What’s New in Apache Spark V2

Spark machine learning

Spark Streaming

Spark SQL

Spark graph processing

Extended ecosystem

What's new in Apache Spark V2?

Cluster design

Cluster management

Local

Standalone

Apache YARN

Apache Mesos

Cloud-based deployments

Performance

The cluster structure

Hadoop Distributed File System

Data locality

Memory

Coding

Cloud

Summary

Apache Spark SQL

The SparkSession--your gateway to structured data processing

Importing and saving data

Processing the text files

Processing JSON files

Processing the Parquet files

Understanding the DataSource API

Implicit schema discovery

Predicate push-down on smart data sources

DataFrames

Using SQL

Defining schemas manually

Using SQL subqueries

Applying SQL table joins

Using Datasets

The Dataset API in action

User-defined functions

RDDs versus DataFrames versus Datasets

Summary

The Catalyst Optimizer

Understanding the workings of the Catalyst Optimizer

Managing temporary views with the catalog API

The SQL abstract syntax tree

How to go from Unresolved Logical Execution Plan to Resolved Logical Execution Plan

Internal class and object representations of LEPs

How to optimize the Resolved Logical Execution Plan

Physical Execution Plan generation and selection

Code generation

Practical examples

Using the explain method to obtain the PEP

How smart data sources work internally

Summary

Project Tungsten

Memory management beyond the Java Virtual Machine Garbage Collector

Understanding the UnsafeRow object

The null bit set region

The fixed length values region

The variable length values region

Understanding the BytesToBytesMap

A practical example on memory usage and performance

Cache-friendly layout of data in memory

Cache eviction strategies and pre-fetching

Code generation

Understanding columnar storage

Understanding whole stage code generation

A practical example on whole stage code generation performance

Operator fusing versus the volcano iterator model

Summary

Apache Spark Streaming

Overview

Errors and recovery

Checkpointing

Streaming sources

TCP stream

File streams

Flume

Kafka

Summary

Structured Streaming

The concept of continuous applications

True unification - same code, same engine

Windowing

How streaming engines use windowing

How Apache Spark improves windowing

Increased performance with good old friends

How transparent fault tolerance and exactly-once delivery guarantee is achieved

Replayable sources can replay streams from a given offset

Idempotent sinks prevent data duplication

State versioning guarantees consistent results after reruns

Example - connection to a MQTT message broker

Controlling continuous applications

More on stream life cycle management

Summary

Apache Spark MLlib

Architecture

The development environment

Classification with Naive Bayes

Theory on Classification

Naive Bayes in practice

Clustering with K-Means

Theory on Clustering

K-Means in practice

Artificial neural networks

ANN in practice

Summary

Apache SparkML

What does the new API look like?

The concept of pipelines

Transformers

String indexer

OneHotEncoder

VectorAssembler

Pipelines

Estimators

RandomForestClassifier

Model evaluation

CrossValidation and hyperparameter tuning

CrossValidation

Hyperparameter tuning

Winning a Kaggle competition with Apache SparkML

Data preparation

Feature engineering

Testing the feature engineering pipeline

Training the machine learning model

Model evaluation

CrossValidation and hyperparameter tuning

Using the evaluator to assess the quality of the cross-validated and tuned model

Summary

Apache SystemML

Why do we need just another library?

Why on Apache Spark?

The history of Apache SystemML

A cost-based optimizer for machine learning algorithms

An example - alternating least squares

ApacheSystemML architecture

Language parsing

High-level operators are generated

How low-level operators are optimized on

Performance measurements

Apache SystemML in action

Summary

Deep Learning on Apache Spark with DeepLearning4j and H2O

H2O

Overview

The build environment

Architecture

Sourcing the data

Data quality

Performance tuning

Deep Learning

Example code – income

The example code – MNIST

H2O Flow

Deeplearning4j

ND4J - high performance linear algebra for the JVM

Deeplearning4j

Example: an IoT real-time anomaly detector

Mastering chaos: the Lorenz attractor model

Deploying the test data generator

Deploy the Node-RED IoT Starter Boilerplate to the IBM Cloud

Deploying the test data generator flow

Testing the test data generator

Install the Deeplearning4j example within Eclipse

Running the examples in Eclipse

Run the examples in Apache Spark

Summary

Apache Spark GraphX

Overview

Graph analytics/processing with GraphX

The raw data

Creating a graph

Example 1 – counting

Example 2 – filtering

Example 3 – PageRank

Example 4 – triangle counting

Example 5 – connected components

Summary

Apache Spark GraphFrames

Architecture

Graph-relational translation

Materialized views

Join elimination

Join reordering

Examples

Example 1 – counting

Example 2 – filtering

Example 3 – page rank

Example 4 – triangle counting

Example 5 – connected components

Summary

Apache Spark with Jupyter Notebooks on IBM DataScience Experience

Why notebooks are the new standard

Learning by example

The IEEE PHM 2012 data challenge bearing dataset

ETL with Scala

Interactive, exploratory analysis using Python and Pixiedust

Real data science work with SparkR

Summary

Apache Spark on Kubernetes

Bare metal, virtual machines, and containers

Containerization

Namespaces

Control groups

Linux containers

Understanding the core concepts of Docker

Understanding Kubernetes

Using Kubernetes for provisioning containerized Spark applications

Example--Apache Spark on Kubernetes

Prerequisites

Deploying the Apache Spark master

Deploying the Apache Spark workers

Deploying the Zeppelin notebooks

Summary

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部