万本电子书0元读

万本电子书0元读

顶部广告

Scala and Spark for Big Data Analytics电子书

售       价:¥

9人正在读 | 0人评论 9.8

作       者:Md. Rezaul Karim, Sridhar Alla

出  版  社:Packt Publishing

出版时间:2017-07-25

字       数:101.5万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
Harness the power of Scala to program Spark and analyze tonnes of data in the blink of an eye! About This Book ? Learn Scala’s sophisticated type system that combines Functional Programming and object-oriented concepts ? Work on a wide array of applications, from simple batch jobs to stream processing and machine learning ? Explore the most common as well as some complex use-cases to perform large-scale data analysis with Spark Who This Book Is For Anyone who wishes to learn how to perform data analysis by harnessing the power of Spark will find this book extremely useful. No knowledge of Spark or Scala is assumed, although prior programming experience (especially with other JVM languages) will be useful to pick up concepts quicker. What You Will Learn ? Understand object-oriented & functional programming concepts of Scala ? In-depth understanding of Scala collection APIs ? Work with RDD and DataFrame to learn Spark’s core abstractions ? Analysing structured and unstructured data using SparkSQL and GraphX ? Scalable and fault-tolerant streaming application development using Spark structured streaming ? Learn machine-learning best practices for classification, regression, dimensionality reduction, and recommendation system to build predictive models with widely used algorithms in Spark MLlib & ML ? Build clustering models to cluster a vast amount of data ? Understand tuning, debugging, and monitoring Spark applications ? Deploy Spark applications on real clusters in Standalone, Mesos, and YARN In Detail Scala has been observing wide adoption over the past few years, especially in the field of data science and analytics. Spark, built on Scala, has gained a lot of recognition and is being used widely in productions. Thus, if you want to leverage the power of Scala and Spark to make sense of big data, this book is for you. The first part introduces you to Scala, helping you understand the object-oriented and functional programming concepts needed for Spark application development. It then moves on to Spark to cover the basic abstractions using RDD and DataFrame. This will help you develop scalable and fault-tolerant streaming applications by analyzing structured and unstructured data using SparkSQL, GraphX, and Spark structured streaming. Finally, the book moves on to some advanced topics, such as monitoring, configuration, debugging, testing, and deployment. You will also learn how to develop Spark applications using SparkR and PySpark APIs, interactive data analytics using Zeppelin, and in-memory data processing with Alluxio. By the end of this book, you will have a thorough understanding of Spark, and you will be able to perform full-stack data analytics with a feel that no amount of data is too big. Style and approach Filled with practical examples and use cases, this book will hot only help you get up and running with Spark, but will also take you farther down the road to becoming a data scientist.
目录展开

Title Page

Copyright

Scala and Spark for Big Data Analytics

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Why subscribe?

Customer Feedback

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

Introduction to Scala

History and purposes of Scala

Platforms and editors

Installing and setting up Scala

Installing Java

Windows

Mac OS

Using Homebrew installer

Installing manually

Linux

Scala: the scalable language

Scala is object-oriented

Scala is functional

Scala is statically typed

Scala runs on the JVM

Scala can execute Java code

Scala can do concurrent and synchronized processing

Scala for Java programmers

All types are objects

Type inference

Scala REPL

Nested functions

Import statements

Operators as methods

Methods and parameter lists

Methods inside methods

Constructor in Scala

Objects instead of static methods

Traits

Scala for the beginners

Your first line of code

I'm the hello world program, explain me well!

Run Scala interactively!

Compile it!

Execute it with Scala command

Summary

Object-Oriented Scala

Variables in Scala

Reference versus value immutability

Data types in Scala

Variable initialization

Type annotations

Type ascription

Lazy val

Methods, classes, and objects in Scala

Methods in Scala

The return in Scala

Classes in Scala

Objects in Scala

Singleton and companion objects

Companion objects

Comparing and contrasting: val and final

Access and visibility

Constructors

Traits in Scala

A trait syntax

Extending traits

Abstract classes

Abstract classes and the override keyword

Case classes in Scala

Packages and package objects

Java interoperability

Pattern matching

Implicit in Scala

Generic in Scala

Defining a generic class

SBT and other build systems

Build with SBT

Maven with Eclipse

Gradle with Eclipse

Summary

Functional Programming Concepts

Introduction to functional programming

Advantages of functional programming

Functional Scala for the data scientists

Why FP and Scala for learning Spark?

Why Spark?

Scala and the Spark programming model

Scala and the Spark ecosystem

Pure functions and higher-order functions

Pure functions

Anonymous functions

Higher-order functions

Function as a return value

Using higher-order functions

Error handling in functional Scala

Failure and exceptions in Scala

Throwing exceptions

Catching exception using try and catch

Finally

Creating an Either

Future

Run one task, but block

Functional programming and data mutability

Summary

Collection APIs

Scala collection APIs

Types and hierarchies

Traversable

Iterable

Seq, LinearSeq, and IndexedSeq

Mutable and immutable

Arrays

Lists

Sets

Tuples

Maps

Option

Exists

Forall

Filter

Map

Take

GroupBy

Init

Drop

TakeWhile

DropWhile

FlatMap

Performance characteristics

Performance characteristics of collection objects

Memory usage by collection objects

Java interoperability

Using Scala implicits

Implicit conversions in Scala

Summary

Tackle Big Data – Spark Comes to the Party

Introduction to data analytics

Inside the data analytics process

Introduction to big data

4 Vs of big data

Variety of Data

Velocity of Data

Volume of Data

Veracity of Data

Distributed computing using Apache Hadoop

Hadoop Distributed File System (HDFS)

HDFS High Availability

HDFS Federation

HDFS Snapshot

HDFS Read

HDFS Write

MapReduce framework

Here comes Apache Spark

Spark core

Spark SQL

Spark streaming

Spark GraphX

Spark ML

PySpark

SparkR

Summary

Start Working with Spark – REPL and RDDs

Dig deeper into Apache Spark

Apache Spark installation

Spark standalone

Spark on YARN

YARN client mode

YARN cluster mode

Spark on Mesos

Introduction to RDDs

RDD Creation

Parallelizing a collection

Reading data from an external source

Transformation of an existing RDD

Streaming API

Using the Spark shell

Actions and Transformations

Transformations

General transformations

Math/Statistical transformations

Set theory/relational transformations

Data structure-based transformations

map function

flatMap function

filter function

coalesce

repartition

Actions

reduce

count

collect

Caching

Loading and saving data

Loading data

textFile

wholeTextFiles

Load from a JDBC Datasource

Saving RDD

Summary

Special RDD Operations

Types of RDDs

Pair RDD

DoubleRDD

SequenceFileRDD

CoGroupedRDD

ShuffledRDD

UnionRDD

HadoopRDD

NewHadoopRDD

Aggregations

groupByKey

reduceByKey

aggregateByKey

combineByKey

Comparison of groupByKey, reduceByKey, combineByKey, and aggregateByKey

Partitioning and shuffling

Partitioners

HashPartitioner

RangePartitioner

Shuffling

Narrow Dependencies

Wide Dependencies

Broadcast variables

Creating broadcast variables

Cleaning broadcast variables

Destroying broadcast variables

Accumulators

Summary

Introduce a Little Structure - Spark SQL

Spark SQL and DataFrames

DataFrame API and SQL API

Pivots

Filters

User-Defined Functions (UDFs)

Schema   structure of data

Implicit schema

Explicit schema

Encoders

Loading and saving datasets

Loading datasets

Saving datasets

Aggregations

Aggregate functions

Count

First

Last

approx_count_distinct

Min

Max

Average

Sum

Kurtosis

Skewness

Variance

Standard deviation

Covariance

groupBy

Rollup

Cube

Window functions

ntiles

Joins

Inner workings of join

Shuffle join

Broadcast join

Join types

Inner join

Left outer join

Right outer join

Outer join

Left anti join

Left semi join

Cross join

Performance implications of join

Summary

Stream Me Up, Scotty - Spark Streaming

A Brief introduction to streaming

At least once processing

At most once processing

Exactly once processing

Spark Streaming

StreamingContext

Creating StreamingContext

Starting StreamingContext

Stopping StreamingContext

Input streams

receiverStream

socketTextStream

rawSocketStream

fileStream

textFileStream

binaryRecordsStream

queueStream

textFileStream example

twitterStream example

Discretized streams

Transformations

Window operations

Stateful/stateless transformations

Stateless transformations

Stateful transformations

Checkpointing

Metadata checkpointing

Data checkpointing

Driver failure recovery

Interoperability with streaming platforms (Apache Kafka)

Receiver-based approach

Direct stream

Structured streaming

Structured streaming

Handling Event-time and late data

Fault tolerance semantics

Summary

Everything is Connected - GraphX

A brief introduction to graph theory

GraphX

VertexRDD and EdgeRDD

VertexRDD

EdgeRDD

Graph operators

Filter

MapValues

aggregateMessages

TriangleCounting

Pregel API

ConnectedComponents

Traveling salesman problem

ShortestPaths

PageRank

Summary

Learning Machine Learning - Spark MLlib and Spark ML

Introduction to machine learning

Typical machine learning workflow

Machine learning tasks

Supervised learning

Unsupervised learning

Reinforcement learning

Recommender system

Semisupervised learning

Spark machine learning APIs

Spark machine learning libraries

Spark MLlib

Spark ML

Spark MLlib or Spark ML?

Feature extraction and transformation

CountVectorizer

Tokenizer

StopWordsRemover

StringIndexer

OneHotEncoder

Spark ML pipelines

Dataset abstraction

Creating a simple pipeline

Unsupervised machine learning

Dimensionality reduction

PCA

Using PCA

Regression Analysis - a practical use of PCA

Dataset collection and exploration

What is regression analysis?

Binary and multiclass classification

Performance metrics

Binary classification using logistic regression

Breast cancer prediction using logistic regression of Spark ML

Dataset collection

Developing the pipeline using Spark ML

Multiclass classification using logistic regression

Improving classification accuracy using random forests

Classifying MNIST dataset using random forest

Summary

Advanced Machine Learning Best Practices

Machine learning best practices

Beware of overfitting and underfitting

Stay tuned with Spark MLlib and Spark ML

Choosing the right algorithm for your application

Considerations when choosing an algorithm

Accuracy

Training time

Linearity

Inspect your data when choosing an algorithm

Number of parameters

How large is your training set?

Number of features

Hyperparameter tuning of ML models

Hyperparameter tuning

Grid search parameter tuning

Cross-validation

Credit risk analysis – An example of hyperparameter tuning

What is credit risk analysis? Why is it important?

The dataset exploration

Step-by-step example with Spark ML

A recommendation system with Spark

Model-based recommendation with Spark

Data exploration

Movie recommendation using ALS

Topic modelling - A best practice for text clustering

How does LDA work?

Topic modeling with Spark MLlib

Scalability of LDA

Summary

My Name is Bayes, Naive Bayes

Multinomial classification

Transformation to binary

Classification using One-Vs-The-Rest approach

Exploration and preparation of the OCR dataset

Hierarchical classification

Extension from binary

Bayesian inference

An overview of Bayesian inference

What is inference?

How does it work?

Naive Bayes

An overview of Bayes' theorem

My name is Bayes, Naive Bayes

Building a scalable classifier with NB

Tune me up!

The decision trees

Advantages and disadvantages of using DTs

Decision tree versus Naive Bayes

Building a scalable classifier with DT algorithm

Summary

Time to Put Some Order - Cluster Your Data with Spark MLlib

Unsupervised learning

Unsupervised learning example

Clustering techniques

Unsupervised learning and the clustering

Hierarchical clustering

Centroid-based clustering

Distribution-based clustestering

Centroid-based clustering (CC)

Challenges in CC algorithm

How does K-means algorithm work?

An example of clustering using K-means of Spark MLlib

Hierarchical clustering (HC)

An overview of HC algorithm and challenges

Bisecting K-means with Spark MLlib

Bisecting K-means clustering of the neighborhood using Spark MLlib

Distribution-based clustering (DC)

Challenges in DC algorithm

How does a Gaussian mixture model work?

An example of clustering using GMM with Spark MLlib

Determining number of clusters

A comparative analysis between clustering algorithms

Submitting Spark job for cluster analysis

Summary

Text Analytics Using Spark ML

Understanding text analytics

Text analytics

Sentiment analysis

Topic modeling

TF-IDF (term frequency - inverse document frequency)

Named entity recognition (NER)

Event extraction

Transformers and Estimators

Standard Transformer

Estimator Transformer

Tokenization

StopWordsRemover

NGrams

TF-IDF

HashingTF

Inverse Document Frequency (IDF)

Word2Vec

CountVectorizer

Topic modeling using LDA

Implementing text classification

Summary

Spark Tuning

Monitoring Spark jobs

Spark web interface

Jobs

Stages

Storage

Environment

Executors

SQL

Visualizing Spark application using web UI

Observing the running and completed Spark jobs

Debugging Spark applications using logs

Logging with log4j with Spark

Spark configuration

Spark properties

Environmental variables

Logging

Common mistakes in Spark app development

Application failure

Slow jobs or unresponsiveness

Optimization techniques

Data serialization

Memory tuning

Memory usage and management

Tuning the data structures

Serialized RDD storage

Garbage collection tuning

Level of parallelism

Broadcasting

Data locality

Summary

Time to Go to ClusterLand - Deploying Spark on a Cluster

Spark architecture in a cluster

Spark ecosystem in brief

Cluster design

Cluster management

Pseudocluster mode (aka Spark local)

Standalone

Apache YARN

Apache Mesos

Cloud-based deployments

Deploying the Spark application on a cluster

Submitting Spark jobs

Running Spark jobs locally and in standalone

Hadoop YARN

Configuring a single-node YARN cluster

Step 1: Downloading Apache Hadoop

Step 2: Setting the JAVA_HOME

Step 3: Creating users and groups

Step 4: Creating data and log directories

Step 5: Configuring core-site.xml

Step 6: Configuring hdfs-site.xml

Step 7: Configuring mapred-site.xml

Step 8: Configuring yarn-site.xml

Step 9: Setting Java heap space

Step 10: Formatting HDFS

Step 11: Starting the HDFS

Step 12: Starting YARN

Step 13: Verifying on the web UI

Submitting Spark jobs on YARN cluster

Advance job submissions in a YARN cluster

Apache Mesos

Client mode

Cluster mode

Deploying on AWS

Step 1: Key pair and access key configuration

Step 2: Configuring Spark cluster on EC2

Step 3: Running Spark jobs on the AWS cluster

Step 4: Pausing, restarting, and terminating the Spark cluster

Summary

Testing and Debugging Spark

Testing in a distributed environment

Distributed environment

Issues in a distributed system

Challenges of software testing in a distributed environment

Testing Spark applications

Testing Scala methods

Unit testing

Testing Spark applications

Method 1: Using Scala JUnit test

Method 2: Testing Scala code using FunSuite

Method 3: Making life easier with Spark testing base

Configuring Hadoop runtime on Windows

Debugging Spark applications

Logging with log4j with Spark recap

Debugging the Spark application

Debugging Spark application on Eclipse as Scala debug

Debugging Spark jobs running as local and standalone mode

Debugging Spark applications on YARN or Mesos cluster

Debugging Spark application using SBT

Summary

PySpark and SparkR

Introduction to PySpark

Installation and configuration

By setting SPARK_HOME

Using Python shell

By setting PySpark on Python IDEs

Getting started with PySpark

Working with DataFrames and RDDs

Reading a dataset in Libsvm format

Reading a CSV file

Reading and manipulating raw text files

Writing UDF on PySpark

Let's do some analytics with k-means clustering

Introduction to SparkR

Why SparkR?

Installing and getting started

Getting started

Using external data source APIs

Data manipulation

Querying SparkR DataFrame

Visualizing your data on RStudio

Summary

Accelerating Spark with Alluxio

The need for Alluxio

Getting started with Alluxio

Downloading Alluxio

Installing and running Alluxio locally

Overview

Browse

Configuration

Workers

In-Memory Data

Logs

Metrics

Current features

Integration with YARN

Alluxio worker memory

Alluxio master memory

CPU vcores

Using Alluxio with Spark

Summary

Interactive Data Analytics with Apache Zeppelin

Introduction to Apache Zeppelin

Installation and getting started

Installation and configuration

Building from source

Starting and stopping Apache Zeppelin

Creating notebooks

Configuring the interpreter

Data processing and visualization

Complex data analytics with Zeppelin

The problem definition

Dataset descripting and exploration

Data and results collaborating

Summary

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部