售 价:¥
温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印
为你推荐
Title Page
Copyright
Scala and Spark for Big Data Analytics
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Why subscribe?
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
Introduction to Scala
History and purposes of Scala
Platforms and editors
Installing and setting up Scala
Installing Java
Windows
Mac OS
Using Homebrew installer
Installing manually
Linux
Scala: the scalable language
Scala is object-oriented
Scala is functional
Scala is statically typed
Scala runs on the JVM
Scala can execute Java code
Scala can do concurrent and synchronized processing
Scala for Java programmers
All types are objects
Type inference
Scala REPL
Nested functions
Import statements
Operators as methods
Methods and parameter lists
Methods inside methods
Constructor in Scala
Objects instead of static methods
Traits
Scala for the beginners
Your first line of code
I'm the hello world program, explain me well!
Run Scala interactively!
Compile it!
Execute it with Scala command
Summary
Object-Oriented Scala
Variables in Scala
Reference versus value immutability
Data types in Scala
Variable initialization
Type annotations
Type ascription
Lazy val
Methods, classes, and objects in Scala
Methods in Scala
The return in Scala
Classes in Scala
Objects in Scala
Singleton and companion objects
Companion objects
Comparing and contrasting: val and final
Access and visibility
Constructors
Traits in Scala
A trait syntax
Extending traits
Abstract classes
Abstract classes and the override keyword
Case classes in Scala
Packages and package objects
Java interoperability
Pattern matching
Implicit in Scala
Generic in Scala
Defining a generic class
SBT and other build systems
Build with SBT
Maven with Eclipse
Gradle with Eclipse
Summary
Functional Programming Concepts
Introduction to functional programming
Advantages of functional programming
Functional Scala for the data scientists
Why FP and Scala for learning Spark?
Why Spark?
Scala and the Spark programming model
Scala and the Spark ecosystem
Pure functions and higher-order functions
Pure functions
Anonymous functions
Higher-order functions
Function as a return value
Using higher-order functions
Error handling in functional Scala
Failure and exceptions in Scala
Throwing exceptions
Catching exception using try and catch
Finally
Creating an Either
Future
Run one task, but block
Functional programming and data mutability
Summary
Collection APIs
Scala collection APIs
Types and hierarchies
Traversable
Iterable
Seq, LinearSeq, and IndexedSeq
Mutable and immutable
Arrays
Lists
Sets
Tuples
Maps
Option
Exists
Forall
Filter
Map
Take
GroupBy
Init
Drop
TakeWhile
DropWhile
FlatMap
Performance characteristics
Performance characteristics of collection objects
Memory usage by collection objects
Java interoperability
Using Scala implicits
Implicit conversions in Scala
Summary
Tackle Big Data – Spark Comes to the Party
Introduction to data analytics
Inside the data analytics process
Introduction to big data
4 Vs of big data
Variety of Data
Velocity of Data
Volume of Data
Veracity of Data
Distributed computing using Apache Hadoop
Hadoop Distributed File System (HDFS)
HDFS High Availability
HDFS Federation
HDFS Snapshot
HDFS Read
HDFS Write
MapReduce framework
Here comes Apache Spark
Spark core
Spark SQL
Spark streaming
Spark GraphX
Spark ML
PySpark
SparkR
Summary
Start Working with Spark – REPL and RDDs
Dig deeper into Apache Spark
Apache Spark installation
Spark standalone
Spark on YARN
YARN client mode
YARN cluster mode
Spark on Mesos
Introduction to RDDs
RDD Creation
Parallelizing a collection
Reading data from an external source
Transformation of an existing RDD
Streaming API
Using the Spark shell
Actions and Transformations
Transformations
General transformations
Math/Statistical transformations
Set theory/relational transformations
Data structure-based transformations
map function
flatMap function
filter function
coalesce
repartition
Actions
reduce
count
collect
Caching
Loading and saving data
Loading data
textFile
wholeTextFiles
Load from a JDBC Datasource
Saving RDD
Summary
Special RDD Operations
Types of RDDs
Pair RDD
DoubleRDD
SequenceFileRDD
CoGroupedRDD
ShuffledRDD
UnionRDD
HadoopRDD
NewHadoopRDD
Aggregations
groupByKey
reduceByKey
aggregateByKey
combineByKey
Comparison of groupByKey, reduceByKey, combineByKey, and aggregateByKey
Partitioning and shuffling
Partitioners
HashPartitioner
RangePartitioner
Shuffling
Narrow Dependencies
Wide Dependencies
Broadcast variables
Creating broadcast variables
Cleaning broadcast variables
Destroying broadcast variables
Accumulators
Summary
Introduce a Little Structure - Spark SQL
Spark SQL and DataFrames
DataFrame API and SQL API
Pivots
Filters
User-Defined Functions (UDFs)
Schema structure of data
Implicit schema
Explicit schema
Encoders
Loading and saving datasets
Loading datasets
Saving datasets
Aggregations
Aggregate functions
Count
First
Last
approx_count_distinct
Min
Max
Average
Sum
Kurtosis
Skewness
Variance
Standard deviation
Covariance
groupBy
Rollup
Cube
Window functions
ntiles
Joins
Inner workings of join
Shuffle join
Broadcast join
Join types
Inner join
Left outer join
Right outer join
Outer join
Left anti join
Left semi join
Cross join
Performance implications of join
Summary
Stream Me Up, Scotty - Spark Streaming
A Brief introduction to streaming
At least once processing
At most once processing
Exactly once processing
Spark Streaming
StreamingContext
Creating StreamingContext
Starting StreamingContext
Stopping StreamingContext
Input streams
receiverStream
socketTextStream
rawSocketStream
fileStream
textFileStream
binaryRecordsStream
queueStream
textFileStream example
twitterStream example
Discretized streams
Transformations
Window operations
Stateful/stateless transformations
Stateless transformations
Stateful transformations
Checkpointing
Metadata checkpointing
Data checkpointing
Driver failure recovery
Interoperability with streaming platforms (Apache Kafka)
Receiver-based approach
Direct stream
Structured streaming
Structured streaming
Handling Event-time and late data
Fault tolerance semantics
Summary
Everything is Connected - GraphX
A brief introduction to graph theory
GraphX
VertexRDD and EdgeRDD
VertexRDD
EdgeRDD
Graph operators
Filter
MapValues
aggregateMessages
TriangleCounting
Pregel API
ConnectedComponents
Traveling salesman problem
ShortestPaths
PageRank
Summary
Learning Machine Learning - Spark MLlib and Spark ML
Introduction to machine learning
Typical machine learning workflow
Machine learning tasks
Supervised learning
Unsupervised learning
Reinforcement learning
Recommender system
Semisupervised learning
Spark machine learning APIs
Spark machine learning libraries
Spark MLlib
Spark ML
Spark MLlib or Spark ML?
Feature extraction and transformation
CountVectorizer
Tokenizer
StopWordsRemover
StringIndexer
OneHotEncoder
Spark ML pipelines
Dataset abstraction
Creating a simple pipeline
Unsupervised machine learning
Dimensionality reduction
PCA
Using PCA
Regression Analysis - a practical use of PCA
Dataset collection and exploration
What is regression analysis?
Binary and multiclass classification
Performance metrics
Binary classification using logistic regression
Breast cancer prediction using logistic regression of Spark ML
Dataset collection
Developing the pipeline using Spark ML
Multiclass classification using logistic regression
Improving classification accuracy using random forests
Classifying MNIST dataset using random forest
Summary
Advanced Machine Learning Best Practices
Machine learning best practices
Beware of overfitting and underfitting
Stay tuned with Spark MLlib and Spark ML
Choosing the right algorithm for your application
Considerations when choosing an algorithm
Accuracy
Training time
Linearity
Inspect your data when choosing an algorithm
Number of parameters
How large is your training set?
Number of features
Hyperparameter tuning of ML models
Hyperparameter tuning
Grid search parameter tuning
Cross-validation
Credit risk analysis – An example of hyperparameter tuning
What is credit risk analysis? Why is it important?
The dataset exploration
Step-by-step example with Spark ML
A recommendation system with Spark
Model-based recommendation with Spark
Data exploration
Movie recommendation using ALS
Topic modelling - A best practice for text clustering
How does LDA work?
Topic modeling with Spark MLlib
Scalability of LDA
Summary
My Name is Bayes, Naive Bayes
Multinomial classification
Transformation to binary
Classification using One-Vs-The-Rest approach
Exploration and preparation of the OCR dataset
Hierarchical classification
Extension from binary
Bayesian inference
An overview of Bayesian inference
What is inference?
How does it work?
Naive Bayes
An overview of Bayes' theorem
My name is Bayes, Naive Bayes
Building a scalable classifier with NB
Tune me up!
The decision trees
Advantages and disadvantages of using DTs
Decision tree versus Naive Bayes
Building a scalable classifier with DT algorithm
Summary
Time to Put Some Order - Cluster Your Data with Spark MLlib
Unsupervised learning
Unsupervised learning example
Clustering techniques
Unsupervised learning and the clustering
Hierarchical clustering
Centroid-based clustering
Distribution-based clustestering
Centroid-based clustering (CC)
Challenges in CC algorithm
How does K-means algorithm work?
An example of clustering using K-means of Spark MLlib
Hierarchical clustering (HC)
An overview of HC algorithm and challenges
Bisecting K-means with Spark MLlib
Bisecting K-means clustering of the neighborhood using Spark MLlib
Distribution-based clustering (DC)
Challenges in DC algorithm
How does a Gaussian mixture model work?
An example of clustering using GMM with Spark MLlib
Determining number of clusters
A comparative analysis between clustering algorithms
Submitting Spark job for cluster analysis
Summary
Text Analytics Using Spark ML
Understanding text analytics
Text analytics
Sentiment analysis
Topic modeling
TF-IDF (term frequency - inverse document frequency)
Named entity recognition (NER)
Event extraction
Transformers and Estimators
Standard Transformer
Estimator Transformer
Tokenization
StopWordsRemover
NGrams
TF-IDF
HashingTF
Inverse Document Frequency (IDF)
Word2Vec
CountVectorizer
Topic modeling using LDA
Implementing text classification
Summary
Spark Tuning
Monitoring Spark jobs
Spark web interface
Jobs
Stages
Storage
Environment
Executors
SQL
Visualizing Spark application using web UI
Observing the running and completed Spark jobs
Debugging Spark applications using logs
Logging with log4j with Spark
Spark configuration
Spark properties
Environmental variables
Logging
Common mistakes in Spark app development
Application failure
Slow jobs or unresponsiveness
Optimization techniques
Data serialization
Memory tuning
Memory usage and management
Tuning the data structures
Serialized RDD storage
Garbage collection tuning
Level of parallelism
Broadcasting
Data locality
Summary
Time to Go to ClusterLand - Deploying Spark on a Cluster
Spark architecture in a cluster
Spark ecosystem in brief
Cluster design
Cluster management
Pseudocluster mode (aka Spark local)
Standalone
Apache YARN
Apache Mesos
Cloud-based deployments
Deploying the Spark application on a cluster
Submitting Spark jobs
Running Spark jobs locally and in standalone
Hadoop YARN
Configuring a single-node YARN cluster
Step 1: Downloading Apache Hadoop
Step 2: Setting the JAVA_HOME
Step 3: Creating users and groups
Step 4: Creating data and log directories
Step 5: Configuring core-site.xml
Step 6: Configuring hdfs-site.xml
Step 7: Configuring mapred-site.xml
Step 8: Configuring yarn-site.xml
Step 9: Setting Java heap space
Step 10: Formatting HDFS
Step 11: Starting the HDFS
Step 12: Starting YARN
Step 13: Verifying on the web UI
Submitting Spark jobs on YARN cluster
Advance job submissions in a YARN cluster
Apache Mesos
Client mode
Cluster mode
Deploying on AWS
Step 1: Key pair and access key configuration
Step 2: Configuring Spark cluster on EC2
Step 3: Running Spark jobs on the AWS cluster
Step 4: Pausing, restarting, and terminating the Spark cluster
Summary
Testing and Debugging Spark
Testing in a distributed environment
Distributed environment
Issues in a distributed system
Challenges of software testing in a distributed environment
Testing Spark applications
Testing Scala methods
Unit testing
Testing Spark applications
Method 1: Using Scala JUnit test
Method 2: Testing Scala code using FunSuite
Method 3: Making life easier with Spark testing base
Configuring Hadoop runtime on Windows
Debugging Spark applications
Logging with log4j with Spark recap
Debugging the Spark application
Debugging Spark application on Eclipse as Scala debug
Debugging Spark jobs running as local and standalone mode
Debugging Spark applications on YARN or Mesos cluster
Debugging Spark application using SBT
Summary
PySpark and SparkR
Introduction to PySpark
Installation and configuration
By setting SPARK_HOME
Using Python shell
By setting PySpark on Python IDEs
Getting started with PySpark
Working with DataFrames and RDDs
Reading a dataset in Libsvm format
Reading a CSV file
Reading and manipulating raw text files
Writing UDF on PySpark
Let's do some analytics with k-means clustering
Introduction to SparkR
Why SparkR?
Installing and getting started
Getting started
Using external data source APIs
Data manipulation
Querying SparkR DataFrame
Visualizing your data on RStudio
Summary
Accelerating Spark with Alluxio
The need for Alluxio
Getting started with Alluxio
Downloading Alluxio
Installing and running Alluxio locally
Overview
Browse
Configuration
Workers
In-Memory Data
Logs
Metrics
Current features
Integration with YARN
Alluxio worker memory
Alluxio master memory
CPU vcores
Using Alluxio with Spark
Summary
Interactive Data Analytics with Apache Zeppelin
Introduction to Apache Zeppelin
Installation and getting started
Installation and configuration
Building from source
Starting and stopping Apache Zeppelin
Creating notebooks
Configuring the interpreter
Data processing and visualization
Complex data analytics with Zeppelin
The problem definition
Dataset descripting and exploration
Data and results collaborating
Summary
买过这本书的人还买过
读了这本书的人还在读
同类图书排行榜