售 价:¥
温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印
为你推荐
Big Data Analytics
Table of Contents
Big Data Analytics
Credits
About the Author
Acknowledgement
About the Reviewers
www.PacktPub.com
eBooks, discount offers, and more
Why subscribe?
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. Big Data Analytics at a 10,000-Foot View
Big Data analytics and the role of Hadoop and Spark
A typical Big Data analytics project life cycle
Identifying the problem and outcomes
Identifying the necessary data
Data collection
Preprocessing data and ETL
Performing analytics
Visualizing data
The role of Hadoop and Spark
Big Data science and the role of Hadoop and Spark
A fundamental shift from data analytics to data science
Data scientists versus software engineers
Data scientists versus data analysts
Data scientists versus business analysts
A typical data science project life cycle
Hypothesis and modeling
Measuring the effectiveness
Making improvements
Communicating the results
The role of Hadoop and Spark
Tools and techniques
Real-life use cases
Summary
2. Getting Started with Apache Hadoop and Apache Spark
Introducing Apache Hadoop
Hadoop Distributed File System
Features of HDFS
MapReduce
MapReduce features
MapReduce v1 versus MapReduce v2
MapReduce v1 challenges
YARN
Storage options on Hadoop
File formats
Sequence file
Protocol buffers and thrift
Avro
Parquet
RCFile and ORCFile
Compression formats
Standard compression formats
Introducing Apache Spark
Spark history
What is Apache Spark?
What Apache Spark is not
MapReduce issues
Spark's stack
Why Hadoop plus Spark?
Hadoop features
Spark features
Frequently asked questions about Spark
Installing Hadoop plus Spark clusters
Summary
3. Deep Dive into Apache Spark
Starting Spark daemons
Working with CDH
Working with HDP, MapR, and Spark pre-built packages
Learning Spark core concepts
Ways to work with Spark
Spark Shell
Exploring the Spark Scala shell
Spark applications
Connecting to the Kerberos Security Enabled Spark Cluster
Resilient Distributed Dataset
Method 1 – parallelizing a collection
Method 2 – reading from a file
Reading files from HDFS
Reading files from HDFS with HA enabled
Spark context
Transformations and actions
Parallelism in RDDs
Lazy evaluation
Lineage Graph
Serialization
Leveraging Hadoop file formats in Spark
Data locality
Shared variables
Pair RDDs
Lifecycle of Spark program
Pipelining
Spark execution summary
Spark applications
Spark Shell versus Spark applications
Creating a Spark context
SparkConf
SparkSubmit
Spark Conf precedence order
Important application configurations
Persistence and caching
Storage levels
What level to choose?
Spark resource managers – Standalone, YARN, and Mesos
Local versus cluster mode
Cluster resource managers
Standalone
YARN
Dynamic resource allocation
Client mode versus cluster mode
Mesos
Which resource manager to use?
Summary
4. Big Data Analytics with Spark SQL, DataFrames, and Datasets
History of Spark SQL
Architecture of Spark SQL
Introducing SQL, Datasources, DataFrame, and Dataset APIs
Evolution of DataFrames and Datasets
What's wrong with RDDs?
RDD Transformations versus Dataset and DataFrames Transformations
Why Datasets and DataFrames?
Optimization
Speed
Automatic Schema Discovery
Multiple sources, multiple languages
Interoperability between RDDs and others
Select and read necessary data only
When to use RDDs, Datasets, and DataFrames?
Analytics with DataFrames
Creating SparkSession
Creating DataFrames
Creating DataFrames from structured data files
Creating DataFrames from RDDs
Creating DataFrames from tables in Hive
Creating DataFrames from external databases
Converting DataFrames to RDDs
Common Dataset/DataFrame operations
Input and Output Operations
Basic Dataset/DataFrame functions
DSL functions
Built-in functions, aggregate functions, and window functions
Actions
RDD operations
Caching data
Performance optimizations
Analytics with the Dataset API
Creating Datasets
Converting a DataFrame to a Dataset
Converting a Dataset to a DataFrame
Accessing metadata using Catalog
Data Sources API
Read and write functions
Built-in sources
Working with text files
Working with JSON
Working with Parquet
Working with ORC
Working with JDBC
Working with CSV
External sources
Working with AVRO
Working with XML
Working with Pandas
DataFrame based Spark-on-HBase connector
Spark SQL as a distributed SQL engine
Spark SQL's Thrift server for JDBC/ODBC access
Querying data using beeline client
Querying data from Hive using spark-sql CLI
Integration with BI tools
Hive on Spark
Summary
5. Real-Time Analytics with Spark Streaming and Structured Streaming
Introducing real-time processing
Pros and cons of Spark Streaming
History of Spark Streaming
Architecture of Spark Streaming
Spark Streaming application flow
Stateless and stateful stream processing
Spark Streaming transformations and actions
Union
Join
Transform operation
updateStateByKey
mapWithState
Window operations
Output operations
Input sources and output stores
Basic sources
Advanced sources
Custom sources
Receiver reliability
Output stores
Spark Streaming with Kafka and HBase
Receiver-based approach
Role of Zookeeper
Direct approach (no receivers)
Integration with HBase
Advanced concepts of Spark Streaming
Using DataFrames
MLlib operations
Caching/persistence
Fault-tolerance in Spark Streaming
Failure of executor
Failure of driver
Recovering with checkpointing
Recovering with WAL
Performance tuning of Spark Streaming applications
Monitoring applications
Introducing Structured Streaming
Structured Streaming application flow
When to use Structured Streaming?
Streaming Datasets and Streaming DataFrames
Input sources and output sinks
Operations on Streaming Datasets and Streaming DataFrames
Summary
6. Notebooks and Dataflows with Spark and Hadoop
Introducing web-based notebooks
Introducing Jupyter
Installing Jupyter
Analytics with Jupyter
Introducing Apache Zeppelin
Jupyter versus Zeppelin
Installing Apache Zeppelin
Ambari service
The manual method
Analytics with Zeppelin
The Livy REST job server and Hue Notebooks
Installing and configuring the Livy server and Hue
Using the Livy server
An interactive session
A batch session
Sharing SparkContexts and RDDs
Using Livy with Hue Notebook
Using Livy with Zeppelin
Introducing Apache NiFi for dataflows
Installing Apache NiFi
Dataflows and analytics with NiFi
Summary
7. Machine Learning with Spark and Hadoop
Introducing machine learning
Machine learning on Spark and Hadoop
Machine learning algorithms
Supervised learning
Unsupervised learning
Recommender systems
Feature extraction and transformation
Optimization
Spark MLlib data types
An example of machine learning algorithms
Logistic regression for spam detection
Building machine learning pipelines
An example of a pipeline workflow
Building an ML pipeline
Saving and loading models
Machine learning with H2O and Spark
Why Sparkling Water?
An application flow on YARN
Getting started with Sparkling Water
Introducing Hivemall
Introducing Hivemall for Spark
Summary
8. Building Recommendation Systems with Spark and Mahout
Building recommendation systems
Content-based filtering
Collaborative filtering
User-based collaborative filtering
Item-based collaborative filtering
Limitations of a recommendation system
A recommendation system with MLlib
Preparing the environment
Creating RDDs
Exploring the data with DataFrames
Creating training and testing datasets
Creating a model
Making predictions
Evaluating the model with testing data
Checking the accuracy of the model
Explicit versus implicit feedback
The Mahout and Spark integration
Installing Mahout
Exploring the Mahout shell
Building a universal recommendation system with Mahout and search tool
Summary
9. Graph Analytics with GraphX
Introducing graph processing
What is a graph?
Graph databases versus graph processing systems
Introducing GraphX
Graph algorithms
Getting started with GraphX
Basic operations of GraphX
Creating a graph
Counting
Filtering
inDegrees, outDegrees, and degrees
Triplets
Transforming graphs
Transforming attributes
Modifying graphs
Joining graphs
VertexRDD and EdgeRDD operations
Mapping VertexRDD and EdgeRDD
Filtering VertexRDDs
Joining VertexRDDs
Joining EdgeRDDs
Reversing edge directions
GraphX algorithms
Triangle counting
Connected components
Analyzing flight data using GraphX
Pregel API
Introducing GraphFrames
Motif finding
Loading and saving GraphFrames
Summary
10. Interactive Analytics with SparkR
Introducing R and SparkR
What is R?
Introducing SparkR
Architecture of SparkR
Getting started with SparkR
Installing and configuring R
Using SparkR shell
Local mode
Standalone mode
Yarn mode
Creating a local DataFrame
Creating a DataFrame from a DataSources API
Creating a DataFrame from Hive
Using SparkR scripts
Using DataFrames with SparkR
Using SparkR with RStudio
Machine learning with SparkR
Using the Naive Bayes model
Using the k-means model
Using SparkR with Zeppelin
Summary
Index
买过这本书的人还买过
读了这本书的人还在读
同类图书排行榜