售 价:¥
温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印
为你推荐
Spark for Data Science
Spark for Data Science
Credits
Foreword
About the Authors
About the Reviewers
www.PacktPub.com
Why subscribe?
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. Big Data and Data Science – An Introduction
Big data overview
Challenges with big data analytics
Computational challenges
Analytical challenges
Evolution of big data analytics
Spark for data analytics
The Spark stack
Spark core
Spark SQL
Spark streaming
MLlib
GraphX
SparkR
Summary
References
2. The Spark Programming Model
The programming paradigm
Supported programming languages
Scala
Java
Python
R
Choosing the right language
The Spark engine
Driver program
The Spark shell
SparkContext
Worker nodes
Executors
Shared variables
Flow of execution
The RDD API
RDD basics
Persistence
RDD operations
Creating RDDs
Transformations on normal RDDs
The filter operation
The distinct operation
The intersection operation
The union operation
The map operation
The flatMap operation
The keys operation
The cartesian operation
Transformations on pair RDDs
The groupByKey operation
The join operation
The reduceByKey operation
The aggregate operation
Actions
The collect() function
The count() function
The take(n) function
The first() function
The takeSample() function
The countByKey() function
Summary
References
3. Introduction to DataFrames
Why DataFrames?
Spark SQL
The Catalyst optimizer
The DataFrame API
DataFrame basics
RDDs versus DataFrames
Similarities
Differences
Creating DataFrames
Creating DataFrames from RDDs
Creating DataFrames from JSON
Creating DataFrames from databases using JDBC
Creating DataFrames from Apache Parquet
Creating DataFrames from other data sources
DataFrame operations
Under the hood
Summary
References
4. Unified Data Access
Data abstractions in Apache Spark
Datasets
Working with Datasets
Creating Datasets from JSON
Datasets API's limitations
Spark SQL
SQL operations
Under the hood
Structured Streaming
The Spark streaming programming model
Under the hood
Comparison with other streaming engines
Continuous applications
Summary
References
5. Data Analysis on Spark
Data analytics life cycle
Data acquisition
Data preparation
Data consolidation
Data cleansing
Missing value treatment
Outlier treatment
Duplicate values treatment
Data transformation
Basics of statistics
Sampling
Simple random sample
Systematic sampling
Stratified sampling
Data distributions
Frequency distributions
Probability distributions
Descriptive statistics
Measures of location
Mean
Median
Mode
Measures of spread
Range
Variance
Standard deviation
Summary statistics
Graphical techniques
Inferential statistics
Discrete probability distributions
Bernoulli distribution
Binomial distribution
Sample problem
Poisson distribution
Sample problem
Continuous probability distributions
Normal distribution
Standard normal distribution
Chi-square distribution
Sample problem
Student's t-distribution
F-distribution
Standard error
Confidence level
Margin of error and confidence interval
Variability in the population
Estimating sample size
Hypothesis testing
Null and alternate hypotheses
Chi-square test
F-test
Problem:
Correlations
Summary
References
6. Machine Learning
Introduction
The evolution
Supervised learning
Unsupervised learning
MLlib and the Pipeline API
MLlib
ML pipeline
Transformer
Estimator
Introduction to machine learning
Parametric methods
Non-parametric methods
Regression methods
Linear regression
Loss function
Optimization
Regularizations on regression
Ridge regression
Lasso regression
Elastic net regression
Classification methods
Logistic regression
Linear Support Vector Machines (SVM)
Linear kernel
Polynomial kernel
Radial Basis Function kernel
Sigmoid kernel
Training an SVM
Decision trees
Impurity measures
Gini Index
Entropy
Variance
Stopping rule
Split candidates
Categorical features
Continuous features
Advantages of decision trees
Disadvantages of decision trees
Example
Ensembles
Random forests
Advantages of random forests
Gradient-Boosted Trees
Multilayer perceptron classifier
Clustering techniques
K-means clustering
Disadvantages of k-means
Example
Summary
References
7. Extending Spark with SparkR
SparkR basics
Accessing SparkR from the R environment
RDDs and DataFrames
Getting started
Advantages and limitations
Programming with SparkR
Function name masking
Subsetting data
Column functions
Grouped data
SparkR DataFrames
SQL operations
Set operations
Merging DataFrames
Machine learning
The Naive Bayes model
The Gaussian GLM model
Summary
References
8. Analyzing Unstructured Data
Sources of unstructured data
Processing unstructured data
Count vectorizer
TF-IDF
Stop-word removal
Normalization/scaling
Word2Vec
n-gram modelling
Text classification
Naive Bayes classifier
Text clustering
K-means
Dimensionality reduction
Singular Value Decomposition
Principal Component Analysis
Summary
References:
9. Visualizing Big Data
Why visualize data?
A data engineer's perspective
A data scientist's perspective
A business user's perspective
Data visualization tools
IPython notebook
Apache Zeppelin
Third-party tools
Data visualization techniques
Summarizing and visualizing
Subsetting and visualizing
Sampling and visualizing
Modeling and visualizing
Summary
References
Data source citations
10. Putting It All Together
A quick recap
Introducing a case study
The business problem
Data acquisition and data cleansing
Developing the hypothesis
Data exploration
Data preparation
Too many levels in a categorical variable
Numerical variables with too much variation
Missing data
Continuous data
Categorical data
Preparing the data
Model building
Data visualization
Communicating the results to business users
Summary
References
11. Building Data Science Applications
Scope of development
Expectations
Presentation options
Interactive notebooks
References
Web API
References
PMML and PFA
References
Development and testing
References
Data quality management
The Scala advantage
Spark development status
Spark 2.0's features and enhancements
Unifying Datasets and DataFrames
Structured Streaming
Project Tungsten phase 2
What's in store?
The big data trends
Summary
References
买过这本书的人还买过
读了这本书的人还在读
同类图书排行榜