万本电子书0元读

万本电子书0元读

顶部广告

Spark for Data Science电子书

售       价:¥

2人正在读 | 0人评论 9.8

作       者:Srinivas Duvvuri,Bikramaditya Singhal

出  版  社:Packt Publishing

出版时间:2016-09-01

字       数:307.0万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
Analyze your data and delve deep into the world of machine learning with the latest Spark version, 2.0 About This Book Perform data analysis and build predictive models on huge datasets that leverage Apache Spark Learn to integrate data science algorithms and techniques with the fast and scalable computing features of Spark to address big data challenges Work through practical examples on real-world problems with sample code snippets Who This Book Is For This book is for anyone who wants to leverage Apache Spark for data science and machine learning. If you are a technologist who wants to expand your knowledge to perform data science operations in Spark, or a data scientist who wants to understand how algorithms are implemented in Spark, or a newbie with minimal development experience who wants to learn about Big Data Analytics, this book is for you! What You Will Learn Consolidate, clean, and transform your data acquired from various data sources Perform statistical analysis of data to find hidden insights Explore graphical techniques to see what your data looks like Use machine learning techniques to build predictive models Build scalable data products and solutions Start programming using the RDD, DataFrame and Dataset APIs Become an expert by improving your data analytical skills In Detail This is the era of Big Data. The words ‘Big Data’ implies big innovation and enables a competitive advantage for businesses. Apache Spark was designed to perform Big Data analytics at scale, and so Spark is equipped with the necessary algorithms and supports multiple programming languages. Whether you are a technologist, a data scientist, or a beginner to Big Data analytics, this book will provide you with all the skills necessary to perform statistical data analysis, data visualization, predictive modeling, and build scalable data products or solutions using Python, Scala, and R. With ample case studies and real-world examples, Spark for Data Science will help you ensure the successful execution of your data science projects. Style and approach This book takes a step-by-step approach to statistical analysis and machine learning, and is explained in a conversational and easy-to-follow style. Each topic is explained sequentially with a focus on the fundamentals as well as the advanced concepts of algorithms and techniques. Real-world examples with sample code snippets are also included.
目录展开

Spark for Data Science

Spark for Data Science

Credits

Foreword

About the Authors

About the Reviewers

www.PacktPub.com

Why subscribe?

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

1. Big Data and Data Science – An Introduction

Big data overview

Challenges with big data analytics

Computational challenges

Analytical challenges

Evolution of big data analytics

Spark for data analytics

The Spark stack

Spark core

Spark SQL

Spark streaming

MLlib

GraphX

SparkR

Summary

References

2. The Spark Programming Model

The programming paradigm

Supported programming languages

Scala

Java

Python

R

Choosing the right language

The Spark engine

Driver program

The Spark shell

SparkContext

Worker nodes

Executors

Shared variables

Flow of execution

The RDD API

RDD basics

Persistence

RDD operations

Creating RDDs

Transformations on normal RDDs

The filter operation

The distinct operation

The intersection operation

The union operation

The map operation

The flatMap operation

The keys operation

The cartesian operation

Transformations on pair RDDs

The groupByKey operation

The join operation

The reduceByKey operation

The aggregate operation

Actions

The collect() function

The count() function

The take(n) function

The first() function

The takeSample() function

The countByKey() function

Summary

References

3. Introduction to DataFrames

Why DataFrames?

Spark SQL

The Catalyst optimizer

The DataFrame API

DataFrame basics

RDDs versus DataFrames

Similarities

Differences

Creating DataFrames

Creating DataFrames from RDDs

Creating DataFrames from JSON

Creating DataFrames from databases using JDBC

Creating DataFrames from Apache Parquet

Creating DataFrames from other data sources

DataFrame operations

Under the hood

Summary

References

4. Unified Data Access

Data abstractions in Apache Spark

Datasets

Working with Datasets

Creating Datasets from JSON

Datasets API's limitations

Spark SQL

SQL operations

Under the hood

Structured Streaming

The Spark streaming programming model

Under the hood

Comparison with other streaming engines

Continuous applications

Summary

References

5. Data Analysis on Spark

Data analytics life cycle

Data acquisition

Data preparation

Data consolidation

Data cleansing

Missing value treatment

Outlier treatment

Duplicate values treatment

Data transformation

Basics of statistics

Sampling

Simple random sample

Systematic sampling

Stratified sampling

Data distributions

Frequency distributions

Probability distributions

Descriptive statistics

Measures of location

Mean

Median

Mode

Measures of spread

Range

Variance

Standard deviation

Summary statistics

Graphical techniques

Inferential statistics

Discrete probability distributions

Bernoulli distribution

Binomial distribution

Sample problem

Poisson distribution

Sample problem

Continuous probability distributions

Normal distribution

Standard normal distribution

Chi-square distribution

Sample problem

Student's t-distribution

F-distribution

Standard error

Confidence level

Margin of error and confidence interval

Variability in the population

Estimating sample size

Hypothesis testing

Null and alternate hypotheses

Chi-square test

F-test

Problem:

Correlations

Summary

References

6. Machine Learning

Introduction

The evolution

Supervised learning

Unsupervised learning

MLlib and the Pipeline API

MLlib

ML pipeline

Transformer

Estimator

Introduction to machine learning

Parametric methods

Non-parametric methods

Regression methods

Linear regression

Loss function

Optimization

Regularizations on regression

Ridge regression

Lasso regression

Elastic net regression

Classification methods

Logistic regression

Linear Support Vector Machines (SVM)

Linear kernel

Polynomial kernel

Radial Basis Function kernel

Sigmoid kernel

Training an SVM

Decision trees

Impurity measures

Gini Index

Entropy

Variance

Stopping rule

Split candidates

Categorical features

Continuous features

Advantages of decision trees

Disadvantages of decision trees

Example

Ensembles

Random forests

Advantages of random forests

Gradient-Boosted Trees

Multilayer perceptron classifier

Clustering techniques

K-means clustering

Disadvantages of k-means

Example

Summary

References

7. Extending Spark with SparkR

SparkR basics

Accessing SparkR from the R environment

RDDs and DataFrames

Getting started

Advantages and limitations

Programming with SparkR

Function name masking

Subsetting data

Column functions

Grouped data

SparkR DataFrames

SQL operations

Set operations

Merging DataFrames

Machine learning

The Naive Bayes model

The Gaussian GLM model

Summary

References

8. Analyzing Unstructured Data

Sources of unstructured data

Processing unstructured data

Count vectorizer

TF-IDF

Stop-word removal

Normalization/scaling

Word2Vec

n-gram modelling

Text classification

Naive Bayes classifier

Text clustering

K-means

Dimensionality reduction

Singular Value Decomposition

Principal Component Analysis

Summary

References:

9. Visualizing Big Data

Why visualize data?

A data engineer's perspective

A data scientist's perspective

A business user's perspective

Data visualization tools

IPython notebook

Apache Zeppelin

Third-party tools

Data visualization techniques

Summarizing and visualizing

Subsetting and visualizing

Sampling and visualizing

Modeling and visualizing

Summary

References

Data source citations

10. Putting It All Together

A quick recap

Introducing a case study

The business problem

Data acquisition and data cleansing

Developing the hypothesis

Data exploration

Data preparation

Too many levels in a categorical variable

Numerical variables with too much variation

Missing data

Continuous data

Categorical data

Preparing the data

Model building

Data visualization

Communicating the results to business users

Summary

References

11. Building Data Science Applications

Scope of development

Expectations

Presentation options

Interactive notebooks

References

Web API

References

PMML and PFA

References

Development and testing

References

Data quality management

The Scala advantage

Spark development status

Spark 2.0's features and enhancements

Unifying Datasets and DataFrames

Structured Streaming

Project Tungsten phase 2

What's in store?

The big data trends

Summary

References

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部