万本电子书0元读

万本电子书0元读

顶部广告

Big Data Analytics电子书

售       价:¥

12人正在读 | 0人评论 9.8

作       者:Venkat Ankam

出  版  社:Packt Publishing

出版时间:2016-09-01

字       数:342.3万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
A handy reference guide for data analysts and data scientists to help to obtain value from big data analytics using Spark on Hadoop clusters About This Book This book is based on the latest 2.0 version of Apache Spark and 2.7 version of Hadoop integrated with most commonly used tools. Learn all Spark stack components including latest topics such as DataFrames, DataSets, GraphFrames, Structured Streaming, DataFrame based ML Pipelines and SparkR. Integrations with frameworks such as HDFS, YARN and tools such as Jupyter, Zeppelin, NiFi, Mahout, HBase Spark Connector, GraphFrames, H2O and Hivemall. Who This Book Is For Though this book is primarily aimed at data analysts and data scientists, it will also help architects, programmers, and practitioners. Knowledge of either Spark or Hadoop would be beneficial. It is assumed that you have basic programming background in Scala, Python, SQL, or R programming with basic Linux experience. Working experience within big data environments is not mandatory. What You Will Learn Find out and implement the tools and techniques of big data analytics using Spark on Hadoop clusters with wide variety of tools used with Spark and Hadoop Understand all the Hadoop and Spark ecosystem components Get to know all the Spark components: Spark Core, Spark SQL, DataFrames, DataSets, Conventional and Structured Streaming, MLLib, ML Pipelines and Graphx See batch and real-time data analytics using Spark Core, Spark SQL, and Conventional and Structured Streaming Get to grips with data science and machine learning using MLLib, ML Pipelines, H2O, Hivemall, Graphx, SparkR and Hivemall. In Detail Big Data Analytics book aims at providing the fundamentals of Apache Spark and Hadoop. All Spark components – Spark Core, Spark SQL, DataFrames, Data sets, Conventional Streaming, Structured Streaming, MLlib, Graphx and Hadoop core components – HDFS, MapReduce and Yarn are explored in greater depth with implementation examples on Spark + Hadoop clusters. It is moving away from MapReduce to Spark. So, advantages of Spark over MapReduce are explained at great depth to reap benefits of in-memory speeds. DataFrames API, Data Sources API and new Data set API are explained for building Big Data analytical applications. Real-time data analytics using Spark Streaming with Apache Kafka and HBase is covered to help building streaming applications. New Structured streaming concept is explained with an IOT (Internet of Things) use case. Machine learning techniques are covered using MLLib, ML Pipelines and SparkR and Graph Analytics are covered with GraphX and GraphFrames components of Spark. Readers will also get an opportunity to get started with web based notebooks such as Jupyter, Apache Zeppelin and data flow tool Apache NiFi to analyze and visualize data. Style and approach This step-by-step pragmatic guide will make life easy no matter what your level of experience. You will deep dive into Apache Spark on Hadoop clusters through ample exciting real-life examples. Practical tutorial explains data science in simple terms to help programmers and data analysts get started with Data Science
目录展开

Big Data Analytics

Table of Contents

Big Data Analytics

Credits

About the Author

Acknowledgement

About the Reviewers

www.PacktPub.com

eBooks, discount offers, and more

Why subscribe?

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

1. Big Data Analytics at a 10,000-Foot View

Big Data analytics and the role of Hadoop and Spark

A typical Big Data analytics project life cycle

Identifying the problem and outcomes

Identifying the necessary data

Data collection

Preprocessing data and ETL

Performing analytics

Visualizing data

The role of Hadoop and Spark

Big Data science and the role of Hadoop and Spark

A fundamental shift from data analytics to data science

Data scientists versus software engineers

Data scientists versus data analysts

Data scientists versus business analysts

A typical data science project life cycle

Hypothesis and modeling

Measuring the effectiveness

Making improvements

Communicating the results

The role of Hadoop and Spark

Tools and techniques

Real-life use cases

Summary

2. Getting Started with Apache Hadoop and Apache Spark

Introducing Apache Hadoop

Hadoop Distributed File System

Features of HDFS

MapReduce

MapReduce features

MapReduce v1 versus MapReduce v2

MapReduce v1 challenges

YARN

Storage options on Hadoop

File formats

Sequence file

Protocol buffers and thrift

Avro

Parquet

RCFile and ORCFile

Compression formats

Standard compression formats

Introducing Apache Spark

Spark history

What is Apache Spark?

What Apache Spark is not

MapReduce issues

Spark's stack

Why Hadoop plus Spark?

Hadoop features

Spark features

Frequently asked questions about Spark

Installing Hadoop plus Spark clusters

Summary

3. Deep Dive into Apache Spark

Starting Spark daemons

Working with CDH

Working with HDP, MapR, and Spark pre-built packages

Learning Spark core concepts

Ways to work with Spark

Spark Shell

Exploring the Spark Scala shell

Spark applications

Connecting to the Kerberos Security Enabled Spark Cluster

Resilient Distributed Dataset

Method 1 – parallelizing a collection

Method 2 – reading from a file

Reading files from HDFS

Reading files from HDFS with HA enabled

Spark context

Transformations and actions

Parallelism in RDDs

Lazy evaluation

Lineage Graph

Serialization

Leveraging Hadoop file formats in Spark

Data locality

Shared variables

Pair RDDs

Lifecycle of Spark program

Pipelining

Spark execution summary

Spark applications

Spark Shell versus Spark applications

Creating a Spark context

SparkConf

SparkSubmit

Spark Conf precedence order

Important application configurations

Persistence and caching

Storage levels

What level to choose?

Spark resource managers – Standalone, YARN, and Mesos

Local versus cluster mode

Cluster resource managers

Standalone

YARN

Dynamic resource allocation

Client mode versus cluster mode

Mesos

Which resource manager to use?

Summary

4. Big Data Analytics with Spark SQL, DataFrames, and Datasets

History of Spark SQL

Architecture of Spark SQL

Introducing SQL, Datasources, DataFrame, and Dataset APIs

Evolution of DataFrames and Datasets

What's wrong with RDDs?

RDD Transformations versus Dataset and DataFrames Transformations

Why Datasets and DataFrames?

Optimization

Speed

Automatic Schema Discovery

Multiple sources, multiple languages

Interoperability between RDDs and others

Select and read necessary data only

When to use RDDs, Datasets, and DataFrames?

Analytics with DataFrames

Creating SparkSession

Creating DataFrames

Creating DataFrames from structured data files

Creating DataFrames from RDDs

Creating DataFrames from tables in Hive

Creating DataFrames from external databases

Converting DataFrames to RDDs

Common Dataset/DataFrame operations

Input and Output Operations

Basic Dataset/DataFrame functions

DSL functions

Built-in functions, aggregate functions, and window functions

Actions

RDD operations

Caching data

Performance optimizations

Analytics with the Dataset API

Creating Datasets

Converting a DataFrame to a Dataset

Converting a Dataset to a DataFrame

Accessing metadata using Catalog

Data Sources API

Read and write functions

Built-in sources

Working with text files

Working with JSON

Working with Parquet

Working with ORC

Working with JDBC

Working with CSV

External sources

Working with AVRO

Working with XML

Working with Pandas

DataFrame based Spark-on-HBase connector

Spark SQL as a distributed SQL engine

Spark SQL's Thrift server for JDBC/ODBC access

Querying data using beeline client

Querying data from Hive using spark-sql CLI

Integration with BI tools

Hive on Spark

Summary

5. Real-Time Analytics with Spark Streaming and Structured Streaming

Introducing real-time processing

Pros and cons of Spark Streaming

History of Spark Streaming

Architecture of Spark Streaming

Spark Streaming application flow

Stateless and stateful stream processing

Spark Streaming transformations and actions

Union

Join

Transform operation

updateStateByKey

mapWithState

Window operations

Output operations

Input sources and output stores

Basic sources

Advanced sources

Custom sources

Receiver reliability

Output stores

Spark Streaming with Kafka and HBase

Receiver-based approach

Role of Zookeeper

Direct approach (no receivers)

Integration with HBase

Advanced concepts of Spark Streaming

Using DataFrames

MLlib operations

Caching/persistence

Fault-tolerance in Spark Streaming

Failure of executor

Failure of driver

Recovering with checkpointing

Recovering with WAL

Performance tuning of Spark Streaming applications

Monitoring applications

Introducing Structured Streaming

Structured Streaming application flow

When to use Structured Streaming?

Streaming Datasets and Streaming DataFrames

Input sources and output sinks

Operations on Streaming Datasets and Streaming DataFrames

Summary

6. Notebooks and Dataflows with Spark and Hadoop

Introducing web-based notebooks

Introducing Jupyter

Installing Jupyter

Analytics with Jupyter

Introducing Apache Zeppelin

Jupyter versus Zeppelin

Installing Apache Zeppelin

Ambari service

The manual method

Analytics with Zeppelin

The Livy REST job server and Hue Notebooks

Installing and configuring the Livy server and Hue

Using the Livy server

An interactive session

A batch session

Sharing SparkContexts and RDDs

Using Livy with Hue Notebook

Using Livy with Zeppelin

Introducing Apache NiFi for dataflows

Installing Apache NiFi

Dataflows and analytics with NiFi

Summary

7. Machine Learning with Spark and Hadoop

Introducing machine learning

Machine learning on Spark and Hadoop

Machine learning algorithms

Supervised learning

Unsupervised learning

Recommender systems

Feature extraction and transformation

Optimization

Spark MLlib data types

An example of machine learning algorithms

Logistic regression for spam detection

Building machine learning pipelines

An example of a pipeline workflow

Building an ML pipeline

Saving and loading models

Machine learning with H2O and Spark

Why Sparkling Water?

An application flow on YARN

Getting started with Sparkling Water

Introducing Hivemall

Introducing Hivemall for Spark

Summary

8. Building Recommendation Systems with Spark and Mahout

Building recommendation systems

Content-based filtering

Collaborative filtering

User-based collaborative filtering

Item-based collaborative filtering

Limitations of a recommendation system

A recommendation system with MLlib

Preparing the environment

Creating RDDs

Exploring the data with DataFrames

Creating training and testing datasets

Creating a model

Making predictions

Evaluating the model with testing data

Checking the accuracy of the model

Explicit versus implicit feedback

The Mahout and Spark integration

Installing Mahout

Exploring the Mahout shell

Building a universal recommendation system with Mahout and search tool

Summary

9. Graph Analytics with GraphX

Introducing graph processing

What is a graph?

Graph databases versus graph processing systems

Introducing GraphX

Graph algorithms

Getting started with GraphX

Basic operations of GraphX

Creating a graph

Counting

Filtering

inDegrees, outDegrees, and degrees

Triplets

Transforming graphs

Transforming attributes

Modifying graphs

Joining graphs

VertexRDD and EdgeRDD operations

Mapping VertexRDD and EdgeRDD

Filtering VertexRDDs

Joining VertexRDDs

Joining EdgeRDDs

Reversing edge directions

GraphX algorithms

Triangle counting

Connected components

Analyzing flight data using GraphX

Pregel API

Introducing GraphFrames

Motif finding

Loading and saving GraphFrames

Summary

10. Interactive Analytics with SparkR

Introducing R and SparkR

What is R?

Introducing SparkR

Architecture of SparkR

Getting started with SparkR

Installing and configuring R

Using SparkR shell

Local mode

Standalone mode

Yarn mode

Creating a local DataFrame

Creating a DataFrame from a DataSources API

Creating a DataFrame from Hive

Using SparkR scripts

Using DataFrames with SparkR

Using SparkR with RStudio

Machine learning with SparkR

Using the Naive Bayes model

Using the k-means model

Using SparkR with Zeppelin

Summary

Index

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部