万本电子书0元读

万本电子书0元读

顶部广告

Big Data Analytics with Hadoop 3电子书

售       价:¥

6人正在读 | 0人评论 9.8

作       者:Sridhar Alla

出  版  社:Packt Publishing

出版时间:2018-05-31

字       数:41.6万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
Explore big data concepts, platforms, analytics, and their applications using the power of Hadoop 3 About This Book ? Learn Hadoop 3 to build effective big data analytics solutions on-premise and on cloud ? Integrate Hadoop with other big data tools such as R, Python, Apache Spark, and Apache Flink ? Exploit big data using Hadoop 3 with real-world examples Who This Book Is For Big Data Analytics with Hadoop 3 is for you if you are looking to build high-performance analytics solutions for your enterprise or business using Hadoop 3’s powerful features, or you’re new to big data analytics. A basic understanding of the Java programming language is required. What You Will Learn ? Explore the new features of Hadoop 3 along with HDFS, YARN, and MapReduce ? Get well-versed with the analytical capabilities of Hadoop ecosystem using practical examples ? Integrate Hadoop with R and Python for more efficient big data processing ? Learn to use Hadoop with Apache Spark and Apache Flink for real-time data analytics ? Set up a Hadoop cluster on AWS cloud ? Perform big data analytics on AWS using Elastic Map Reduce In Detail Apache Hadoop is the most popular platform for big data processing, and can be combined with a host of other big data tools to build powerful analytics solutions. Big Data Analytics with Hadoop 3 shows you how to do just that, by providing insights into the software as well as its benefits with the help of practical examples. Once you have taken a tour of Hadoop 3’s latest features, you will get an overview of HDFS, MapReduce, and YARN, and how they enable faster, more efficient big data processing. You will then move on to learning how to integrate Hadoop with the open source tools, such as Python and R, to analyze and visualize data and perform statistical computing on big data. As you get acquainted with all this, you will explore how to use Hadoop 3 with Apache Spark and Apache Flink for real-time data analytics and stream processing. In addition to this, you will understand how to use Hadoop to build analytics solutions on the cloud and an end-to-end pipeline to perform big data analysis using practical use cases. By the end of this book, you will be well-versed with the analytical capabilities of the Hadoop ecosystem. You will be able to build powerful solutions to perform big data analytics and get insight effortlessly. Style and approach Filled with practical examples and use cases, this book will not only help you get up and running with Hadoop, but will also take you farther down the road to deal with Big Data Analytics
目录展开

Title Page

Copyright and Credits

Big Data Analytics with Hadoop 3

Packt Upsell

Why subscribe?

PacktPub.com

Contributors

About the author

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Introduction to Hadoop

Hadoop Distributed File System

High availability

Intra-DataNode balancer

Erasure coding

Port numbers

MapReduce framework

Task-level native optimization

YARN

Opportunistic containers

Types of container execution

YARN timeline service v.2

Enhancing scalability and reliability

Usability improvements

Architecture

Other changes

Minimum required Java version

Shell script rewrite

Shaded-client JARs

Installing Hadoop 3

Prerequisites

Downloading

Installation

Setup password-less ssh

Setting up the NameNode

Starting HDFS

Setting up the YARN service

Erasure Coding

Intra-DataNode balancer

Installing YARN timeline service v.2

Setting up the HBase cluster

Simple deployment for HBase

Enabling the co-processor

Enabling timeline service v.2

Running timeline service v.2

Enabling MapReduce to write to timeline service v.2

Summary

Overview of Big Data Analytics

Introduction to data analytics

Inside the data analytics process

Introduction to big data

Variety of data

Velocity of data

Volume of data

Veracity of data

Variability of data

Visualization

Value

Distributed computing using Apache Hadoop

The MapReduce framework

Hive

Downloading and extracting the Hive binaries

Installing Derby

Using Hive

Creating a database

Creating a table

SELECT statement syntax

WHERE clauses

INSERT statement syntax

Primitive types

Complex types

Built-in operators and functions

Built-in operators

Built-in functions

Language capabilities

A cheat sheet on retrieving information

Apache Spark

Visualization using Tableau

Summary

Big Data Processing with MapReduce

The MapReduce framework

Dataset

Record reader

Map

Combiner

Partitioner

Shuffle and sort

Reduce

Output format

MapReduce job types

Single mapper job

Single mapper reducer job

Multiple mappers reducer job

SingleMapperCombinerReducer job

Scenario

MapReduce patterns

Aggregation patterns

Average temperature by city

Record count

Min/max/count

Average/median/standard deviation

Filtering patterns

Join patterns

Inner join

Left anti join

Left outer join

Right outer join

Full outer join

Left semi join

Cross join

Summary

Scientific Computing and Big Data Analysis with Python and Hadoop

Installation

Installing standard Python

Installing Anaconda

Using Conda

Data analysis

Summary

Statistical Big Data Computing with R and Hadoop

Introduction

Install R on workstations and connect to the data in Hadoop

Install R on a shared server and connect to Hadoop

Utilize Revolution R Open

Execute R inside of MapReduce using RMR2

Summary and outlook for pure open source options

Methods of integrating R and Hadoop

RHADOOP – install R on workstations and connect to data in Hadoop

RHIPE – execute R inside Hadoop MapReduce

R and Hadoop Streaming

RHIVE – install R on workstations and connect to data in Hadoop

ORCH – Oracle connector for Hadoop

Data analytics

Summary

Batch Analytics with Apache Spark

SparkSQL and DataFrames

DataFrame APIs and the SQL API

Pivots

Filters

User-defined functions

Schema – structure of data

Implicit schema

Explicit schema

Encoders

Loading datasets

Saving datasets

Aggregations

Aggregate functions

count

first

last

approx_count_distinct

min

max

avg

sum

kurtosis

skewness

Variance

Standard deviation

Covariance

groupBy

Rollup

Cube

Window functions

ntiles

Joins

Inner workings of join

Shuffle join

Broadcast join

Join types

Inner join

Left outer join

Right outer join

Outer join

Left anti join

Left semi join

Cross join

Performance implications of join

Summary

Real-Time Analytics with Apache Spark

Streaming

At-least-once processing

At-most-once processing

Exactly-once processing

Spark Streaming

StreamingContext

Creating StreamingContext

Starting StreamingContext

Stopping StreamingContext

Input streams

receiverStream

socketTextStream

rawSocketStream

fileStream

textFileStream

binaryRecordsStream

queueStream

textFileStream example

twitterStream example

Discretized Streams

Transformations

Windows operations

Stateful/stateless transformations

Stateless transformations

Stateful transformations

Checkpointing

Metadata checkpointing

Data checkpointing

Driver failure recovery

Interoperability with streaming platforms (Apache Kafka)

Receiver-based

Direct Stream

Structured Streaming

Getting deeper into Structured Streaming

Handling event time and late date

Fault-tolerance semantics

Summary

Batch Analytics with Apache Flink

Introduction to Apache Flink

Continuous processing for unbounded datasets

Flink, the streaming model, and bounded datasets

Installing Flink

Downloading Flink

Installing Flink

Starting a local Flink cluster

Using the Flink cluster UI

Batch analytics

Reading file

File-based

Collection-based

Generic

Transformations

GroupBy

Aggregation

Joins

Inner join

Left outer join

Right outer join

Full outer join

Writing to a file

Summary

Stream Processing with Apache Flink

Introduction to streaming execution model

Data processing using the DataStream API

Execution environment

Data sources

Socket-based

File-based

Transformations

map

flatMap

filter

keyBy

reduce

fold

Aggregations

window

Global windows

Tumbling windows

Sliding windows

Session windows

windowAll

union

Window join

split

Select

Project

Physical partitioning

Custom partitioning

Random partitioning

Rebalancing partitioning

Rescaling

Broadcasting

Event time and watermarks

Connectors

Kafka connector

Twitter connector

RabbitMQ connector

Elasticsearch connector

Cassandra connector

Summary

Visualizing Big Data

Introduction

Tableau

Chart types

Line charts

Pie chart

Bar chart

Heat map

Using Python to visualize data

Using R to visualize data

Big data visualization tools

Summary

Introduction to Cloud Computing

Concepts and terminology

Cloud

IT resource

On-premise

Cloud consumers and Cloud providers

Scaling

Types of scaling

Horizontal scaling

Vertical scaling

Cloud service

Cloud service consumer

Goals and benefits

Increased scalability

Increased availability and reliability

Risks and challenges

Increased security vulnerabilities

Reduced operational governance control

Limited portability between Cloud providers

Roles and boundaries

Cloud provider

Cloud consumer

Cloud service owner

Cloud resource administrator

Additional roles

Organizational boundary

Trust boundary

Cloud characteristics

On-demand usage

Ubiquitous access

Multi-tenancy (and resource pooling)

Elasticity

Measured usage

Resiliency

Cloud delivery models

Infrastructure as a Service

Platform as a Service

Software as a Service

Combining Cloud delivery models

IaaS + PaaS

IaaS + PaaS + SaaS

Cloud deployment models

Public Clouds

Community Clouds

Private Clouds

Hybrid Clouds

Summary

Using Amazon Web Services

Amazon Elastic Compute Cloud

Elastic web-scale computing

Complete control of operations

Flexible Cloud hosting services

Integration

High reliability

Security

Inexpensive

Easy to start

Instances and Amazon Machine Images

Launching multiple instances of an AMI

Instances

AMIs

Regions and availability zones

Region and availability zone concepts

Regions

Availability zones

Available regions

Regions and endpoints

Instance types

Tag basics

Amazon EC2 key pairs

Amazon EC2 security groups for Linux instances

Elastic IP addresses

Amazon EC2 and Amazon Virtual Private Cloud

Amazon Elastic Block Store

Amazon EC2 instance store

What is AWS Lambda?

When should I use AWS Lambda?

Introduction to Amazon S3

Getting started with Amazon S3

Comprehensive security and compliance capabilities

Query in place

Flexible management

Most supported platform with the largest ecosystem

Easy and flexible data transfer

Backup and recovery

Data archiving

Data lakes and big data analytics

Hybrid Cloud storage

Cloud-native application data

Disaster recovery

Amazon DynamoDB

Amazon Kinesis Data Streams

What can I do with Kinesis Data Streams?

Accelerated log and data feed intake and processing

Real-time metrics and reporting

Real-time data analytics

Complex stream processing

Benefits of using Kinesis Data Streams

AWS Glue

When should I use AWS Glue?

Amazon EMR

Practical AWS EMR cluster

Summary

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部