万本电子书0元读

万本电子书0元读

顶部广告

Mastering Hadoop 3电子书

售       价:¥

37人正在读 | 0人评论 6.6

作       者:Chanchal Singh

出  版  社:Packt Publishing

出版时间:2019-02-28

字       数:72.2万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:此类商品不支持退换货,不支持下载打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
A comprehensive guide to mastering the most advanced Hadoop 3 concepts Key Features * Get to grips with the newly introduced features and capabilities of Hadoop 3 * Crunch and process data using MapReduce, YARN, and a host of tools within the Hadoop ecosystem * Sharpen your Hadoop skills with real-world case studies and code Book Description Apache Hadoop is one of the most popular big data solutions for distributed storage and for processing large chunks of data. With Hadoop 3, Apache promises to provide a high-performance, more fault-tolerant, and highly efficient big data processing platform, with a focus on improved scalability and increased efficiency. With this guide, you’ll understand advanced concepts of the Hadoop ecosystem tool. You’ll learn how Hadoop works internally, study advanced concepts of different ecosystem tools, discover solutions to real-world use cases, and understand how to secure your cluster. It will then walk you through HDFS, YARN, MapReduce, and Hadoop 3 concepts. You’ll be able to address common challenges like using Kafka efficiently, designing low latency, reliable message delivery Kafka systems, and handling high data volumes. As you advance, you’ll discover how to address major challenges when building an enterprise-grade messaging system, and how to use different stream processing systems along with Kafka to fulfil your enterprise goals. By the end of this book, you’ll have a complete understanding of how components in the Hadoop ecosystem are effectively integrated to implement a fast and reliable data pipeline, and you’ll be equipped to tackle a range of real-world problems in data pipelines. What you will learn * Gain an in-depth understanding of distributed computing using Hadoop 3 * Develop enterprise-grade applications using Apache Spark, Flink, and more * Build scalable and high-performance Hadoop data pipelines with security, monitoring, and data governance * Explore batch data processing patterns and how to model data in Hadoop * Master best practices for enterprises using, or planning to use, Hadoop 3 as a data platform * Understand security aspects of Hadoop, including authorization and authentication Who this book is for If you want to become a big data professional by mastering the advanced concepts of Hadoop, this book is for you. You’ll also find this book useful if you’re a Hadoop professional looking to strengthen your knowledge of the Hadoop ecosystem. Fundamental knowledge of the Java programming language and basics of Hadoop is necessary to get started with this book.
目录展开

Title Page

Copyright and Credits

Mastering Hadoop 3

Dedication

About Packt

Why subscribe?

Packt.com

Foreword

Contributors

About the authors

About the reviewer

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Code in action

Conventions used

Get in touch

Reviews

Section 1: Introduction to Hadoop 3

Journey to Hadoop 3

Hadoop origins and Timelines

Origins

MapReduce origin

Timelines

Overview of Hadoop 3 and its features

Hadoop logical view

Hadoop distributions

On-premise distribution

Cloud distributions

Points to remember

Summary

Deep Dive into the Hadoop Distributed File System

Technical requirements

Defining HDFS

Deep dive into the HDFS architecture

HDFS logical architecture

Concepts of the data group

Blocks

Replication

HDFS communication architecture

NameNode internals

Data locality and rack awareness

DataNode internals

Quorum Journal Manager (QJM)

HDFS high availability in Hadoop 3.x

Data management

Metadata management

Checkpoint using a secondary NameNode

Data integrity

HDFS Snapshots

Data rebalancing

Best practices for using balancer

HDFS reads and writes

Write workflows

Read workflows

Short circuit reads

Managing disk-skewed data in Hadoop 3.x

Lazy persist writes in HDFS

Erasure encoding in Hadoop 3.x

Advantages of erasure coding

Disadvantages of erasure coding

HDFS common interfaces

HDFS read

HDFS write

HDFSFileSystemWrite.java

HDFS delete

HDFS command reference

File System commands

Distributed copy

Admin commands

Points to remember

Summary

YARN Resource Management in Hadoop

Architecture

Resource Manager component

Node manager core

Introduction to YARN job scheduling

FIFO scheduler

Capacity scheduler

Configuring capacity scheduler

Fair scheduler

Scheduling queues

Configuring fair scheduler

Resource Manager high availability

Architecture of RM high availability

Configuring Resource Manager high availability

Node labels

​Configuring node labels

YARN Timeline server in Hadoop 3.x

Configuring YARN Timeline server

Opportunistic containers in Hadoop 3.x

Configuring opportunist container

Docker containers in YARN

Configuring Docker containers

Running the Docker image

Running the container

YARN REST APIs

Resource Manager API

Node Manager REST API

YARN command reference

User command

Application commands

Logs command

Administration commands

Summary

Internals of MapReduce

Technical requirements

Deep dive into the Hadoop MapReduce framework

YARN and MapReduce

MapReduce workflow in the Hadoop framework

Common MapReduce patterns

Summarization patterns

Word count example

Mapper

Reducer

Combiner

Minimum and maximum

Filtering patterns

Top-k MapReduce implementation

Join pattern

Reduce side join

Map side join (replicated join)

Composite join

Sorting and partitioning

MapReduce use case

MovieRatingMapper

MovieRatingReducer

MovieRatingDriver

Optimizing MapReduce

Hardware configuration

Operating system tuning

Optimization techniques

Runtime configuration

File System optimization

Summary

Section 2: Hadoop Ecosystem

SQL on Hadoop

Technical requirements

Presto – introduction

Presto architecture

Presto installation and basic query execution

Functions

Conversion functions

Mathematical functions

String functions

Presto connectors

Hive connector

Kafka connector

Configuration properties

MySQL connector

Redshift connector

MongoDB connector

Hive

Apache Hive architecture

Installing and running Hive

Hive queries

Hive table creation

Loading data to a table

The select query

Choosing file format

Splitable and non-splitable file formats

Query performance

Disk usage and compression

Schema change

Introduction to HCatalog

Introduction to HiveServer2

Hive UDF

Understanding ACID in HIVE

Example

Partitioning and bucketing

Prerequisite

Partitioning

Bucketing

Best practices

Impala

Impala architecture

Understanding the Impala interface and queries

Practicing Impala

Loading Data from CSV files

Best practices

Summary

Real-Time Processing Engines

Technical requirements

Spark

Apache Spark internals

Spark driver

Spark workers

Cluster manager

Spark application job flow

Deep dive into resilient distributed datasets

RDD features

RDD operations

Installing and running our first Spark job

Spark-shell

Spark submit command

Maven dependencies

Accumulators and broadcast variables

Understanding dataframe and dataset

Dataframes

Dataset

Spark cluster managers

Best practices

Apache Flink

Flink architecture

Apache Flink ecosystem component

Dataset and data stream API

Dataset API

Transformation

Data sinks

Data streams

Exploring the table API

Best practices

Storm/Heron

Deep dive into the Storm/Heron architecture

Concept of a Storm application

Introduction to Apache Heron

Heron architecture

Understanding Storm Trident

Storm integrations

Best practices

Summary

Widely Used Hadoop Ecosystem Components

Technical requirements

Pig

Apache Pig architecture

Installing and running Pig

Introducing Pig Latin and Grunt

Writing UDF in Pig

Eval function

Filter function

How to use custom UDF in Pig

Pig with Hive

Best practices

HBase

HBase architecture and its concept

CAP theorem

HBase operations and its examples

Put operation

Get operation

Delete operation

Batch operation

Installation

Local mode Installation

Distributed mode installation

Master node configuration

Slave node configuration

Best practices

Kafka

Apache Kafka architecture

Installing and running Apache Kafka

Local mode installation

Distributed mode

Internals of producer and consumer

Producer

Consumer

Writing producer and consumer application

Kafka Connect for ETL

Best practices

Flume

Apache Flume architecture

Deep dive into source, channel, and sink

Sources

Pollable source

Event-driven source

Channels

Memory channel

File channel

Kafka channel

Sinks

Flume interceptor

Timestamp interceptor

Universally Unique Identifier (UUID) interceptor

Regex filter interceptor

Writing a custom interceptor

Use case – Twitter data

Best practices

Summary

Section 3: Hadoop in the Real World

Designing Applications in Hadoop

Technical requirements

File formats

Understanding file formats

Row format and column format

Schema evolution

Splittable versus non-splittable

Compression

Text

Sequence file

Avro

Optimized Row Columnar (ORC)

Parquet

Data compression

Types of data compression in Hadoop

Gzip

BZip2

Lempel-Ziv-Oberhumer

Snappy

Compression format consideration

Serialization

Data ingestion

Batch ingestion

Macro batch ingestion

Real-time ingestion

Data processing

Batch processing

Micro batch processing

Real-time processing

Common batch processing pattern

Slowly changing dimension

Slowly changing dimensions – type 1

Slowly changing dimensions - type 2

Duplicate record and small files

Real-time lookup

Airflow for orchestration

Data governance

Data governance pillars

Metadata management

Data life cycle management

Data classification

Summary

Real-Time Stream Processing in Hadoop

Technical requirements

What are streaming datasets?

Stream data ingestion

Flume event-based data ingestion

Kafka

Common stream data processing patterns

Unbounded data batch processing

Streaming design considerations

Latency

Data availability, integrity, and security

Unbounded data sources

Data lookups

Data formats

Serializing your data

Parallel processing

Out-of-order events

Message delivery semantics

Micro-batch processing case study

Real-time processing case study

Main code

Executing the code

Summary

Machine Learning in Hadoop

Technical requirements

Machine learning steps

Common machine learning challenges

Spark machine learning

Transformer function

Estimator

Spark ML pipeline

Hadoop and R

Mahout

Machine learning case study in Spark

Sentiment analysis using Spark ML

Summary

Hadoop in the Cloud

Technical requirements

Logical view of Hadoop in the cloud

Network

Regions and availability zone

VPC and subnet

Security groups/firewall rules

Practical example using AWS

Managing resources

Cloud-watch

Data pipelines

Amazon Data Pipeline

Airflow

Airflow components

Sample data pipeline DAG example

High availability (HA)

Server failure

Server instance high availability

Region and zone failure

Cloud storage high availability

Amazon S3 outage case history

Summary

Hadoop Cluster Profiling

Introduction to benchmarking and profiling

HDFS

DFSIO

NameNode

NNBench

NNThroughputBenchmark

Synthetic load generator (SLG)

YARN

Scheduler Load Simulator (SLS)

Hive

TPC-DS

TPC-H

Mix-workloads

Rumen

Gridmix

Summary

Section 4: Securing Hadoop

Who Can Do What in Hadoop

Hadoop security pillars

System security

Kerberos authentication

Kerberos advantages

Kerberos authentication flows

Service authentication

User authentication

Communication between the authenticated client and the authenticated Hadoop service

Symmetric key-based communication in Hadoop

User authorization

Ranger

Sentry

List of security features that have been worked upon in Hadoop 3.0

Summary

Network and Data Security

Securing Hadoop networks

Segregating different types of networks

Network firewalls

Tools for securing Hadoop services' network perimeter

Encryption

Data in transit encryption

Data at rest encryption

Masking

Filtering

Row-level filtering

Column-level filtering

Summary

Monitoring Hadoop

General monitoring

HDFS metrics

NameNode metrics

DataNode metrics

YARN metrics

ZooKeeper metrics

Apache Ambari

Security monitoring

Security information and event management

How does SIEM work?

Intrusion detection system

Intrusion prevention system

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部