万本电子书0元读

万本电子书0元读

顶部广告

Learning Hadoop 2电子书

售       价:¥

30人正在读 | 0人评论 9.8

作       者:Garry Turkington

出  版  社:Packt Publishing

出版时间:2015-02-13

字       数:479.5万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:此类商品不支持退换货,不支持下载打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
If you are a system or application developer interested in learning how to solve practical problems using the Hadoop framework, then this book is ideal for you. You are expected to be familiar with the Unix/Linux command-line interface and have some experience with the Java programming language. Familiarity with Hadoop would be a plus.
目录展开

Learning Hadoop 2

Table of Contents

Learning Hadoop 2

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers, and more

Why subscribe?

Free access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Introduction

A note on versioning

The background of Hadoop

Components of Hadoop

Common building blocks

Storage

Computation

Better together

Hadoop 2 – what's the big deal?

Storage in Hadoop 2

Computation in Hadoop 2

Distributions of Apache Hadoop

A dual approach

AWS – infrastructure on demand from Amazon

Simple Storage Service (S3)

Elastic MapReduce (EMR)

Getting started

Cloudera QuickStart VM

Amazon EMR

Creating an AWS account

Signing up for the necessary services

Using Elastic MapReduce

Getting Hadoop up and running

How to use EMR

AWS credentials

The AWS command-line interface

Running the examples

Data processing with Hadoop

Why Twitter?

Building our first dataset

One service, multiple APIs

Anatomy of a Tweet

Twitter credentials

Programmatic access with Python

Summary

2. Storage

The inner workings of HDFS

Cluster startup

NameNode startup

DataNode startup

Block replication

Command-line access to the HDFS filesystem

Exploring the HDFS filesystem

Protecting the filesystem metadata

Secondary NameNode not to the rescue

Hadoop 2 NameNode HA

Keeping the HA NameNodes in sync

Client configuration

How a failover works

Apache ZooKeeper – a different type of filesystem

Implementing a distributed lock with sequential ZNodes

Implementing group membership and leader election using ephemeral ZNodes

Java API

Building blocks

Further reading

Automatic NameNode failover

HDFS snapshots

Hadoop filesystems

Hadoop interfaces

Java FileSystem API

Libhdfs

Thrift

Managing and serializing data

The Writable interface

Introducing the wrapper classes

Array wrapper classes

The Comparable and WritableComparable interfaces

Storing data

Serialization and Containers

Compression

General-purpose file formats

Column-oriented data formats

RCFile

ORC

Parquet

Avro

Using the Java API

Summary

3. Processing – MapReduce and Beyond

MapReduce

Java API to MapReduce

The Mapper class

The Reducer class

The Driver class

Combiner

Partitioning

The optional partition function

Hadoop-provided mapper and reducer implementations

Sharing reference data

Writing MapReduce programs

Getting started

Running the examples

Local cluster

Elastic MapReduce

WordCount, the Hello World of MapReduce

Word co-occurrences

Trending topics

The Top N pattern

Sentiment of hashtags

Text cleanup using chain mapper

Walking through a run of a MapReduce job

Startup

Splitting the input

Task assignment

Task startup

Ongoing JobTracker monitoring

Mapper input

Mapper execution

Mapper output and reducer input

Reducer input

Reducer execution

Reducer output

Shutdown

Input/Output

InputFormat and RecordReader

Hadoop-provided InputFormat

Hadoop-provided RecordReader

OutputFormat and RecordWriter

Hadoop-provided OutputFormat

Sequence files

YARN

YARN architecture

The components of YARN

Anatomy of a YARN application

Life cycle of a YARN application

Fault tolerance and monitoring

Thinking in layers

Execution models

YARN in the real world – Computation beyond MapReduce

The problem with MapReduce

Tez

Hive-on-tez

Apache Spark

Apache Samza

YARN-independent frameworks

YARN today and beyond

Summary

4. Real-time Computation with Samza

Stream processing with Samza

How Samza works

Samza high-level architecture

Samza's best friend – Apache Kafka

YARN integration

An independent model

Hello Samza!

Building a tweet parsing job

The configuration file

Getting Twitter data into Kafka

Running a Samza job

Samza and HDFS

Windowing functions

Multijob workflows

Tweet sentiment analysis

Bootstrap streams

Stateful tasks

Summary

5. Iterative Computation with Spark

Apache Spark

Cluster computing with working sets

Resilient Distributed Datasets (RDDs)

Actions

Deployment

Spark on YARN

Spark on EC2

Getting started with Spark

Writing and running standalone applications

Scala API

Java API

WordCount in Java

Python API

The Spark ecosystem

Spark Streaming

GraphX

MLlib

Spark SQL

Processing data with Apache Spark

Building and running the examples

Running the examples on YARN

Finding popular topics

Assigning a sentiment to topics

Data processing on streams

State management

Data analysis with Spark SQL

SQL on data streams

Comparing Samza and Spark Streaming

Summary

6. Data Analysis with Apache Pig

An overview of Pig

Getting started

Running Pig

Grunt – the Pig interactive shell

Elastic MapReduce

Fundamentals of Apache Pig

Programming Pig

Pig data types

Pig functions

Load/store

Eval

The tuple, bag, and map functions

The math, string, and datetime functions

Dynamic invokers

Macros

Working with data

Filtering

Aggregation

Foreach

Join

Extending Pig (UDFs)

Contributed UDFs

Piggybank

Elephant Bird

Apache DataFu

Analyzing the Twitter stream

Prerequisites

Dataset exploration

Tweet metadata

Data preparation

Top n statistics

Datetime manipulation

Sessions

Capturing user interactions

Link analysis

Influential users

Summary

7. Hadoop and SQL

Why SQL on Hadoop

Other SQL-on-Hadoop solutions

Prerequisites

Overview of Hive

The nature of Hive tables

Hive architecture

Data types

DDL statements

File formats and storage

JSON

Avro

Columnar stores

Queries

Structuring Hive tables for given workloads

Partitioning a table

Overwriting and updating data

Bucketing and sorting

Sampling data

Writing scripts

Hive and Amazon Web Services

Hive and S3

Hive on Elastic MapReduce

Extending HiveQL

Programmatic interfaces

JDBC

Thrift

Stinger initiative

Impala

The architecture of Impala

Co-existing with Hive

A different philosophy

Drill, Tajo, and beyond

Summary

8. Data Lifecycle Management

What data lifecycle management is

Importance of data lifecycle management

Tools to help

Building a tweet analysis capability

Getting the tweet data

Introducing Oozie

A note on HDFS file permissions

Making development a little easier

Extracting data and ingesting into Hive

A note on workflow directory structure

Introducing HCatalog

Using HCatalog

The Oozie sharelib

HCatalog and partitioned tables

Producing derived data

Performing multiple actions in parallel

Calling a subworkflow

Adding global settings

Challenges of external data

Data validation

Validation actions

Handling format changes

Handling schema evolution with Avro

Final thoughts on using Avro schema evolution

Only make additive changes

Manage schema versions explicitly

Think about schema distribution

Collecting additional data

Scheduling workflows

Other Oozie triggers

Pulling it all together

Other tools to help

Summary

9. Making Development Easier

Choosing a framework

Hadoop streaming

Streaming word count in Python

Differences in jobs when using streaming

Finding important words in text

Calculate term frequency

Calculate document frequency

Putting it all together – TF-IDF

Kite Data

Data Core

Data HCatalog

Data Hive

Data MapReduce

Data Spark

Data Crunch

Apache Crunch

Getting started

Concepts

Data serialization

Data processing patterns

Aggregation and sorting

Joining data

Pipelines implementation and execution

SparkPipeline

MemPipeline

Crunch examples

Word co-occurrence

TF-IDF

Kite Morphlines

Concepts

Morphline commands

Summary

10. Running a Hadoop Cluster

I'm a developer – I don't care about operations!

Hadoop and DevOps practices

Cloudera Manager

To pay or not to pay

Cluster management using Cloudera Manager

Cloudera Manager and other management tools

Monitoring with Cloudera Manager

Finding configuration files

Cloudera Manager API

Cloudera Manager lock-in

Ambari – the open source alternative

Operations in the Hadoop 2 world

Sharing resources

Building a physical cluster

Physical layout

Rack awareness

Service layout

Upgrading a service

Building a cluster on EMR

Considerations about filesystems

Getting data into EMR

EC2 instances and tuning

Cluster tuning

JVM considerations

The small files problem

Map and reduce optimizations

Security

Evolution of the Hadoop security model

Beyond basic authorization

The future of Hadoop security

Consequences of using a secured cluster

Monitoring

Hadoop – where failures don't matter

Monitoring integration

Application-level metrics

Troubleshooting

Logging levels

Access to logfiles

ResourceManager, NodeManager, and Application Manager

Applications

Nodes

Scheduler

MapReduce

MapReduce v1

MapReduce v2 (YARN)

JobHistory Server

NameNode and DataNode

Summary

11. Where to Go Next

Alternative distributions

Cloudera Distribution for Hadoop

Hortonworks Data Platform

MapR

And the rest…

Choosing a distribution

Other computational frameworks

Apache Storm

Apache Giraph

Apache HAMA

Other interesting projects

HBase

Sqoop

Whir

Mahout

Hue

Other programming abstractions

Cascading

AWS resources

SimpleDB and DynamoDB

Kinesis

Data Pipeline

Sources of information

Source code

Mailing lists and forums

LinkedIn groups

HUGs

Conferences

Summary

Index

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部