当当云阅读 > 进口书 > 外文原版书 > 电脑/网络 > Learning Hadoop 2

| | 手机阅读

扫描下载当当云阅读App

Learning Hadoop 2电子书

售价：¥

11人正在读 | 0人评论

9.8

作者：Garry Turkington

出版社：Packt Publishing

出版时间：2015-02-13

字数：479.5万

所属分类：进口书 > 外文原版书 > 电脑/网络

温馨提示：数字商品不支持退换货，不提供源文件，不支持导出打印

为你推荐

Mastering pandas for Finance

￥80.65

Swift Essentials

￥90.46

Creating your MySQL Database: Practical Design Tips and Techniques

￥35.96

Building Telephony Systems with OpenSIPS - Second Edition

￥80.65
Hands-On MQTT Programming with Python

￥63.21

Learn Web Development with Python

￥90.46

Beginning Data Science with Python and Jupyter

￥90.46

Python Deep Learning

￥99.18

读书简介
目录
累计评论(0条)

读书简介
目录
累计评论(0条)

If you are a system or application developer interested in learning how to solve practical problems using the Hadoop framework, then this book is ideal for you. You are expected to be familiar with the Unix/Linux command-line interface and have some experience with the Java programming language. Familiarity with Hadoop would be a plus.

目录展开

Learning Hadoop 2

Table of Contents

Learning Hadoop 2

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers, and more

Why subscribe?

Free access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Introduction

A note on versioning

The background of Hadoop

Components of Hadoop

Common building blocks

Storage

Computation

Better together

Hadoop 2 – what's the big deal?

Storage in Hadoop 2

Computation in Hadoop 2

Distributions of Apache Hadoop

A dual approach

AWS – infrastructure on demand from Amazon

Simple Storage Service (S3)

Elastic MapReduce (EMR)

Getting started

Cloudera QuickStart VM

Amazon EMR

Creating an AWS account

Signing up for the necessary services

Using Elastic MapReduce

Getting Hadoop up and running

How to use EMR

AWS credentials

The AWS command-line interface

Running the examples

Data processing with Hadoop

Why Twitter?

Building our first dataset

One service, multiple APIs

Anatomy of a Tweet

Twitter credentials

Programmatic access with Python

Summary

2. Storage

The inner workings of HDFS

Cluster startup

NameNode startup

DataNode startup

Block replication

Command-line access to the HDFS filesystem

Exploring the HDFS filesystem

Protecting the filesystem metadata

Secondary NameNode not to the rescue

Hadoop 2 NameNode HA

Keeping the HA NameNodes in sync

Client configuration

How a failover works

Apache ZooKeeper – a different type of filesystem

Implementing a distributed lock with sequential ZNodes

Implementing group membership and leader election using ephemeral ZNodes

Java API

Building blocks

Other SQL-on-Hadoop solutions

Prerequisites

Overview of Hive

The nature of Hive tables

Hive architecture

Data types

DDL statements

File formats and storage

JSON

Avro

Columnar stores

Queries

Structuring Hive tables for given workloads

Partitioning a table

Overwriting and updating data

Bucketing and sorting

Sampling data

Writing scripts

Hive and Amazon Web Services

Hive and S3

Hive on Elastic MapReduce

Extending HiveQL

Programmatic interfaces

JDBC

Thrift

Stinger initiative

Impala

The architecture of Impala

Co-existing with Hive

A different philosophy

Drill, Tajo, and beyond

Summary

8. Data Lifecycle Management

What data lifecycle management is

Importance of data lifecycle management

Tools to help

Building a tweet analysis capability

Getting the tweet data

Introducing Oozie

A note on HDFS file permissions

Making development a little easier

Extracting data and ingesting into Hive

A note on workflow directory structure

Introducing HCatalog

Using HCatalog

The Oozie sharelib

HCatalog and partitioned tables

Producing derived data

Performing multiple actions in parallel

Calling a subworkflow

Adding global settings

Challenges of external data

Data validation

Validation actions

Handling format changes

Handling schema evolution with Avro

Final thoughts on using Avro schema evolution

Only make additive changes

Manage schema versions explicitly

Think about schema distribution

Collecting additional data

Scheduling workflows

Other Oozie triggers

Pulling it all together

Other tools to help

Summary

9. Making Development Easier

Choosing a framework

Hadoop streaming

Streaming word count in Python

Differences in jobs when using streaming

Finding important words in text

Calculate term frequency

Calculate document frequency

Putting it all together – TF-IDF

Kite Data

Data Core

Data HCatalog

Data Hive

Data MapReduce

Data Spark

Data Crunch

Apache Crunch

Getting started

Concepts

Data serialization

Data processing patterns

Aggregation and sorting

Joining data

Pipelines implementation and execution

SparkPipeline

MemPipeline

Crunch examples

Word co-occurrence

TF-IDF

Kite Morphlines

Concepts

Morphline commands

Summary

10. Running a Hadoop Cluster

I'm a developer – I don't care about operations!

Hadoop and DevOps practices

Cloudera Manager

To pay or not to pay

Cluster management using Cloudera Manager

Cloudera Manager and other management tools

Monitoring with Cloudera Manager

Finding configuration files

Cloudera Manager API

Cloudera Manager lock-in

Ambari – the open source alternative

Operations in the Hadoop 2 world

Sharing resources

Building a physical cluster

Physical layout

Rack awareness

Service layout

Upgrading a service

Building a cluster on EMR

Considerations about filesystems

Getting data into EMR

EC2 instances and tuning

Cluster tuning

JVM considerations

The small files problem

Map and reduce optimizations

Security

Evolution of the Hadoop security model

Beyond basic authorization

The future of Hadoop security

Consequences of using a secured cluster

Monitoring

Hadoop – where failures don't matter

Monitoring integration

Application-level metrics

Troubleshooting

Logging levels

Access to logfiles

ResourceManager, NodeManager, and Application Manager

Applications

Nodes

Scheduler

MapReduce

MapReduce v1

MapReduce v2 (YARN)

JobHistory Server

NameNode and DataNode

Summary

11. Where to Go Next

Alternative distributions

Cloudera Distribution for Hadoop

Hortonworks Data Platform

MapR

And the rest…

Choosing a distribution

Other computational frameworks

Apache Storm

Apache Giraph

Apache HAMA

Other interesting projects

HBase

Sqoop

Whir

Mahout

Hue

Other programming abstractions

Cascading

AWS resources

SimpleDB and DynamoDB

Kinesis

Data Pipeline

Sources of information

Source code

Mailing lists and forums

LinkedIn groups

HUGs

Conferences

Summary

Index

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论，分享你的想法吧！

发表评论

买过这本书的人还买过

读了这本书的人还在读

支持设备

Mastering pandas for Finance ￥80.65

Michael Heydt

￥80.65

FreeCAD [How-to] ￥35.96

Daniel Falck

￥35.96

Swift Essentials ￥90.46

Dr Alex Blewitt

￥90.46

Creating your MySQL Database: Practical Design Tips and Techniques ￥35.96

Marc Delisle

￥35.96

Building Telephony Systems with OpenSIPS - Second Edition ￥80.65

Flavio E. Goncalves

￥80.65

Hands-On MQTT Programming with Python ￥63.21

Gaston C. Hillar

￥63.21

Learning ServiceNow ￥90.46

Tim Woodruff

￥90.46

Learn Web Development with Python ￥90.46

Fabrizio Romano

￥90.46

Beginning Data Science with Python and Jupyter ￥90.46

Alex Galea

￥90.46

WiX 3.6: A Developer’s Guide to Windows Installer XML ￥90.46

Nick Ramirez

￥90.46

更多同类图书 >