售 价:¥
温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印
为你推荐
Learning Hadoop 2
Table of Contents
Learning Hadoop 2
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Introduction
A note on versioning
The background of Hadoop
Components of Hadoop
Common building blocks
Storage
Computation
Better together
Hadoop 2 – what's the big deal?
Storage in Hadoop 2
Computation in Hadoop 2
Distributions of Apache Hadoop
A dual approach
AWS – infrastructure on demand from Amazon
Simple Storage Service (S3)
Elastic MapReduce (EMR)
Getting started
Cloudera QuickStart VM
Amazon EMR
Creating an AWS account
Signing up for the necessary services
Using Elastic MapReduce
Getting Hadoop up and running
How to use EMR
AWS credentials
The AWS command-line interface
Running the examples
Data processing with Hadoop
Why Twitter?
Building our first dataset
One service, multiple APIs
Anatomy of a Tweet
Twitter credentials
Programmatic access with Python
Summary
2. Storage
The inner workings of HDFS
Cluster startup
NameNode startup
DataNode startup
Block replication
Command-line access to the HDFS filesystem
Exploring the HDFS filesystem
Protecting the filesystem metadata
Secondary NameNode not to the rescue
Hadoop 2 NameNode HA
Keeping the HA NameNodes in sync
Client configuration
How a failover works
Apache ZooKeeper – a different type of filesystem
Implementing a distributed lock with sequential ZNodes
Implementing group membership and leader election using ephemeral ZNodes
Java API
Building blocks
Further reading
Automatic NameNode failover
HDFS snapshots
Hadoop filesystems
Hadoop interfaces
Java FileSystem API
Libhdfs
Thrift
Managing and serializing data
The Writable interface
Introducing the wrapper classes
Array wrapper classes
The Comparable and WritableComparable interfaces
Storing data
Serialization and Containers
Compression
General-purpose file formats
Column-oriented data formats
RCFile
ORC
Parquet
Avro
Using the Java API
Summary
3. Processing – MapReduce and Beyond
MapReduce
Java API to MapReduce
The Mapper class
The Reducer class
The Driver class
Combiner
Partitioning
The optional partition function
Hadoop-provided mapper and reducer implementations
Sharing reference data
Writing MapReduce programs
Getting started
Running the examples
Local cluster
Elastic MapReduce
WordCount, the Hello World of MapReduce
Word co-occurrences
Trending topics
The Top N pattern
Sentiment of hashtags
Text cleanup using chain mapper
Walking through a run of a MapReduce job
Startup
Splitting the input
Task assignment
Task startup
Ongoing JobTracker monitoring
Mapper input
Mapper execution
Mapper output and reducer input
Reducer input
Reducer execution
Reducer output
Shutdown
Input/Output
InputFormat and RecordReader
Hadoop-provided InputFormat
Hadoop-provided RecordReader
OutputFormat and RecordWriter
Hadoop-provided OutputFormat
Sequence files
YARN
YARN architecture
The components of YARN
Anatomy of a YARN application
Life cycle of a YARN application
Fault tolerance and monitoring
Thinking in layers
Execution models
YARN in the real world – Computation beyond MapReduce
The problem with MapReduce
Tez
Hive-on-tez
Apache Spark
Apache Samza
YARN-independent frameworks
YARN today and beyond
Summary
4. Real-time Computation with Samza
Stream processing with Samza
How Samza works
Samza high-level architecture
Samza's best friend – Apache Kafka
YARN integration
An independent model
Hello Samza!
Building a tweet parsing job
The configuration file
Getting Twitter data into Kafka
Running a Samza job
Samza and HDFS
Windowing functions
Multijob workflows
Tweet sentiment analysis
Bootstrap streams
Stateful tasks
Summary
5. Iterative Computation with Spark
Apache Spark
Cluster computing with working sets
Resilient Distributed Datasets (RDDs)
Actions
Deployment
Spark on YARN
Spark on EC2
Getting started with Spark
Writing and running standalone applications
Scala API
Java API
WordCount in Java
Python API
The Spark ecosystem
Spark Streaming
GraphX
MLlib
Spark SQL
Processing data with Apache Spark
Building and running the examples
Running the examples on YARN
Finding popular topics
Assigning a sentiment to topics
Data processing on streams
State management
Data analysis with Spark SQL
SQL on data streams
Comparing Samza and Spark Streaming
Summary
6. Data Analysis with Apache Pig
An overview of Pig
Getting started
Running Pig
Grunt – the Pig interactive shell
Elastic MapReduce
Fundamentals of Apache Pig
Programming Pig
Pig data types
Pig functions
Load/store
Eval
The tuple, bag, and map functions
The math, string, and datetime functions
Dynamic invokers
Macros
Working with data
Filtering
Aggregation
Foreach
Join
Extending Pig (UDFs)
Contributed UDFs
Piggybank
Elephant Bird
Apache DataFu
Analyzing the Twitter stream
Prerequisites
Dataset exploration
Tweet metadata
Data preparation
Top n statistics
Datetime manipulation
Sessions
Capturing user interactions
Link analysis
Influential users
Summary
7. Hadoop and SQL
Why SQL on Hadoop
Other SQL-on-Hadoop solutions
Prerequisites
Overview of Hive
The nature of Hive tables
Hive architecture
Data types
DDL statements
File formats and storage
JSON
Avro
Columnar stores
Queries
Structuring Hive tables for given workloads
Partitioning a table
Overwriting and updating data
Bucketing and sorting
Sampling data
Writing scripts
Hive and Amazon Web Services
Hive and S3
Hive on Elastic MapReduce
Extending HiveQL
Programmatic interfaces
JDBC
Thrift
Stinger initiative
Impala
The architecture of Impala
Co-existing with Hive
A different philosophy
Drill, Tajo, and beyond
Summary
8. Data Lifecycle Management
What data lifecycle management is
Importance of data lifecycle management
Tools to help
Building a tweet analysis capability
Getting the tweet data
Introducing Oozie
A note on HDFS file permissions
Making development a little easier
Extracting data and ingesting into Hive
A note on workflow directory structure
Introducing HCatalog
Using HCatalog
The Oozie sharelib
HCatalog and partitioned tables
Producing derived data
Performing multiple actions in parallel
Calling a subworkflow
Adding global settings
Challenges of external data
Data validation
Validation actions
Handling format changes
Handling schema evolution with Avro
Final thoughts on using Avro schema evolution
Only make additive changes
Manage schema versions explicitly
Think about schema distribution
Collecting additional data
Scheduling workflows
Other Oozie triggers
Pulling it all together
Other tools to help
Summary
9. Making Development Easier
Choosing a framework
Hadoop streaming
Streaming word count in Python
Differences in jobs when using streaming
Finding important words in text
Calculate term frequency
Calculate document frequency
Putting it all together – TF-IDF
Kite Data
Data Core
Data HCatalog
Data Hive
Data MapReduce
Data Spark
Data Crunch
Apache Crunch
Getting started
Concepts
Data serialization
Data processing patterns
Aggregation and sorting
Joining data
Pipelines implementation and execution
SparkPipeline
MemPipeline
Crunch examples
Word co-occurrence
TF-IDF
Kite Morphlines
Concepts
Morphline commands
Summary
10. Running a Hadoop Cluster
I'm a developer – I don't care about operations!
Hadoop and DevOps practices
Cloudera Manager
To pay or not to pay
Cluster management using Cloudera Manager
Cloudera Manager and other management tools
Monitoring with Cloudera Manager
Finding configuration files
Cloudera Manager API
Cloudera Manager lock-in
Ambari – the open source alternative
Operations in the Hadoop 2 world
Sharing resources
Building a physical cluster
Physical layout
Rack awareness
Service layout
Upgrading a service
Building a cluster on EMR
Considerations about filesystems
Getting data into EMR
EC2 instances and tuning
Cluster tuning
JVM considerations
The small files problem
Map and reduce optimizations
Security
Evolution of the Hadoop security model
Beyond basic authorization
The future of Hadoop security
Consequences of using a secured cluster
Monitoring
Hadoop – where failures don't matter
Monitoring integration
Application-level metrics
Troubleshooting
Logging levels
Access to logfiles
ResourceManager, NodeManager, and Application Manager
Applications
Nodes
Scheduler
MapReduce
MapReduce v1
MapReduce v2 (YARN)
JobHistory Server
NameNode and DataNode
Summary
11. Where to Go Next
Alternative distributions
Cloudera Distribution for Hadoop
Hortonworks Data Platform
MapR
And the rest…
Choosing a distribution
Other computational frameworks
Apache Storm
Apache Giraph
Apache HAMA
Other interesting projects
HBase
Sqoop
Whir
Mahout
Hue
Other programming abstractions
Cascading
AWS resources
SimpleDB and DynamoDB
Kinesis
Data Pipeline
Sources of information
Source code
Mailing lists and forums
LinkedIn groups
HUGs
Conferences
Summary
Index
买过这本书的人还买过
读了这本书的人还在读
同类图书排行榜