售 价:¥
温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印
为你推荐
Hadoop Beginner's Guide
Table of Contents
Hadoop Beginner's Guide
Credits
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers and more
Why Subscribe?
Free Access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Time for action – heading
What just happened?
Pop quiz – heading
Have a go hero – heading
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. What It's All About
Big data processing
The value of data
Historically for the few and not the many
Classic data processing systems
Scale-up
Early approaches to scale-out
Limiting factors
A different approach
All roads lead to scale-out
Share nothing
Expect failure
Smart software, dumb hardware
Move processing, not data
Build applications, not infrastructure
Hadoop
Thanks, Google
Thanks, Doug
Thanks, Yahoo
Parts of Hadoop
Common building blocks
HDFS
MapReduce
Better together
Common architecture
What it is and isn't good for
Cloud computing with Amazon Web Services
Too many clouds
A third way
Different types of costs
AWS – infrastructure on demand from Amazon
Elastic Compute Cloud (EC2)
Simple Storage Service (S3)
Elastic MapReduce (EMR)
What this book covers
A dual approach
Summary
2. Getting Hadoop Up and Running
Hadoop on a local Ubuntu host
Other operating systems
Time for action – checking the prerequisites
What just happened?
Setting up Hadoop
A note on versions
Time for action – downloading Hadoop
What just happened?
Time for action – setting up SSH
What just happened?
Configuring and running Hadoop
Time for action – using Hadoop to calculate Pi
What just happened?
Three modes
Time for action – configuring the pseudo-distributed mode
What just happened?
Configuring the base directory and formatting the filesystem
Time for action – changing the base HDFS directory
What just happened?
Time for action – formatting the NameNode
What just happened?
Starting and using Hadoop
Time for action – starting Hadoop
What just happened?
Time for action – using HDFS
What just happened?
Time for action – WordCount, the Hello World of MapReduce
What just happened?
Have a go hero – WordCount on a larger body of text
Monitoring Hadoop from the browser
The HDFS web UI
The MapReduce web UI
Using Elastic MapReduce
Setting up an account in Amazon Web Services
Creating an AWS account
Signing up for the necessary services
Time for action – WordCount on EMR using the management console
What just happened?
Have a go hero – other EMR sample applications
Other ways of using EMR
AWS credentials
The EMR command-line tools
The AWS ecosystem
Comparison of local versus EMR Hadoop
Summary
3. Understanding MapReduce
Key/value pairs
What it mean
Why key/value data?
Some real-world examples
MapReduce as a series of key/value transformations
Pop quiz – key/value pairs
The Hadoop Java API for MapReduce
The 0.20 MapReduce Java API
The Mapper class
The Reducer class
The Driver class
Writing MapReduce programs
Time for action – setting up the classpath
What just happened?
Time for action – implementing WordCount
What just happened?
Time for action – building a JAR file
What just happened?
Time for action – running WordCount on a local Hadoop cluster
What just happened?
Time for action – running WordCount on EMR
What just happened?
The pre-0.20 Java MapReduce API
Hadoop-provided mapper and reducer implementations
Time for action – WordCount the easy way
What just happened?
Walking through a run of WordCount
Startup
Splitting the input
Task assignment
Task startup
Ongoing JobTracker monitoring
Mapper input
Mapper execution
Mapper output and reduce input
Partitioning
The optional partition function
Reducer input
Reducer execution
Reducer output
Shutdown
That's all there is to it!
Apart from the combiner…maybe
Why have a combiner?
Time for action – WordCount with a combiner
What just happened?
When you can use the reducer as the combiner
Time for action – fixing WordCount to work with a combiner
What just happened?
Reuse is your friend
Pop quiz – MapReduce mechanics
Hadoop-specific data types
The Writable and WritableComparable interfaces
Introducing the wrapper classes
Primitive wrapper classes
Array wrapper classes
Map wrapper classes
Time for action – using the Writable wrapper classes
What just happened?
Other wrapper classes
Have a go hero – playing with Writables
Making your own
Input/output
Files, splits, and records
InputFormat and RecordReader
Hadoop-provided InputFormat
Hadoop-provided RecordReader
OutputFormat and RecordWriter
Hadoop-provided OutputFormat
Don't forget Sequence files
Summary
4. Developing MapReduce Programs
Using languages other than Java with Hadoop
How Hadoop Streaming works
Why to use Hadoop Streaming
Time for action – implementing WordCount using Streaming
What just happened?
Differences in jobs when using Streaming
Analyzing a large dataset
Getting the UFO sighting dataset
Getting a feel for the dataset
Time for action – summarizing the UFO data
What just happened?
Examining UFO shapes
Time for action – summarizing the shape data
What just happened?
Time for action – correlating of sighting duration to UFO shape
What just happened?
Using Streaming scripts outside Hadoop
Time for action – performing the shape/time analysis from the command line
What just happened?
Java shape and location analysis
Time for action – using ChainMapper for field validation/analysis
What just happened?
Have a go hero
Too many abbreviations
Using the Distributed Cache
Time for action – using the Distributed Cache to improve location output
What just happened?
Counters, status, and other output
Time for action – creating counters, task states, and writing log output
What just happened?
Too much information!
Summary
5. Advanced MapReduce Techniques
Simple, advanced, and in-between
Joins
When this is a bad idea
Map-side versus reduce-side joins
Matching account and sales information
Time for action – reduce-side join using MultipleInputs
What just happened?
DataJoinMapper and TaggedMapperOutput
Implementing map-side joins
Using the Distributed Cache
Have a go hero - Implementing map-side joins
Pruning data to fit in the cache
Using a data representation instead of raw data
Using multiple mappers
To join or not to join...
Graph algorithms
Graph 101
Graphs and MapReduce – a match made somewhere
Representing a graph
Time for action – representing the graph
What just happened?
Overview of the algorithm
The mapper
The reducer
Iterative application
Time for action – creating the source code
What just happened?
Time for action – the first run
What just happened?
Time for action – the second run
What just happened?
Time for action – the third run
What just happened?
Time for action – the fourth and last run
What just happened?
Running multiple jobs
Final thoughts on graphs
Using language-independent data structures
Candidate technologies
Introducing Avro
Time for action – getting and installing Avro
What just happened?
Avro and schemas
Time for action – defining the schema
What just happened?
Time for action – creating the source Avro data with Ruby
What just happened?
Time for action – consuming the Avro data with Java
What just happened?
Using Avro within MapReduce
Time for action – generating shape summaries in MapReduce
What just happened?
Time for action – examining the output data with Ruby
What just happened?
Time for action – examining the output data with Java
What just happened?
Have a go hero – graphs in Avro
Going forward with Avro
Summary
6. When Things Break
Failure
Embrace failure
Or at least don't fear it
Don't try this at home
Types of failure
Hadoop node failure
The dfsadmin command
Cluster setup, test files, and block sizes
Fault tolerance and Elastic MapReduce
Time for action – killing a DataNode process
What just happened?
NameNode and DataNode communication
Have a go hero – NameNode log delving
Time for action – the replication factor in action
What just happened?
Time for action – intentionally causing missing blocks
What just happened?
When data may be lost
Block corruption
Time for action – killing a TaskTracker process
What just happened?
Comparing the DataNode and TaskTracker failures
Permanent failure
Killing the cluster masters
Time for action – killing the JobTracker
What just happened?
Starting a replacement JobTracker
Have a go hero – moving the JobTracker to a new host
Time for action – killing the NameNode process
What just happened?
Starting a replacement NameNode
The role of the NameNode in more detail
File systems, files, blocks, and nodes
The single most important piece of data in the cluster – fsimage
DataNode startup
Safe mode
SecondaryNameNode
So what to do when the NameNode process has a critical failure?
BackupNode/CheckpointNode and NameNode HA
Hardware failure
Host failure
Host corruption
The risk of correlated failures
Task failure due to software
Failure of slow running tasks
Time for action – causing task failure
What just happened?
Have a go hero – HDFS programmatic access
Hadoop's handling of slow-running tasks
Speculative execution
Hadoop's handling of failing tasks
Have a go hero – causing tasks to fail
Task failure due to data
Handling dirty data through code
Using Hadoop's skip mode
Time for action – handling dirty data by using skip mode
What just happened?
To skip or not to skip...
Summary
7. Keeping Things Running
A note on EMR
Hadoop configuration properties
Default values
Time for action – browsing default properties
What just happened?
Additional property elements
Default storage location
Where to set properties
Setting up a cluster
How many hosts?
Calculating usable space on a node
Location of the master nodes
Sizing hardware
Processor / memory / storage ratio
EMR as a prototyping platform
Special node requirements
Storage types
Commodity versus enterprise class storage
Single disk versus RAID
Finding the balance
Network storage
Hadoop networking configuration
How blocks are placed
Rack awareness
The rack-awareness script
Time for action – examining the default rack configuration
What just happened?
Time for action – adding a rack awareness script
What just happened?
What is commodity hardware anyway?
Pop quiz – setting up a cluster
Cluster access control
The Hadoop security model
Time for action – demonstrating the default security
What just happened?
User identity
The super user
More granular access control
Working around the security model via physical access control
Managing the NameNode
Configuring multiple locations for the fsimage class
Time for action – adding an additional fsimage location
What just happened?
Where to write the fsimage copies
Swapping to another NameNode host
Having things ready before disaster strikes
Time for action – swapping to a new NameNode host
What just happened?
Don't celebrate quite yet!
What about MapReduce?
Have a go hero – swapping to a new NameNode host
Managing HDFS
Where to write data
Using balancer
When to rebalance
MapReduce management
Command line job management
Have a go hero – command line job management
Job priorities and scheduling
Time for action – changing job priorities and killing a job
What just happened?
Alternative schedulers
Capacity Scheduler
Fair Scheduler
Enabling alternative schedulers
When to use alternative schedulers
Scaling
Adding capacity to a local Hadoop cluster
Have a go hero – adding a node and running balancer
Adding capacity to an EMR job flow
Expanding a running job flow
Summary
8. A Relational View on Data with Hive
Overview of Hive
Why use Hive?
Thanks, Facebook!
Setting up Hive
Prerequisites
Getting Hive
Time for action – installing Hive
What just happened?
Using Hive
Time for action – creating a table for the UFO data
What just happened?
Time for action – inserting the UFO data
What just happened?
Validating the data
Time for action – validating the table
What just happened?
Time for action – redefining the table with the correct column separator
What just happened?
Hive tables – real or not?
Time for action – creating a table from an existing file
What just happened?
Time for action – performing a join
What just happened?
Have a go hero – improve the join to use regular expressions
Hive and SQL views
Time for action – using views
What just happened?
Handling dirty data in Hive
Have a go hero – do it!
Time for action – exporting query output
What just happened?
Partitioning the table
Time for action – making a partitioned UFO sighting table
What just happened?
Bucketing, clustering, and sorting... oh my!
User-Defined Function
Time for action – adding a new User Defined Function (UDF)
What just happened?
To preprocess or not to preprocess...
Hive versus Pig
What we didn't cover
Hive on Amazon Web Services
Time for action – running UFO analysis on EMR
What just happened?
Using interactive job flows for development
Have a go hero – using an interactive EMR cluster
Integration with other AWS products
Summary
9. Working with Relational Databases
Common data paths
Hadoop as an archive store
Hadoop as a preprocessing step
Hadoop as a data input tool
The serpent eats its own tail
Setting up MySQL
Time for action – installing and setting up MySQL
What just happened?
Did it have to be so hard?
Time for action – configuring MySQL to allow remote connections
What just happened?
Don't do this in production!
Time for action – setting up the employee database
What just happened?
Be careful with data file access rights
Getting data into Hadoop
Using MySQL tools and manual import
Have a go hero – exporting the employee table into HDFS
Accessing the database from the mapper
A better way – introducing Sqoop
Time for action – downloading and configuring Sqoop
What just happened?
Sqoop and Hadoop versions
Sqoop and HDFS
Time for action – exporting data from MySQL to HDFS
What just happened?
Mappers and primary key columns
Other options
Sqoop's architecture
Importing data into Hive using Sqoop
Time for action – exporting data from MySQL into Hive
What just happened?
Time for action – a more selective import
What just happened?
Datatype issues
Time for action – using a type mapping
What just happened?
Time for action – importing data from a raw query
What just happened?
Have a go hero
Sqoop and Hive partitions
Field and line terminators
Getting data out of Hadoop
Writing data from within the reducer
Writing SQL import files from the reducer
A better way – Sqoop again
Time for action – importing data from Hadoop into MySQL
What just happened?
Differences between Sqoop imports and exports
Inserts versus updates
Have a go hero
Sqoop and Hive exports
Time for action – importing Hive data into MySQL
What just happened?
Time for action – fixing the mapping and re-running the export
What just happened?
Other Sqoop features
Incremental merge
Avoiding partial exports
Sqoop as a code generator
AWS considerations
Considering RDS
Summary
10. Data Collection with Flume
A note about AWS
Data data everywhere...
Types of data
Getting network traffic into Hadoop
Time for action – getting web server data into Hadoop
What just happened?
Have a go hero
Getting files into Hadoop
Hidden issues
Keeping network data on the network
Hadoop dependencies
Reliability
Re-creating the wheel
A common framework approach
Introducing Apache Flume
A note on versioning
Time for action – installing and configuring Flume
What just happened?
Using Flume to capture network data
Time for action – capturing network traffic in a log file
What just happened?
Time for action – logging to the console
What just happened?
Writing network data to log files
Time for action – capturing the output of a command to a flat file
What just happened?
Logs versus files
Time for action – capturing a remote file in a local flat file
What just happened?
Sources, sinks, and channels
Sources
Sinks
Channels
Or roll your own
Understanding the Flume configuration files
Have a go hero
It's all about events
Time for action – writing network traffic onto HDFS
What just happened?
Time for action – adding timestamps
What just happened?
To Sqoop or to Flume...
Time for action – multi level Flume networks
What just happened?
Time for action – writing to multiple sinks
What just happened?
Selectors replicating and multiplexing
Handling sink failure
Have a go hero - Handling sink failure
Next, the world
Have a go hero - Next, the world
The bigger picture
Data lifecycle
Staging data
Scheduling
Summary
11. Where to Go Next
What we did and didn't cover in this book
Upcoming Hadoop changes
Alternative distributions
Why alternative distributions?
Bundling
Free and commercial extensions
Cloudera Distribution for Hadoop
Hortonworks Data Platform
MapR
IBM InfoSphere Big Insights
Choosing a distribution
Other Apache projects
HBase
Oozie
Whir
Mahout
MRUnit
Other programming abstractions
Pig
Cascading
AWS resources
HBase on EMR
SimpleDB
DynamoDB
Sources of information
Source code
Mailing lists and forums
LinkedIn groups
HUGs
Conferences
Summary
A. Pop Quiz Answers
Chapter 3, Understanding MapReduce
Pop quiz – key/value pairs
Pop quiz – walking through a run of WordCount
Chapter 7, Keeping Things Running
Pop quiz – setting up a cluster
Index
买过这本书的人还买过
读了这本书的人还在读
同类图书排行榜