售 价:¥
温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印
为你推荐
Mastering Hadoop
Table of Contents
Mastering Hadoop
Credits
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book?
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Hadoop 2.X
The inception of Hadoop
The evolution of Hadoop
Hadoop's genealogy
Hadoop-0.20-append
Hadoop-0.20-security
Hadoop's timeline
Hadoop 2.X
Yet Another Resource Negotiator (YARN)
Architecture overview
Storage layer enhancements
High availability
HDFS Federation
HDFS snapshots
Other enhancements
Support enhancements
Hadoop distributions
Which Hadoop distribution?
Performance
Scalability
Reliability
Manageability
Available distributions
Cloudera Distribution of Hadoop (CDH)
Hortonworks Data Platform (HDP)
MapR
Pivotal HD
Summary
2. Advanced MapReduce
MapReduce input
The InputFormat class
The InputSplit class
The RecordReader class
Hadoop's "small files" problem
Filtering inputs
The Map task
The dfs.blocksize attribute
Sort and spill of intermediate outputs
Node-local Reducers or Combiners
Fetching intermediate outputs – Map-side
The Reduce task
Fetching intermediate outputs – Reduce-side
Merge and spill of intermediate outputs
MapReduce output
Speculative execution of tasks
MapReduce job counters
Handling data joins
Reduce-side joins
Map-side joins
Summary
3. Advanced Pig
Pig versus SQL
Different modes of execution
Complex data types in Pig
Compiling Pig scripts
The logical plan
The physical plan
The MapReduce plan
Development and debugging aids
The DESCRIBE command
The EXPLAIN command
The ILLUSTRATE command
The advanced Pig operators
The advanced FOREACH operator
The FLATTEN operator
The nested FOREACH operator
The COGROUP operator
The UNION operator
The CROSS operator
Specialized joins in Pig
The Replicated join
Skewed joins
The Merge join
User-defined functions
The evaluation functions
The aggregate functions
The Algebraic interface
The Accumulator interface
The filter functions
The load functions
The store functions
Pig performance optimizations
The optimization rules
Measurement of Pig script performance
Combiners in Pig
Memory for the Bag data type
Number of reducers in Pig
The multiquery mode in Pig
Best practices
The explicit usage of types
Early and frequent projection
Early and frequent filtering
The usage of the LIMIT operator
The usage of the DISTINCT operator
The reduction of operations
The usage of Algebraic UDFs
The usage of Accumulator UDFs
Eliminating nulls in the data
The usage of specialized joins
Compressing intermediate results
Combining smaller files
Summary
4. Advanced Hive
The Hive architecture
The Hive metastore
The Hive compiler
The Hive execution engine
The supporting components of Hive
Data types
File formats
Compressed files
ORC files
The Parquet files
The data model
Dynamic partitions
Semantics for dynamic partitioning
Indexes on Hive tables
Hive query optimizers
Advanced DML
The GROUP BY operation
ORDER BY versus SORT BY clauses
The JOIN operator and its types
Map-side joins
Advanced aggregation support
Other advanced clauses
UDF, UDAF, and UDTF
Summary
5. Serialization and Hadoop I/O
Data serialization in Hadoop
Writable and WritableComparable
Hadoop versus Java serialization
Avro serialization
Avro and MapReduce
Avro and Pig
Avro and Hive
Comparison – Avro versus Protocol Buffers / Thrift
File formats
The Sequence file format
Reading and writing Sequence files
The MapFile format
Other data structures
Compression
Splits and compressions
Scope for compression
Summary
6. YARN – Bringing Other Paradigms to Hadoop
The YARN architecture
Resource Manager (RM)
Application Master (AM)
Node Manager (NM)
YARN clients
Developing YARN applications
Writing YARN clients
Writing the Application Master entity
Monitoring YARN
Job scheduling in YARN
CapacityScheduler
FairScheduler
YARN commands
User commands
Administration commands
Summary
7. Storm on YARN – Low Latency Processing in Hadoop
Batch processing versus streaming
Apache Storm
Architecture of an Apache Storm cluster
Computation and data modeling in Apache Storm
Use cases for Apache Storm
Developing with Apache Storm
Apache Storm 0.9.1
Storm on YARN
Installing Apache Storm-on-YARN
Prerequisites
Installation procedure
Summary
8. Hadoop on the Cloud
Cloud computing characteristics
Hadoop on the cloud
Amazon Elastic MapReduce (EMR)
Provisioning a Hadoop cluster on EMR
Summary
9. HDFS Replacements
HDFS – advantages and drawbacks
Amazon AWS S3
Hadoop support for S3
Implementing a filesystem in Hadoop
Implementing an S3 native filesystem in Hadoop
Summary
10. HDFS Federation
Limitations of the older HDFS architecture
Architecture of HDFS Federation
Benefits of HDFS Federation
Deploying federated NameNodes
HDFS high availability
Secondary NameNode, Checkpoint Node, and Backup Node
High availability – edits sharing
Useful HDFS tools
Three-layer versus four-layer network topology
HDFS block placement
Pluggable block placement policy
Summary
11. Hadoop Security
The security pillars
Authentication in Hadoop
Kerberos authentication
The Kerberos architecture and workflow
Kerberos authentication and Hadoop
Authentication via HTTP interfaces
Authorization in Hadoop
Authorization in HDFS
Identity of an HDFS user
Group listings for an HDFS user
HDFS APIs and shell commands
Specifying the HDFS superuser
Turning off HDFS authorization
Limiting HDFS usage
Name quotas in HDFS
Space quotas in HDFS
Service-level authorization in Hadoop
Data confidentiality in Hadoop
HTTPS and encrypted shuffle
SSL configuration changes
Configuring the keystore and truststore
Audit logging in Hadoop
Summary
12. Analytics Using Hadoop
Data analytics workflow
Machine learning
Apache Mahout
Document analysis using Hadoop and Mahout
Term frequency
Document frequency
Term frequency – inverse document frequency
Tf-Idf in Pig
Cosine similarity distance measures
Clustering using k-means
K-means clustering using Apache Mahout
RHadoop
Summary
A. Hadoop for Microsoft Windows
Deploying Hadoop on Microsoft Windows
Prerequisites
Building Hadoop
Configuring Hadoop
Deploying Hadoop
Summary
Index
买过这本书的人还买过
读了这本书的人还在读
同类图书排行榜