售 价:¥
温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印
为你推荐
Hadoop Essentials
Table of Contents
Hadoop Essentials
Credits
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Introduction to Big Data and Hadoop
V's of big data
Volume
Velocity
Variety
Understanding big data
NoSQL
Types of NoSQL databases
Analytical database
Who is creating big data?
Big data use cases
Big data use case patterns
Big data as a storage pattern
Big data as a data transformation pattern
Big data for a data analysis pattern
Big data for data in a real-time pattern
Big data for a low latency caching pattern
Hadoop
Hadoop history
Description
Advantages of Hadoop
Uses of Hadoop
Hadoop ecosystem
Apache Hadoop
Hadoop distributions
Pillars of Hadoop
Data access components
Data storage component
Data ingestion in Hadoop
Streaming and real-time analysis
Summary
2. Hadoop Ecosystem
Traditional systems
Database trend
The Hadoop use cases
Hadoop's basic data flow
Hadoop integration
The Hadoop ecosystem
Distributed filesystem
HDFS
Distributed programming
NoSQL databases
Apache HBase
Data ingestion
Service programming
Apache YARN
Apache Zookeeper
Scheduling
Data analytics and machine learning
System management
Apache Ambari
Summary
3. Pillars of Hadoop – HDFS, MapReduce, and YARN
HDFS
Features of HDFS
HDFS architecture
NameNode
DataNode
Checkpoint NameNode or Secondary NameNode
BackupNode
Data storage in HDFS
Read pipeline
Write pipeline
Rack awareness
Advantages of rack awareness in HDFS
HDFS federation
Limitations of HDFS 1.0
The benefit of HDFS federation
HDFS ports
HDFS commands
MapReduce
The MapReduce architecture
JobTracker
TaskTracker
Serialization data types
The Writable interface
WritableComparable interface
The MapReduce example
The MapReduce process
Mapper
Shuffle and sorting
Reducer
Speculative execution
FileFormats
InputFormats
RecordReader
OutputFormats
RecordWriter
Writing a MapReduce program
Mapper code
Reducer code
Driver code
Auxiliary steps
Combiner
Partitioner
Custom partitioner
YARN
YARN architecture
ResourceManager
NodeManager
ApplicationMaster
Applications powered by YARN
Summary
4. Data Access Components – Hive and Pig
Need of a data processing tool on Hadoop
Pig
Pig data types
The Pig architecture
The logical plan
The physical plan
The MapReduce plan
Pig modes
Grunt shell
Input data
Loading data
Dump
Store
FOREACH generate
Filter
Group By
Limit
Aggregation
Cogroup
DESCRIBE
EXPLAIN
ILLUSTRATE
Hive
The Hive architecture
Metastore
The Query compiler
The Execution engine
Data types and schemas
Installing Hive
Starting Hive shell
HiveQL
DDL (Data Definition Language) operations
DML (Data Manipulation Language) operations
The SQL operation
Joins
Aggregations
Built-in functions
Custom UDF (User Defined Functions)
Managing tables – external versus managed
SerDe
Partitioning
Bucketing
Summary
5. Storage Component – HBase
An Overview of HBase
Advantages of HBase
The Architecture of HBase
MasterServer
RegionServer
WAL
BlockCache
LRUBlockCache
SlabCache
BucketCache
Regions
MemStore
Zookeeper
The HBase data model
Logical components of a data model
ACID properties
The CAP theorem
The Schema design
The Write pipeline
The Read pipeline
Compaction
The Compaction policy
Minor compaction
Major compaction
Splitting
Pre-Splitting
Auto Splitting
Forced Splitting
Commands
help
Create
List
Put
Scan
Get
Disable
Drop
HBase Hive integration
Performance tuning
Compression
Filters
Counters
HBase coprocessors
Summary
6. Data Ingestion in Hadoop – Sqoop and Flume
Data sources
Challenges in data ingestion
Sqoop
Connectors and drivers
Sqoop 1 architecture
Limitation of Sqoop 1
Sqoop 2 architecture
Imports
Exports
Apache Flume
Reliability
Flume architecture
Multitier topology
Flume master
Flume nodes
Components in Agent
Source
Sink
Channels
Memory channel
File Channel
JDBC Channel
Examples of configuring Flume
The Single agent example
Multiple flows in an agent
Configuring a multiagent setup
Summary
7. Streaming and Real-time Analysis – Storm and Spark
An introduction to Storm
Features of Storm
Physical architecture of Storm
Data architecture of Storm
Storm topology
Storm on YARN
Topology configuration example
Spouts
Bolts
Topology
An introduction to Spark
Features of Spark
Spark framework
Spark SQL
GraphX
MLib
Spark streaming
Spark architecture
Directed Acyclic Graph engine
Resilient Distributed Dataset
Physical architecture
Operations in Spark
Transformations
Actions
Spark example
Summary
Index
买过这本书的人还买过
读了这本书的人还在读
同类图书排行榜