售 价:¥
温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印
为你推荐
Title Page
Copyright and Credits
Modern Big Data Processing with Hadoop
Packt Upsell
Why subscribe?
PacktPub.com
Contributors
About the authors
About the reviewers
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
Enterprise Data Architecture Principles
Data architecture principles
Volume
Velocity
Variety
Veracity
The importance of metadata
Data governance
Fundamentals of data governance
Data security
Application security
Input data
Big data security
RDBMS security
BI security
Physical security
Data encryption
Secure key management
Data as a Service
Evolution data architecture with Hadoop
Hierarchical database architecture
Network database architecture
Relational database architecture
Employees
Devices
Department
Department and employee mapping table
Hadoop data architecture
Data layer
Data management layer
Job execution layer
Summary
Hadoop Life Cycle Management
Data wrangling
Data acquisition
Data structure analysis
Information extraction
Unwanted data removal
Data transformation
Data standardization
Data masking
Substitution
Static
Dynamic
Encryption
Hashing
Hiding
Erasing
Truncation
Variance
Shuffling
Data security
What is Apache Ranger?
Apache Ranger installation using Ambari
Ambari admin UI
Add service
Service placement
Service client placement
Database creation on master
Ranger database configuration
Configuration changes
Configuration review
Deployment progress
Application restart
Apache Ranger user guide
Login to UI
Access manager
Service details
Policy definition and auditing for HDFS
Summary
Hadoop Design Consideration
Understanding data structure principles
Installing Hadoop cluster
Configuring Hadoop on NameNode
Format NameNode
Start all services
Exploring HDFS architecture
Defining NameNode
Secondary NameNode
NameNode safe mode
DataNode
Data replication
Rack awareness
HDFS WebUI
Introducing YARN
YARN architecture
Resource manager
Node manager
Configuration of YARN
Configuring HDFS high availability
During Hadoop 1.x
During Hadoop 2.x and onwards
HDFS HA cluster using NFS
Important architecture points
Configuration of HA NameNodes with shared storage
HDFS HA cluster using the quorum journal manager
Important architecture points
Configuration of HA NameNodes with QJM
Automatic failover
Important architecture points
Configuring automatic failover
Hadoop cluster composition
Typical Hadoop cluster
Best practices Hadoop deployment
Hadoop file formats
Text/CSV file
JSON
Sequence file
Avro
Parquet
ORC
Which file format is better?
Summary
Data Movement Techniques
Batch processing versus real-time processing
Batch processing
Real-time processing
Apache Sqoop
Sqoop Import
Import into HDFS
Import a MySQL table into an HBase table
Sqoop export
Flume
Apache Flume architecture
Data flow using Flume
Flume complex data flow architecture
Flume setup
Log aggregation use case
Apache NiFi
Main concepts of Apache NiFi
Apache NiFi architecture
Key features
Real-time log capture dataflow
Kafka Connect
Kafka Connect – a brief history
Why Kafka Connect?
Kafka Connect features
Kafka Connect architecture
Kafka Connect workers modes
Standalone mode
Distributed mode
Kafka Connect cluster distributed architecture
Example 1
Example 2
Summary
Data Modeling in Hadoop
Apache Hive
Apache Hive and RDBMS
Supported datatypes
How Hive works
Hive architecture
Hive data model management
Hive tables
Managed tables
External tables
Hive table partition
Hive static partitions and dynamic partitions
Hive partition bucketing
How Hive bucketing works
Creating buckets in a non-partitioned table
Creating buckets in a partitioned table
Hive views
Syntax of a view
Hive indexes
Compact index
Bitmap index
JSON documents using Hive
Example 1 – Accessing simple JSON documents with Hive (Hive 0.14 and later versions)
Example 2 – Accessing nested JSON documents with Hive (Hive 0.14 and later versions)
Example 3 – Schema evolution with Hive and Avro (Hive 0.14 and later versions)
Apache HBase
Differences between HDFS and HBase
Differences between Hive and HBase
Key features of HBase
HBase data model
Difference between RDBMS table and column - oriented data store
HBase architecture
HBase architecture in a nutshell
HBase rowkey design
Example 4 – loading data from MySQL table to HBase table
Example 5 – incrementally loading data from MySQL table to HBase table
Example 6 – Load the MySQL customer changed data into the HBase table
Example 7 – Hive HBase integration
Summary
Designing Real-Time Streaming Data Pipelines
Real-time streaming concepts
Data stream
Batch processing versus real-time data processing
Complex event processing
Continuous availability
Low latency
Scalable processing frameworks
Horizontal scalability
Storage
Real-time streaming components
Message queue
So what is Kafka?
Kafka features
Kafka architecture
Kafka architecture components
Kafka Connect deep dive
Kafka Connect architecture
Kafka Connect workers standalone versus distributed mode
Install Kafka
Create topics
Generate messages to verify the producer and consumer
Kafka Connect using file Source and Sink
Kafka Connect using JDBC and file Sink Connectors
Apache Storm
Features of Apache Storm
Storm topology
Storm topology components
Installing Storm on a single node cluster
Developing a real-time streaming pipeline with Storm
Streaming a pipeline from Kafka to Storm to MySQL
Streaming a pipeline with Kafka to Storm to HDFS
Other popular real-time data streaming frameworks
Kafka Streams API
Spark Streaming
Apache Flink
Apache Flink versus Spark
Apache Spark versus Storm
Summary
Large-Scale Data Processing Frameworks
MapReduce
Hadoop MapReduce
Streaming MapReduce
Java MapReduce
Summary
Apache Spark 2
Installing Spark using Ambari
Service selection in Ambari Admin
Add Service Wizard
Server placement
Clients and Slaves selection
Service customization
Software deployment
Spark installation progress
Service restarts and cleanup
Apache Spark data structures
RDDs, DataFrames and datasets
Apache Spark programming
Sample data for analysis
Interactive data analysis with pyspark
Standalone application with Spark
Spark streaming application
Spark SQL application
Summary
Building Enterprise Search Platform
The data search concept
The need for an enterprise search engine
Tools for building an enterprise search engine
Elasticsearch
Why Elasticsearch?
Elasticsearch components
Index
Document
Mapping
Cluster
Type
How to index documents in Elasticsearch?
Elasticsearch installation
Installation of Elasticsearch
Create index
Primary shard
Replica shard
Ingest documents into index
Bulk Insert
Document search
Meta fields
Mapping
Static mapping
Dynamic mapping
Elasticsearch-supported data types
Mapping example
Analyzer
Elasticsearch stack components
Beats
Logstash
Kibana
Use case
Summary
Designing Data Visualization Solutions
Data visualization
Bar/column chart
Line/area chart
Pie chart
Radar chart
Scatter/bubble chart
Other charts
Practical data visualization in Hadoop
Apache Druid
Druid components
Other required components
Apache Druid installation
Add service
Select Druid and Superset
Service placement on servers
Choose Slaves and Clients
Service configurations
Service installation
Installation summary
Sample data ingestion into Druid
MySQL database
Sample database
Download the sample dataset
Copy the data to MySQL
Verify integrity of the tables
Single Normalized Table
Apache Superset
Accessing the Superset application
Superset dashboards
Understanding Wikipedia edits data
Create Superset Slices using Wikipedia data
Unique users count
Word Cloud for top US regions
Sunburst chart – top 10 cities
Top 50 channels and namespaces via directed force layout
Top 25 countries/channels distribution
Creating wikipedia edits dashboard from Slices
Apache Superset with RDBMS
Supported databases
Understanding employee database
Employees table
Departments table
Department manager table
Department Employees Table
Titles table
Salaries table
Normalized employees table
Superset Slices for employees database
Register MySQL database/table
Slices and Dashboard creation
Department salary breakup
Salary Diversity
Salary Change Per Role Per Year
Dashboard creation
Summary
Developing Applications Using the Cloud
What is the Cloud?
Available technologies in the Cloud
Planning the Cloud infrastructure
Dedicated servers versus shared servers
Dedicated servers
Shared servers
High availability
Business continuity planning
Infrastructure unavailability
Natural disasters
Business data
BCP design example
The Hot–Hot system
The Hot–Cold system
Security
Server security
Application security
Network security
Single Sign On
The AAA requirement
Building a Hadoop cluster in the Cloud
Google Cloud Dataproc
Getting a Google Cloud account
Activating the Google Cloud Dataproc service
Creating a new Hadoop cluster
Logging in to the cluster
Deleting the cluster
Data access in the Cloud
Block storage
File storage
Encrypted storage
Cold storage
Summary
Production Hadoop Cluster Deployment
Apache Ambari architecture
The Ambari server
Daemon management
Software upgrade
Software setup
LDAP/PAM/Kerberos management
Ambari backup and restore
Miscellaneous options
Ambari Agent
Ambari web interface
Database
Setting up a Hadoop cluster with Ambari
Server configurations
Preparing the server
Installing the Ambari server
Preparing the Hadoop cluster
Creating the Hadoop cluster
Ambari web interface
The Ambari home page
Creating a cluster
Managing users and groups
Deploying views
The cluster install wizard
Naming your cluster
Selecting the Hadoop version
Selecting a server
Setting up the node
Selecting services
Service placement on nodes
Selecting slave and client nodes
Customizing services
Reviewing the services
Installing the services on the nodes
Installation summary
The cluster dashboard
Hadoop clusters
A single cluster for the entire business
Multiple Hadoop clusters
Redundancy
A fully redundant Hadoop cluster
A data redundant Hadoop cluster
Cold backup
High availability
Business continuity
Application environments
Hadoop data copy
HDFS data copy
Summary
买过这本书的人还买过
读了这本书的人还在读
同类图书排行榜