万本电子书0元读

万本电子书0元读

顶部广告

Modern Big Data Processing with Hadoop电子书

售       价:¥

0人正在读 | 0人评论 9.8

作       者:V. Naresh Kumar,Prashant Shindgikar

出  版  社:Packt Publishing

出版时间:2018-03-30

字       数:39.6万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
A comprehensive guide to design, build and execute effective Big Data strategies using Hadoop About This Book ? Get an in-depth view of the Apache Hadoop ecosystem and an overview of the architectural patterns pertaining to the popular Big Data platform ? Conquer different data processing and analytics challenges using a multitude of tools such as Apache Spark, Elasticsearch, Tableau and more ? A comprehensive, step-by-step guide that will teach you everything you need to know, to be an expert Hadoop Architect Who This Book Is For This book is for Big Data professionals who want to fast-track their career in the Hadoop industry and become an expert Big Data architect. Project managers and mainframe professionals looking forward to build a career in Big Data Hadoop will also find this book to be useful. Some understanding of Hadoop is required to get the best out of this book. What You Will Learn ? Build an efficient enterprise Big Data strategy centered around Apache Hadoop ? Gain a thorough understanding of using Hadoop with various Big Data frameworks such as Apache Spark, Elasticsearch and more ? Set up and deploy your Big Data environment on premises or on the cloud with Apache Ambari ? Design effective streaming data pipelines and build your own enterprise search solutions ? Utilize the historical data to build your analytics solutions and visualize them using popular tools such as Apache Superset ? Plan, set up and administer your Hadoop cluster efficiently In Detail The complex structure of data these days requires sophisticated solutions for data transformation, to make the information more accessible to the users.This book empowers you to build such solutions with relative ease with the help of Apache Hadoop, along with a host of other Big Data tools. This book will give you a complete understanding of the data lifecycle management with Hadoop, followed by modeling of structured and unstructured data in Hadoop. It will also show you how to design real-time streaming pipelines by leveraging tools such as Apache Spark, and build efficient enterprise search solutions using Elasticsearch. You will learn to build enterprise-grade analytics solutions on Hadoop, and how to visualize your data using tools such as Apache Superset. This book also covers techniques for deploying your Big Data solutions on the cloud Apache Ambari, as well as expert techniques for managing and administering your Hadoop cluster. By the end of this book, you will have all the knowledge you need to build expert Big Data systems. Style and approach Comprehensive guide with a perfect blend of theory, examples and implementation of real-world use-cases
目录展开

Title Page

Copyright and Credits

Modern Big Data Processing with Hadoop

Packt Upsell

Why subscribe?

PacktPub.com

Contributors

About the authors

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Enterprise Data Architecture Principles

Data architecture principles

Volume

Velocity

Variety

Veracity

The importance of metadata

Data governance

Fundamentals of data governance

Data security

Application security

Input data

Big data security

RDBMS security

BI security

Physical security

Data encryption

Secure key management

Data as a Service

Evolution data architecture with Hadoop

Hierarchical database architecture

Network database architecture

Relational database architecture

Employees

Devices

Department

Department and employee mapping table

Hadoop data architecture

Data layer

Data management layer

Job execution layer

Summary

Hadoop Life Cycle Management

Data wrangling

Data acquisition

Data structure analysis

Information extraction

Unwanted data removal

Data transformation

Data standardization

Data masking

Substitution

Static

Dynamic

Encryption

Hashing

Hiding

Erasing

Truncation

Variance

Shuffling

Data security

What is Apache Ranger?

Apache Ranger installation using Ambari

Ambari admin UI

Add service

Service placement

Service client placement

Database creation on master

Ranger database configuration

Configuration changes

Configuration review

Deployment progress

Application restart

Apache Ranger user guide

Login to UI

Access manager

Service details

Policy definition and auditing for HDFS

Summary

Hadoop Design Consideration

Understanding data structure principles

Installing Hadoop cluster

Configuring Hadoop on NameNode

Format NameNode

Start all services

Exploring HDFS architecture

Defining NameNode

Secondary NameNode

NameNode safe mode

DataNode

Data replication

Rack awareness

HDFS WebUI

Introducing YARN

YARN architecture

Resource manager

Node manager

Configuration of YARN

Configuring HDFS high availability

During Hadoop 1.x

During Hadoop 2.x and onwards

HDFS HA cluster using NFS

Important architecture points

Configuration of HA NameNodes with shared storage

HDFS HA cluster using the quorum journal manager

Important architecture points

Configuration of HA NameNodes with QJM

Automatic failover

Important architecture points

Configuring automatic failover

Hadoop cluster composition

Typical Hadoop cluster

Best practices Hadoop deployment

Hadoop file formats

Text/CSV file

JSON

Sequence file

Avro

Parquet

ORC

Which file format is better?

Summary

Data Movement Techniques

Batch processing versus real-time processing

Batch processing

Real-time processing

Apache Sqoop

Sqoop Import

Import into HDFS

Import a MySQL table into an HBase table

Sqoop export

Flume

Apache Flume architecture

Data flow using Flume

Flume complex data flow architecture

Flume setup

Log aggregation use case

Apache NiFi

Main concepts of Apache NiFi

Apache NiFi architecture

Key features

Real-time log capture dataflow

Kafka Connect

Kafka Connect – a brief history

Why Kafka Connect?

Kafka Connect features

Kafka Connect architecture

Kafka Connect workers modes

Standalone mode

Distributed mode

Kafka Connect cluster distributed architecture

Example 1

Example 2

Summary

Data Modeling in Hadoop

Apache Hive

Apache Hive and RDBMS

Supported datatypes

How Hive works

Hive architecture

Hive data model management

Hive tables

Managed tables

External tables

Hive table partition

Hive static partitions and dynamic partitions

Hive partition bucketing

How Hive bucketing works

Creating buckets in a non-partitioned table

Creating buckets in a partitioned table

Hive views

Syntax of a view

Hive indexes

Compact index

Bitmap index

JSON documents using Hive

Example 1 – Accessing simple JSON documents with Hive (Hive 0.14 and later versions)

Example 2 – Accessing nested JSON documents with Hive (Hive 0.14 and later versions)

Example 3 – Schema evolution with Hive and Avro (Hive 0.14 and later versions)

Apache HBase

Differences between HDFS and HBase

Differences between Hive and HBase

Key features of HBase

HBase data model

Difference between RDBMS table and column - oriented data store

HBase architecture

HBase architecture in a nutshell

HBase rowkey design

Example 4 – loading data from MySQL table to HBase table

Example 5 – incrementally loading data from MySQL table to HBase table

Example 6 – Load the MySQL customer changed data into the HBase table

Example 7 – Hive HBase integration

Summary

Designing Real-Time Streaming Data Pipelines

Real-time streaming concepts

Data stream

Batch processing versus real-time data processing

Complex event processing

Continuous availability

Low latency

Scalable processing frameworks

Horizontal scalability

Storage

Real-time streaming components

Message queue

So what is Kafka?

Kafka features

Kafka architecture

Kafka architecture components

Kafka Connect deep dive

Kafka Connect architecture

Kafka Connect workers standalone versus distributed mode

Install Kafka

Create topics

Generate messages to verify the producer and consumer

Kafka Connect using file Source and Sink

Kafka Connect using JDBC and file Sink Connectors

Apache Storm

Features of Apache Storm

Storm topology

Storm topology components

Installing Storm on a single node cluster

Developing a real-time streaming pipeline with Storm

Streaming a pipeline from Kafka to Storm to MySQL

Streaming a pipeline with Kafka to Storm to HDFS

Other popular real-time data streaming frameworks

Kafka Streams API

Spark Streaming

Apache Flink

Apache Flink versus Spark

Apache Spark versus Storm

Summary

Large-Scale Data Processing Frameworks

MapReduce

Hadoop MapReduce

Streaming MapReduce

Java MapReduce

Summary

Apache Spark 2

Installing Spark using Ambari

Service selection in Ambari Admin

Add Service Wizard

Server placement

Clients and Slaves selection

Service customization

Software deployment

Spark installation progress

Service restarts and cleanup

Apache Spark data structures

RDDs, DataFrames and datasets

Apache Spark programming

Sample data for analysis

Interactive data analysis with pyspark

Standalone application with Spark

Spark streaming application

Spark SQL application

Summary

Building Enterprise Search Platform

The data search concept

The need for an enterprise search engine

Tools for building an enterprise search engine

Elasticsearch

Why Elasticsearch?

Elasticsearch components

Index

Document

Mapping

Cluster

Type

How to index documents in Elasticsearch?

Elasticsearch installation

Installation of Elasticsearch

Create index

Primary shard

Replica shard

Ingest documents into index

Bulk Insert

Document search

Meta fields

Mapping

Static mapping

Dynamic mapping

Elasticsearch-supported data types

Mapping example

Analyzer

Elasticsearch stack components

Beats

Logstash

Kibana

Use case

Summary

Designing Data Visualization Solutions

Data visualization

Bar/column chart

Line/area chart

Pie chart

Radar chart

Scatter/bubble chart

Other charts

Practical data visualization in Hadoop

Apache Druid

Druid components

Other required components

Apache Druid installation

Add service

Select Druid and Superset

Service placement on servers

Choose Slaves and Clients

Service configurations

Service installation

Installation summary

Sample data ingestion into Druid

MySQL database

Sample database

Download the sample dataset

Copy the data to MySQL

Verify integrity of the tables

Single Normalized Table

Apache Superset

Accessing the Superset application

Superset dashboards

Understanding Wikipedia edits data

Create Superset Slices using Wikipedia data

Unique users count

Word Cloud for top US regions

Sunburst chart – top 10 cities

Top 50 channels and namespaces via directed force layout

Top 25 countries/channels distribution

Creating wikipedia edits dashboard from Slices

Apache Superset with RDBMS

Supported databases

Understanding employee database

Employees table

Departments table

Department manager table

Department Employees Table

Titles table

Salaries table

Normalized employees table

Superset Slices for employees database

Register MySQL database/table

Slices and Dashboard creation

Department salary breakup

Salary Diversity

Salary Change Per Role Per Year

Dashboard creation

Summary

Developing Applications Using the Cloud

What is the Cloud?

Available technologies in the Cloud

Planning the Cloud infrastructure

Dedicated servers versus shared servers

Dedicated servers

Shared servers

High availability

Business continuity planning

Infrastructure unavailability

Natural disasters

Business data

BCP design example

The Hot–Hot system

The Hot–Cold system

Security

Server security

Application security

Network security

Single Sign On

The AAA requirement

Building a Hadoop cluster in the Cloud

Google Cloud Dataproc

Getting a Google Cloud account

Activating the Google Cloud Dataproc service

Creating a new Hadoop cluster

Logging in to the cluster

Deleting the cluster

Data access in the Cloud

Block storage

File storage

Encrypted storage

Cold storage

Summary

Production Hadoop Cluster Deployment

Apache Ambari architecture

The Ambari server

Daemon management

Software upgrade

Software setup

LDAP/PAM/Kerberos management

Ambari backup and restore

Miscellaneous options

Ambari Agent

Ambari web interface

Database

Setting up a Hadoop cluster with Ambari

Server configurations

Preparing the server

Installing the Ambari server

Preparing the Hadoop cluster

Creating the Hadoop cluster

Ambari web interface

The Ambari home page

Creating a cluster

Managing users and groups

Deploying views

The cluster install wizard

Naming your cluster

Selecting the Hadoop version

Selecting a server

Setting up the node

Selecting services

Service placement on nodes

Selecting slave and client nodes

Customizing services

Reviewing the services

Installing the services on the nodes

Installation summary

The cluster dashboard

Hadoop clusters

A single cluster for the entire business

Multiple Hadoop clusters

Redundancy

A fully redundant Hadoop cluster

A data redundant Hadoop cluster

Cold backup

High availability

Business continuity

Application environments

Hadoop data copy

HDFS data copy

Summary

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部