万本电子书0元读

万本电子书0元读

顶部广告

HDInsight Essentials - Second Edition电子书

售       价:¥

1人正在读 | 0人评论 9.8

作       者:Rajesh Nadipalli

出  版  社:Packt Publishing

出版时间:2015-01-27

字       数:95.3万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
If you want to discover one of the latest tools designed to produce stunning Big Data insights, this book features everything you need to get to grips with your data. Whether you are a data architect, developer, or a business strategist, HDInsight adds value in everything from development, administration, and reporting.
目录展开

HDInsight Essentials Second Edition

Table of Contents

HDInsight Essentials Second Edition

Credits

About the Author

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers, and more

Why subscribe?

Free access for Packt account holders

Instant updates on new Packt books

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Hadoop and HDInsight in a Heartbeat

Data is everywhere

Business value of big data

Hadoop concepts

Brief history of Hadoop

Core components

Hadoop cluster layout

HDFS overview

Writing a file to HDFS

Reading a file from HDFS

HDFS basic commands

YARN overview

YARN application life cycle

YARN workloads

Hadoop distributions

HDInsight overview

HDInsight and Hadoop relationship

Hadoop on Windows deployment options

Microsoft Azure HDInsight Service

HDInsight Emulator

Hortonworks Data Platform (HDP) for Windows

Summary

2. Enterprise Data Lake using HDInsight

Enterprise Data Warehouse architecture

Source systems

Data warehouse

Storage

Processing

User access

Provisioning and monitoring

Data governance and security

Pain points of EDW

The next generation Hadoop-based Enterprise data architecture

Source systems

Data Lake

Storage

Processing

User access

Provisioning and monitoring

Data governance, security, and metadata

Journey to your Data Lake dream

Ingestion and organization

Transformation (rules driven)

Access, analyze, and report

Tools and technology for Hadoop ecosystem

Use case powered by Microsoft HDInsight

Problem statement

Solution

Source systems

Storage

Processing

User access

Benefits

Summary

3. HDInsight Service on Azure

Registering for an Azure account

Azure storage

Provisioning an HDInsight cluster

Cluster topology

Provisioning using Azure PowerShell

Creating a storage container

Provisioning a new HDInsight cluster

HDInsight management dashboard

Dashboard

Monitor

Configuration

Exploring clusters using the remote desktop

Running a sample MapReduce

Deleting the cluster

HDInsight Emulator for the development

Installing HDInsight Emulator

Installation verification

Using HDInsight Emulator

Summary

4. Administering Your HDInsight Cluster

Monitoring cluster health

Name Node status

The Name Node Overview page

Datanode Status

Utilities and logs

Hadoop Service Availability

YARN Application Status

Azure storage management

Configuring your storage account

Monitoring your storage account

Managing access keys

Deleting your storage account

Azure PowerShell

Access Azure Blob storage using Azure PowerShell

Summary

5. Ingest and Organize Data Lake

End-to-end Data Lake solution

Ingesting to Data Lake using HDFS command

Connecting to a Hadoop client

Getting your files on the local storage

Transferring to HDFS

Loading data to Azure Blob storage using Azure PowerShell

Loading files to Data Lake using GUI tools

Storage access keys

Storage tools

CloudXplorer

Key benefits

Registering your storage account

Uploading files to your Blob storage

Using Sqoop to move data from RDBMS to Data Lake

Key benefits

Two modes of using Sqoop

Using Sqoop to import data (SQL to Hadoop)

Organizing your Data Lake in HDFS

Managing file metadata using HCatalog

Key benefits

Using HCatalog Command Line to create tables

Summary

6. Transform Data in the Data Lake

Transformation overview

Tools for transforming data in Data Lake

HCatalog

Persisting HCatalog metastore in a SQL database

Apache Hive

Hive architecture

Starting Hive in HDInsight

Basic Hive commands

Apache Pig

Pig architecture

Starting Pig in HDInsight node

Basic Pig commands

Pig or Hive

MapReduce

The mapper code

The reducer code

The driver code

Executing MapReduce on HDInsight

Azure PowerShell for execution of Hadoop jobs

Transformation for the OTP project

Cleaning data using Pig

Executing Pig script

Registering a refined and aggregate table using Hive

Executing Hive script

Reviewing results

Other tools used for transformation

Oozie

Spark

Summary

7. Analyze and Report from Data Lake

Data access overview

Analysis using Excel and Microsoft Hive ODBC driver

Prerequisites

Step 1 – installing the Microsoft Hive ODBC driver

Step 2 – creating Hive ODBC Data Source

Step 3 – importing data to Excel

Analysis using Excel Power Query

Prerequisites

Step 1 – installing the Microsoft Power Query for Excel

Step 2 – importing Azure Blob storage data into Excel

Step 3 – analyzing data using Excel

Other BI features in Excel

PowerPivot

Power View and Power Map

Step 1 – importing Azure Blob storage data into Excel

Step 2 – launch map view

Step 3 – configure the map

Power BI Catalog

Ad hoc analysis using Hive

Other alternatives for analysis

RHadoop

Apache Giraph

Apache Mahout

Azure Machine Learning

Summary

8. HDInsight 3.1 New Features

HBase

HBase positioning in Data Lake and use cases

Provisioning HDInsight HBase cluster

Creating a sample HBase schema

Designing the airline on-time performance table

Connecting to HBase using the HBase shell

Creating an HBase table

Loading data to the HBase table

Querying data from the HBase table

HBase additional information

Storm

Storm positioning in Data Lake

Storm key concepts

Provisioning HDInsight Storm cluster

Running a sample Storm topology

Connecting to Storm using Storm shell

Running the Storm Wordcount topology

Monitoring status of the Wordcount topology

Additional information on Storm

Apache Tez

Summary

9. Strategy for a Successful Data Lake Implementation

Challenges on building a production Data Lake

The success path for a production Data Lake

Identifying the big data problem

Proof of technology for Data Lake

Form a Data Lake Center of Excellence

Executive sponsors

Data Lake consumers

Development

Operations and infrastructure

Architectural considerations

Extensible and modular

Metadata-driven solution

Integration strategy

Security

Online resources

Summary

Index

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部