万本电子书0元读

万本电子书0元读

顶部广告

Hadoop Blueprints电子书

售       价:¥

4人正在读 | 0人评论 9.8

作       者:Anurag Shrivastava,Tanmay Deshpande

出  版  社:Packt Publishing

出版时间:2016-09-01

字       数:264.0万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
Use Hadoop to solve business problems by learning from a rich set of real-life case studies About This Book Solve real-world business problems using Hadoop and other Big Data technologies Build efficient data lakes in Hadoop, and develop systems for various business cases like improving marketing campaigns, fraud detection, and more Power packed with six case studies to get you going with Hadoop for Business Intelligence Who This Book Is For If you are interested in building efficient business solutions using Hadoop, this is the book for you This book assumes that you have basic knowledge of Hadoop, Java, and any *ing language. What You Will Learn Learn about the evolution of Hadoop as the big data platform Understand the basics of Hadoop architecture Build a 360 degree view of your customer using Sqoop and Hive Build and run classification models on Hadoop using BigML Use Spark and Hadoop to build a fraud detection system Develop a churn detection system using Java and MapReduce Build an IoT-based data collection and visualization system Get to grips with building a Hadoop-based Data Lake for large enterprises Learn about the coexistence of NoSQL and In-Memory databases in the Hadoop ecosystem In Detail If you have a basic understanding of Hadoop and want to put your knowledge to use to build fantastic Big Data solutions for business, then this book is for you. Build six real-life, end-to-end solutions using the tools in the Hadoop ecosystem, and take your knowledge of Hadoop to the next level. Start off by understanding various business problems which can be solved using Hadoop. You will also get acquainted with the common architectural patterns which are used to build Hadoop-based solutions. Build a 360-degree view of the customer by working with different types of data, and build an efficient fraud detection system for a financial institution. You will also develop a system in Hadoop to improve the effectiveness of marketing campaigns. Build a churn detection system for a telecom company, develop an Internet of Things (IoT) system to monitor the environment in a factory, and build a data lake – all making use of the concepts and techniques mentioned in this book. The book covers other technologies and frameworks like Apache Spark, Hive, Sqoop, and more, and how they can be used in conjunction with Hadoop. You will be able to try out the solutions explained in the book and use the knowledge gained to extend them further in your own problem space. Style and approach This is an example-driven book where each chapter covers a single business problem and describes its solution by explaining the structure of a dataset and tools required to process it. Every project is demonstrated with a step-by-step approach, and explained in a very easy-to-understand manner.
目录展开

Hadoop Blueprints

Hadoop Blueprints

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Why subscribe?

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Hadoop and Big Data

The beginning of the big data problem

Limitations of RDBMS systems

Scaling out a database on Google

Parallel processing of large datasets

Building open source Hadoop

Enterprise Hadoop

Social media and mobile channels

Data storage cost reduction

Enterprise software vendors

Pure Play Hadoop vendors

Cloud Hadoop vendors

The design of the Hadoop system

The Hadoop Distributed File System (HDFS)

Data organization in HDFS

HDFS file management commands

NameNode and DataNodes

Metadata store in NameNode

Preventing a single point of failure with Hadoop HA

Checkpointing process

Data Store on a DataNode

Handshakes and heartbeats

MapReduce

The execution model of MapReduce Version 1

Apache YARN

Building a MapReduce Version 2 program

Problem statement

Solution workflow

Getting the dataset

Studying the dataset

Cleaning the dataset

Loading the dataset on the HDFS

Starting with a MapReduce program

Installing Eclipse

Creating a project in Eclipse

Coding and building a MapReduce program

Run the MapReduce program locally

Examine the result

Run the MapReduce program on Hadoop

Further processing of results

Hadoop platform tools

Data ingestion tools

Data access tools

Monitoring tools

Data governance tools

Big data use cases

Creating a 360 degree view of a customer

Fraud detection systems for banks

Marketing campaign planning

Churn detection in telecom

Analyzing sensor data

Building a data lake

The architecture of Hadoop-based systems

Lambda architecture

Summary

2. A 360-Degree View of the Customer

Capturing business information

Collecting data from data sources

Creating a data processing approach

Presenting the results

Setting up the technology stack

Tools used

Installing Hortonworks Sandbox

Creating user accounts

Exploring HUE

Exploring MYSQL and the HIVE command line

Exploring Sqoop at the command line

Test driving Hive and Sqoop

Querying data using Hive

Importing data in Hive using Sqoop

Engineering the solution

Datasets

Loading customer master data into Hadoop

Loading web logs into Hadoop

Loading tweets into Hadoop

Creating the 360-degree view

Exporting data from Hadoop

Presenting the view

Building a web application

Installing Node.js

Coding the web application in Node.js

Summary

3. Building a Fraud Detection System

Understanding the business problem

Selecting and cleansing the dataset

Finding relevant fields

Machine learning for fraud detection

Clustering as an unsupervised machine learning method

Designing the high-level architecture

Introducing Apache Spark

Apache Spark architecture

Resilient Distributed Datasets

Transformation functions

Actions

Test driving Apache Spark

Calculating the yearly average stock prices using Spark

Apache Spark 2.X

Understanding MLib

Test driving K-means using MLib

Creating our fraud detection model

Building our K-means clustering model

Processing the data

Putting the fraud detection model to use

Generating a data stream

Processing the data stream using Spark streaming

Putting the model to use

Scaling the solution

Summary

4. Marketing Campaign Planning

Creating the solution outline

Supervised learning

Tree-structure models for classification

Finding the right dataset

Setting the up the solution architecture

Coupon scan at POS

Join and transform

Train the classification model

Scoring

Mail merge

Building the machine learning model

Introducing BigML

Model building steps

Sign up as a user on BigML site

Upload the data file

Creating the dataset

Building the classification model

Downloading the classification model

Running the Model on Hadoop

Creating the target List

Post campaign activities

Summary

5. Churn Detection

A business case for churn detection

Creating the solution outline

Building a predictive model using Hadoop

Bayes' Theorem

Playing with the Bayesian predictor

Running a Node.js-based Bayesian predictor

Understanding the predictor code

Limitations of our solution

Building a churn predictor using Hadoop

Synthetic data generation tools

Preparing a synthetic historical churn dataset

The processing approach

Running the MapReduce program

Understanding the frequency counter code

Putting the model to use

Integrating the churn predictor

Summary

6. Analyze Sensor Data Using Hadoop

A business case for sensor data analytics

Creating the solution outline

Technology stack

Kafka

Flume

HDFS

Hive

Open TSDB

HBase

Grafana

Batch data analytics

Loading streams of sensor data from Kafka topics to HDFS

Using Hive to perform analytics on inserted data

Data visualization in MS Excel

Stream data analytics

Loading streams of sensor data

Data visualization using Grafana

Summary

7. Building a Data Lake

Data lake building blocks

Ingestion tier

Storage tier

Insights tier

Ops facilities

Limitation of open source Hadoop ecosystem tools

Hadoop security

HDFS permissions model

Fine-grained permissions with HDFS ACLs

Apache Ranger

Installing Apache Ranger

Test driving Apache Ranger

Define services and access policies

Examine the audit logs

Viewing users and groups in Ranger

Data Lake security with Apache Ranger

Apache Flume

Understanding the Design of Flume

Installing Apache Flume

Running Apache Flume

Apache Zeppelin

Installation of Apache Zeppelin

Test driving Zeppelin

Exploring data visualization features of Zeppelin

Define the gold price movement table in Hive

Load gold price history in the Table

Run a select query

Plot price change per month

Running the paragraph

Zeppelin in Data Lake

Technology stack for Data Lake

Data Lake business requirements

Understanding the business requirements

Understanding the IT systems and security

Designing the data pipeline

Building the data pipeline

Setting up the access control

Synchronizing the users and groups in Ranger

Setting up data access policies in Ranger

Restricting the access in Zeppelin

Testing our data pipeline

Scheduling the data loading

Refining the business requirements

Implementing the new requirements

Loading the stock holding data in Data Lake

Restricting the access to stock holding data in Data Lake

Testing the Loaded Data with Zeppelin

Adding stock feed in the Data Lake

Fetching data from Yahoo Service

Configuring Flume

Running Flume as Stock Feeder to Data Lake

Transforming the data in Data Lake

Growing Data Lake

Summary

8. Future Directions

Hadoop solutions team

The role of the data engineer

Data science for non-experts

From the data science model to business value

Hadoop on Cloud

Deploying Hadoop on cloud servers

Using Hadoop as a service

NoSQL databases

Types of NoSQL databases

Common observations about NoSQL databases

In-memory databases

Apache Ignite as an in-memory database

Apache Ignite as a Hadoop accelerator

Apache Spark versus Apache Ignite

Summary

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部