售 价:¥
温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印
为你推荐
Hadoop Blueprints
Hadoop Blueprints
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Why subscribe?
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Hadoop and Big Data
The beginning of the big data problem
Limitations of RDBMS systems
Scaling out a database on Google
Parallel processing of large datasets
Building open source Hadoop
Enterprise Hadoop
Social media and mobile channels
Data storage cost reduction
Enterprise software vendors
Pure Play Hadoop vendors
Cloud Hadoop vendors
The design of the Hadoop system
The Hadoop Distributed File System (HDFS)
Data organization in HDFS
HDFS file management commands
NameNode and DataNodes
Metadata store in NameNode
Preventing a single point of failure with Hadoop HA
Checkpointing process
Data Store on a DataNode
Handshakes and heartbeats
MapReduce
The execution model of MapReduce Version 1
Apache YARN
Building a MapReduce Version 2 program
Problem statement
Solution workflow
Getting the dataset
Studying the dataset
Cleaning the dataset
Loading the dataset on the HDFS
Starting with a MapReduce program
Installing Eclipse
Creating a project in Eclipse
Coding and building a MapReduce program
Run the MapReduce program locally
Examine the result
Run the MapReduce program on Hadoop
Further processing of results
Hadoop platform tools
Data ingestion tools
Data access tools
Monitoring tools
Data governance tools
Big data use cases
Creating a 360 degree view of a customer
Fraud detection systems for banks
Marketing campaign planning
Churn detection in telecom
Analyzing sensor data
Building a data lake
The architecture of Hadoop-based systems
Lambda architecture
Summary
2. A 360-Degree View of the Customer
Capturing business information
Collecting data from data sources
Creating a data processing approach
Presenting the results
Setting up the technology stack
Tools used
Installing Hortonworks Sandbox
Creating user accounts
Exploring HUE
Exploring MYSQL and the HIVE command line
Exploring Sqoop at the command line
Test driving Hive and Sqoop
Querying data using Hive
Importing data in Hive using Sqoop
Engineering the solution
Datasets
Loading customer master data into Hadoop
Loading web logs into Hadoop
Loading tweets into Hadoop
Creating the 360-degree view
Exporting data from Hadoop
Presenting the view
Building a web application
Installing Node.js
Coding the web application in Node.js
Summary
3. Building a Fraud Detection System
Understanding the business problem
Selecting and cleansing the dataset
Finding relevant fields
Machine learning for fraud detection
Clustering as an unsupervised machine learning method
Designing the high-level architecture
Introducing Apache Spark
Apache Spark architecture
Resilient Distributed Datasets
Transformation functions
Actions
Test driving Apache Spark
Calculating the yearly average stock prices using Spark
Apache Spark 2.X
Understanding MLib
Test driving K-means using MLib
Creating our fraud detection model
Building our K-means clustering model
Processing the data
Putting the fraud detection model to use
Generating a data stream
Processing the data stream using Spark streaming
Putting the model to use
Scaling the solution
Summary
4. Marketing Campaign Planning
Creating the solution outline
Supervised learning
Tree-structure models for classification
Finding the right dataset
Setting the up the solution architecture
Coupon scan at POS
Join and transform
Train the classification model
Scoring
Mail merge
Building the machine learning model
Introducing BigML
Model building steps
Sign up as a user on BigML site
Upload the data file
Creating the dataset
Building the classification model
Downloading the classification model
Running the Model on Hadoop
Creating the target List
Post campaign activities
Summary
5. Churn Detection
A business case for churn detection
Creating the solution outline
Building a predictive model using Hadoop
Bayes' Theorem
Playing with the Bayesian predictor
Running a Node.js-based Bayesian predictor
Understanding the predictor code
Limitations of our solution
Building a churn predictor using Hadoop
Synthetic data generation tools
Preparing a synthetic historical churn dataset
The processing approach
Running the MapReduce program
Understanding the frequency counter code
Putting the model to use
Integrating the churn predictor
Summary
6. Analyze Sensor Data Using Hadoop
A business case for sensor data analytics
Creating the solution outline
Technology stack
Kafka
Flume
HDFS
Hive
Open TSDB
HBase
Grafana
Batch data analytics
Loading streams of sensor data from Kafka topics to HDFS
Using Hive to perform analytics on inserted data
Data visualization in MS Excel
Stream data analytics
Loading streams of sensor data
Data visualization using Grafana
Summary
7. Building a Data Lake
Data lake building blocks
Ingestion tier
Storage tier
Insights tier
Ops facilities
Limitation of open source Hadoop ecosystem tools
Hadoop security
HDFS permissions model
Fine-grained permissions with HDFS ACLs
Apache Ranger
Installing Apache Ranger
Test driving Apache Ranger
Define services and access policies
Examine the audit logs
Viewing users and groups in Ranger
Data Lake security with Apache Ranger
Apache Flume
Understanding the Design of Flume
Installing Apache Flume
Running Apache Flume
Apache Zeppelin
Installation of Apache Zeppelin
Test driving Zeppelin
Exploring data visualization features of Zeppelin
Define the gold price movement table in Hive
Load gold price history in the Table
Run a select query
Plot price change per month
Running the paragraph
Zeppelin in Data Lake
Technology stack for Data Lake
Data Lake business requirements
Understanding the business requirements
Understanding the IT systems and security
Designing the data pipeline
Building the data pipeline
Setting up the access control
Synchronizing the users and groups in Ranger
Setting up data access policies in Ranger
Restricting the access in Zeppelin
Testing our data pipeline
Scheduling the data loading
Refining the business requirements
Implementing the new requirements
Loading the stock holding data in Data Lake
Restricting the access to stock holding data in Data Lake
Testing the Loaded Data with Zeppelin
Adding stock feed in the Data Lake
Fetching data from Yahoo Service
Configuring Flume
Running Flume as Stock Feeder to Data Lake
Transforming the data in Data Lake
Growing Data Lake
Summary
8. Future Directions
Hadoop solutions team
The role of the data engineer
Data science for non-experts
From the data science model to business value
Hadoop on Cloud
Deploying Hadoop on cloud servers
Using Hadoop as a service
NoSQL databases
Types of NoSQL databases
Common observations about NoSQL databases
In-memory databases
Apache Ignite as an in-memory database
Apache Ignite as a Hadoop accelerator
Apache Spark versus Apache Ignite
Summary
买过这本书的人还买过
读了这本书的人还在读
同类图书排行榜