


Apache Mahout Essentials电子书

售       价:¥

3人正在读 | 0人评论 9.8

作       者:Jayani Withanawasam

出  版  社:Packt Publishing


字       数:94.3万

所属分类: 进口书 > 外文原版书 > 电脑/网络



  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
If you are a Java developer or data scientist, haven't worked with Apache Mahout before, and want to get up to speed on implementing machine learning on big data, then this is the perfect guide for you.

Apache Mahout Essentials

Table of Contents

Apache Mahout Essentials


About the Author

About the Reviewers


Support files, eBooks, discount offers, and more

Why subscribe?

Free access for Packt account holders


What this book covers

What you need for this book

Who this book is for


Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book




1. Introducing Apache Mahout

Machine learning in a nutshell


Supervised learning versus unsupervised learning

Machine learning applications

Information retrieval


Market segmentation (clustering)

Stock market predictions (regression)

Health care

Using a mammogram for cancer tissue detection

Machine learning libraries

Open source or commercial


Languages used

Algorithm support

Batch processing versus stream processing

The story so far

Apache Mahout

Setting up Apache Mahout

How Apache Mahout works?

The high-level design

The distribution

From Hadoop MapReduce to Spark

Problems with Hadoop MapReduce

In-memory data processing with Spark and H2O

Why is Mahout shifting from Hadoop MapReduce to Spark?

When is it appropriate to use Apache Mahout?


2. Clustering

Unsupervised learning and clustering

Applications of clustering

Computer vision and image processing

Types of clustering

Hard clustering versus soft clustering

Flat clustering versus hierarchical clustering

Model-based clustering

K-Means clustering

Getting your hands dirty!

Running K-Means using Java programming

Data preparation

Understanding important parameters

Cluster visualization

Distance measure

Writing a custom distance measure

K-Means clustering with MapReduce

MapReduce in Apache Mahout

The map function

The reduce function

Additional clustering algorithms

Canopy clustering

Fuzzy K-Means

Streaming K-Means

The streaming step

The ball K-Means step

Spectral clustering

Dirichlet clustering

Text clustering

The vector space model and TF-IDF

N-grams and collocations

Preprocessing text with Lucene

Text clustering with the K-Means algorithm

Topic modeling

Optimizing clustering performance

Selecting the right features

Selecting the right algorithms

Selecting the right distance measure

Evaluating clusters

The initialization of centroids and the number of clusters

Tuning up parameters

The decision on infrastructure


3. Regression and Classification

Supervised learning

Target variables and predictor variables

Predictive analytics' techniques

Regression-based prediction

Model-based prediction

Tree-based prediction

Classification versus regression

Linear regression with Apache Spark

How does linear regression work?

A real-world example

The impact of smoking on mortality and different diseases

Linear regression with one variable and multiple variables

The integration of Apache Spark

Setting up Apache Spark with Apache Mahout

An example script

Distributed row matrix

An explanation of the code

Mahout references

The bias-variance trade-off

How to avoid over-fitting and under-fitting

Logistic regression with SGD

Logistic functions

Minimizing the cost function

Multinomial logistic regression versus binary logistic regression

A real-world example

An example script

Testing and evaluation

The confusion matrix

The area under the curve

The Naïve Bayes algorithm

The Bayes theorem

Text classification

Naïve assumption and its pros and cons in text classification

Improvements that Apache Mahout has made to the Naïve Bayes classification

A text classification coding example using the 20 newsgroups' example

Understand the 20 newsgroups' dataset

Text classification using Naïve Bayes – a MapReduce implementation with Hadoop

Text classification using Naïve Bayes – the Spark implementation

The Markov chain

Hidden Markov Model

A real-world example – developing a POS tagger using HMM supervised learning

POS tagging

HMM for POS tagging

HMM implementation in Apache Mahout

HMM supervised learning

The important parameters


The Baum Welch algorithm

A code example

The important parameters

The Viterbi evaluator

The Apache Mahout references


4. Recommendations

Collaborative versus content-based filtering

Content-based filtering

Collaborative filtering

Hybrid filtering

User-based recommenders

A real-world example – movie recommendations

Data models

The similarity measure

The neighborhood


Evaluation techniques

The IR-based method (precision/recall)

Addressing the issues with inaccurate recommendation results

Item-based recommenders

Item-based recommenders with Spark

Matrix factorization-based recommenders

Alternative least squares

Singular value decomposition

Algorithm usage tips and tricks


5. Apache Mahout in Production


Apache Mahout with Hadoop

YARN with MapReduce 2.0

The resource manager

The application manager

A node manager

The application master


Managing storage with HDFS

The life cycle of a Hadoop application

Setting up Hadoop

Setting up Mahout in local mode


Java installation

Setting up Mahout in Hadoop distributed mode


Creating a Hadoop user

Passwordless SSH configuration

The pseudo-distributed mode

Configuration changes

Formatting the DFS filesystem

Starting the servers

The fully-distributed mode


Host file configuration

Hadoop configuration changes

Formatting the DFS filesystem

Starting servers

Monitoring Hadoop


Data nodes

Node managers

Web UIs

Setting up Mahout with Hadoop's fully-distributed mode

Troubleshooting Hadoop

Optimization tips


6. Visualization

The significance of visualization in machine learning


A visualization example for K-Means clustering



累计评论(0条) 0个书友正在讨论这本书 发表评论




