万本电子书0元读

万本电子书0元读

顶部广告

Mastering Machine Learning with Spark 2.x电子书

售       价:¥

3人正在读 | 0人评论 9.8

作       者:Alex Tellez,Max Pumperla,Michal Malohlava

出  版  社:Packt Publishing

出版时间:2017-08-31

字       数:44.6万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
Unlock the complexities of machine learning algorithms in Spark to generate useful data insights through this data analysis tutorial About This Book ? Process and analyze big data in a distributed and scalable way ? Write sophisticated Spark pipelines that incorporate elaborate extraction ? Build and use regression models to predict flight delays Who This Book Is For Are you a developer with a background in machine learning and statistics who is feeling limited by the current slow and “small data” machine learning tools? Then this is the book for you! In this book, you will create scalable machine learning applications to power a modern data-driven business using Spark. We assume that you already know the machine learning concepts and algorithms and have Spark up and running (whether on a cluster or locally) and have a basic knowledge of the various libraries contained in Spark. What You Will Learn ? Use Spark streams to cluster tweets online ? Run the PageRank algorithm to compute user influence ? Perform complex manipulation of DataFrames using Spark ? Define Spark pipelines to compose individual data transformations ? Utilize generated models for off-line/on-line prediction ? Transfer the learning from an ensemble to a simpler Neural Network ? Understand basic graph properties and important graph operations ? Use GraphFrames, an extension of DataFrames to graphs, to study graphs using an elegant query language ? Use K-means algorithm to cluster movie reviews dataset In Detail The purpose of machine learning is to build systems that learn from data. Being able to understand trends and patterns in complex data is critical to success; it is one of the key strategies to unlock growth in the challenging contemporary marketplace today. With the meteoric rise of machine learning, developers are now keen on finding out how can they make their Spark applications smarter. This book gives you access to transform data into actionable knowledge. The book commences by defining machine learning primitives by the MLlib and H2O libraries. You will learn how to use Binary classification to detect the Higgs Boson particle in the huge amount of data produced by CERN particle collider and classify daily health activities using ensemble Methods for Multi-Class Classification. Next, you will solve a typical regression problem involving flight delay predictions and write sophisticated Spark pipelines. You will analyze Twitter data with help of the doc2vec algorithm and K-means clustering. Finally, you will build different pattern mining models using MLlib, perform complex manipulation of DataFrames using Spark and Spark SQL, and deploy your app in a Spark streaming environment. Style and approach This book takes a practical approach to help you get to grips with using Spark for analytics and to implement machine learning algorithms. We'll teach you about advanced applications of machine learning through illustrative examples. These examples will equip you to harness the potential of machine learning, through Spark, in a variety of enterprise-grade systems.
目录展开

Title Page

Copyright

Mastering Machine Learning with Spark 2.x

Credits

About the Authors

About the Reviewer

www.PacktPub.com

Why subscribe?

Customer Feedback

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

Introduction to Large-Scale Machine Learning and Spark

Data science

The sexiest role of the 21st century – data scientist?

A day in the life of a data scientist

Working with big data

The machine learning algorithm using a distributed environment

Splitting of data into multiple machines

From Hadoop MapReduce to Spark

What is Databricks?

Inside the box

Introducing H2O.ai

Design of Sparkling Water

What's the difference between H2O and Spark's MLlib?

Data munging

Data science - an iterative process

Summary

Detecting Dark Matter - The Higgs-Boson Particle

Type I versus type II error

Finding the Higgs-Boson particle

The LHC and data creation

The theory behind the Higgs-Boson

Measuring for the Higgs-Boson

The dataset

Spark start and data load

Labeled point vector

Data caching

Creating a training and testing set

What about cross-validation?

Our first model – decision tree

Gini versus Entropy

Next model – tree ensembles

Random forest model

Grid search

Gradient boosting machine

Last model - H2O deep learning

Build a 3-layer DNN

Adding more layers

Building models and inspecting results

Summary

Ensemble Methods for Multi-Class Classification

Data

Modeling goal

Challenges

Machine learning workflow

Starting Spark shell

Exploring data

Missing data

Summary of missing value analysis

Data unification

Missing values

Categorical values

Final transformation

Modelling data with Random Forest

Building a classification model using Spark RandomForest

Classification model evaluation

Spark model metrics

Building a classification model using H2O RandomForest

Summary

Predicting Movie Reviews Using NLP and Spark Streaming

NLP - a brief primer

The dataset

Dataset preparation

Feature extraction

Feature extraction method– bag-of-words model

Text tokenization

Declaring our stopwords list

Stemming and lemmatization

Featurization - feature hashing

Term Frequency - Inverse Document Frequency (TF-IDF) weighting scheme

Let's do some (model) training!

Spark decision tree model

Spark Naive Bayes model

Spark random forest model

Spark GBM model

Super-learner model

Super learner

Composing all transformations together

Using the super-learner model

Summary

Word2vec for Prediction and Clustering

Motivation of word vectors

Word2vec explained

What is a word vector?

The CBOW model

The skip-gram model

Fun with word vectors

Cosine similarity

Doc2vec explained

The distributed-memory model

The distributed bag-of-words model

Applying word2vec and exploring our data with vectors

Creating document vectors

Supervised learning task

Summary

Extracting Patterns from Clickstream Data

Frequent pattern mining

Pattern mining terminology

Frequent pattern mining problem

The association rule mining problem

The sequential pattern mining problem

Pattern mining with Spark MLlib

Frequent pattern mining with FP-growth

Association rule mining

Sequential pattern mining with prefix span

Pattern mining on MSNBC clickstream data

Deploying a pattern mining application

The Spark Streaming module

Summary

Graph Analytics with GraphX

Basic graph theory

Graphs

Directed and undirected graphs

Order and degree

Directed acyclic graphs

Connected components

Trees

Multigraphs

Property graphs

GraphX distributed graph processing engine

Graph representation in GraphX

Graph properties and operations

Building and loading graphs

Visualizing graphs with Gephi

Gephi

Creating GEXF files from GraphX graphs

Advanced graph processing

Aggregating messages

Pregel

GraphFrames

Graph algorithms and applications

Clustering

Vertex importance

GraphX in context

Summary

Lending Club Loan Prediction

Motivation

Goal

Data

Data dictionary

Preparation of the environment

Data load

Exploration – data analysis

Basic clean up

Useless columns

String columns

Loan progress columns

Categorical columns

Text columns

Missing data

Prediction targets

Loan status model

Base model

The emp_title column transformation

The desc column transformation

Interest RateModel

Using models for scoring

Model deployment

Stream creation

Stream transformation

Stream output

Summary

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部