万本电子书0元读

万本电子书0元读

顶部广告

Applied Supervised Learning with R电子书

售       价:¥

0人正在读 | 0人评论 9.8

作       者:Karthik Ramasubramanian

出  版  社:Packt Publishing

出版时间:2019-05-31

字       数:1636.9万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
Explore supervised machine learning with R by studying popular real-world uses cases such as object detection in driverless cars, customer churn, and default prediction Key Features * Study supervised learning algorithms by using real-world datasets * Fine tune optimal parameters with hyperparameter optimization * Select the best algorithm using the model evaluation framework Book Description R provides excellent visualization features that are essential for exploring data before using it in automated learning. Applied Supervised Learning with R helps you cover the complete process of employing R to develop applications using supervised machine learning algorithms for your business needs. The book starts by helping you develop your analytical thinking to create a problem statement using business inputs and domain research. You will then learn different evaluation metrics that compare various algorithms, and later progress to using these metrics to select the best algorithm for your problem. After finalizing the algorithm you want to use, you will study the hyperparameter optimization technique to fine-tune your set of optimal parameters. To prevent you from overfitting your model, a dedicated section will even demonstrate how you can add various regularization terms. By the end of this book, you will have the advanced skills you need for modeling a supervised machine learning algorithm that precisely fulfills your business needs. What you will learn * Develop analytical thinking to precisely identify a business problem * Wrangle data with dplyr, tidyr, and reshape2 * Visualize data with ggplot2 * Validate your supervised machine learning model using k-fold * Optimize hyperparameters with grid and random search, and Bayesian optimization * Deploy your model on Amazon Web Services (AWS) Lambda with plumber * Improve your model’s performance with feature selection and dimensionality reduction Who this book is for This book is specially designed for novice and intermediate-level data analysts, data scientists, and data engineers who want to explore different methods of supervised machine learning and its various use cases. Some background in statistics, probability, calculus, linear algebra, and programming will help you thoroughly understand and follow the content of this book.
目录展开

Preface

About the Book

About the Authors

Learning Objectives

Audience

Approach

Minimum Hardware Requirements

Software Requirements

Conventions

Installation and Setup

Installing the Code Bundle

Additional Resources

Chapter 1:

R for Advanced Analytics

Introduction

Working with Real-World Datasets

Exercise 1: Using the unzip Method for Unzipping a Downloaded File

Reading Data from Various Data Formats

CSV Files

Exercise 2: Reading a CSV File and Summarizing its Column

JSON

Exercise 3: Reading a JSON file and Storing the Data in DataFrame

Text

Exercise 4: Reading a CSV File with Text Column and Storing the Data in VCorpus

Write R Markdown Files for Code Reproducibility

Activity 1: Create an R Markdown File to Read a CSV File and Write a Summary of Data

Data Structures in R

Vector

Matrix

Exercise 5: Performing Transformation on the Data to Make it Available for the Analysis

List

Exercise 6: Using the List Method for Storing Integers and Characters Together

Activity 2: Create a List of Two Matrices and Access the Values

DataFrame

Exercise 7: Performing Integrity Checks Using DataFrame

Data Table

Exercise 8: Exploring the File Read Operation

Data Processing and Transformation

cbind

Exercise 9: Exploring the cbind Function

rbind

Exercise 10: Exploring the rbind Function

The merge Function

Exercise 11: Exploring the merge Function

Inner Join

Left Join

Right Join

Full Join

The reshape Function

Exercise 12: Exploring the reshape Function

The aggregate Function

The Apply Family of Functions

The apply Function

Exercise 13: Implementing the apply Function

The lapply Function

Exercise 14: Implementing the lapply Function

The sapply Function

The tapply Function

Useful Packages

The dplyr Package

Exercise 15: Implementing the dplyr Package

The tidyr Package

Exercise 16: Implementing the tidyr Package

Activity 3: Create a DataFrame with Five Summary Statistics for All Numeric Variables from Bank Data Using dplyr and tidyr

The plyr Package

Exercise 17: Exploring the plyr Package

The caret Package

Data Visualization

Scatterplot

Scatter Plot between Age and Balance split by Marital Status

Line Charts

Histogram

Boxplot

Summary

Chapter 2:

Exploratory Analysis of Data

Introduction

Defining the Problem Statement

Problem-Designing Artifacts

Understanding the Science Behind EDA

Exploratory Data Analysis

Exercise 18: Studying the Data Dimensions

Univariate Analysis

Exploring Numeric/Continuous Features

Exercise 19: Visualizing Data Using a Box Plot

Exercise 20: Visualizing Data Using a Histogram

Exercise 21: Visualizing Data Using a Density Plot

Exercise 22: Visualizing Multiple Variables Using a Histogram

Activity 4: Plotting Multiple Density Plots and Boxplots

Exercise 23: Plotting a Histogram for the nr.employed, euribor3m, cons.conf.idx, and duration Variables

Exploring Categorical Features

Exercise 24: Exploring Categorical Features

Exercise 25: Exploring Categorical Features Using a Bar Chart

Exercise 26: Exploring Categorical Features using Pie Chart

Exercise 27: Automate Plotting Categorical Variables

Exercise 28: Automate Plotting for the Remaining Categorical Variables

Exercise 29: Exploring the Last Remaining Categorical Variable and the Target Variable

Bivariate Analysis

Studying the Relationship between Two Numeric Variables

Exercise 30: Studying the Relationship between Employee Variance Rate and Number of Employees

Studying the Relationship between a Categorical and a Numeric Variable

Exercise 31: Studying the Relationship between the y and age Variables

Exercise 32: Studying the Relationship between the Average Value and the y Variable

Exercise 33: Studying the Relationship between the cons.price.idx, cons.conf.idx, curibor3m, and nr.employed Variables

Studying the Relationship Between Two Categorical Variables

Exercise 34: Studying the Relationship Between the Target y and marital status Variables

Exercise 35: Studying the Relationship between the job and education Variables

Multivariate Analysis

Validating Insights Using Statistical Tests

Categorical Dependent and Numeric/Continuous Independent Variables

Exercise 36: Hypothesis 1 Testing for Categorical Dependent Variables and Continuous Independent Variables

Exercise 37: Hypothesis 2 Testing for Categorical Dependent Variables and Continuous Independent Variables

Categorical Dependent and Categorical Independent Variables

Exercise 38: Hypothesis 3 Testing for Categorical Dependent Variables and Categorical Independent Variables

Exercise 39: Hypothesis 4 and 5 Testing for a Categorical Dependent Variable and a Categorical Independent Variable

Collating Insights – Refine the Solution to the Problem

Summary

Chapter 3:

Introduction to Supervised Learning

Introduction

Summary of the Beijing PM2.5 Dataset

Exercise 40: Exploring the Data

Regression and Classification Problems

Machine Learning Workflow

Design the Problem

Source and Prepare Data

Code the Model

Train and Evaluate

Exercise 41: Creating a Train-and-Test Dataset Randomly Generated by the Beijing PM2.5 Dataset

Deploy the Model

Regression

Simple and Multiple Linear Regression

Assumptions in Linear Regression Models

Exploratory Data Analysis (EDA)

Exercise 42: Exploring the Time Series Views of PM2.5, DEWP, TEMP, and PRES variables of the Beijing PM2.5 Dataset

Exercise 43: Undertaking Correlation Analysis

Exercise 44: Drawing a Scatterplot to Explore the Relationship between PM2.5 Levels and Other Factors

Activity 5: Draw a Scatterplot between PRES and PM2.5 Split by Months

Model Building

Exercise 45: Exploring Simple and Multiple Regression Models

Model Interpretation

Classification

Logistic Regression

A Brief Introduction

Mechanics of Logistic Regression

Model Building

Exercise 46: Storing the Rolling 3-Hour Average in the Beijing PM2.5 Dataset

Activity 6: Transforming Variables and Deriving New Variables to Build a Model

Interpreting a Model

Evaluation Metrics

Mean Absolute Error (MAE)

Root Mean Squared Error (RMSE)

R-squared

Adjusted R-square

Mean Reciprocal Rank (MRR)

Exercise 47: Finding Evaluation Metrics

Confusion Matrix-Based Metrics

Accuracy

Sensitivity

Specificity

F1 Score

Exercise 48: Working with Model Evaluation on Training Data

Receiver Operating Characteristic (ROC) Curve

Exercise 49: Creating an ROC Curve

Summary

Chapter 4:

Regression

Introduction

Linear Regression

Exercise 50: Print the Coefficient and Residual Values Using the multiple_PM_25_linear_model Object

Activity 7: Printing Various Attributes Using Model Object Without Using the Summary Function

Exercise 51: Add the Interaction Term DEWP:TEMP:month in the lm() Function

Model Diagnostics

Exercise 52: Generating and Fitting Models Using the Linear and Quadratic Equations

Residual versus Fitted Plot

Normal Q-Q Plot

Scale-Location Plot

Residual versus Leverage

Improving the Model

Transform the Predictor or Target Variable

Choose a Non-Linear Model

Remove an Outlier or Influential Point

Adding the Interaction Effect

Quantile Regression

Exercise 53: Fit a Quantile Regression on the Beijing PM2.5 Dataset

Exercise 54: Plotting Various Quantiles with More Granularity

Polynomial Regression

Exercise 55: Performing Uniform Distribution Using the runif() Function

Ridge Regression

Regularization Term – L2 Norm

Exercise 56: Ridge Regression on the Beijing PM2.5 dataset

LASSO Regression

Exercise 57: LASSO Regression

Elastic Net Regression

Exercise 58: Elastic Net Regression

Comparison between Coefficients and Residual Standard Error

Exercise 59: Computing the RSE of Linear, Ridge, LASSO, and Elastic Net Regressions

Poisson Regression

Exercise 60: Performing Poisson Regression

Exercise 61: Computing Overdispersion

Cox Proportional-Hazards Regression Model

NCCTG Lung Cancer Data

Exercise 62: Exploring the NCCTG Lung Cancer Data Using Cox-Regression

Summary

Chapter 5:

Classification

Introduction

Getting Started with the Use Case

Some Background on the Use Case

Defining the Problem Statement

Data Gathering

Exercise 63: Exploring Data for the Use Case

Exercise 64: Calculating the Null Value Percentage in All Columns

Exercise 65: Removing Null Values from the Dataset

Exercise 66: Engineer Time-Based Features from the Date Variable

Exercise 67: Exploring the Location Frequency

Exercise 68: Engineering the New Location with Reduced Levels

Classification Techniques for Supervised Learning

Logistic Regression

How Does Logistic Regression Work?

Exercise 69: Build a Logistic Regression Model

Interpreting the Results of Logistic Regression

Evaluating Classification Models

Confusion Matrix and Its Derived Metrics

What Metric Should You Choose?

Evaluating Logistic Regression

Exercise 70: Evaluate a Logistic Regression Model

Exercise 71: Develop a Logistic Regression Model with All of the Independent Variables Available in Our Use Case

Activity 8: Building a Logistic Regression Model with Additional Features

Decision Trees

How Do Decision Trees Work?

Exercise 72: Create a Decision Tree Model in R

Activity 9: Create a Decision Tree Model with Additional Control Parameters

Ensemble Modelling

Random Forest

Why Are Ensemble Models Used?

Bagging – Predecessor to Random Forest

How Does Random Forest Work?

Exercise 73: Building a Random Forest Model in R

Activity 10: Build a Random Forest Model with a Greater Number of Trees

XGBoost

How Does the Boosting Process Work?

What Are Some Popular Boosting Techniques?

How Does XGBoost Work?

Implementing XGBoost in R

Exercise 74: Building an XGBoost Model in R

Exercise 75: Improving the XGBoost Model's Performance

Deep Neural Networks

A Deeper Look into Deep Neural Networks

How Does the Deep Learning Model Work?

What Framework Do We Use for Deep Learning Models?

Building a Deep Neural Network in Keras

Exercise 76: Build a Deep Neural Network in R using R Keras

Choosing the Right Model for Your Use Case

Summary

Chapter 6:

Feature Selection and Dimensionality Reduction

Introduction

Feature Engineering

Discretization

Exercise 77: Performing Binary Discretization

Multi-Category Discretization

Exercise 78: Demonstrating the Use of Quantile Function

One-Hot Encoding

Exercise 79: Using One-Hot Encoding

Activity 11: Converting the CBWD Feature of the Beijing PM2.5 Dataset into One-Hot Encoded Columns

Log Transformation

Exercise 80: Performing Log Transformation

Feature Selection

Univariate Feature Selection

Exercise 81: Exploring Chi-Squared

Highly Correlated Variables

Exercise 82: Plotting a Correlated Matrix

Model-Based Feature Importance Ranking

Exercise 83: Exploring RFE Using RF

Exercise 84: Exploring the Variable Importance using the Random Forest Model

Feature Reduction

Principal Component Analysis (PCA)

Exercise 85: Performing PCA

Variable Clustering

Exercise 86: Using Variable Clustering

Linear Discriminant Analysis for Feature Reduction

Exercise 87: Exploring LDA

Summary

Chapter 7:

Model Improvements

Introduction

Bias-Variance Trade-off

What is Bias and Variance in Machine Learning Models?

Underfitting and Overfitting

Defining a Sample Use Case

Exercise 88: Loading and Exploring Data

Cross-Validation

Holdout Approach/Validation

Exercise 89: Performing Model Assessment Using Holdout Validation

K-Fold Cross-Validation

Exercise 90: Performing Model Assessment Using K-Fold Cross-Validation

Hold-One-Out Validation

Exercise 91: Performing Model Assessment Using Hold-One-Out Validation

Hyperparameter Optimization

Grid Search Optimization

Exercise 92: Performing Grid Search Optimization – Random Forest

Exercise 93: Grid Search Optimization – XGBoost

Random Search Optimization

Exercise 94: Using Random Search Optimization on a Random Forest Model

Exercise 95: Random Search Optimization – XGBoost

Bayesian Optimization

Exercise 96: Performing Bayesian Optimization on the Random Forest Model

Exercise 97: Performing Bayesian Optimization using XGBoost

Activity 12: Performing Repeated K-Fold Cross Validation and Grid Search Optimization

Summary

Chapter 8:

Model Deployment

Introduction

What is an API?

Introduction to plumber

Exercise 98: Developing an ML Model and Deploying It as a Web Service Using Plumber

Challenges in Deploying Models with plumber

A Brief History of the Pre-Docker Era

Docker

Deploying the ML Model Using Docker and plumber

Exercise 99: Create a Docker Container for the R plumber Application

Disadvantages of Using plumber to Deploy R Models

Amazon Web Services

Introducing AWS SageMaker

Deploying an ML Model Endpoint Using SageMaker

Exercise 100: Deploy the ML Model as a SageMaker Endpoint

What is Amazon Lambda?

What is Amazon API Gateway?

Building Serverless ML Applications

Exercise 101: Building a Serverless Application Using API Gateway, AWS Lambda, and SageMaker

Deleting All Cloud Resources to Stop Billing

Activity 13: Deploy an R Model Using plumber

Summary

Chapter 9:

Capstone Project - Based on Research Papers

Introduction

Exploring Research Work

The mlr Package

OpenML Package

Problem Design from the Research Paper

Features in Scene Dataset

Implementing Multilabel Classifier Using the mlr and OpenML Packages

Exercise 102: Downloading the Scene Dataset from OpenML

Constructing a Learner

Adaptation Methods

Transformation Methods

Binary Relevance Method

Classifier Chains Method

Nested Stacking

Dependent Binary Relevance Method

Stacking

Exercise 103: Generating Decision Tree Model Using the classif.rpart Method

Train the Model

Exercise 104: Train the Model

Predicting the Output

Performance of the Model

Resampling the Data

Binary Performance for Each Label

Benchmarking Model

Conducting Benchmark Experiments

Exercise 105: Exploring How to Conduct a Benchmarking on Various Learners

Accessing Benchmark Results

Learner Performances

Predictions

Learners and measures

Activity 14: Getting the Binary Performance Step with classif.C50 Learner Instead of classif.rpart

Working with OpenML Upload Functions

Summary

Appendix

Chapter 1: R for Advanced Analytics

Activity 1: Create an R Markdown File to Read a CSV File and Write a Summary of Data

Activity 2: Create a List of Two Matrices and Access the Values

Activity 3: Create a DataFrame with Five Summary Statistics for All Numeric Variables from Bank Data Using dplyr and tidyr

Chapter 2: Exploratory Analysis of Data

Activity 4: Plotting Multiple Density Plots and Boxplots

Chapter 3: Introduction to Supervised Learning

Activity 5: Draw a Scatterplot between PRES and PM2.5 Split by Months

Activity 6: Transforming Variables and Deriving New Variables to Build a Model

Chapter 4: Regression

Activity 7: Printing Various Attributes Using Model Object Without Using the summary Function

Chapter 5: Classification

Activity 8: Building a Logistic Regression Model with Additional Features

Activity 9: Create a Decision Tree Model with Additional Control Parameters

Activity 10: Build a Random Forest Model with a Greater Number of Trees

Chapter 6: Feature Selection and Dimensionality Reduction

Activity 11: Converting the CBWD Feature of the Beijing PM2.5 Dataset into One-Hot Encoded Columns

Chapter 7: Model Improvements

Activity 12: Perform Repeated K-Fold Cross Validation and Grid Search Optimization

Chapter 8: Model Deployment

Activity 13: Deploy an R Model Using Plumber

Chapter 9: Capstone Project - Based on Research Papers

Activity 14: Getting the Binary Performance Step with classif.C50 Learner Instead of classif.rpart

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部