万本电子书0元读

万本电子书0元读

顶部广告

Practical Predictive Analytics电子书

售       价:¥

0人正在读 | 0人评论 9.8

作       者:Ralph Winters

出  版  社:Packt Publishing

出版时间:2017-07-07

字       数:63.8万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
Make sense of your data and predict the unpredictable About This Book ? A unique book that centers around develop six key practical skills needed to develop and implement predictive analytics ? Apply the principles and techniques of predictive analytics to effectively interpret big data ? Solve real-world analytical problems with the help of practical case studies and real-world scenarios taken from the world of healthcare, marketing, and other business domains Who This Book Is For This book is for those with a mathematical/statistics background who wish to understand the concepts, techniques, and implementation of predictive analytics to resolve complex analytical issues. Basic familiarity with a programming language of R is expected. What You Will Learn ? Master the core predictive analytics algorithm which are used today in business ? Learn to implement the six steps for a successful analytics project ? Classify the right algorithm for your requirements ? Use and apply predictive analytics to research problems in healthcare ? Implement predictive analytics to retain and acquire your customers ? Use text mining to understand unstructured data ? Develop models on your own PC or in Spark/Hadoop environments ? Implement predictive analytics products for customers In Detail This is the go-to book for anyone interested in the steps needed to develop predictive analytics solutions with examples from the world of marketing, healthcare, and retail. We'll get started with a brief history of predictive analytics and learn about different roles and functions people play within a predictive analytics project. Then, we will learn about various ways of installing R along with their pros and cons, combined with a step-by-step installation of RStudio, and a de*ion of the best practices for organizing your projects. On completing the installation, we will begin to acquire the skills necessary to input, clean, and prepare your data for modeling. We will learn the six specific steps needed to implement and successfully deploy a predictive model starting from asking the right questions through model development and ending with deploying your predictive model into production. We will learn why collaboration is important and how agile iterative modeling cycles can increase your chances of developing and deploying the best successful model. We will continue your journey in the cloud by extending your skill set by learning about Databricks and SparkR, which allow you to develop predictive models on vast gigabytes of data. Style and Approach This book takes a practical hands-on approach wherein the algorithms will be explained with the help of real-world use cases. It is written in a well-researched academic style which is a great mix of theoretical and practical information. Code examples are supplied for both theoretical concepts as well as for the case studies. Key references and summaries will be provided at the end of each chapter so that you can explore those topics on their own.
目录展开

Title Page

Copyright

Practical Predictive Analytics

Credits

About the Author

About the Reviewers

www.PacktPub.com

Why subscribe?

Customer Feedback

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

Getting Started with Predictive Analytics

Predictive analytics are in so many industries

Predictive Analytics in marketing

Predictive Analytics in healthcare

Predictive Analytics in other industries

Skills and roles that are important in Predictive Analytics

Related job skills and terms

Predictive analytics software

Open source software

Closed source software

Peaceful coexistence

Other helpful tools

Past the basics

Data analytics/research

Data engineering

Management

Team data science

Two different ways to look at predictive analytics

R

CRAN

R installation

Alternate ways of exploring R

How is a predictive analytics project organized?

Setting up your project and subfolders

GUIs

Getting started with RStudio

Rearranging the layout to correspond with the examples

Brief description of some important panes

Creating a new project

The R console

The source window

Creating a new script

Our first predictive model

Code description

Saving the script

Your second script

Code description

The predict function

Examining the prediction errors

R packages

The stargazer package

Installing stargazer package

Code description

Saving your work

References

Summary

The Modeling Process

Advantages of a structured approach

Ways in which structured methodologies can help

Analytic process methodologies

CRISP-DM and SEMMA

CRISP-DM and SEMMA chart

Agile processes

Six sigma and root cause

To sample or not to sample?

Using all of the data

Comparing a sample to the population

An analytics methodology outline – specific steps

Step 1 business understanding

Communicating business goals – the feedback loop

Internal data

External data

Tools of the trade

Process understanding

Data lineage

Data dictionaries

SQL

Example – Using SQL to get sales by region

Charts and plots

Spreadsheets

Simulation

Example – simulating if a customer contact will yield a sale

Example – simulating customer service calls

Step 2 data understanding

Levels of measurement

Nominal data

Ordinal data

Interval data

Ratio data

Converting from the different levels of measurement

Dependent and independent variables

Transformed variables

Single variable analysis

Summary statistics

Bivariate analysis

Types of questions that bivariate analysis can answer

Quantitative with quantitative variables

Code example

Nominal with nominal variables

Cross-tabulations

Mosaic plots

Nominal with quantitative variables

Point biserial correlation

Step 3 data preparation

Step 4 modeling

Description of specific models

Poisson (counts)

Logistic regression

Support vector machines (SVM)

Decision trees

Random forests

Example - comparing single decision trees to a random forest

An age decision tree

An alternative decision tree

The random forest model

Random forest versus decision trees

Variable importance plots

Dimension reduction techniques

Principal components

Clustering

Time series models

Naive Bayes classifier

Text mining techniques

Step 5 evaluation

Model validation

Area under the curve

Computing an ROC curve using the titanic dataset

In sample/out of sample tests, walk forward tests

Training/test/validation datasets

Time series validation

Benchmark against best champion model

Expert opinions: man against machine

Meta-analysis

Dart board method

Step 6 deployment

Model scoring

References

Notes

Summary

Inputting and Exploring Data

Data input

Text file Input

The read.table function

Database tables

Spreadsheet files

XML and JSON data

Generating your own data

Tips for dealing with large files

Data munging and wrangling

Joining data

Using the sqldf function

Housekeeping and loading of necessary packages

Generating the data

Examining the metadata

Merging data using Inner and Outer joins

Identifying members with multiple purchases

Eliminating duplicate records

Exploring the hospital dataset

Output from the str(df) function

Output from the View function

The colnames function

The summary function

Sending the output to an HTML file

Open the file in the browser

Plotting the distributions

Visual plotting of the variables

Breaking out summaries by groups

Standardizing data

Changing a variable to another type

Appending the variables to the existing dataframe

Extracting a subset

Transposing a dataframe

Dummy variable coding

Binning – numeric and character

Binning character data

Missing values

Setting up the missing values test dataset

The various types of missing data

Missing Completely at Random (MCAR)

Testing for MCAR

Missing at Random (MAR)

Not Missing at Random (NMAR)

Correcting for missing values

Listwise deletion

Imputation methods

Imputing missing values using the 'mice' package

Running a regression with imputed values

Imputing categorical variables

Outliers

Why outliers are important

Detecting outliers

Transforming the data

Tracking down the cause of the outliers

Ways to deal with outliers

Example – setting the outliers to NA

Multivariate outliers

Data transformations

Generating the test data

The Box-Cox Transform

Variable reduction/variable importance

Principal Components Analysis (PCA)

Where is PCA used?

A PCA example – US Arrests

All subsets regression

An example – airquality

Adjusted R-square plot

Variable importance

Variable influence plot

References

Summary

Introduction to Regression Algorithms

Supervised versus unsupervised learning models

Supervised learning models

Unsupervised learning models

Regression techniques

Advantages of regression

Generalized linear models

Linear regression using GLM

Logistic regression

The odds ratio

The logistic regression coefficients

Example - using logistic regression in health care to predict pain thresholds

Reading the data

Obtaining some basic counts

Saving your data

Fitting a GLM model

Examining the residuals

Residual plots

Added variable plots

Outliers in the regression

P-values and effect size

P-values and effect sizes

Variable selection

Interactions

Goodness of fit statistics

McFadden statistic

Confidence intervals and Wald statistics

Basic regression diagnostic plots

Description of the plots

An interactive game – guessing if the residuals are random

Goodness of fit – Hosmer-Lemeshow test

Goodness of fit example on the PainGLM data

Regularization

An example – ElasticNet

Choosing a correct lamda

Printing out the possible coefficients based on Lambda

Summary

Introduction to Decision Trees, Clustering, and SVM

Decision tree algorithms

Advantages of decision trees

Disadvantages of decision trees

Basic decision tree concepts

Growing the tree

Impurity

Controlling the growth of the tree

Types of decision tree algorithms

Examining the target variable

Using formula notation in an rpart model

Interpretation of the plot

Printing a text version of the decision tree

The ctree algorithm

Pruning

Other options to render decision trees

Cluster analysis

Clustering is used in diverse industries

What is a cluster?

Types of clustering

Partitional clustering

K-means clustering

The k-means algorithm

Measuring distance between clusters

Clustering example using k-means

Cluster elbow plot

Extracting the cluster assignments

Graphically displaying the clusters

Cluster plots

Generating the cluster plot

Hierarchical clustering

Examining some examples from cluster 1

Examining some examples from cluster 2

Examining some examples from cluster 3

Support vector machines

Simple illustration of a mapping function

Analyzing consumer complains data using SVM

Converting unstructured to structured data

References

Summary

Using Survival Analysis to Predict and Analyze Customer Churn

What is survival analysis?

Time-dependent data

Censoring

Left censoring

Right censoring

Our customer satisfaction dataset

Generating the data using probability functions

Creating the churn and no churn dataframes

Creating and verifying the new simulated variables

Recombining the churner and non-churners

Creating matrix plots

Partitioning into training and test data

Setting the stage by creating survival objects

Examining survival curves

Better plots

Contrasting survival curves

Testing for the gender difference between survival curves

Testing for the educational differences between survival curves

Plotting the customer satisfaction and number of service call curves

Improving the education survival curve by adding gender

Transforming service calls to a binary variable

Testing the difference between customers who called and those who did not

Cox regression modeling

Our first model

Examining the cox regression output

Proportional hazards test

Proportional hazard plots

Obtaining the cox survival curves

Plotting the curve

Partial regression plots

Examining subset survival curves

Comparing gender differences

Comparing customer satisfaction differences

Validating the model

Computing baseline estimates

Running the predict() function

Predicting the outcome at time 6

Determining concordance

Time-based variables

Changing the data to reflect the second survey

How survSplit works

Adjusting records to simulate an intervention

Running the time-based model

Comparing the models

Variable selection

Incorporating interaction terms

Displaying the formulas sublist

Comparing AIC among the candidate models

Summary

Using Market Basket Analysis as a Recommender Engine

What is market basket analysis?

Examining the groceries transaction file

Format of the groceries transaction Files

The sample market basket

Association rule algorithms

Antecedents and descendants

Evaluating the accuracy of a rule

Support

Calculating support

Examples

Confidence

Lift

Evaluating lift

Preparing the raw data file for analysis

Reading the transaction file

capture.output function

Analyzing the input file

Analyzing the invoice dates

Plotting the dates

Scrubbing and cleaning the data

Removing unneeded character spaces

Simplifying the descriptions

Removing colors automatically

The colors() function

Cleaning up the colors

Filtering out single item transactions

Looking at the distributions

Merging the results back into the original data

Compressing descriptions using camelcase

Custom function to map to camelcase

Extracting the last word

Creating the test and training datasets

Saving the results

Loading the analytics file

Determining the consequent rules

Replacing missing values

Making the final subset

Creating the market basket transaction file

Method one – Coercing a dataframe to a transaction file

Inspecting the transaction file

Obtaining the topN purchased items

Finding the association rules

Examining the rules summary

Examining the rules quality and observing the highest support

Confidence and lift measures

Filtering a large number of rules

Generating many rules

Plotting many rules

Method two – Creating a physical transactions file

Reading the transaction file back in

Plotting the rules

Creating subsets of the rules

Text clustering

Converting to a document term matrix

Removing sparse terms

Finding frequent terms

K-means clustering of terms

Examining cluster 1

Examining cluster 2

Examining cluster 3

Examining cluster 4

Examining cluster 5

Predicting cluster assignments

Using flexclust to predict cluster assignment

Running k-means to generate the clusters

Creating the test DTM

Running the apriori algorithm on the clusters

Summarizing the metrics

References

Summary

Exploring Health Care Enrollment Data as a Time Series

Time series data

Exploring time series data

Health insurance coverage dataset

Housekeeping

Read the data in

Subsetting the columns

Description of the data

Target time series variable

Saving the data

Determining all of the subset groups

Merging the aggregate data back into the original data

Checking the time intervals

Picking out the top groups in terms of average population size

Plotting the data using lattice

Plotting the data using ggplot

Sending output to an external file

Examining the output

Detecting linear trends

Automating the regressions

Ranking the coefficients

Merging scores back into the original dataframe

Plotting the data with the trend lines

Plotting all the categories on one graph

Adding labels

Performing some automated forecasting using the ets function

Converting the dataframe to a time series object

Smoothing the data using moving averages

Simple moving average

Computing the SMA using a function

Verifying the SMA calculation

Exponential moving average

Computing the EMA using a function

Selecting a smoothing factor

Using the ets function

Forecasting using ALL AGES

Plotting the predicted and actual values

The forecast (fit) method

Plotting future values with confidence bands

Modifying the model to include a trend component

Running the ets function iteratively over all of the categories

Accuracy measures produced by onestep

Comparing the Test and Training for the "UNDER 18 YEARS" group

Accuracy measures

References

Summary

Introduction to Spark Using R

About Spark

Spark environments

Cluster computing

Parallel computing

SparkR

Dataframes

Building our first Spark dataframe

Simulation

Importing the sample notebook

Notebook format

Creating a new notebook

Becoming large by starting small

The Pima Indians diabetes dataset

Running the code

Running the initialization code

Extracting the Pima Indians diabetes dataset

Examining the output

Output from the str() function

Output from the summary() function

Comparing outcomes

Checking for missing values

Imputing the missing values

Checking the imputations (reader exercise)

Missing values complete!

Calculating the correlation matrices

Calculating the column means

Simulating the data

Which correlations to use?

Checking the object type

Simulating the negative cases

Concatenating the positive and negative cases into a single Spark dataframe

Running summary statistics

Saving your work

Summary

Exploring Large Datasets Using Spark

Performing some exploratory analysis on positives

Displaying the contents of a Spark dataframe

Graphing using native graph features

Running pairwise correlations directly on a Spark dataframe

Cleaning up and caching the table in memory

Some useful Spark functions to explore your data

Count and groupby

Covariance and correlation functions

Creating new columns

Constructing a cross-tab

Contrasting histograms

Plotting using ggplot

Spark SQL

Registering tables

Issuing SQL through the R interface

Using SQL to examine potential outliers

Creating some aggregates

Picking out some potential outliers using a third query

Changing to the SQL API

SQL – computing a new column using the Case statement

Evaluating outcomes based upon the Age segment

Computing mean values for all of the variables

Exporting data from Spark back into R

Running local R packages

Using the pairs function (available in the base package)

Generating a correlation plot

Some tips for using Spark

Summary

Spark Machine Learning - Regression and Cluster Models

About this chapter/what you will learn

Reading the data

Running a summary of the dataframe and saving the object

Splitting the data into train and test datasets

Generating the training datasets

Generating the test dataset

A note on parallel processing

Introducing errors into the test data set

Generating a histogram of the distribution

Generating the new test data with errors

Spark machine learning using logistic regression

Examining the output:

Regularization Models

Predicting outcomes

Plotting the results

Running predictions for the test data

Combining the training and test dataset

Exposing the three tables to SQL

Validating the regression results

Calculating goodness of fit measures

Confusion matrix

Confusion matrix for test group

Distribution of average errors by group

Plotting the data

Pseudo R-square

Root-mean-square error (RMSE)

Plotting outside of Spark

Collecting a sample of the results

Examining the distributions by outcome

Registering some additional tables

Creating some global views

User exercise

Cluster analysis

Preparing the data for analysis

Reading the data from the global views

Inputting the previously computed means and standard deviations

Joining the means and standard deviations with the training data

Joining the means and standard deviations with the test data

Normalizing the data

Displaying the output

Running the k-means model

Fitting the model to the training data

Fitting the model to the test data

Graphically display cluster assignment

Plotting via the Pairs function

Characterizing the clusters by their mean values

Calculating mean values for the test data

Summary

Spark Models – Rule-Based Learning

Loading the stop and frisk dataset

Importing the CSV file to databricks

Reading the table

Running the first cell

Reading the entire file into memory

Transforming some variables to integers

Discovering the important features

Eliminating some factors with a large number of levels

Test and train datasets

Examining the binned data

Running the OneR model

Interpreting the output

Constructing new variables

Running the prediction on the test sample

Another OneR example

The rules section

Constructing a decision tree using Rpart

First collect the sample

Decision tree using Rpart

Plot the tree

Running an alternative model in Python

Running a Python Decision Tree

Reading the Stop and Frisk table

Indexing the classification features

Mapping to an RDD

Specifying the decision tree model

Producing a larger tree

Visual trees

Comparing train and test decision trees

Summary

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部