售 价:¥
温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印
为你推荐
Title Page
Copyright
Practical Predictive Analytics
Credits
About the Author
About the Reviewers
www.PacktPub.com
Why subscribe?
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
Getting Started with Predictive Analytics
Predictive analytics are in so many industries
Predictive Analytics in marketing
Predictive Analytics in healthcare
Predictive Analytics in other industries
Skills and roles that are important in Predictive Analytics
Related job skills and terms
Predictive analytics software
Open source software
Closed source software
Peaceful coexistence
Other helpful tools
Past the basics
Data analytics/research
Data engineering
Management
Team data science
Two different ways to look at predictive analytics
R
CRAN
R installation
Alternate ways of exploring R
How is a predictive analytics project organized?
Setting up your project and subfolders
GUIs
Getting started with RStudio
Rearranging the layout to correspond with the examples
Brief description of some important panes
Creating a new project
The R console
The source window
Creating a new script
Our first predictive model
Code description
Saving the script
Your second script
Code description
The predict function
Examining the prediction errors
R packages
The stargazer package
Installing stargazer package
Code description
Saving your work
References
Summary
The Modeling Process
Advantages of a structured approach
Ways in which structured methodologies can help
Analytic process methodologies
CRISP-DM and SEMMA
CRISP-DM and SEMMA chart
Agile processes
Six sigma and root cause
To sample or not to sample?
Using all of the data
Comparing a sample to the population
An analytics methodology outline – specific steps
Step 1 business understanding
Communicating business goals – the feedback loop
Internal data
External data
Tools of the trade
Process understanding
Data lineage
Data dictionaries
SQL
Example – Using SQL to get sales by region
Charts and plots
Spreadsheets
Simulation
Example – simulating if a customer contact will yield a sale
Example – simulating customer service calls
Step 2 data understanding
Levels of measurement
Nominal data
Ordinal data
Interval data
Ratio data
Converting from the different levels of measurement
Dependent and independent variables
Transformed variables
Single variable analysis
Summary statistics
Bivariate analysis
Types of questions that bivariate analysis can answer
Quantitative with quantitative variables
Code example
Nominal with nominal variables
Cross-tabulations
Mosaic plots
Nominal with quantitative variables
Point biserial correlation
Step 3 data preparation
Step 4 modeling
Description of specific models
Poisson (counts)
Logistic regression
Support vector machines (SVM)
Decision trees
Random forests
Example - comparing single decision trees to a random forest
An age decision tree
An alternative decision tree
The random forest model
Random forest versus decision trees
Variable importance plots
Dimension reduction techniques
Principal components
Clustering
Time series models
Naive Bayes classifier
Text mining techniques
Step 5 evaluation
Model validation
Area under the curve
Computing an ROC curve using the titanic dataset
In sample/out of sample tests, walk forward tests
Training/test/validation datasets
Time series validation
Benchmark against best champion model
Expert opinions: man against machine
Meta-analysis
Dart board method
Step 6 deployment
Model scoring
References
Notes
Summary
Inputting and Exploring Data
Data input
Text file Input
The read.table function
Database tables
Spreadsheet files
XML and JSON data
Generating your own data
Tips for dealing with large files
Data munging and wrangling
Joining data
Using the sqldf function
Housekeeping and loading of necessary packages
Generating the data
Examining the metadata
Merging data using Inner and Outer joins
Identifying members with multiple purchases
Eliminating duplicate records
Exploring the hospital dataset
Output from the str(df) function
Output from the View function
The colnames function
The summary function
Sending the output to an HTML file
Open the file in the browser
Plotting the distributions
Visual plotting of the variables
Breaking out summaries by groups
Standardizing data
Changing a variable to another type
Appending the variables to the existing dataframe
Extracting a subset
Transposing a dataframe
Dummy variable coding
Binning – numeric and character
Binning character data
Missing values
Setting up the missing values test dataset
The various types of missing data
Missing Completely at Random (MCAR)
Testing for MCAR
Missing at Random (MAR)
Not Missing at Random (NMAR)
Correcting for missing values
Listwise deletion
Imputation methods
Imputing missing values using the 'mice' package
Running a regression with imputed values
Imputing categorical variables
Outliers
Why outliers are important
Detecting outliers
Transforming the data
Tracking down the cause of the outliers
Ways to deal with outliers
Example – setting the outliers to NA
Multivariate outliers
Data transformations
Generating the test data
The Box-Cox Transform
Variable reduction/variable importance
Principal Components Analysis (PCA)
Where is PCA used?
A PCA example – US Arrests
All subsets regression
An example – airquality
Adjusted R-square plot
Variable importance
Variable influence plot
References
Summary
Introduction to Regression Algorithms
Supervised versus unsupervised learning models
Supervised learning models
Unsupervised learning models
Regression techniques
Advantages of regression
Generalized linear models
Linear regression using GLM
Logistic regression
The odds ratio
The logistic regression coefficients
Example - using logistic regression in health care to predict pain thresholds
Reading the data
Obtaining some basic counts
Saving your data
Fitting a GLM model
Examining the residuals
Residual plots
Added variable plots
Outliers in the regression
P-values and effect size
P-values and effect sizes
Variable selection
Interactions
Goodness of fit statistics
McFadden statistic
Confidence intervals and Wald statistics
Basic regression diagnostic plots
Description of the plots
An interactive game – guessing if the residuals are random
Goodness of fit – Hosmer-Lemeshow test
Goodness of fit example on the PainGLM data
Regularization
An example – ElasticNet
Choosing a correct lamda
Printing out the possible coefficients based on Lambda
Summary
Introduction to Decision Trees, Clustering, and SVM
Decision tree algorithms
Advantages of decision trees
Disadvantages of decision trees
Basic decision tree concepts
Growing the tree
Impurity
Controlling the growth of the tree
Types of decision tree algorithms
Examining the target variable
Using formula notation in an rpart model
Interpretation of the plot
Printing a text version of the decision tree
The ctree algorithm
Pruning
Other options to render decision trees
Cluster analysis
Clustering is used in diverse industries
What is a cluster?
Types of clustering
Partitional clustering
K-means clustering
The k-means algorithm
Measuring distance between clusters
Clustering example using k-means
Cluster elbow plot
Extracting the cluster assignments
Graphically displaying the clusters
Cluster plots
Generating the cluster plot
Hierarchical clustering
Examining some examples from cluster 1
Examining some examples from cluster 2
Examining some examples from cluster 3
Support vector machines
Simple illustration of a mapping function
Analyzing consumer complains data using SVM
Converting unstructured to structured data
References
Summary
Using Survival Analysis to Predict and Analyze Customer Churn
What is survival analysis?
Time-dependent data
Censoring
Left censoring
Right censoring
Our customer satisfaction dataset
Generating the data using probability functions
Creating the churn and no churn dataframes
Creating and verifying the new simulated variables
Recombining the churner and non-churners
Creating matrix plots
Partitioning into training and test data
Setting the stage by creating survival objects
Examining survival curves
Better plots
Contrasting survival curves
Testing for the gender difference between survival curves
Testing for the educational differences between survival curves
Plotting the customer satisfaction and number of service call curves
Improving the education survival curve by adding gender
Transforming service calls to a binary variable
Testing the difference between customers who called and those who did not
Cox regression modeling
Our first model
Examining the cox regression output
Proportional hazards test
Proportional hazard plots
Obtaining the cox survival curves
Plotting the curve
Partial regression plots
Examining subset survival curves
Comparing gender differences
Comparing customer satisfaction differences
Validating the model
Computing baseline estimates
Running the predict() function
Predicting the outcome at time 6
Determining concordance
Time-based variables
Changing the data to reflect the second survey
How survSplit works
Adjusting records to simulate an intervention
Running the time-based model
Comparing the models
Variable selection
Incorporating interaction terms
Displaying the formulas sublist
Comparing AIC among the candidate models
Summary
Using Market Basket Analysis as a Recommender Engine
What is market basket analysis?
Examining the groceries transaction file
Format of the groceries transaction Files
The sample market basket
Association rule algorithms
Antecedents and descendants
Evaluating the accuracy of a rule
Support
Calculating support
Examples
Confidence
Lift
Evaluating lift
Preparing the raw data file for analysis
Reading the transaction file
capture.output function
Analyzing the input file
Analyzing the invoice dates
Plotting the dates
Scrubbing and cleaning the data
Removing unneeded character spaces
Simplifying the descriptions
Removing colors automatically
The colors() function
Cleaning up the colors
Filtering out single item transactions
Looking at the distributions
Merging the results back into the original data
Compressing descriptions using camelcase
Custom function to map to camelcase
Extracting the last word
Creating the test and training datasets
Saving the results
Loading the analytics file
Determining the consequent rules
Replacing missing values
Making the final subset
Creating the market basket transaction file
Method one – Coercing a dataframe to a transaction file
Inspecting the transaction file
Obtaining the topN purchased items
Finding the association rules
Examining the rules summary
Examining the rules quality and observing the highest support
Confidence and lift measures
Filtering a large number of rules
Generating many rules
Plotting many rules
Method two – Creating a physical transactions file
Reading the transaction file back in
Plotting the rules
Creating subsets of the rules
Text clustering
Converting to a document term matrix
Removing sparse terms
Finding frequent terms
K-means clustering of terms
Examining cluster 1
Examining cluster 2
Examining cluster 3
Examining cluster 4
Examining cluster 5
Predicting cluster assignments
Using flexclust to predict cluster assignment
Running k-means to generate the clusters
Creating the test DTM
Running the apriori algorithm on the clusters
Summarizing the metrics
References
Summary
Exploring Health Care Enrollment Data as a Time Series
Time series data
Exploring time series data
Health insurance coverage dataset
Housekeeping
Read the data in
Subsetting the columns
Description of the data
Target time series variable
Saving the data
Determining all of the subset groups
Merging the aggregate data back into the original data
Checking the time intervals
Picking out the top groups in terms of average population size
Plotting the data using lattice
Plotting the data using ggplot
Sending output to an external file
Examining the output
Detecting linear trends
Automating the regressions
Ranking the coefficients
Merging scores back into the original dataframe
Plotting the data with the trend lines
Plotting all the categories on one graph
Adding labels
Performing some automated forecasting using the ets function
Converting the dataframe to a time series object
Smoothing the data using moving averages
Simple moving average
Computing the SMA using a function
Verifying the SMA calculation
Exponential moving average
Computing the EMA using a function
Selecting a smoothing factor
Using the ets function
Forecasting using ALL AGES
Plotting the predicted and actual values
The forecast (fit) method
Plotting future values with confidence bands
Modifying the model to include a trend component
Running the ets function iteratively over all of the categories
Accuracy measures produced by onestep
Comparing the Test and Training for the "UNDER 18 YEARS" group
Accuracy measures
References
Summary
Introduction to Spark Using R
About Spark
Spark environments
Cluster computing
Parallel computing
SparkR
Dataframes
Building our first Spark dataframe
Simulation
Importing the sample notebook
Notebook format
Creating a new notebook
Becoming large by starting small
The Pima Indians diabetes dataset
Running the code
Running the initialization code
Extracting the Pima Indians diabetes dataset
Examining the output
Output from the str() function
Output from the summary() function
Comparing outcomes
Checking for missing values
Imputing the missing values
Checking the imputations (reader exercise)
Missing values complete!
Calculating the correlation matrices
Calculating the column means
Simulating the data
Which correlations to use?
Checking the object type
Simulating the negative cases
Concatenating the positive and negative cases into a single Spark dataframe
Running summary statistics
Saving your work
Summary
Exploring Large Datasets Using Spark
Performing some exploratory analysis on positives
Displaying the contents of a Spark dataframe
Graphing using native graph features
Running pairwise correlations directly on a Spark dataframe
Cleaning up and caching the table in memory
Some useful Spark functions to explore your data
Count and groupby
Covariance and correlation functions
Creating new columns
Constructing a cross-tab
Contrasting histograms
Plotting using ggplot
Spark SQL
Registering tables
Issuing SQL through the R interface
Using SQL to examine potential outliers
Creating some aggregates
Picking out some potential outliers using a third query
Changing to the SQL API
SQL – computing a new column using the Case statement
Evaluating outcomes based upon the Age segment
Computing mean values for all of the variables
Exporting data from Spark back into R
Running local R packages
Using the pairs function (available in the base package)
Generating a correlation plot
Some tips for using Spark
Summary
Spark Machine Learning - Regression and Cluster Models
About this chapter/what you will learn
Reading the data
Running a summary of the dataframe and saving the object
Splitting the data into train and test datasets
Generating the training datasets
Generating the test dataset
A note on parallel processing
Introducing errors into the test data set
Generating a histogram of the distribution
Generating the new test data with errors
Spark machine learning using logistic regression
Examining the output:
Regularization Models
Predicting outcomes
Plotting the results
Running predictions for the test data
Combining the training and test dataset
Exposing the three tables to SQL
Validating the regression results
Calculating goodness of fit measures
Confusion matrix
Confusion matrix for test group
Distribution of average errors by group
Plotting the data
Pseudo R-square
Root-mean-square error (RMSE)
Plotting outside of Spark
Collecting a sample of the results
Examining the distributions by outcome
Registering some additional tables
Creating some global views
User exercise
Cluster analysis
Preparing the data for analysis
Reading the data from the global views
Inputting the previously computed means and standard deviations
Joining the means and standard deviations with the training data
Joining the means and standard deviations with the test data
Normalizing the data
Displaying the output
Running the k-means model
Fitting the model to the training data
Fitting the model to the test data
Graphically display cluster assignment
Plotting via the Pairs function
Characterizing the clusters by their mean values
Calculating mean values for the test data
Summary
Spark Models – Rule-Based Learning
Loading the stop and frisk dataset
Importing the CSV file to databricks
Reading the table
Running the first cell
Reading the entire file into memory
Transforming some variables to integers
Discovering the important features
Eliminating some factors with a large number of levels
Test and train datasets
Examining the binned data
Running the OneR model
Interpreting the output
Constructing new variables
Running the prediction on the test sample
Another OneR example
The rules section
Constructing a decision tree using Rpart
First collect the sample
Decision tree using Rpart
Plot the tree
Running an alternative model in Python
Running a Python Decision Tree
Reading the Stop and Frisk table
Indexing the classification features
Mapping to an RDD
Specifying the decision tree model
Producing a larger tree
Visual trees
Comparing train and test decision trees
Summary
买过这本书的人还买过
读了这本书的人还在读
同类图书排行榜