万本电子书0元读

万本电子书0元读

顶部广告

Python: Advanced Predictive Analytics电子书

售       价:¥

5人正在读 | 0人评论 9.8

作       者:Ashish Kumar,Joseph Babcock

出  版  社:Packt Publishing

出版时间:2017-12-27

字       数:629.7万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
Gain practical insights by exploiting data in your business to build advanced predictive modeling applications About This Book ? A step-by-step guide to predictive modeling including lots of tips, tricks, and best practices ? Learn how to use popular predictive modeling algorithms such as Linear Regression, Decision Trees, Logistic Regression, and Clustering ? Master open source Python tools to build sophisticated predictive models Who This Book Is For This book is designed for business analysts, BI analysts, data scientists, or junior level data analysts who are ready to move on from a conceptual understanding of advanced analytics and become an expert in designing and building advanced analytics solutions using Python. If you are familiar with coding in Python (or some other programming/statistical/*ing language) but have never used or read about predictive analytics algorithms, this book will also help you. What You Will Learn ? Understand the statistical and mathematical concepts behind predictive analytics algorithms and implement them using Python libraries ? Get to know various methods for importing, cleaning, sub-setting, merging, joining, concatenating, exploring, grouping, and plotting data with pandas and NumPy ? Master the use of Python notebooks for exploratory data analysis and rapid prototyping ? Get to grips with applying regression, classification, clustering, and deep learning algorithms ? Discover advanced methods to analyze structured and unstructured data ? Visualize the performance of models and the insights they produce ? Ensure the robustness of your analytic applications by mastering the best practices of predictive analysis In Detail Social Media and the Internet of Things have resulted in an avalanche of data. Data is powerful but not in its raw form; it needs to be processed and modeled, and Python is one of the most robust tools out there to do so. It has an array of packages for predictive modeling and a suite of IDEs to choose from. Using the Python programming language, analysts can use these sophisticated methods to build scalable analytic applications. This book is your guide to getting started with predictive analytics using Python. You'll balance both statistical and mathematical concepts, and implement them in Python using libraries such as pandas, scikit-learn, and NumPy. Through case studies and code examples using popular open-source Python libraries, this book illustrates the complete development process for analytic applications. Covering a wide range of algorithms for classification, regression, clustering, as well as cutting-edge techniques such as deep learning, this book illustrates explains how these methods work. You will learn to choose the right approach for your problem and how to develop engaging visualizations to bring to life the insights of predictive modeling. Finally, you will learn best practices in predictive modeling, as well as the different applications of predictive modeling in the modern world. The course provides you with highly practical content from the following Packt books: 1. Learning Predictive Analytics with Python 2. Mastering Predictive Analytics with Python Style and approach This course aims to create a smooth learning path that will teach you how to effectively perform predictive analytics using Python. Through this comprehensive course, you’ll learn the basics of predictive analytics and progress to predictive modeling in the modern world.
目录展开

Python: Advanced Predictive Analytics

Table of Contents

Python: Advanced Predictive Analytics

Credits

Preface

What this learning path covers

What you need for this learning path

Who this learning path is for

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Module 1

1. Getting Started with Predictive Modelling

Introducing predictive modelling

Scope of predictive modelling

Ensemble of statistical algorithms

Statistical tools

Historical data

Mathematical function

Business context

Knowledge matrix for predictive modelling

Task matrix for predictive modelling

Applications and examples of predictive modelling

LinkedIn's "People also viewed" feature

What it does?

How is it done?

Correct targeting of online ads

How is it done?

Santa Cruz predictive policing

How is it done?

Determining the activity of a smartphone user using accelerometer data

How is it done?

Sport and fantasy leagues

How was it done?

Python and its packages – download and installation

Anaconda

Standalone Python

Installing a Python package

Installing pip

Installing Python packages with pip

Python and its packages for predictive modelling

IDEs for Python

Summary

2. Data Cleaning

Reading the data – variations and examples

Data frames

Delimiters

Various methods of importing data in Python

Case 1 – reading a dataset using the read_csv method

The read_csv method

Use cases of the read_csv method

Passing the directory address and filename as variables

Reading a .txt dataset with a comma delimiter

Specifying the column names of a dataset from a list

Case 2 – reading a dataset using the open method of Python

Reading a dataset line by line

Changing the delimiter of a dataset

Case 3 – reading data from a URL

Case 4 – miscellaneous cases

Reading from an .xls or .xlsx file

Writing to a CSV or Excel file

Basics – summary, dimensions, and structure

Handling missing values

Checking for missing values

What constitutes missing data?

How missing values are generated and propagated

Treating missing values

Deletion

Imputation

Creating dummy variables

Visualizing a dataset by basic plotting

Scatter plots

Histograms

Boxplots

Summary

3. Data Wrangling

Subsetting a dataset

Selecting columns

Selecting rows

Selecting a combination of rows and columns

Creating new columns

Generating random numbers and their usage

Various methods for generating random numbers

Seeding a random number

Generating random numbers following probability distributions

Probability density function

Cumulative density function

Uniform distribution

Normal distribution

Using the Monte-Carlo simulation to find the value of pi

Geometry and mathematics behind the calculation of pi

Generating a dummy data frame

Grouping the data – aggregation, filtering, and transformation

Aggregation

Filtering

Transformation

Miscellaneous operations

Random sampling – splitting a dataset in training and testing datasets

Method 1 – using the Customer Churn Model

Method 2 – using sklearn

Method 3 – using the shuffle function

Concatenating and appending data

Merging/joining datasets

Inner Join

Left Join

Right Join

An example of the Inner Join

An example of the Left Join

An example of the Right Join

Summary of Joins in terms of their length

Summary

4. Statistical Concepts for Predictive Modelling

Random sampling and the central limit theorem

Hypothesis testing

Null versus alternate hypothesis

Z-statistic and t-statistic

Confidence intervals, significance levels, and p-values

Different kinds of hypothesis test

A step-by-step guide to do a hypothesis test

An example of a hypothesis test

Chi-square tests

Correlation

Summary

5. Linear Regression with Python

Understanding the maths behind linear regression

Linear regression using simulated data

Fitting a linear regression model and checking its efficacy

Finding the optimum value of variable coefficients

Making sense of result parameters

p-values

F-statistics

Residual Standard Error

Implementing linear regression with Python

Linear regression using the statsmodel library

Multiple linear regression

Multi-collinearity

Variance Inflation Factor

Model validation

Training and testing data split

Summary of models

Linear regression with scikit-learn

Feature selection with scikit-learn

Handling other issues in linear regression

Handling categorical variables

Transforming a variable to fit non-linear relations

Handling outliers

Other considerations and assumptions for linear regression

Summary

6. Logistic Regression with Python

Linear regression versus logistic regression

Understanding the math behind logistic regression

Contingency tables

Conditional probability

Odds ratio

Moving on to logistic regression from linear regression

Estimation using the Maximum Likelihood Method

Likelihood function:

Log likelihood function:

Building the logistic regression model from scratch

Making sense of logistic regression parameters

Wald test

Likelihood Ratio Test statistic

Chi-square test

Implementing logistic regression with Python

Processing the data

Data exploration

Data visualization

Creating dummy variables for categorical variables

Feature selection

Implementing the model

Model validation and evaluation

Cross validation

Model validation

The ROC curve

Confusion matrix

Summary

7. Clustering with Python

Introduction to clustering – what, why, and how?

What is clustering?

How is clustering used?

Why do we do clustering?

Mathematics behind clustering

Distances between two observations

Euclidean distance

Manhattan distance

Minkowski distance

The distance matrix

Normalizing the distances

Linkage methods

Single linkage

Compete linkage

Average linkage

Centroid linkage

Ward's method

Hierarchical clustering

K-means clustering

Implementing clustering using Python

Importing and exploring the dataset

Normalizing the values in the dataset

Hierarchical clustering using scikit-learn

K-Means clustering using scikit-learn

Interpreting the cluster

Fine-tuning the clustering

The elbow method

Silhouette Coefficient

Summary

8. Trees and Random Forests with Python

Introducing decision trees

A decision tree

Understanding the mathematics behind decision trees

Homogeneity

Entropy

Information gain

ID3 algorithm to create a decision tree

Gini index

Reduction in Variance

Pruning a tree

Handling a continuous numerical variable

Handling a missing value of an attribute

Implementing a decision tree with scikit-learn

Visualizing the tree

Cross-validating and pruning the decision tree

Understanding and implementing regression trees

Regression tree algorithm

Implementing a regression tree using Python

Understanding and implementing random forests

The random forest algorithm

Implementing a random forest using Python

Why do random forests work?

Important parameters for random forests

Summary

9. Best Practices for Predictive Modelling

Best practices for coding

Commenting the codes

Defining functions for substantial individual tasks

Example 1

Example 2

Example 3

Avoid hard-coding of variables as much as possible

Version control

Using standard libraries, methods, and formulas

Best practices for data handling

Best practices for algorithms

Best practices for statistics

Best practices for business contexts

Summary

A. A List of Links

2. Module 2

1. From Data to Decisions – Getting Started with Analytic Applications

Designing an advanced analytic solution

Data layer: warehouses, lakes, and streams

Modeling layer

Deployment layer

Reporting layer

Case study: sentiment analysis of social media feeds

Data input and transformation

Sanity checking

Model development

Scoring

Visualization and reporting

Case study: targeted e-mail campaigns

Data input and transformation

Sanity checking

Model development

Scoring

Visualization and reporting

Summary

2. Exploratory Data Analysis and Visualization in Python

Exploring categorical and numerical data in IPython

Installing IPython notebook

The notebook interface

Loading and inspecting data

Basic manipulations – grouping, filtering, mapping, and pivoting

Charting with Matplotlib

Time series analysis

Cleaning and converting

Time series diagnostics

Joining signals and correlation

Working with geospatial data

Loading geospatial data

Working in the cloud

Introduction to PySpark

Creating the SparkContext

Creating an RDD

Creating a Spark DataFrame

Summary

3. Finding Patterns in the Noise – Clustering and Unsupervised Learning

Similarity and distance metrics

Numerical distance metrics

Correlation similarity metrics and time series

Similarity metrics for categorical data

K-means clustering

Affinity propagation – automatically choosing cluster numbers

k-medoids

Agglomerative clustering

Where agglomerative clustering fails

Streaming clustering in Spark

Summary

4. Connecting the Dots with Models – Regression Methods

Linear regression

Data preparation

Model fitting and evaluation

Statistical significance of regression outputs

Generalize estimating equations

Mixed effects models

Time series data

Generalized linear models

Applying regularization to linear models

Tree methods

Decision trees

Random forest

Scaling out with PySpark – predicting year of song release

Summary

5. Putting Data in its Place – Classification Methods and Analysis

Logistic regression

Multiclass logistic classifiers: multinomial regression

Formatting a dataset for classification problems

Learning pointwise updates with stochastic gradient descent

Jointly optimizing all parameters with second-order methods

Fitting the model

Evaluating classification models

Strategies for improving classification models

Separating Nonlinear boundaries with Support vector machines

Fitting and SVM to the census data

Boosting – combining small models to improve accuracy

Gradient boosted decision trees

Comparing classification methods

Case study: fitting classifier models in pyspark

Summary

6. Words and Pixels – Working with Unstructured Data

Working with textual data

Cleaning textual data

Extracting features from textual data

Using dimensionality reduction to simplify datasets

Principal component analysis

Latent Dirichlet Allocation

Using dimensionality reduction in predictive modeling

Images

Cleaning image data

Thresholding images to highlight objects

Dimensionality reduction for image analysis

Case Study: Training a Recommender System in PySpark

Summary

7. Learning from the Bottom Up – Deep Networks and Unsupervised Features

Learning patterns with neural networks

A network of one – the perceptron

Combining perceptrons – a single-layer neural network

Parameter fitting with back-propagation

Discriminative versus generative models

Vanishing gradients and explaining away

Pretraining belief networks

Using dropout to regularize networks

Convolutional networks and rectified units

Compressing Data with autoencoder networks

Optimizing the learning rate

The TensorFlow library and digit recognition

The MNIST data

Constructing the network

Summary

8. Sharing Models with Prediction Services

The architecture of a prediction service

Clients and making requests

The GET requests

The POST request

The HEAD request

The PUT request

The DELETE request

Server – the web traffic controller

Application – the engine of the predictive services

Persisting information with database systems

Case study – logistic regression service

Setting up the database

The web server

The web application

The flow of a prediction service – training a model

On-demand and bulk prediction

Summary

9. Reporting and Testing – Iterating on Analytic Systems

Checking the health of models with diagnostics

Evaluating changes in model performance

Changes in feature importance

Changes in unsupervised model performance

Iterating on models through A/B testing

Experimental allocation – assigning customers to experiments

Deciding a sample size

Multiple hypothesis testing

Guidelines for communication

Translate terms to business values

Visualizing results

Case Study: building a reporting service

The report server

The report application

The visualization layer

Summary

Bibliography

Index

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部