万本电子书0元读

万本电子书0元读

顶部广告

Data Science Projects with Python电子书

售       价:¥

4人正在读 | 0人评论 9.8

作       者:Stephen Klosterman

出  版  社:Packt Publishing

出版时间:2019-04-30

字       数:658.0万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
Gain hands-on experience with industry-standard data analysis and machine learning tools in Python Key Features * Learn techniques to use data to identify the exact problem to be solved * Visualize data using different graphs * Identify how to select an appropriate algorithm for data extraction Book Description Data Science Projects with Python is designed to give you practical guidance on industry-standard data analysis and machine learning tools in Python, with the help of realistic data. The book will help you understand how you can use pandas and Matplotlib to critically examine a dataset with summary statistics and graphs, and extract the insights you seek to derive. You will continue to build on your knowledge as you learn how to prepare data and feed it to machine learning algorithms, such as regularized logistic regression and random forest, using the scikit-learn package. You’ll discover how to tune the algorithms to provide the best predictions on new and, unseen data. As you delve into later chapters, you’ll be able to understand the working and output of these algorithms and gain insight into not only the predictive capabilities of the models but also their reasons for making these predictions. By the end of this book, you will have the skills you need to confidently use various machine learning algorithms to perform detailed data analysis and extract meaningful insights from unstructured data. What you will learn * Install the required packages to set up a data science coding environment * Load data into a Jupyter Notebook running Python * Use Matplotlib to create data visualizations * Fit a model using scikit-learn * Use lasso and ridge regression to reduce overfitting * Fit and tune a random forest model and compare performance with logistic regression * Create visuals using the output of the Jupyter Notebook Who this book is for If you are a data analyst, data scientist, or a business analyst who wants to get started with using Python and machine learning techniques to analyze data and predict outcomes, this book is for you. Basic knowledge of computer programming and data analytics is a must. Familiarity with mathematical concepts such as algebra and basic statistics will be useful.
目录展开

Preface

About the Book

About the Author

Objectives

Audience

Approach

Hardware Requirements

Software Requirements

Installation and Setup

Conventions

Chapter 1:

Data Exploration and Cleaning

Introduction

Python and the Anaconda Package Management System

Indexing and the Slice Operator

Exercise 1: Examining Anaconda and Getting Familiar with Python

Different Types of Data Science Problems

Loading the Case Study Data with Jupyter and pandas

Exercise 2: Loading the Case Study Data in a Jupyter Notebook

Getting Familiar with Data and Performing Data Cleaning

The Business Problem

Data Exploration Steps

Exercise 3: Verifying Basic Data Integrity

Boolean Masks

Exercise 4: Continuing Verification of Data Integrity

Exercise 5: Exploring and Cleaning the Data

Data Quality Assurance and Exploration

Exercise 6: Exploring the Credit Limit and Demographic Features

Deep Dive: Categorical Features

Exercise 7: Implementing OHE for a Categorical Feature

Exploring the Financial History Features in the Dataset

Activity 1: Exploring Remaining Financial Features in the Dataset

Summary

Chapter 2:

Introduction to Scikit-Learn and Model Evaluation

Introduction

Exploring the Response Variable and Concluding the Initial Exploration

Introduction to Scikit-Learn

Generating Synthetic Data

Data for a Linear Regression

Exercise 8: Linear Regression in Scikit-Learn

Model Performance Metrics for Binary Classification

Splitting the Data: Training and Testing sets

Classification Accuracy

True Positive Rate, False Positive Rate, and Confusion Matrix

Exercise 9: Calculating the True and False Positive and Negative Rates and Confusion Matrix in Python

Discovering Predicted Probabilities: How Does Logistic Regression Make Predictions?

Exercise 10: Obtaining Predicted Probabilities from a Trained Logistic Regression Model

The Receiver Operating Characteristic (ROC) Curve

Precision

Activity 2: Performing Logistic Regression with a New Feature and Creating a Precision-Recall Curve

Summary

Chapter 3:

Details of Logistic Regression and Feature Exploration

Introduction

Examining the Relationships between Features and the Response

Pearson Correlation

F-test

Exercise 11: F-test and Univariate Feature Selection

Finer Points of the F-test: Equivalence to t-test for Two Classes and Cautions

Hypotheses and Next Steps

Exercise 12: Visualizing the Relationship between Features and Response

Univariate Feature Selection: What It Does and Doesn't Do

Understanding Logistic Regression with function Syntax in Python and the Sigmoid Function

Exercise 13: Plotting the Sigmoid Function

Scope of Functions

Why is Logistic Regression Considered a Linear Model?

Exercise 14: Examining the Appropriateness of Features for Logistic Regression

From Logistic Regression Coefficients to Predictions Using the Sigmoid

Exercise 15: Linear Decision Boundary of Logistic Regression

Activity 3: Fitting a Logistic Regression Model and Directly Using the Coefficients

Summary

Chapter 4:

The Bias-Variance Trade-off

Introduction

Estimating the Coefficients and Intercepts of Logistic Regression

Gradient Descent to Find Optimal Parameter Values

Exercise 16: Using Gradient Descent to Minimize a Cost Function

Assumptions of Logistic Regression

The Motivation for Regularization: The Bias-Variance Trade-off

Exercise 17: Generating and Modeling Synthetic Classification Data

Lasso (L1) and Ridge (L2) Regularization

Cross Validation: Choosing the Regularization Parameter and Other Hyperparameters

Exercise 18: Reducing Overfitting on the Synthetic Data Classification Problem

Options for Logistic Regression in Scikit-Learn

Scaling Data, Pipelines, and Interaction Features in Scikit-Learn

Activity 4: Cross-Validation and Feature Engineering with the Case Study Data

Summary

Chapter 5:

Decision Trees and Random Forests

Introduction

Decision trees

The Terminology of Decision Trees and Connections to Machine Learning

Exercise 19: A Decision Tree in scikit-learn

Training Decision Trees: Node Impurity

Features Used for the First splits: Connections to Univariate Feature Selection and Interactions

Training Decision Trees: A Greedy Algorithm

Training Decision Trees: Different Stopping Criteria

Using Decision Trees: Advantages and Predicted Probabilities

A More Convenient Approach to Cross-Validation

Exercise 20: Finding Optimal Hyperparameters for a Decision Tree

Random Forests: Ensembles of Decision Trees

Random Forest: Predictions and Interpretability

Exercise 21: Fitting a Random Forest

Checkerboard Graph

Activity 5: Cross-Validation Grid Search with Random Forest

Summary

Chapter 6:

Imputation of Missing Data, Financial Analysis, and Delivery to Client

Introduction

Review of Modeling Results

Dealing with Missing Data: Imputation Strategies

Preparing Samples with Missing Data

Exercise 22: Cleaning the Dataset

Exercise 23: Mode and Random Imputation of PAY_1

A Predictive Model for PAY_1

Exercise 24: Building a Multiclass Classification Model for Imputation

Using the Imputation Model and Comparing it to Other Methods

Confirming Model Performance on the Unseen Test Set

Financial Analysis

Financial Conversation with the Client

Exercise 25: Characterizing Costs and Savings

Activity 6: Deriving Financial Insights

Final Thoughts on Delivering the Predictive Model to the Client

Summary

Appendix

Chapter 1: Data Exploration and Cleaning

Activity 1: Exploring Remaining Financial Features in the Dataset

Chapter 2: Introduction to Scikit-Learn and Model Evaluation

Activity 2: Performing Logistic Regression with a New Feature and Creating a Precision-Recall Curve

Chapter 3: Details of Logistic Regression and Feature Exploration

Activity 3: Fitting a Logistic Regression Model and Directly Using the Coefficients

Chapter 4: The Bias-Variance Trade-off

Activity 4: Cross-Validation and Feature Engineering with the Case Study Data

Chapter 5: Decision Trees and Random Forests

Activity 5: Cross-Validation Grid Search with Random Forest

Chapter 6: Imputation of Missing Data, Financial Analysis, and Delivery to Client

Activity 6: Deriving Financial Insights

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部