万本电子书0元读

万本电子书0元读

顶部广告

Python Data Mining Quick Start Guide电子书

售       价:¥

3人正在读 | 0人评论 9.8

作       者:Nathan Greeneltch

出  版  社:Packt Publishing

出版时间:2019-04-25

字       数:18.7万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
Explore the different data mining techniques using the libraries and packages offered by Python Key Features * Grasp the basics of data loading, cleaning, analysis, and visualization * Use the popular Python libraries such as NumPy, pandas, matplotlib, and scikit-learn for data mining * Your one-stop guide to build efficient data mining pipelines without going into too much theory Book Description Data mining is a necessary and predictable response to the dawn of the information age. It is typically defined as the pattern and/ or trend discovery phase in the data mining pipeline, and Python is a popular tool for performing these tasks as it offers a wide variety of tools for data mining. This book will serve as a quick introduction to the concept of data mining and putting it to practical use with the help of popular Python packages and libraries. You will get a hands-on demonstration of working with different real-world datasets and extracting useful insights from them using popular Python libraries such as NumPy, pandas, scikit-learn, and matplotlib. You will then learn the different stages of data mining such as data loading, cleaning, analysis, and visualization. You will also get a full conceptual description of popular data transformation, clustering, and classification techniques. By the end of this book, you will be able to build an efficient data mining pipeline using Python without any hassle. What you will learn * Explore the methods for summarizing datasets and visualizing/plotting data * Collect and format data for analytical work * Assign data points into groups and visualize clustering patterns * Learn how to predict continuous and categorical outputs for data * Clean, filter noise from, and reduce the dimensions of data * Serialize a data processing model using scikit-learn’s pipeline feature * Deploy the data processing model using Python’s pickle module Who this book is for Python developers interested in getting started with data mining will love this book. Budding data scientists and data analysts looking to quickly get to grips with practical data mining with Python will also find this book to be useful. Knowledge of Python programming is all you need to get started.
目录展开

Dedication

About Packt

Why subscribe?

Packt.com

Contributors

About the author

About the reviewer

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Data Mining and Getting Started with Python Tools

Descriptive, predictive, and prescriptive analytics

What will and will not be covered in this book

Recommended readings for further explanation

Setting up Python environments for data mining

Installing the Anaconda distribution and Conda package manager

Installing on Linux

Installing on Windows

Installing on macOS

Launching the Spyder IDE

Launching a Jupyter Notebook

Installing high-performance Python distribution

Recommended libraries and how to install

Recommended libraries

Summary

Basic Terminology and Our End-to-End Example

Basic data terminology

Sample spaces

Variable types

Data types

Basic summary statistics

An end-to-end example of data mining in Python

Loading data into memory – viewing and managing with ease using pandas

Plotting and exploring data – harnessing the power of Seaborn

Transforming data – PCA and LDA with scikit-learn

Quantifying separations – k-means clustering and the silhouette score

Making decisions or predictions

Summary

Collecting, Exploring, and Visualizing Data

Types of data sources and loading into pandas

Databases

Basic Structured Query Language (SQL) queries

Disks

Web sources

From URLs

From Scikit-learn and Seaborn-included sets

Access, search, and sanity checks with pandas

Basic plotting in Seaborn

Popular types of plots for visualizing data

Scatter plots

Histograms

Jointplots

Violin plots

Pairplots

Summary

Cleaning and Readying Data for Analysis

The scikit-learn transformer API

Cleaning input data

Missing values

Finding and removing missing values

Imputing to replace the missing values

Feature scaling

Normalization

Standardization

Handling categorical data

Ordinal encoding

One-hot encoding

Label encoding

High-dimensional data

Dimension reduction

Feature selection

Feature filtering

The variance threshold

The correlation coefficient

Wrapper methods

Sequential feature selection

Transformation

PCA

LDA

Summary

Grouping and Clustering Data

Introducing clustering concepts

Location of the group

Euclidean space (centroids)

Non-Euclidean space (medioids)

Similarity

Euclidean space

The Euclidean distance

The Manhattan distance

Maximum distance

Non-Euclidean space

The cosine distance

The Jaccard distance

Termination condition

With known number of groupings

Without known number of groupings

Quality score and silhouette score

Clustering methods

Means separation

K-means

Finding k

K-means++

Mini batch K-means

Hierarchical clustering

Reuse the dendrogram to find number of clusters

Plot dendrogram

Density clustering

Spectral clustering

Summary

Prediction with Regression and Classification

Scikit-learn Estimator API

Introducing prediction concepts

Prediction nomenclature

Mathematical machinery

Loss function

Gradient descent

Fit quality regimes

Regression

Metrics of regression model prediction

Regression example dataset

Linear regression

Extension to multivariate form

Regularization with penalized regression

Regularization penalties

Classification

Classification example dataset

Metrics of classification model prediction

Multi-class classification

One-versus-all

One-versus-one

Logistic regression

Regularized logistic regression

Support vector machines

Soft-margin with C

The kernel trick

Tree-based classification

Decision trees

Node splitting with Gini

Random forest

Avoid overfitting and speed up the fits

Built-in validation with bagging

Tuning a prediction model

Cross-validation

Introduction of the validation set

Multiple validation sets with k-fold method

Grid search for hyperparameter tuning

Summary

Advanced Topics - Building a Data Processing Pipeline and Deploying It

Pipelining your analysis

Scikit-learn's pipeline object

Deploying the model

Serializing a model and storing with the pickle module

Loading a serialized model and predicting

Python-specific deployment concerns

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部