万本电子书0元读

万本电子书0元读

顶部广告

Data Wrangling with Python电子书

售       价:¥

0人正在读 | 0人评论 9.8

作       者:Dr. Tirthajyoti Sarkar

出  版  社:Packt Publishing

出版时间:2019-02-28

字       数:994.7万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
Simplify your ETL processes with these hands-on data hygiene tips, tricks, and best practices. Key Features * Focus on the basics of data wrangling * Study various ways to extract the most out of your data in less time * Boost your learning curve with bonus topics like random data generation and data integrity checks Book Description For data to be useful and meaningful, it must be curated and refined. Data Wrangling with Python teaches you the core ideas behind these processes and equips you with knowledge of the most popular tools and techniques in the domain. The book starts with the absolute basics of Python, focusing mainly on data structures. It then delves into the fundamental tools of data wrangling like NumPy and Pandas libraries. You’ll explore useful insights into why you should stay away from traditional ways of data cleaning, as done in other languages, and take advantage of the specialized pre-built routines in Python. This combination of Python tips and tricks will also demonstrate how to use the same Python backend and extract/transform data from an array of sources including the Internet, large database vaults, and Excel financial tables. To help you prepare for more challenging scenarios, you’ll cover how to handle missing or wrong data, and reformat it based on the requirements from the downstream analytics tool. The book will further help you grasp concepts through real-world examples and datasets. By the end of this book, you will be confident in using a diverse array of sources to extract, clean, transform, and format your data efficiently. What you will learn * Use and manipulate complex and simple data structures * Harness the full potential of DataFrames and numpy.array at run time * Perform web scraping with BeautifulSoup4 and html5lib * Execute advanced string search and manipulation with RegEX * Handle outliers and perform data imputation with Pandas * Use descriptive statistics and plotting techniques * Practice data wrangling and modeling using data generation techniques Who this book is for Data Wrangling with Python is designed for developers, data analysts, and business analysts who are keen to pursue a career as a full-fledged data scientist or analytics expert. Although, this book is for beginners, prior working knowledge of Python is necessary to easily grasp the concepts covered here. It will also help to have rudimentary knowledge of relational database and SQL.
目录展开

Preface

About the Book

About the Authors

Learning Objectives

Approach

Audience

Minimum Hardware Requirements

Software Requirements

Conventions

Installation and Setup

Installing the Code Bundle

Additional Resources

Chapter 1

Introduction to Data Wrangling with Python

Introduction

Importance of Data Wrangling

Python for Data Wrangling

Lists, Sets, Strings, Tuples, and Dictionaries

Lists

Exercise 1: Accessing the List Members

Exercise 2: Generating a List

Exercise 3: Iterating over a List and Checking Membership

Exercise 4: Sorting a List

Exercise 5: Generating a Random List

Activity 1: Handling Lists

Sets

Introduction to Sets

Union and Intersection of Sets

Creating Null Sets

Dictionary

Exercise 6: Accessing and Setting Values in a Dictionary

Exercise 7: Iterating Over a Dictionary

Exercise 8: Revisiting the Unique Valued List Problem

Exercise 9: Deleting Value from Dict

Exercise 10: Dictionary Comprehension

Tuples

Creating a Tuple with Different Cardinalities

Unpacking a Tuple

Exercise 11: Handling Tuples

Strings

Exercise 12: Accessing Strings

Exercise 13: String Slices

String Functions

Exercise 14: Split and Join

Activity 2: Analyze a Multiline String and Generate the Unique Word Count

Summary

Chapter 2

Advanced Data Structures and File Handling

Introduction

Advanced Data Structures

Iterator

Exercise 15: Introduction to the Iterator

Stacks

Exercise 16: Implementing a Stack in Python

Exercise 17: Implementing a Stack Using User-Defined Methods

Exercise 18: Lambda Expression

Exercise 19: Lambda Expression for Sorting

Exercise 20: Multi-Element Membership Checking

Queue

Exercise 21: Implementing a Queue in Python

Activity 3: Permutation, Iterator, Lambda, List

Basic File Operations in Python

Exercise 22: File Operations

File Handling

Exercise 23: Opening and Closing a File

The with Statement

Opening a File Using the with Statement

Exercise 24: Reading a File Line by Line

Exercise 25: Write to a File

Activity 4: Design Your Own CSV Parser

Summary

Chapter 3

Introduction to NumPy, Pandas,and Matplotlib

Introduction

NumPy Arrays

NumPy Array and Features

Exercise 26: Creating a NumPy Array (from a List)

Exercise 27: Adding Two NumPy Arrays

Exercise 28: Mathematical Operations on NumPy Arrays

Exercise 29: Advanced Mathematical Operations on NumPy Arrays

Exercise 30: Generating Arrays Using arange and linspace

Exercise 31: Creating Multi-Dimensional Arrays

Exercise 32: The Dimension, Shape, Size, and Data Type of the Two-dimensional Array

Exercise 33: Zeros, Ones, Random, Identity Matrices, and Vectors

Exercise 34: Reshaping, Ravel, Min, Max, and Sorting

Exercise 35: Indexing and Slicing

Conditional Subsetting

Exercise 36: Array Operations (array-array, array-scalar, and universal functions)

Stacking Arrays

Pandas DataFrames

Exercise 37: Creating a Pandas Series

Exercise 38: Pandas Series and Data Handling

Exercise 39: Creating Pandas DataFrames

Exercise 40: Viewing a DataFrame Partially

Indexing and Slicing Columns

Indexing and Slicing Rows

Exercise 41: Creating and Deleting a New Column or Row

Statistics and Visualization with NumPy and Pandas

Refresher of Basic Descriptive Statistics (and the Matplotlib Library for Visualization)

Exercise 42: Introduction to Matplotlib Through a Scatter Plot

Definition of Statistical Measures – Central Tendency and Spread

Random Variables and Probability Distribution

What Is a Probability Distribution?

Discrete Distributions

Continuous Distributions

Data Wrangling in Statistics and Visualization

Using NumPy and Pandas to Calculate Basic Descriptive Statistics on the DataFrame

Random Number Generation Using NumPy

Exercise 43: Generating Random Numbers from a Uniform Distribution

Exercise 44: Generating Random Numbers from a Binomial Distribution and Bar Plot

Exercise 45: Generating Random Numbers from Normal Distribution and Histograms

Exercise 46: Calculation of Descriptive Statistics from a DataFrame

Exercise 47: Built-in Plotting Utilities

Activity 5: Generating Statistics from a CSV File

Summary

Chapter 4

A Deep Dive into Data Wrangling with Python

Introduction

Subsetting, Filtering, and Grouping

Exercise 48: Loading and Examining a Superstore's Sales Data from an Excel File

Subsetting the DataFrame

An Example Use Case: Determining Statistics on Sales and Profit

Exercise 49: The unique Function

Conditional Selection and Boolean Filtering

Exercise 50: Setting and Resetting the Index

Exercise 51: The GroupBy Method

Detecting Outliers and Handling Missing Values

Missing Values in Pandas

Exercise 52: Filling in the Missing Values with fillna

Exercise 53: Dropping Missing Values with dropna

Outlier Detection Using a Simple Statistical Test

Concatenating, Merging, and Joining

Exercise 54: Concatenation

Exercise 55: Merging by a Common Key

Exercise 56: The join Method

Useful Methods of Pandas

Exercise 57: Randomized Sampling

The value_counts Method

Pivot Table Functionality

Exercise 58: Sorting by Column Values – the sort_values Method

Exercise 59: Flexibility for User-Defined Functions with the apply Method

Activity 6: Working with the Adult Income Dataset (UCI)

Summary

Chapter 5

Getting Comfortable with Different Kinds of Data Sources

Introduction

Reading Data from Different Text-Based (and Non-Text-Based) Sources

Data Files Provided with This Chapter

Libraries to Install for This Chapter

Exercise 60: Reading Data from a CSV File Where Headers Are Missing

Exercise 61: Reading from a CSV File where Delimiters are not Commas

Exercise 62: Bypassing the Headers of a CSV File

Exercise 63: Skipping Initial Rows and Footers when Reading a CSV File

Reading Only the First N Rows (Especially Useful for Large Files)

Exercise 64: Combining Skiprows and Nrows to Read Data in Small Chunks

Setting the skip_blank_lines Option

Read CSV from a Zip file

Reading from an Excel File Using sheet_name and Handling a Distinct sheet_name

Exercise 65: Reading a General Delimited Text File

Reading HTML Tables Directly from a URL

Exercise 66: Further Wrangling to Get the Desired Data

Exercise 67: Reading from a JSON File

Reading a Stata File

Exercise 68: Reading Tabular Data from a PDF File

Introduction to Beautiful Soup 4 and Web Page Parsing

Structure of HTML

Exercise 69: Reading an HTML file and Extracting its Contents Using BeautifulSoup

Exercise 70: DataFrames and BeautifulSoup

Exercise 71: Exporting a DataFrame as an Excel File

Exercise 72: Stacking URLs from a Document using bs4

Activity 7: Reading Tabular Data from a Web Page and Creating DataFrames

Summary

Chapter 6

Learning the Hidden Secrets of Data Wrangling

Introduction

Additional Software Required for This Section

Advanced List Comprehension and the zip Function

Introduction to Generator Expressions

Exercise 73: Generator Expressions

Exercise 74: One-Liner Generator Expression

Exercise 75: Extracting a List with Single Words

Exercise 76: The zip Function

Exercise 77: Handling Messy Data

Data Formatting

The % operator

Using the format Function

Exercise 78: Data Representation Using {}

Identify and Clean Outliers

Exercise 79: Outliers in Numerical Data

Z-score

Exercise 80: The Z-Score Value to Remove Outliers

Exercise 81: Fuzzy Matching of Strings

Activity 8: Handling Outliers and Missing Data

Summary

Chapter 7

Advanced Web Scraping and Data Gathering

Introduction

The Basics of Web Scraping and the Beautiful Soup Library

Libraries in Python

Exercise 81: Using the Requests Library to Get a Response from the Wikipedia Home Page

Exercise 82: Checking the Status of the Web Request

Checking the Encoding of the Web Page

Exercise 83: Creating a Function to Decode the Contents of the Response and Check its Length

Exercise 84: Extracting Human-Readable Text From a BeautifulSoup Object

Extracting Text from a Section

Extracting Important Historical Events that Happened on Today's Date

Exercise 85: Using Advanced BS4 Techniques to Extract Relevant Text

Exercise 86: Creating a Compact Function to Extract the "On this Day" Text from the Wikipedia Home Page

Reading Data from XML

Exercise 87: Creating an XML File and Reading XML Element Objects

Exercise 88: Finding Various Elements of Data within a Tree (Element)

Reading from a Local XML File into an ElementTree Object

Exercise 89: Traversing the Tree, Finding the Root, and Exploring all Child Nodes and their Tags and Attributes

Exercise 90: Using the text Method to Extract Meaningful Data

Extracting and Printing the GDP/Per Capita Information Using a Loop

Exercise 91: Finding All the Neighboring Countries for each Country and Printing Them

Exercise 92: A Simple Demo of Using XML Data Obtained by Web Scraping

Reading Data from an API

Defining the Base URL (or API Endpoint)

Exercise 93: Defining and Testing a Function to Pull Country Data from an API

Using the Built-In JSON Library to Read and Examine Data

Printing All the Data Elements

Using a Function that Extracts a DataFrame Containing Key Information

Exercise 94: Testing the Function by Building a Small Database of Countries' Information

Fundamentals of Regular Expressions (RegEx)

Regex in the Context of Web Scraping

Exercise 95: Using the match Method to Check Whether a Pattern matches a String/Sequence

Using the Compile Method to Create a Regex Program

Exercise 96: Compiling Programs to Match Objects

Exercise 97: Using Additional Parameters in Match to Check for Positional Matching

Finding the Number of Words in a List That End with "ing"

Exercise 98: The search Method in Regex

Exercise 99: Using the span Method of the Match Object to Locate the Position of the Matched Pattern

Exercise 100: Examples of Single Character Pattern Matching with search

Exercise 101: Examples of Pattern Matching at the Start or End of a String

Exercise 102: Examples of Pattern Matching with Multiple Characters

Exercise 103: Greedy versus Non-Greedy Matching

Exercise 104: Controlling Repetitions to Match

Exercise 105: Sets of Matching Characters

Exercise 106: The use of OR in Regex using the OR Operator

The findall Method

Activity 9: Extracting the Top 100 eBooks from Gutenberg

Activity 10: Building Your Own Movie Database by Reading an API

Summary

Chapter 8

RDBMS and SQL

Introduction

Refresher of RDBMS and SQL

How is an RDBMS Structured?

SQL

Using an RDBMS (MySQL/PostgreSQL/SQLite)

Exercise 107: Connecting to Database in SQLite

Exercise 108: DDL and DML Commands in SQLite

Reading Data from a Database in SQLite

Exercise 109: Sorting Values that are Present in the Database

Exercise 110: Altering the Structure of a Table and Updating the New Fields

Exercise 111: Grouping Values in Tables

Relation Mapping in Databases

Adding Rows in the comments Table

Joins

Retrieving Specific Columns from a JOIN query

Exercise 112: Deleting Rows

Updating Specific Values in a Table

Exercise 113: RDBMS and DataFrames

Activity 11: Retrieving Data Correctly From Databases

Summary

Chapter 9

Application of Data Wrangling in Real Life

Introduction

Applying Your Knowledge to a Real-life Data Wrangling Task

Activity 12: Data Wrangling Task – Fixing UN Data

Activity 13: Data Wrangling Task – Cleaning GDP Data

Activity 14: Data Wrangling Task – Merging UN Data and GDP Data

Activity 15: Data Wrangling Task – Connecting the New Data to the Database

An Extension to Data Wrangling

Additional Skills Required to Become a Data Scientist

Basic Familiarity with Big Data and Cloud Technologies

What Goes with Data Wrangling?

Tips and Tricks for Mastering Machine Learning

Summary

Appendix

Solution of Activity 1: Handling Lists

Solution of Activity 2: Analyze a Multiline String and Generate the Unique Word Count

Solution of Activity 3: Permutation, Iterator, Lambda, List

Solution of Activity 4: Design Your Own CSV Parser

Solution of Activity 5: Generating Statistics from a CSV File

Solution of Activity 6: Working with the Adult Income Dataset (UCI)

Solution of Activity 7: Reading Tabular Data from a Web Page and Creating DataFrames

Solution of Activity 8: Handling Outliers and Missing Data

Solution of Activity 9: Extracting the Top 100 eBooks from Gutenberg

Solution of Activity 10: Extracting the top 100 eBooks from Gutenberg.org

Solution of Activity 11: Retrieving Data Correctly from Databases

Solution of Activity 12: Data Wrangling Task – Fixing UN Data

Activity 13: Data Wrangling Task – Cleaning GDP Data

Solution of Activity 14: Data Wrangling Task – Merging UN Data and GDP Data

Activity 15: Data Wrangling Task – Connecting the New Data to a Database

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部