万本电子书0元读

万本电子书0元读

顶部广告

Big Data Analysis with Python电子书

售       价:¥

9人正在读 | 0人评论 6.2

作       者:Ivan Marin

出  版  社:Packt Publishing

出版时间:2019-04-10

字       数:505.3万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
Get to grips with processing large volumes of data and presenting it as engaging, interactive insights using Spark and Python. Key Features * Get a hands-on, fast-paced introduction to the Python data science stack * Explore ways to create useful metrics and statistics from large datasets * Create detailed analysis reports with real-world data Book Description Processing big data in real time is challenging due to scalability, information inconsistency, and fault tolerance. Big Data Analysis with Python teaches you how to use tools that can control this data avalanche for you. With this book, you'll learn practical techniques to aggregate data into useful dimensions for posterior analysis, extract statistical measurements, and transform datasets into features for other systems. The book begins with an introduction to data manipulation in Python using pandas. You'll then get familiar with statistical analysis and plotting techniques. With multiple hands-on activities in store, you'll be able to analyze data that is distributed on several computers by using Dask. As you progress, you'll study how to aggregate data for plots when the entire data cannot be accommodated in memory. You'll also explore Hadoop (HDFS and YARN), which will help you tackle larger datasets. The book also covers Spark and explains how it interacts with other tools. By the end of this book, you'll be able to bootstrap your own Python environment, process large files, and manipulate data to generate statistics, metrics, and graphs. What you will learn * Use Python to read and transform data into different formats * Generate basic statistics and metrics using data on disk * Work with computing tasks distributed over a cluster * Convert data from various sources into storage or querying formats * Prepare data for statistical analysis, visualization, and machine learning * Present data in the form of effective visuals Who this book is for Big Data Analysis with Python is designed for Python developers, data analysts, and data scientists who want to get hands-on with methods to control data and transform it into impactful insights. Basic knowledge of statistical measurements and relational databases will help you to understand various concepts explained in this book.
目录展开

Preface

About the Book

About the Authors

Learning Objectives

Approach

Audience

Minimum Hardware Requirements

Software Requirements

Conventions

Installation and Setup

Installing the Code Bundle

Additional Resources

Chapter 1

The Python Data Science Stack

Introduction

Python Libraries and Packages

IPython: A Powerful Interactive Shell

Exercise 1: Interacting with the Python Shell Using the IPython Commands

The Jupyter Notebook

Exercise 2: Getting Started with the Jupyter Notebook

IPython or Jupyter?

Activity 1: IPython and Jupyter

NumPy

SciPy

Matplotlib

Pandas

Using Pandas

Reading Data

Exercise 3: Reading Data with Pandas

Data Manipulation

Selection and Filtering

Selecting Rows Using Slicing

Exercise 4: Data Selection and the .loc Method

Applying a Function to a Column

Activity 2: Working with Data Problems

Data Type Conversion

Exercise 5: Exploring Data Types

Aggregation and Grouping

Exercise 6: Aggregation and Grouping Data

NumPy on Pandas

Exporting Data from Pandas

Exercise 7: Exporting Data in Different Formats

Visualization with Pandas

Activity 3: Plotting Data with Pandas

Summary

Chapter 2

Statistical Visualizations

Introduction

Types of Graphs and When to Use Them

Exercise 8: Plotting an Analytical Function

Components of a Graph

Exercise 9: Creating a Graph

Exercise 10: Creating a Graph for a Mathematical Function

Seaborn

Which Tool Should Be Used?

Types of Graphs

Line Graphs

Time Series Plots

Exercise 11: Creating Line Graphs Using Different Libraries

Pandas DataFrames and Grouped Data

Activity 4: Line Graphs with the Object-Oriented API and Pandas DataFrames

Scatter Plots

Activity 5: Understanding Relationships of Variables Using Scatter Plots

Histograms

Exercise 12: Creating a Histogram of Horsepower Distribution

Boxplots

Exercise 13: Analyzing the Behavior of the Number of Cylinders and Horsepower Using a Boxplot

Changing Plot Design: Modifying Graph Components

Title and Label Configuration for Axis Objects

Exercise 14: Configuring a Title and Labels for Axis Objects

Line Styles and Color

Figure Size

Exercise 15: Working with Matplotlib Style Sheets

Exporting Graphs

Activity 6: Exporting a Graph to a File on Disk

Activity 7: Complete Plot Design

Summary

Chapter 3

Working with Big Data Frameworks

Introduction

Hadoop

Manipulating Data with the HDFS

Exercise 16: Manipulating Files in the HDFS

Spark

Spark SQL and Pandas DataFrames

Exercise 17: Performing DataFrame Operations in Spark

Exercise 18: Accessing Data with Spark

Exercise 19: Reading Data from the Local Filesystem and the HDFS

Exercise 20: Writing Data Back to the HDFS and PostgreSQL

Writing Parquet Files

Exercise 21: Writing Parquet Files

Increasing Analysis Performance with Parquet and Partitions

Exercise 22: Creating a Partitioned Dataset

Handling Unstructured Data

Exercise 23: Parsing Text and Cleaning

Activity 8: Removing Stop Words from Text

Summary

Chapter 4

Diving Deeper with Spark

Introduction

Getting Started with Spark DataFrames

Exercise 24: Specifying the Schema of a DataFrame

Exercise 25: Creating a DataFrame from an Existing RDD

Exercise 25: Creating a DataFrame Using a CSV File

Writing Output from Spark DataFrames

Exercise 27: Converting a Spark DataFrame to a Pandas DataFrame

Exploring Spark DataFrames

Exercise 28: Displaying Basic DataFrame Statistics

Activity 9: Getting Started with Spark DataFrames

Data Manipulation with Spark DataFrames

Exercise 29: Selecting and Renaming Columns from the DataFrame

Exercise 30: Adding and Removing a Column from the DataFrame

Exercise 31: Displaying and Counting Distinct Values in a DataFrame

Exercise 32: Removing Duplicate Rows and Filtering Rows of a DataFrame

Exercise 33: Ordering Rows in a DataFrame

Exercise 34: Aggregating Values in a DataFrame

Activity 10: Data Manipulation with Spark DataFrames

Graphs in Spark

Exercise 35: Creating a Bar Chart

Exercise 36: Creating a Linear Model Plot

Exercise 37: Creating a KDE Plot and a Boxplot

Activity 11: Graphs in Spark

Summary

Chapter 5

Handling Missing Values and Correlation Analysis

Introduction

Setting up the Jupyter Notebook

Missing Values

Exercise 38: Counting Missing Values in a DataFrame

Exercise 39: Counting Missing Values in All DataFrame Columns

Fetching Missing Value Records from the DataFrame

Handling Missing Values in Spark DataFrames

Exercise 40: Removing Records with Missing Values from a DataFrame

Exercise 41: Filling Missing Values with a Constant in a DataFrame Column

Correlation

Exercise 42: Computing Correlation

Activity 12: Missing Value Handling and Correlation Analysis with PySpark DataFrames

Summary

Chapter 6

Exploratory Data Analysis

Introduction

Defining a Business Problem

Problem Identification

Requirement Gathering

Data Pipeline and Workflow

Identifying Measurable Metrics

Documentation and Presentation

Translating a Business Problem into Measurable Metrics and Exploratory Data Analysis (EDA)

Data Gathering

Analysis of Data Generation

KPI Visualization

Feature Importance

Exercise 43: Identify the Target Variable and Related KPIs from the Given Data for the Business Problem

Exercise 44: Generate the Feature Importance of the Target Variable and Carry Out EDA

Structured Approach to the Data Science Project Life Cycle

Data Science Project Life Cycle Phases

Phase 1: Understanding and Defining the Business Problem

Phase 2: Data Access and Discovery

Phase 3: Data Engineering and Pre-processing

Activity 13: Carry Out Mapping to Gaussian Distribution of Numeric Features from the Given Data

Phase 4: Model Development

Summary

Chapter 7

Reproducibility in Big Data Analysis

Introduction

Reproducibility with Jupyter Notebooks

Introduction to the Business Problem

Documenting the Approach and Workflows

Explaining the Data Pipeline

Explain the Dependencies

Using Source Code Version Control

Modularizing the Process

Gathering Data in a Reproducible Way

Functionalities in Markdown and Code Cells

Explaining the Business Problem in the Markdown

Providing a Detailed Introduction to the Data Source

Explain the Data Attributes in the Markdown

Exercise 45: Performing Data Reproducibility

Code Practices and Standards

Environment Documentation

Writing Readable Code with Comments

Effective Segmentation of Workflows

Workflow Documentation

Exercise 46: Missing Value Preprocessing with High Reproducibility

Avoiding Repetition

Using Functions and Loops for Optimizing Code

Developing Libraries/Packages for Code/Algorithm Reuse

Activity 14: Carry normalisation of data

Summary

Chapter 8

Creating a Full Analysis Report

Introduction

Reading Data in Spark from Different Data Sources

Exercise 47: Reading Data from a CSV File Using the PySpark Object

Reading JSON Data Using the PySpark Object

SQL Operations on a Spark DataFrame

Exercise 48: Reading Data in PySpark and Carrying Out SQL Operations

Exercise 49: Creating and Merging Two DataFrames

Exercise 50: Subsetting the DataFrame

Generating Statistical Measurements

Activity 15: Generating Visualization Using Plotly

Summary

Appendix

Chapter 01: The Python Data Science Stack

Activity 1: IPython and Jupyter

Activity 2: Working with Data Problems

Activity 3: Plotting Data with Pandas

Chapter 02: Statistical Visualizations Using Matplotlib and Seaborn

Activity 4: Line Graphs with the Object-Oriented API and Pandas DataFrames

Activity 5: Understanding Relationships of Variables Using Scatter Plots

Activity 6: Exporting a Graph to a File on Disk

Activity 7: Complete Plot Design

Chapter 03: Working with Big Data Frameworks

Activity 8: Parsing Text

Chapter 04: Diving Deeper with Spark

Activity 9: Getting Started with Spark DataFrames

Activity 10: Data Manipulation with Spark DataFrames

Activity 11: Graphs in Spark

Chapter 05: Missing Value Handling and Correlation Analysis in Spark

Activity 12: Missing Value Handling and Correlation Analysis with PySpark DataFrames

Chapter 6: Business Process Definition and Exploratory Data Analysis

Activity 13: Carry Out Mapping to Gaussian Distribution of Numeric Features from the Given Data

Chapter 07: Reproducibility in Big Data Analysis

Activity 14: Test normality of data attributes (columns) and carry out Gaussian normalization of non-normally distributed attributes

Chapter 08: Creating a Full Analysis Report

Activity 15: Generating Visualization Using Plotly

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部