万本电子书0元读

万本电子书0元读

顶部广告

Frank Kane's Taming Big Data with Apache Spark and Python电子书

售       价:¥

4人正在读 | 0人评论 9.8

作       者:Frank Kane

出  版  社:Packt Publishing

出版时间:2017-07-07

字       数:30.6万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
Frank Kane’s hands-on Spark training course, based on his bestselling Taming Big Data with Apache Spark and Python video, now available in a book. Understand and analyze large data sets using Spark on a single system or on a cluster. About This Book ? Understand how Spark can be distributed across computing clusters ? Develop and run Spark jobs efficiently using Python ? A hands-on tutorial by Frank Kane with over 15 real-world examples teaching you Big Data processing with Spark Who This Book Is For If you are a data scientist or data analyst who wants to learn Big Data processing using Apache Spark and Python, this book is for you. If you have some programming experience in Python, and want to learn how to process large amounts of data using Apache Spark, Frank Kane’s Taming Big Data with Apache Spark and Python will also help you. What You Will Learn ? Find out how you can identify Big Data problems as Spark problems ? Install and run Apache Spark on your computer or on a cluster ? Analyze large data sets across many CPUs using Spark’s Resilient Distributed Datasets ? Implement machine learning on Spark using the MLlib library ? Process continuous streams of data in real time using the Spark streaming module ? Perform complex network analysis using Spark’s GraphX library ? Use Amazon's Elastic MapReduce service to run your Spark jobs on a cluster In Detail Frank Kane’s Taming Big Data with Apache Spark and Python is your companion to learning Apache Spark in a hands-on manner. Frank will start you off by teaching you how to set up Spark on a single system or on a cluster, and you’ll soon move on to analyzing large data sets using Spark RDD, and developing and running effective Spark jobs quickly using Python. Apache Spark has emerged as the next big thing in the Big Data domain – quickly rising from an ascending technology to an established superstar in just a matter of years. Spark allows you to quickly extract actionable insights from large amounts of data, on a real-time basis, making it an essential tool in many modern businesses. Frank has packed this book with over 15 interactive, fun-filled examples relevant to the real world, and he will empower you to understand the Spark ecosystem and implement production-grade real-time Spark projects with ease. Style and approach Frank Kane’s Taming Big Data with Apache Spark and Python is a hands-on tutorial with over 15 real-world examples carefully explained by Frank in a step-by-step manner. The examples vary in complexity, and you can move through them at your own pace.
目录展开

Title Page

Copyright

Frank Kane's Taming Big Data with Apache Spark and Python

Credits

About the Author

www.PacktPub.com

Why subscribe?

Customer Feedback

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

Getting Started with Spark

Getting set up - installing Python, a JDK, and Spark and its dependencies

Installing Enthought Canopy

Installing the Java Development Kit

Installing Spark

Running Spark code

Installing the MovieLens movie rating dataset

Run your first Spark program - the ratings histogram example

Examining the ratings counter script

Running the ratings counter script

Summary

Spark Basics and Spark Examples

What is Spark?

Spark is scalable

Spark is fast

Spark is hot

Spark is not that hard

Components of Spark

Using Python with Spark

The Resilient Distributed Dataset (RDD)

What is the RDD?

The SparkContext object

Creating RDDs

Transforming RDDs

Map example

RDD actions

Ratings histogram walk-through

Understanding the code

Setting up the SparkContext object

Loading the data

Extract (MAP) the data we care about

Perform an action - count by value

Sort and display the results

Looking at the ratings-counter script in Canopy

Key/value RDDs and the average friends by age example

Key/value concepts - RDDs can hold key/value pairs

Creating a key/value RDD

What Spark can do with key/value data?

Mapping the values of a key/value RDD

The friends by age example

Parsing (mapping) the input data

Counting up the sum of friends and number of entries per age

Compute averages

Collect and display the results

Running the average friends by age example

Examining the script

Running the code

Filtering RDDs and the minimum temperature by location example

What is filter()

The source data for the minimum temperature by location example

Parse (map) the input data

Filter out all but the TMIN entries

Create (station ID, temperature) key/value pairs

Find minimum temperature by station ID

Collect and print results

Running the minimum temperature example and modifying it for maximums

Examining the min-temperatures script

Running the script

Running the maximum temperature by location example

Counting word occurrences using flatmap()

Map versus flatmap

Map ()

Flatmap ()

Code sample - count the words in a book

Improving the word-count script with regular expressions

Text normalization

Examining the use of regular expressions in the word-count script

Running the code

Sorting the word count results

Step 1 - Implement countByValue() the hard way to create a new RDD

Step 2 - Sort the new RDD

Examining the script

Running the code

Find the total amount spent by customer

Introducing the problem

Strategy for solving the problem

Useful snippets of code

Check your results and sort them by the total amount spent

Check your sorted implementation and results against mine

Summary

Advanced Examples of Spark Programs

Finding the most popular movie

Examining the popular-movies script

Getting results

Using broadcast variables to display movie names instead of ID numbers

Introducing broadcast variables

Examining the popular-movies-nicer.py script

Getting results

Finding the most popular superhero in a social graph

Superhero social networks

Input data format

Strategy

Running the script - discover who the most popular superhero is

Mapping input data to (hero ID, number of co-occurrences) per line

Adding up co-occurrence by hero ID

Flipping the (map) RDD to (number, hero ID)

Using max() and looking up the name of the winner

Getting results

Superhero degrees of separation - introducing the breadth-first search algorithm

Degrees of separation

How the breadth-first search algorithm works?

The initial condition of our social graph

First pass through the graph

Second pass through the graph

Third pass through the graph

Final pass through the graph

Accumulators and implementing BFS in Spark

Convert the input file into structured data

Writing code to convert Marvel-Graph.txt to BFS nodes

Iteratively process the RDD

Using a mapper and a reducer

How do we know when we're done?

Superhero degrees of separation - review the code and run it

Setting up an accumulator and using the convert to BFS function

Calling flatMap()

Calling an action

Calling reduceByKey

Getting results

Item-based collaborative filtering in Spark, cache(), and persist()

How does item-based collaborative filtering work?

Making item-based collaborative filtering a Spark problem

It's getting real

Caching RDDs

Running the similar-movies script using Spark's cluster manager

Examining the script

Getting results

Improving the quality of the similar movies example

Summary

Running Spark on a Cluster

Introducing Elastic MapReduce

Why use Elastic MapReduce?

Warning - Spark on EMR is not cheap

Setting up our Amazon Web Services / Elastic MapReduce account and PuTTY

Partitioning

Using .partitionBy()

Choosing a partition size

Creating similar movies from one million ratings - part 1

Changes to the script

Creating similar movies from one million ratings - part 2

Our strategy

Specifying memory per executor

Specifying a cluster manager

Running on a cluster

Setting up to run the movie-similarities-1m.py script on a cluster

Preparing the script

Creating a cluster

Connecting to the master node using SSH

Running the code

Creating similar movies from one million ratings – part 3

Assessing the results

Terminating the cluster

Troubleshooting Spark on a cluster

More troubleshooting and managing dependencies

Troubleshooting

Managing dependencies

Summary

SparkSQL, DataFrames, and DataSets

Introducing SparkSQL

Using SparkSQL in Python

More things you can do with DataFrames

Differences between DataFrames and DataSets

Shell access in SparkSQL

User-defined functions (UDFs)

Executing SQL commands and SQL-style functions on a DataFrame

Using SQL-style functions instead of queries

Using DataFrames instead of RDDs

Summary

Other Spark Technologies and Libraries

Introducing MLlib

MLlib capabilities

Special MLlib data types

For more information on machine learning

Making movie recommendations

Using MLlib to produce movie recommendations

Examining the movie-recommendations-als.py script

Analyzing the ALS recommendations results

Why did we get bad results?

Using DataFrames with MLlib

Examining the spark-linear-regression.py script

Getting results

Spark Streaming and GraphX

What is Spark Streaming?

GraphX

Summary

Where to Go From Here? – Learning More About Spark and Data Science

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部