售 价:¥
温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印
为你推荐
About Packt
Why subscribe?
Packt.com
Contributors
About the authors
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
Installing Pyspark and Setting up Your Development Environment
An overview of PySpark
Spark SQL
Setting up Spark on Windows and PySpark
Core concepts in Spark and PySpark
SparkContext
Spark shell
SparkConf
Summary
Getting Your Big Data into the Spark Environment Using RDDs
Loading data on to Spark RDDs
The UCI machine learning repository
Getting the data from the repository to Spark
Getting data into Spark
Parallelization with Spark RDDs
What is parallelization?
Basics of RDD operation
Summary
Big Data Cleaning and Wrangling with Spark Notebooks
Using Spark Notebooks for quick iteration of ideas
Sampling/filtering RDDs to pick out relevant data points
Splitting datasets and creating some new combinations
Summary
Aggregating and Summarizing Data into Useful Reports
Calculating averages with map and reduce
Faster average computations with aggregate
Pivot tabling with key-value paired data points
Summary
Powerful Exploratory Data Analysis with MLlib
Computing summary statistics with MLlib
Using Pearson and Spearman correlations to discover correlations
The Pearson correlation
The Spearman correlation
Computing Pearson and Spearman correlations
Testing our hypotheses on large datasets
Summary
Putting Structure on Your Big Data with SparkSQL
Manipulating DataFrames with Spark SQL schemas
Using Spark DSL to build queries
Summary
Transformations and Actions
Using Spark transformations to defer computations to a later time
Avoiding transformations
Using the reduce and reduceByKey methods to calculate the results
Performing actions that trigger computations
Reusing the same rdd for different actions
Summary
Immutable Design
Delving into the Spark RDD's parent/child chain
Extending an RDD
Chaining a new RDD with the parent
Testing our custom RDD
Using RDD in an immutable way
Using DataFrame operations to transform
Immutability in the highly concurrent environment
Using the Dataset API in an immutable way
Summary
Avoiding Shuffle and Reducing Operational Expenses
Detecting a shuffle in a process
Testing operations that cause a shuffle in Apache Spark
Changing the design of jobs with wide dependencies
Using keyBy() operations to reduce shuffle
Using a custom partitioner to reduce shuffle
Summary
Saving Data in the Correct Format
Saving data in plain text format
Leveraging JSON as a data format
Tabular formats – CSV
Using Avro with Spark
Columnar formats – Parquet
Summary
Working with the Spark Key/Value API
Available actions on key/value pairs
Using aggregateByKey instead of groupBy()
Actions on key/value pairs
Available partitioners on key/value data
Implementing a custom partitioner
Summary
Testing Apache Spark Jobs
Separating logic from Spark engine-unit testing
Integration testing using SparkSession
Mocking data sources using partial functions
Using ScalaCheck for property-based testing
Testing in different versions of Spark
Summary
Leveraging the Spark GraphX API
Creating a graph from a data source
Creating the loader component
Revisiting the graph format
Loading Spark from file
Using the Vertex API
Constructing a graph using the vertex
Creating couple relationships
Using the Edge API
Constructing the graph using edge
Calculating the degree of the vertex
The in-degree
The out-degree
Calculating PageRank
Loading and reloading data about users and followers
Summary
Other Books You May Enjoy
Leave a review - let other readers know what you think
买过这本书的人还买过
读了这本书的人还在读
同类图书排行榜