万本电子书0元读

万本电子书0元读

顶部广告

Apache Spark 2.x for Java Developers电子书

售       价:¥

29人正在读 | 0人评论 6.2

作       者:Sourav Gulati, Sumit Kumar

出  版  社:Packt Publishing

出版时间:2017-07-26

字       数:42.1万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
Unleash the data processing and analytics capability of Apache Spark with the language of choice: Java About This Book ? Perform big data processing with Spark—without having to learn Scala! ? Use the Spark Java API to implement efficient enterprise-grade applications for data processing and analytics ? Go beyond mainstream data processing by adding querying capability, Machine Learning, and graph processing using Spark Who This Book Is For If you are a Java developer interested in learning to use the popular Apache Spark framework, this book is the resource you need to get started. Apache Spark developers who are looking to build enterprise-grade applications in Java will also find this book very useful. What You Will Learn ? Process data using different file formats such as XML, JSON, CSV, and plain and delimited text, using the Spark core Library. ? Perform analytics on data from various data sources such as Kafka, and Flume using Spark Streaming Library ? Learn SQL schema creation and the analysis of structured data using various SQL functions including Windowing functions in the Spark SQL Library ? Explore Spark Mlib APIs while implementing Machine Learning techniques to solve real-world problems ? Get to know Spark GraphX so you understand various graph-based analytics that can be performed with Spark In Detail Apache Spark is the buzzword in the big data industry right now, especially with the increasing need for real-time streaming and data processing. While Spark is built on Scala, the Spark Java API exposes all the Spark features available in the Scala version for Java developers. This book will show you how you can implement various functionalities of the Apache Spark framework in Java, without stepping out of your comfort zone. The book starts with an introduction to the Apache Spark 2.x ecosystem, followed by explaining how to install and configure Spark, and refreshes the Java concepts that will be useful to you when consuming Apache Spark's APIs. You will explore RDD and its associated common Action and Transformation Java APIs, set up a production-like clustered environment, and work with Spark SQL. Moving on, you will perform near-real-time processing with Spark streaming, Machine Learning analytics with Spark MLlib, and graph processing with GraphX, all using various Java packages. By the end of the book, you will have a solid foundation in implementing components in the Spark framework in Java to build fast, real-time applications. Style and approach This practical guide teaches readers the fundamentals of the Apache Spark framework and how to implement components using the Java language. It is a unique blend of theory and practical examples, and is written in a way that will gradually build your knowledge of Apache Spark.
目录展开

Title Page

Copyright

Apache Spark 2.x for Java Developers

Credits

Foreword

About the Authors

About the Reviewer

www.PacktPub.com

Why subscribe?

Customer Feedback

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

Introduction to Spark

Dimensions of big data

What makes Hadoop so revolutionary?

Defining HDFS

NameNode

HDFS I/O

YARN

Processing the flow of application submission in YARN

Overview of MapReduce

Why Apache Spark?

RDD - the first citizen of Spark

Operations on RDD

Lazy evaluation

Benefits of RDD

Exploring the Spark ecosystem

What's new in Spark 2.X?

References

Summary

Revisiting Java

Why use Java for Spark?

Generics

Creating your own generic type

Interfaces

Static method in an interface

Default method in interface

What if a class implements two interfaces which have default methods with same name and signature?

Anonymous inner classes

Lambda expressions

Functional interface

Syntax of Lambda expressions

Lexical scoping

Method reference

Understanding closures

Streams

Generating streams

Intermediate operations

Working with intermediate operations

Terminal operations

Working with terminal operations

String collectors

Collection collectors

Map collectors

Groupings

Partitioning

Matching

Finding elements

Summary

Let Us Spark

Getting started with Spark

Spark REPL also known as CLI

Some basic exercises using Spark shell

Checking Spark version

Creating and filtering RDD

Word count on RDD

Finding the sum of all even numbers in an RDD of integers

Counting the number of words in a file

Spark components

Spark Driver Web UI

Jobs

Stages

Storage

Environment

Executors

SQL

Streaming

Spark job configuration and submission

Spark REST APIs

Summary

Understanding the Spark Programming Model

Hello Spark

Prerequisites

Common RDD transformations

Map

Filter

flatMap

mapToPair

flatMapToPair

union

Intersection

Distinct

Cartesian

groupByKey

reduceByKey

sortByKey

Join

CoGroup

Common RDD actions

isEmpty

collect

collectAsMap

count

countByKey

countByValue

Max

Min

First

Take

takeOrdered

takeSample

top

reduce

Fold

aggregate

forEach

saveAsTextFile

saveAsObjectFile

RDD persistence and cache

Summary

Working with Data and Storage

Interaction with external storage systems

Interaction with local filesystem

Interaction with Amazon S3

Interaction with HDFS

Interaction with Cassandra

Working with different data formats

Plain and specially formatted text

Working with CSV data

Working with JSON data

Working with XML Data

References

Summary

Spark on Cluster

Spark application in distributed-mode

Driver program

Executor program

Cluster managers

Spark standalone

Installation of Spark standalone cluster

Start master

Start slave

Stop master and slaves

Deploying applications on Spark standalone cluster

Client mode

Cluster mode

Useful job configurations

Useful cluster level configurations (Spark standalone)

Yet Another Resource Negotiator (YARN)

YARN client

YARN cluster

Useful job configuration

Summary

Spark Programming Model - Advanced

RDD partitioning

Repartitioning

How Spark calculates the partition count for transformations with shuffling (wide transformations )

Partitioner

Hash Partitioner

Range Partitioner

Custom Partitioner

Advanced transformations

mapPartitions

mapPartitionsWithIndex

mapPartitionsToPair

mapValues

flatMapValues

repartitionAndSortWithinPartitions

coalesce

foldByKey

aggregateByKey

combineByKey

Advanced actions

Approximate actions

Asynchronous actions

Miscellaneous actions

Shared variable

Broadcast variable

Properties of the broadcast variable

Lifecycle of a broadcast variable

Map-side join using broadcast variable

Accumulators

Driver program

Summary

Working with Spark SQL

SQLContext and HiveContext

Initializing SparkSession

Reading CSV using SparkSession

Dataframe and dataset

SchemaRDD

Dataframe

Dataset

Creating a dataset using encoders

Creating a dataset using StructType

Unified dataframe and dataset API

Data persistence

Spark SQL operations

Untyped dataset operation

Temporary view

Global temporary view

Spark UDF

Spark UDAF

Untyped UDAF

Type-safe UDAF:

Hive integration

Table Persistence

Summary

Near Real-Time Processing with Spark Streaming

Introducing Spark Streaming

Understanding micro batching

Getting started with Spark Streaming jobs

Streaming sources

fileStream

Kafka

Streaming transformations

Stateless transformation

Stateful transformation

Checkpointing

Windowing

Transform operation

Fault tolerance and reliability

Data receiver stage

File streams

Advanced streaming sources

Transformation stage

Output stage

Structured Streaming

Recap of the use case

Structured streaming - programming model

Built-in input sources and sinks

Input sources

Built-in Sinks

Summary

Machine Learning Analytics with Spark MLlib

Introduction to machine learning

Concepts of machine learning

Datatypes

Machine learning work flow

Pipelines

Operations on feature vectors

Feature extractors

Feature transformers

Feature selectors

Summary

Learning Spark GraphX

Introduction to GraphX

Introduction to Property Graph

Getting started with the GraphX API

Using vertex and edge RDDs

From edges

EdgeTriplet

Graph operations

mapVertices

mapEdges

mapTriplets

reverse

subgraph

aggregateMessages

outerJoinVertices

Graph algorithms

PageRank

Static PageRank

Dynamic PageRank

Triangle counting

Connected components

Summary

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部