万本电子书0元读

万本电子书0元读

顶部广告

Apache Spark Quick Start Guide电子书

售       价:¥

2人正在读 | 0人评论 9.8

作       者:Shrey Mehrotra

出  版  社:Packt Publishing

出版时间:2019-01-31

字       数:15.9万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
A practical guide for solving complex data processing challenges by applying the best optimizations techniques in Apache Spark. Key Features * Learn about the core concepts and the latest developments in Apache Spark * Master writing efficient big data applications with Spark’s built-in modules for SQL, Streaming, Machine Learning and Graph analysis * Get introduced to a variety of optimizations based on the actual experience Book Description Apache Spark is a flexible framework that allows processing of batch and real-time data. Its unified engine has made it quite popular for big data use cases. This book will help you to get started with Apache Spark 2.0 and write big data applications for a variety of use cases. It will also introduce you to Apache Spark – one of the most popular Big Data processing frameworks. Although this book is intended to help you get started with Apache Spark, but it also focuses on explaining the core concepts. This practical guide provides a quick start to the Spark 2.0 architecture and its components. It teaches you how to set up Spark on your local machine. As we move ahead, you will be introduced to resilient distributed datasets (RDDs) and DataFrame APIs, and their corresponding transformations and actions. Then, we move on to the life cycle of a Spark application and learn about the techniques used to debug slow-running applications. You will also go through Spark’s built-in modules for SQL, streaming, machine learning, and graph analysis. Finally, the book will lay out the best practices and optimization techniques that are key for writing efficient Spark applications. By the end of this book, you will have a sound fundamental understanding of the Apache Spark framework and you will be able to write and optimize Spark applications. What you will learn * Learn core concepts such as RDDs, DataFrames, transformations, and more * Set up a Spark development environment * Choose the right APIs for your applications * Understand Spark’s architecture and the execution flow of a Spark application * Explore built-in modules for SQL, streaming, ML, and graph analysis * Optimize your Spark job for better performance Who this book is for If you are a big data enthusiast and love processing huge amount of data, this book is for you. If you are data engineer and looking for the best optimization techniques for your Spark applications, then you will find this book helpful. This book also helps data scientists who want to implement their machine learning algorithms in Spark. You need to have a basic understanding of any one of the programming languages such as Scala, Python or Java.
目录展开

Title Page

Copyright and Credits

Apache Spark Quick Start Guide

About Packt

Why subscribe?

Packt.com

Contributors

About the authors

About the reviewer

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Introduction to Apache Spark

What is Spark?

Spark architecture overview

Spark language APIs

Scala

Java

Python

R

SQL

Spark components

Spark Core

Spark SQL

Spark Streaming

Spark machine learning

Spark graph processing

Cluster manager

Standalone scheduler

YARN

Mesos

Kubernetes

Making the most of Hadoop and Spark

Summary

Apache Spark Installation

AWS elastic compute cloud (EC2)

Creating a free account on AWS

Connecting to your Linux instance

Configuring Spark

Prerequisites

Installing Java

Installing Scala

Installing Python

Installing Spark

Using Spark components

Different modes of execution

Spark sandbox

Summary

Spark RDD

What is an RDD?

Resilient metadata

Programming using RDDs

Transformations and actions

Transformation

Narrow transformations

map()

flatMap()

filter()

union()

mapPartitions()

Wide transformations

distinct()

sortBy()

intersection()

subtract()

cartesian()

Action

collect()

count()

take()

top()

takeOrdered()

first()

countByValue()

reduce()

saveAsTextFile()

foreach()

Types of RDDs

Pair RDDs

groupByKey()

reduceByKey()

sortByKey()

join()

Caching and checkpointing

Caching

Checkpointing

Understanding partitions

repartition() versus coalesce()

partitionBy()

Drawbacks of using RDDs

Summary

Spark DataFrame and Dataset

DataFrames

Creating DataFrames

Data sources

DataFrame operations and associated functions

Running SQL on DataFrames

Temporary views on DataFrames

Global temporary views on DataFrames

Datasets

Encoders

Internal row

Creating custom encoders

Summary

Spark Architecture and Application Execution Flow

A sample application

DAG constructor

Stage

Tasks

Task scheduler

FIFO

FAIR

Application execution modes

Local mode

Client mode

Cluster mode

Application monitoring

Spark UI

Application logs

External monitoring solution

Summary

Spark SQL

Spark SQL

Spark metastore

Using the Hive metastore in Spark SQL

Hive configuration with Spark

SQL language manual

Database

Table and view

Load data

Creating UDFs

SQL database using JDBC

Summary

Spark Streaming, Machine Learning, and Graph Analysis

Spark Streaming

Use cases

Data sources

Stream processing

Microbatch

DStreams

Streaming architecture

Streaming example

Machine learning

MLlib

ML

Graph processing

GraphX

mapVertices

mapEdges

subgraph

GraphFrames

degrees

subgraphs

Graph algorithms

PageRank

Summary

Spark Optimizations

Cluster-level optimizations

Memory

Disk

CPU cores

Project Tungsten

Application optimizations

Language choice

Structured versus unstructured APIs

File format choice

RDD optimizations

Choosing the right transformations

Serializing and compressing

Broadcast variables

DataFrame and dataset optimizations

Catalyst optimizer

Storage

Parallelism

Join performance

Code generation

Speculative execution

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部