万本电子书0元读

万本电子书0元读

顶部广告

Scala Data Analysis Cookbook电子书

售       价:¥

6人正在读 | 0人评论 9.8

作       者:Arun Manivannan

出  版  社:Packt Publishing

出版时间:2015-10-30

字       数:240.3万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
Navigate the world of data analysis, visualization, and machine learning with over 100 hands-on Scala recipesAbout This BookImplement Scala in your data analysis using features from Spark, Breeze, and ZeppelinScale up your data anlytics infrastructure with practical recipes for Scala machine learningRecipes for every stage of the data analysis process, from reading and collecting data to distributed analytics Who This Book Is For This book shows data scientists and analysts how to leverage their existing knowledge of Scala for quality and scalable data analysis.What You Will LearnFamiliarize and set up the Breeze and Spark libraries and use data structuresImport data from a host of possible sources and create dataframes from CSVClean, validate and transform data using Scala to pre-process numerical and string dataIntegrate quintessential machine learning algorithms using Scala stackBundle and scale up Spark jobs by deploying them into a variety of cluster managersRun streaming and graph analytics in Spark to visualize data, enabling exploratory analysis In Detail This book will introduce you to the most popular Scala tools, libraries, and frameworks through practical recipes around loading, manipulating, and preparing your data. It will also help you explore and make sense of your data using stunning and insightfulvisualizations, and machine learning toolkits. Starting with introductory recipes on utilizing the Breeze and Spark libraries, get to grips withhow to import data from a host of possible sources and how to pre-process numerical, string, and date data. Next, you’ll get an understanding of concepts that will help you visualize data using the Apache Zeppelin and Bokeh bindings in Scala, enabling exploratory data analysis. iscover how to program quintessential machine learning algorithms using Spark ML library. Work through steps to scale your machine learning models and deploy them into a standalone cluster, EC2, YARN, and Mesos. Finally dip into the powerful options presented by Spark Streaming, and machine learning for streaming data, as well as utilizing Spark GraphX.Style and approach This book contains a rich set of recipes that covers the full spectrum of interesting data analysis tasks and will help you revolutionize your data analysis skills using Scala and Spark.
目录展开

Scala Data Analysis Cookbook

Table of Contents

Scala Data Analysis Cookbook

Credits

About the Author

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers, and more

Why Subscribe?

Free Access for Packt account holders

Preface

Apache Flink

Scalding

Saddle

Spire

Akka

Accord

What this book covers

What you need for this book

Who this book is for

Sections

Getting ready

How to do it…

How it works…

There's more…

See also

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Getting Started with Breeze

Introduction

Getting Breeze – the linear algebra library

How to do it...

There's more...

The org.scalanlp.breeze dependency

The org.scalanlp.breeze-natives package

Working with vectors

Getting ready

How to do it...

Creating vectors

Constructing a vector from values

Creating a zero vector

Creating a vector out of a function

Creating a vector of linearly spaced values

Creating a vector with values in a specific range

Creating an entire vector with a single value

Slicing a sub-vector from a bigger vector

Creating a Breeze Vector from a Scala Vector

Vector arithmetic

Scalar operations

Calculating the dot product of two vectors

Creating a new vector by adding two vectors together

Appending vectors and converting a vector of one type to another

Concatenating two vectors

Converting a vector of Int to a vector of Double

Computing basic statistics

Mean and variance

Standard deviation

Find the largest value in a vector

Finding the sum, square root and log of all the values in the vector

The Sqrt function

The Log function

Working with matrices

How to do it...

Creating matrices

Creating a matrix from values

Creating a zero matrix

Creating a matrix out of a function

Creating an identity matrix

Creating a matrix from random numbers

Creating from a Scala collection

Matrix arithmetic

Addition

Multiplication

Appending and conversion

Concatenating matrices – vertically

Concatenating matrices – horizontally

Converting a matrix of Int to a matrix of Double

Data manipulation operations

Getting column vectors out of the matrix

Getting row vectors out of the matrix

Getting values inside the matrix

Getting the inverse and transpose of a matrix

Computing basic statistics

Mean and variance

Standard deviation

Finding the largest value in a matrix

Finding the sum, square root and log of all the values in the matrix

Sqrt

Log

Calculating the eigenvectors and eigenvalues of a matrix

How it works...

Vectors and matrices with randomly distributed values

How it works...

Creating vectors with uniformly distributed random values

Creating vectors with normally distributed random values

Creating vectors with random values that have a Poisson distribution

Creating a matrix with uniformly random values

Creating a matrix with normally distributed random values

Creating a matrix with random values that has a Poisson distribution

Reading and writing CSV files

How it works...

2. Getting Started with Apache Spark DataFrames

Introduction

Getting Apache Spark

How to do it...

Creating a DataFrame from CSV

How to do it...

How it works...

There's more…

Manipulating DataFrames

How to do it...

Printing the schema of the DataFrame

Sampling the data in the DataFrame

Selecting DataFrame columns

Filtering data by condition

Sorting data in the frame

Renaming columns

Treating the DataFrame as a relational table

Joining two DataFrames

Inner join

Right outer join

Left outer join

Saving the DataFrame as a file

Creating a DataFrame from Scala case classes

How to do it...

How it works...

3. Loading and Preparing Data – DataFrame

Introduction

Loading more than 22 features into classes

How to do it...

How it works...

There's more…

Loading JSON into DataFrames

How to do it…

Reading a JSON file using SQLContext.jsonFile

Reading a text file and converting it to JSON RDD

Explicitly specifying your schema

There's more…

Storing data as Parquet files

How to do it…

Load a simple CSV file, convert it to case classes, and create a DataFrame from it

Save it as a Parquet file

Install Parquet tools

Using the tools to inspect the Parquet file

Enable compression for the Parquet file

Using the Avro data model in Parquet

How to do it…

Creation of the Avro model

Generation of Avro objects using the sbt-avro plugin

Constructing an RDD of our generated object from Students.csv

Saving RDD[StudentAvro] in a Parquet file

Reading the file back for verification

Using Parquet tools for verification

Loading from RDBMS

How to do it…

Preparing data in Dataframes

How to do it...

4. Data Visualization

Introduction

Visualizing using Zeppelin

How to do it...

Installing Zeppelin

Customizing Zeppelin's server and websocket port

Visualizing data on HDFS – parameterizing inputs

Running custom functions

Adding external dependencies to Zeppelin

Pointing to an external Spark cluster

Creating scatter plots with Bokeh-Scala

How to do it...

Preparing our data

Creating Plot and Document objects

Creating a marker object

Setting the X and Y axes' data range for the plot

Drawing the x and the y axes

Viewing flower species with varying colors

Adding grid lines

Adding a legend to the plot

Creating a time series MultiPlot with Bokeh-Scala

How to do it...

Preparing our data

Creating a plot

Creating a line that joins all the data points

Setting the x and y axes' data range for the plot

Drawing the axes and the grids

Adding tools

Adding a legend to the plot

Multiple plots in the document

5. Learning from Data

Introduction

Supervised and unsupervised learning

Gradient descent

Predicting continuous values using linear regression

How to do it...

Importing the data

Converting each instance into a LabeledPoint

Preparing the training and test data

Scaling the features

Training the model

Predicting against test data

Evaluating the model

Regularizing the parameters

Mini batching

Binary classification using LogisticRegression and SVM

How to do it...

Importing the data

Tokenizing the data and converting it into LabeledPoints

Factoring the inverse document frequency

Prepare the training and test data

Constructing the algorithm

Training the model and predicting the test data

Evaluating the model

Binary classification using LogisticRegression with Pipeline API

How to do it...

Importing and splitting data as test and training sets

Construct the participants of the Pipeline

Preparing a pipeline and training a model

Predicting against test data

Evaluating a model without cross-validation

Constructing parameters for cross-validation

Constructing cross-validator and fit the best model

Evaluating the model with cross-validation

Clustering using K-means

How to do it...

KMeans.RANDOM

KMeans.PARALLEL

K-means++

K-means||

Max iterations

Epsilon

Importing the data and converting it into a vector

Feature scaling the data

Deriving the number of clusters

Constructing the model

Evaluating the model

Feature reduction using principal component analysis

How to do it...

Dimensionality reduction of data for supervised learning

Mean-normalizing the training data

Extracting the principal components

Preparing the labeled data

Preparing the test data

Classify and evaluate the metrics

Dimensionality reduction of data for unsupervised learning

Mean-normalizing the training data

Extracting the principal components

Arriving at the number of components

Evaluating the metrics

6. Scaling Up

Introduction

Building the Uber JAR

How to do it...

Transitive dependency stated explicitly in the SBT dependency

Two different libraries depend on the same external library

Submitting jobs to the Spark cluster (local)

How to do it...

Downloading Spark

Running HDFS on Pseudo-clustered mode

Running the Spark master and slave locally

Pushing data into HDFS

Submitting the Spark application on the cluster

Running the Spark Standalone cluster on EC2

How to do it...

Creating the AccessKey and pem file

Setting the environment variables

Running the launch script

Verifying installation

Making changes to the code

Transferring the data and job files

Loading the dataset into HDFS

Running the job

Destroying the cluster

Running the Spark Job on Mesos (local)

How to do it...

Installing Mesos

Starting the Mesos master and slave

Uploading the Spark binary package and the dataset to HDFS

Running the job

Running the Spark Job on YARN (local)

How to do it...

Installing the Hadoop cluster

Starting HDFS and YARN

Pushing Spark assembly and dataset to HDFS

Running a Spark job in yarn-client mode

Running Spark job in yarn-cluster mode

7. Going Further

Introduction

Using Spark Streaming to subscribe to a Twitter stream

How to do it...

Using Spark as an ETL tool

How to do it...

Using StreamingLogisticRegression to classify a Twitter stream using Kafka as a training stream

How to do it...

Using GraphX to analyze Twitter data

How to do it...

Index

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部