Master scala's advanced techniques to solve real-world problems in data analysis and gain valuable insights from your data Key Features * A beginner's guide for performing data analysis loaded with numerous rich, practical examples * Access to popular Scala libraries such as Breeze, Saddle for efficient data manipulation and exploratory analysis * Develop applications in Scala for real-time analysis and machine learning in Apache Spark Book Description Efficient business decisions with an accurate sense of business data helps in delivering better performance across products and services. This book helps you to leverage the popular Scala libraries and tools for performing core data analysis tasks with ease. The book begins with a quick overview of the building blocks of a standard data analysis process. You will learn to perform basic tasks like Extraction, Staging, Validation, Cleaning, and Shaping of datasets. You will later deep dive into the data exploration and visualization areas of the data analysis life cycle. You will make use of popular Scala libraries like Saddle, Breeze, Vegas, and PredictionIO for processing your datasets. You will learn statistical methods for deriving meaningful insights from data. You will also learn to create applications for Apache Spark 2.x on complex data analysis, in real-time. You will discover traditional machine learning techniques for doing data analysis. Furthermore, you will also be introduced to neural networks and deep learning from a data analysis standpoint. By the end of this book, you will be capable of handling large sets of structured and unstructured data, perform exploratory analysis, and building efficient Scala applications for discovering and delivering insights What you will learn * Techniques to determine the validity and confidence level of data * Apply quartiles and n-tiles to datasets to see how data is distributed into many buckets * Create data pipelines that combine multiple data lifecycle steps * Use built-in features to gain a deeper understanding of the data * Apply Lasso regression analysis method to your data * Compare Apache Spark API with traditional Apache Spark data analysis Who this book is for If you are a data scientist or a data analyst who wants to learn how to perform data analysis using Scala, this book is for you. All you need is knowledge of the basic fundamentals of Scala programming.


About Packt

Why subscribe?



About the author

About the reviewer

Packt is searching for authors like you


Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch


Section 1: Scala and Data Analysis Life Cycle

Scala Overview

Getting started with Scala

Running Scala code online



Installing Scala on your computer

Installing command-line tools

Installing IDE

Overview of object-oriented and functional programming

Object-oriented programming using Scala

Functional programming using Scala

Scala case classes and the collection API

Scala case classes

Scala collection API




Overview of Scala libraries for data analysis

Apache Spark










Data Analysis Life Cycle

Data journey

Sourcing data

Data formats




Understanding data

Using statistical methods for data exploration

Using Scala

Other Scala tools

Using data visualization for data exploration

Using the vegas-viz library for data visualization

Other libraries for data visualization

Using ML to learn from data

Setting up Smile

Running Smile

Creating a data pipeline


Data Ingestion

Data extraction

Pull-oriented data extraction

Push-oriented data delivery

Data staging

Why is the staging important?

Cleaning and normalizing


Organizing and storing


Data Exploration and Visualization

Sampling data

Selecting the sample

Selecting samples using Saddle

Performing ad hoc analysis

Finding a relationship between data elements

Visualizing data

Vegas viz for data visualization

Spark Notebook for data visualization

Downloading and installing Spark Notebook

Creating a Spark Notebook with simple visuals

More charts with Spark Notebook

Box plot


Bubble chart


Applying Statistics and Hypothesis Testing

Basics of statistics

Summary level statistics

Correlation statistics

Vector level statistics

Random data generation

Pseudorandom numbers

Random numbers with normal distribution

Random numbers with Poisson distribution

Hypothesis testing


Section 2: Advanced Data Analysis and Machine Learning

Introduction to Spark for Distributed Data Analysis

Spark setup and overview

Spark core concepts

Spark Datasets and DataFrames

Sourcing data using Spark

Parquet file format

Avro file format

Spark JDBC integration

Using Spark to explore data


Traditional Machine Learning for Data Analysis

ML overview

Characteristics of ML

Categories or types of ML

Decision trees

Implementing decision trees

Decision tree algorithms

Implementing decision tree algorithms in our example

Evaluating the results

Using our model with a decision tree

Random forest

Random forest algorithms

Ridge and lasso regression

Characteristics of ridge regression

Characteristics of lasso regression

k-means cluster analysis

Natural language processing for data analysis

Algorithm selections


Section 3: Real-Time Data Analysis and Scalability

Near Real-Time Data Analysis Using Streaming

Overview of streaming

Spark Streaming overview

Word count using pure Scala

Word count using Scala and Spark

Word count using Scala and Spark Streaming

Deep dive into the Spark Streaming solution

Streaming a k-means clustering algorithm using Spark

Streaming linear regression using Spark


Working with Data at Scale

Working with data at scale

Cost considerations

Data storage

Data governance

Reliability considerations

Input data errors

Processing failures


