


R: Data Analysis and Visualization电子书

售       价:¥

5人正在读 | 0人评论 9.8

作       者:Tony Fischetti,Brett Lantz,Jaynal Abedin

出  版  社:Packt Publishing


字       数:1706.0万

所属分类: 进口书 > 外文原版书 > 电脑/网络



  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
Master the art of building analytical models using R About This Book Load, wrangle, and analyze your data using the world's most powerful statistical programming language Build and customize publication-quality visualizations of powerful and stunning R graphs Develop key skills and techniques with R to create and customize data mining algorithms Use R to optimize your trading strategy and build up your own risk management system Discover how to build machine learning algorithms, prepare data, and dig deep into data prediction techniques with R Who This Book Is For This course is for data scientist or quantitative analyst who are looking at learning R and take advantage of its powerful analytical design framework. It’s a seamless journey in becoming a full-stack R developer. What You Will Learn Describe and visualize the behavior of data and relationships between data Gain a thorough understanding of statistical reasoning and sampling Handle missing data gracefully using multiple imputation Create diverse types of bar charts using the default R functions Familiarize yourself with algorithms written in R for spatial data mining, text mining, and so on Understand relationships between market factors and their impact on your portfolio Harness the power of R to build machine learning algorithms with real-world data science applications Learn specialized machine learning techniques for text mining, big data, and more In Detail The R learning path created for you has five connected modules, which are a mini-course in their own right. As you complete each one, you'll have gained key skills and be ready for the material in the next module! This course begins by looking at the Data Analysis with R module. This will help you navigate the R environment. You'll gain a thorough understanding of statistical reasoning and sampling. Finally, you'll be able to put best practices into effect to make your job easier and facilitate reproducibility. The second place to explore is R Graphs, which will help you leverage powerful default R graphics and utilize advanced graphics systems such as lattice and ggplot2, the grammar of graphics. You'll learn how to produce, customize, and publish advanced visualizations using this popular and powerful framework. With the third module, Learning Data Mining with R, you will learn how to manipulate data with R using code snippets and be introduced to mining frequent patterns, association, and correlations while working with R programs. The Mastering R for Quantitative Finance module pragmatically introduces both the quantitative finance concepts and their modeling in R, enabling you to build a tailor-made trading system on your own. By the end of the module, you will be well-versed with various financial techniques using R and will be able to place good bets while making financial decisions. Finally, we'll look at the Machine Learning with R module. With this module, you'll discover all the analytical tools you need to gain insights from complex data and learn how to choose the correct algorithm for your specific needs. You'll also learn to apply machine learning methods to deal with common tasks, including classification, prediction, forecasting, and so on. Style and approach Learn data analysis, data visualization techniques, data mining, and machine learning all using R and also learn to build models in quantitative finance using this powerful language.

R: Data Analysis and Visualization

Table of Contents

R: Data Analysis and Visualization

Meet Your Course Guide

Course Structure

Course journey

The Course Roadmap and Timeline

I. Module 1: Data Analysis with R

1. RefresheR

Navigating the basics

Arithmetic and assignment

Logicals and characters

Flow of control

Getting help in R



Vectorized functions

Advanced subsetting




Loading data into R

Working with packages

2. The Shape of Data

Univariate data

Frequency distributions

Central tendency


Populations, samples, and estimation

Probability distributions

Visualization methods

3. Describing Relationships

Multivariate data

Relationships between a categorical and a continuous variable

Relationships between two categorical variables

The relationship between two continuous variables


Correlation coefficients

Comparing multiple correlations

Visualization methods

Categorical and continuous variables

Two categorical variables

Two continuous variables

More than two continuous variables

4. Probability

Basic probability

A tale of two interpretations

Sampling from distributions


The binomial distribution

The normal distribution

The three-sigma rule and using z-tables

5. Using Data to Reason About the World

Estimating means

The sampling distribution

Interval estimation

How did we get 1.96?

Smaller samples

6. Testing Hypotheses

Null Hypothesis Significance Testing

One and two-tailed tests

When things go wrong

A warning about significance

A warning about p-values

Testing the mean of one sample

Assumptions of the one sample t-test

Testing two means

Don't be fooled!

Assumptions of the independent samples t-test

Testing more than two means

Assumptions of ANOVA

Testing independence of proportions

What if my assumptions are unfounded?

7. Bayesian Methods

The big idea behind Bayesian analysis

Choosing a prior

Who cares about coin flips

Enter MCMC – stage left

Using JAGS and runjags

Fitting distributions the Bayesian way

The Bayesian independent samples t-test

8. Predicting Continuous Variables

Linear models

Simple linear regression

Simple linear regression with a binary predictor

A word of warning

Multiple regression

Regression with a non-binary predictor

Kitchen sink regression

The bias-variance trade-off


Striking a balance

Linear regression diagnostics

Second Anscombe relationship

Third Anscombe relationship

Fourth Anscombe relationship

Advanced topics

9. Predicting Categorical Variables

k-Nearest Neighbors

Using k-NN in R

Confusion matrices

Limitations of k-NN

Logistic regression

Using logistic regression in R

Decision trees

Random forests

Choosing a classifier

The vertical decision boundary

The diagonal decision boundary

The crescent decision boundary

The circular decision boundary

10. Sources of Data

Relational Databases

Why didn't we just do that in SQL?

Using JSON


Other data formats

Online repositories

11. Dealing with Messy Data

Analysis with missing data

Visualizing missing data

Types of missing data

So which one is it?

Unsophisticated methods for dealing with missing data

Complete case analysis

Pairwise deletion

Mean substitution

Hot deck imputation

Regression imputation

Stochastic regression imputation

Multiple imputation

So how does mice come up with the imputed values?

Methods of imputation

Multiple imputation in practice

Analysis with unsanitized data

Checking for out-of-bounds data

Checking the data type of a column

Checking for unexpected categories

Checking for outliers, entry errors, or unlikely data points

Chaining assertions

Other messiness


Regular expressions


12. Dealing with Large Data

Wait to optimize

Using a bigger and faster machine

Be smart about your code

Allocation of memory


Using optimized packages

Using another R implementation

Use parallelization

Getting started with parallel R

An example of (some) substance

Using Rcpp

Be smarter about your code

13. Reproducibility and Best Practices

R Scripting


Running R scripts

An example script

Scripting and reproducibility

R projects

Version control

Communicating results

II. Module 2: R Graphs

1. R Graphics

Base graphics using the default package

Trellis graphs using lattice

Graphs inspired by Grammar of Graphics

2. Basic Graph Functions


Creating basic scatter plots

Getting ready

How to do it...

How it works...

There's more...

A note on R's built-in datasets

See also

Creating line graphs

Getting ready

How to do it...

How it works...

There's more...

See also

Creating bar charts

Getting ready

How to do it...

How it works...

There's more...

See also

Creating histograms and density plots

How to do it...

How it works...

There's more...

See also

Creating box plots

Getting ready

How to do it...

How it works...

There's more...

See also

Adjusting x and y axes' limits

How to do it...

How it works...

There's more...

See also

Creating heat maps

How to do it...

How it works...

There's more...

See also

Creating pairs plots

How to do it...

How it works...

There's more...

See also

Creating multiple plot matrix layouts

How to do it...

How it works...

There's more...

See also

Adding and formatting legends

Getting ready

How to do it...

How it works...

There's more...

See also

Creating graphs with maps

Getting ready

How to do it...

How it works...

There's more...

See also

Saving and exporting graphs

How to do it...

How it works...

There's more...

See also

3. Beyond the Basics – Adjusting Key Parameters


Setting colors of points, lines, and bars

Getting ready

How to do it...

How it works...

There's more...

See also

Setting plot background colors

Getting ready

How to do it...

How it works...

There's more...

Setting colors for text elements – axis annotations, labels, plot titles, and legends

Getting ready

How to do it...

How it works...

There's more...

Choosing color combinations and palettes

Getting ready

How to do it...

How it works...

There's more...

See also

Setting fonts for annotations and titles

Getting ready

How to do it...

How it works...

There's more...

See also

Choosing plotting point symbol styles and sizes

Getting ready

How to do it...

How it works...

There's more...

See also

Choosing line styles and width

Getting ready

How to do it...

How it works...

See also

Choosing box styles

Getting ready

How to do it...

How it works...

There's more...

Adjusting axis annotations and tick marks

Getting ready

How to do it...

How it works...

There's more...

See also

Formatting log axes

Getting ready

How to do it...

How it works...

There's more...

Setting graph margins and dimensions

Getting ready

How to do it...

How it works...

See also

4. Creating Scatter Plots


Grouping data points within a scatter plot

Getting ready

How to do it...

How it works...

There's more...

See also

Highlighting grouped data points by size and symbol type

Getting ready

How to do it...

How it works...

Labeling data points

Getting ready

How to do it...

How it works...

There's more...

Correlation matrix using pairs plots

Getting ready

How to do it...

How it works...

Adding error bars

Getting ready

How to do it...

How it works...

There's more...

Using jitter to distinguish closely packed data points

Getting ready

How to do it...

How it works...

Adding linear model lines

Getting ready

How to do it...

How it works...

Adding nonlinear model curves

Getting ready

How to do it...

How it works...

Adding nonparametric model curves with lowess

Getting ready

How to do it...

How it works...

Creating three-dimensional scatter plots

Getting ready

How to do it...

How it works...

There's more...

Creating Quantile-Quantile plots

Getting ready

How to do it...

How it works...

There's more...

Displaying the data density on axes

Getting ready

How to do it...

How it works...

There's more...

Creating scatter plots with a smoothed density representation

Getting ready

How to do it...

How it works...

There's more...

5. Creating Line Graphs and Time Series Charts


Adding customized legends for multiple-line graphs

Getting ready

How to do it...

How it works...

There's more...

See also

Using margin labels instead of legends for multiple-line graphs

Getting ready

How to do it...

How it works...

There's more...

Adding horizontal and vertical grid lines

Getting ready

How to do it...

How it works...

There's more...

See also

Adding marker lines at specific x and y values using abline

Getting ready

How to do it...

How it works...

There's more...

Creating sparklines

Getting ready

How to do it...

How it works...

Plotting functions of a variable in a dataset

Getting ready

How to do it...

How it works...

There's more...

Formatting time series data for plotting

Getting ready

How to do it...

How it works...

There's more...

Plotting the date or time variable on the x axis

Getting ready

How to do it...

How it works...

There's more...

Annotating axis labels in different human-readable time formats

Getting ready

How to do it...

How it works...

There's more...

Adding vertical markers to indicate specific time events

Getting ready

How to do it...

How it works...

There's more...

Plotting data with varying time-averaging periods

Getting ready

How to do it...

How it works...

Creating stock charts

Getting ready

How to do it...

How it works...

There's more...

6. Creating Bar, Dot, and Pie Charts


Creating bar charts with more than one factor variable

Getting ready

How to do it...

How it works...

See also

Creating stacked bar charts

Getting ready

How to do it...

How it works...

There's more...

Adjusting the orientation of bars – horizontal and vertical

Getting ready

How to do it...

How it works...

There's more...

Adjusting bar widths, spacing, colors, and borders

Getting ready

How to do it...

How it works...

There's more...

Displaying values on top of or next to the bars

Getting ready

How to do it...

How it works...

There's more...

See also

Placing labels inside bars

Getting ready

How to do it...

How it works...

There's more...

Creating bar charts with vertical error bars

Getting ready

How to do it...

How it works...

There's more...

Modifying dot charts by grouping variables

Getting ready

How to do it...

How it works...

Making better, readable pie charts with clockwise-ordered slices

Getting ready

How to do it...

How it works...

See also

Labeling a pie chart with percentage values for each slice

Getting ready

How it works...

There's more...

See also

Adding a legend to a pie chart

Getting ready

How to do it...

How it works...

There's more...

7. Creating Histograms


Visualizing distributions as count frequencies or probability densities

Getting ready

How to do it...

How it works...

There's more

Setting the bin size and the number of breaks

Getting ready

How to do it...

How it works...

There's more

Adjusting histogram styles – bar colors, borders, and axes

Getting ready

How to do it...

How it works...

There's more

Overlaying a density line over a histogram

Getting ready

How to do it...

How it works...

Multiple histograms along the diagonal of a pairs plot

Getting ready

How to do it...

How it works...

Histograms in the margins of line and scatter plots

Getting ready

How to do it...

How it works...

8. Box and Whisker Plots


Creating box plots with narrow boxes for a small number of variables

Getting ready

How to do it...

How it works...

There's more

See also

Grouping over a variable

Getting ready

How to do it...

How it works...

There's more

See also

Varying box widths by the number of observations

Getting ready

How to do it...

How it works...

Creating box plots with notches

Getting ready

How to do it...

How it works...

There's more

Including or excluding outliers

Getting ready

How to do it...

How it works...

See also

Creating horizontal box plots

Getting ready

How to do it...

How it works...

Changing the box styling

Getting ready

How to do it...

How it works...

There's more

Adjusting the extent of plot whiskers outside the box

Getting ready

How to do it...

How it works...

There's more

Showing the number of observations

Getting ready

How to do it...

How it works...

There's more

Splitting a variable at arbitrary values into subsets

Getting ready

How to do it...

How it works...

There's more

9. Creating Heat Maps and Contour Plots


Creating heat maps of a single Z variable with a scale

Getting ready

How to do it...

How it works...

There's more

See also

Creating correlation heat maps

Getting ready

How to do it...

How it works...

There's more

Summarizing multivariate data in a single heat map

Getting ready

How to do it...

How it works...

There's more

Creating contour plots

Getting ready

How to do it...

How it works...

There's more

See also

Creating filled contour plots

Getting ready

How to do it...

How it works...

There's more

See also

Creating three-dimensional surface plots

Getting ready

How to do it...

How it works...

There's more

Visualizing time series as calendar heat maps

Getting ready

How to do it...

How it works...

There's more

10. Creating Maps


Plotting global data by countries on a world map

Getting ready

How to do it...

How it works...

There's more

See also

Creating graphs with regional maps

Getting ready

How to do it...

How it works...

There's more

Plotting data on Google maps

Getting ready

How to do it...

How it works...

There's more

See also

Creating and reading KML data

Getting ready

How to do it...

How it works...

See Also

Working with ESRI shapefiles

Getting ready

How to do it...

How it works...

There's more

11. Data Visualization Using Lattice


Creating bar charts

Getting ready

How to do it…

How it works…

There's more…

See also

Creating stacked bar charts

Getting ready

How to do it…

How it works…

There's more…

See also

Creating bar charts to visualize cross-tabulation

Getting ready

How to do it…

How it works…

There's more…

Creating a conditional histogram

Getting ready

How to do it…

How it works…

There's more…

See also

Visualizing distributions through a kernel-density plot

Getting ready

How to do it…

How it works…

There's more…

Creating a normal Q-Q plot

Getting ready

How to do it…

How it works…

There's more…

Visualizing an empirical Cumulative Distribution Function

Getting ready

How to do it…

How it works…

There's more…

Creating a boxplot

Getting ready

How to do it…

How it works…

There's more…

Creating a conditional scatter plot

Getting ready

How to do it…

How it works…

There's more…

12. Data Visualization Using ggplot2


Creating bar charts

Getting ready

How to do it…

How it works…

There's more…

See also

Creating multiple bar charts

Getting ready

How to do it…

How it works…

There's more…

See also

Creating a bar chart with error bars

Getting ready

How to do it…

How it works…

There's more…

Visualizing the density of a numeric variable

Getting ready

How to do it...

How it works…

There's more...

Creating a box plot

Getting ready

How to do it...

How it works…

Creating a layered plot with a scatter plot and fitted line

Getting ready

How to do it...

How it works…

There's more...

Creating a line chart

Getting ready

How to do it...

How it works…

There's more...

Graph annotation with ggplot

Getting ready

How to do it...

How it works...

13. Inspecting Large Datasets


Multivariate continuous data visualization

Getting ready

How to do it…

How it works…

There's more…

See also

Multivariate categorical data visualization

Getting ready

How to do it…

How it works…

There's more…

Visualizing mixed data

Getting ready

How to do it…

Zooming and filtering

Getting ready

How to do it...

How it works…

There's more...

14. Three-dimensional Visualizations


Three-dimensional scatter plots

Getting ready

How to do it…

How it works…

There's more…

See also...

Three-dimensional scatter plots with a regression plane

Getting ready

How to do it…

How it works…

There's more…

Three-dimensional bar charts

Getting ready

How to do it…

How it works…

Three-dimensional density plots

Getting ready

How to do it...

How it works…

15. Finalizing Graphs for Publications and Presentations


Exporting graphs in high-resolution image formats – PNG, JPEG, BMP, and TIFF

Getting ready

How to do it...

How it works...

There's more

See also

Exporting graphs in vector formats – SVG, PDF, and PS

Getting ready

How to do it...

How it works...

There's more

Adding mathematical and scientific notations (typesetting)

Getting ready

How to do it...

How it works...

There's more

Adding text descriptions to graphs

Getting ready

How to do it...

How it works...

There's more

Using graph templates

Getting ready

How to do it...

How it works...

There's more

Choosing font families and styles under Windows, Mac OS X, and Linux

Getting ready

How to do it...

How it works...

There's more

See also

Choosing fonts for PostScripts and PDFs

Getting ready

How to do it...

How it works...

There's more

III. Module 3: Learning Data Mining with R

1. Warming Up

Big data

Scalability and efficiency

Data source

Data mining

Feature extraction


The data mining process



Social network mining

Social network

Text mining

Information retrieval and text mining

Mining text for prediction

Web data mining

Why R?

What are the disadvantages of R?


Statistics and data mining

Statistics and machine learning

Statistics and R

The limitations of statistics on data mining

Machine learning

Approaches to machine learning

Machine learning architecture

Data attributes and description

Numeric attributes

Categorical attributes

Data description

Data measuring

Data cleaning

Missing values

Junk, noisy data, or outlier

Data integration

Data dimension reduction

Eigenvalues and Eigenvectors

Principal-Component Analysis

Singular-value decomposition

CUR decomposition

Data transformation and discretization

Data transformation

Normalization data transformation methods

Data discretization

Visualization of results

Visualization with R

2. Mining Frequent Patterns, Associations, and Correlations

An overview of associations and patterns

Patterns and pattern discovery

The frequent itemset

The frequent subsequence

The frequent substructures

Relationship or rules discovery

Association rules

Correlation rules

Market basket analysis

The market basket model

A-Priori algorithms

Input data characteristics and data structure

The A-Priori algorithm

The R implementation

A-Priori algorithm variants

The Eclat algorithm

The R implementation

The FP-growth algorithm

Input data characteristics and data structure

The FP-growth algorithm

The R implementation

The GenMax algorithm with maximal frequent itemsets

The R implementation

The Charm algorithm with closed frequent itemsets

The R implementation

The algorithm to generate association rules

The R implementation

Hybrid association rules mining

Mining multilevel and multidimensional association rules

Constraint-based frequent pattern mining

Mining sequence dataset

Sequence dataset

The GSP algorithm

The R implementation

The SPADE algorithm

The R implementation

Rule generation from sequential patterns

High-performance algorithms

3. Classification


Generic decision tree induction

Attribute selection measures

Tree pruning

General algorithm for the decision tree generation

The R implementation

High-value credit card customers classification using ID3

The ID3 algorithm

The R implementation

Web attack detection

High-value credit card customers classification

Web spam detection using C4.5

The C4.5 algorithm

The R implementation

A parallel version with MapReduce

Web spam detection

Web key resource page judgment using CART

The CART algorithm

The R implementation

Web key resource page judgment

Trojan traffic identification method and Bayes classification


Prior probability estimation

Likelihood estimation

The Bayes classification

The R implementation

Trojan traffic identification method

Identify spam e-mail and Naïve Bayes classification

The Naïve Bayes classification

The R implementation

Identify spam e-mail

Rule-based classification of player types in computer games and rule-based classification

Transformation from decision tree to decision rules

Rule-based classification

Sequential covering algorithm

The RIPPER algorithm

The R implementation

Rule-based classification of player types in computer games

4. Advanced Classification

Ensemble (EM) methods

The bagging algorithm

The boosting and AdaBoost algorithms

The Random forests algorithm

The R implementation

Parallel version with MapReduce

Biological traits and the Bayesian belief network

The Bayesian belief network (BBN) algorithm

The R implementation

Biological traits

Protein classification and the k-Nearest Neighbors algorithm

The kNN algorithm

The R implementation

Document retrieval and Support Vector Machine

The SVM algorithm

The R implementation

Parallel version with MapReduce

Document retrieval

Classification using frequent patterns

The associative classification


Discriminative frequent pattern-based classification

The R implementation

Text classification using sentential frequent itemsets

Classification using the backpropagation algorithm

The BP algorithm

The R implementation

Parallel version with MapReduce

5. Cluster Analysis

Search engines and the k-means algorithm

The k-means clustering algorithm

The kernel k-means algorithm

The k-modes algorithm

The R implementation

Parallel version with MapReduce

Search engine and web page clustering

Automatic abstraction of document texts and the k-medoids algorithm

The PAM algorithm

The R implementation

Automatic abstraction and summarization of document text

The CLARA algorithm

The CLARA algorithm

The R implementation


The CLARANS algorithm

The R implementation

Unsupervised image categorization and affinity propagation clustering

Affinity propagation clustering

The R implementation

Unsupervised image categorization

The spectral clustering algorithm

The R implementation

News categorization and hierarchical clustering

Agglomerative hierarchical clustering

The BIRCH algorithm

The chameleon algorithm

The Bayesian hierarchical clustering algorithm

The probabilistic hierarchical clustering algorithm

The R implementation

News categorization

6. Advanced Cluster Analysis

Customer categorization analysis of e-commerce and DBSCAN

The DBSCAN algorithm

Customer categorization analysis of e-commerce

Clustering web pages and OPTICS

The OPTICS algorithm

The R implementation

Clustering web pages

Visitor analysis in the browser cache and DENCLUE

The DENCLUE algorithm

The R implementation

Visitor analysis in the browser cache

Recommendation system and STING

The STING algorithm

The R implementation

Recommendation systems

Web sentiment analysis and CLIQUE

The CLIQUE algorithm

The R implementation

Web sentiment analysis

Opinion mining and WAVE clustering

The WAVE cluster algorithm

The R implementation

Opinion mining

User search intent and the EM algorithm

The EM algorithm

The R implementation

The user search intent

Customer purchase data analysis and clustering high-dimensional data

The MAFIA algorithm

The SURFING algorithm

The R implementation

Customer purchase data analysis

SNS and clustering graph and network data

The SCAN algorithm

The R implementation

Social networking service (SNS)

7. Outlier Detection

Credit card fraud detection and statistical methods

The likelihood-based outlier detection algorithm

The R implementation

Credit card fraud detection

Activity monitoring – the detection of fraud involving mobile phones and proximity-based methods

The NL algorithm

The FindAllOutsM algorithm

The FindAllOutsD algorithm

The distance-based algorithm

The Dolphin algorithm

The R implementation

Activity monitoring and the detection of mobile fraud

Intrusion detection and density-based methods

The OPTICS-OF algorithm

The High Contrast Subspace algorithm

The R implementation

Intrusion detection

Intrusion detection and clustering-based methods

Hierarchical clustering to detect outliers

The k-means-based algorithm

The ODIN algorithm

The R implementation

Monitoring the performance of the web server and classification-based methods

The OCSVM algorithm

The one-class nearest neighbor algorithm

The R implementation

Monitoring the performance of the web server

Detecting novelty in text, topic detection, and mining contextual outliers

The conditional anomaly detection (CAD) algorithm

The R implementation

Detecting novelty in text and topic detection

Collective outliers on spatial data

The route outlier detection (ROD) algorithm

The R implementation

Characteristics of collective outliers

Outlier detection in high-dimensional data

The brute-force algorithm

The HilOut algorithm

The R implementation

8. Mining Stream, Time-series, and Sequence Data

The credit card transaction flow and STREAM algorithm

The STREAM algorithm

The single-pass-any-time clustering algorithm

The R implementation

The credit card transaction flow

Predicting future prices and time-series analysis

The ARIMA algorithm

Predicting future prices

Stock market data and time-series clustering and classification

The hError algorithm

Time-series classification with the 1NN classifier

The R implementation

Stock market data

Web click streams and mining symbolic sequences

The TECNO-STREAMS algorithm

The R implementation

Web click streams

Mining sequence patterns in transactional databases

The PrefixSpan algorithm

The R implementation

9. Graph Mining and Network Analysis

Graph mining


Graph mining algorithms

Mining frequent subgraph patterns

The gPLS algorithm

The GraphSig algorithm

The gSpan algorithm

Rightmost path extensions and their supports

The subgraph isomorphism enumeration algorithm

The canonical checking algorithm

The R implementation

Social network mining

Community detection and the shingling algorithm

The node classification and iterative classification algorithms

The R implementation

10. Mining Text and Web Data

Text mining and TM packages

Text summarization

Topic representation

The multidocument summarization algorithm

The Maximal Marginal Relevance algorithm

The R implementation

The question answering system

Genre categorization of web pages

Categorizing newspaper articles and newswires into topics

The N-gram-based text categorization

The R implementation

Web usage mining with web logs

The FCA-based association rule mining algorithm

The R implementation

IV. Module 4: Mastering R for Quantitative Finance

1. Time Series Analysis

Multivariate time series analysis


Vector autoregressive models

VAR implementation example

Cointegrated VAR and VECM

Volatility modeling

GARCH modeling with the rugarch package

The standard GARCH model

The Exponential GARCH model (EGARCH)

The Threshold GARCH model (TGARCH)

Simulation and forecasting

References and reading list

2. Factor Models

Arbitrage pricing theory

Implementation of APT

Fama-French three-factor model

Modeling in R

Data selection

Estimation of APT with principal component analysis

Estimation of the Fama-French model


3. Forecasting Volume


The intensity of trading

The volume forecasting model

Implementation in R

The data

Loading the data

The seasonal component

AR(1) estimation and forecasting

SETAR estimation and forecasting

Interpreting the results


4. Big Data – Advanced Analytics

Getting data from open sources

Introduction to big data analysis in R

K-means clustering on big data

Loading big matrices

Big data K-means clustering analysis

Big data linear regression analysis

Loading big data

Fitting a linear regression model on large datasets


5. FX Derivatives

Terminology and notations

Currency options

Exchange options

Two-dimensional Wiener processes

The Margrabe formula

Application in R

Quanto options

Pricing formula for a call quanto

Pricing a call quanto in R


6. Interest Rate Derivatives and Models

The Black model

Pricing a cap with Black's model

The Vasicek model

The Cox-Ingersoll-Ross model

Parameter estimation of interest rate models

Using the SMFI5 package


7. Exotic Options

A general pricing approach

The role of dynamic hedging

How R can help a lot

A glance beyond vanillas

Greeks – the link back to the vanilla world

Pricing the Double-no-touch option

Another way to price the Double-no-touch option

The life of a Double-no-touch option – a simulation

Exotic options embedded in structured products


8. Optimal Hedging

Hedging of derivatives

Market risk of derivatives

Static delta hedge

Dynamic delta hedge

Comparing the performance of delta hedging

Hedging in the presence of transaction costs

Optimization of the hedge

Optimal hedging in the case of absolute transaction costs

Optimal hedging in the case of relative transaction costs

Further extensions


9. Fundamental Analysis

The basics of fundamental analysis

Collecting data

Revealing connections

Including multiple variables

Separating investment targets

Setting classification rules


Industry-specific investment


10. Technical Analysis, Neural Networks, and Logoptimal Portfolios

Market efficiency

Technical analysis

The TA toolkit


Plotting charts - bitcoin

Built-in indicators




Candle patterns: key reversal

Evaluating the signals and managing the position

A word on money management

Wraping up

Neural networks

Forecasting bitcoin prices

Evaluation of the strategy

Logoptimal portfolios

A universally consistent, non-parametric investment strategy

Evaluation of the strategy


11. Asset and Liability Management

Data preparation

Data source at first glance

Cash-flow generator functions

Preparing the cash-flow

Interest rate risk measurement

Liquidity risk measurement

Modeling non-maturity deposits

A Model of deposit interest rate development

Static replication of non-maturity deposits


12. Capital Adequacy

Principles of the Basel Accords

Basel I

Basel II

Minimum capital requirements

Supervisory review


Basel III

Risk measures

Analytical VaR

Historical VaR

Monte-Carlo simulation

Risk categories

Market risk

Credit risk

Operational risk


13. Systemic Risks

Systemic risk in a nutshell

The dataset used in our examples

Core-periphery decomposition

Implementation in R


The simulation method

The simulation

Implementation in R


Possible interpretations and suggestions


V. Module 5: Machine Learning with R module

1. Introducing Machine Learning

The origins of machine learning

Uses and abuses of machine learning

Machine learning successes

The limits of machine learning

Machine learning ethics

How machines learn

Data storage




Machine learning in practice

Types of input data

Types of machine learning algorithms

Matching input data to algorithms

Machine learning with R

Installing R packages

Loading and unloading R packages

2. Managing and Understanding Data

R data structures




Data frames

Matrixes and arrays

Managing data with R

Saving, loading, and removing R data structures

Importing and saving data from CSV files

Exploring and understanding data

Exploring the structure of data

Exploring numeric variables

Measuring the central tendency – mean and median

Measuring spread – quartiles and the five-number summary

Visualizing numeric variables – boxplots

Visualizing numeric variables – histograms

Understanding numeric data – uniform and normal distributions

Measuring spread – variance and standard deviation

Exploring categorical variables

Measuring the central tendency – the mode

Exploring relationships between variables

Visualizing relationships – scatterplots

Examining relationships – two-way cross-tabulations

3. Lazy Learning – Classification Using Nearest Neighbors

Understanding nearest neighbor classification

The k-NN algorithm

Measuring similarity with distance

Choosing an appropriate k

Preparing data for use with k-NN

Why is the k-NN algorithm lazy?

Example – diagnosing breast cancer with the k-NN algorithm

Step 1 – collecting data

Step 2 – exploring and preparing the data

Transformation – normalizing numeric data

Data preparation – creating training and test datasets

Step 3 – training a model on the data

Step 4 – evaluating model performance

Step 5 – improving model performance

Transformation – z-score standardization

Testing alternative values of k

4. Probabilistic Learning – Classification Using Naive Bayes

Understanding Naive Bayes

Basic concepts of Bayesian methods

Understanding probability

Understanding joint probability

Computing conditional probability with Bayes' theorem

The Naive Bayes algorithm

Classification with Naive Bayes

The Laplace estimator

Using numeric features with Naive Bayes

Example – filtering mobile phone spam with the Naive Bayes algorithm

Step 1 – collecting data

Step 2 – exploring and preparing the data

Data preparation – cleaning and standardizing text data

Data preparation – splitting text documents into words

Data preparation – creating training and test datasets

Visualizing text data – word clouds

Data preparation – creating indicator features for frequent words

Step 3 – training a model on the data

Step 4 – evaluating model performance

Step 5 – improving model performance

5. Divide and Conquer – Classification Using Decision Trees and Rules

Understanding decision trees

Divide and conquer

The C5.0 decision tree algorithm

Choosing the best split

Pruning the decision tree

Example – identifying risky bank loans using C5.0 decision trees

Step 1 – collecting data

Step 2 – exploring and preparing the data

Data preparation – creating random training and test datasets

Step 3 – training a model on the data

Step 4 – evaluating model performance

Step 5 – improving model performance

Boosting the accuracy of decision trees

Making mistakes more costlier than others

Understanding classification rules

Separate and conquer

The 1R algorithm

The RIPPER algorithm

Rules from decision trees

What makes trees and rules greedy?

Example – identifying poisonous mushrooms with rule learners

Step 1 – collecting data

Step 2 – exploring and preparing the data

Step 3 – training a model on the data

Step 4 – evaluating model performance

Step 5 – improving model performance

6. Forecasting Numeric Data – Regression Methods

Understanding regression

Simple linear regression

Ordinary least squares estimation


Multiple linear regression

Example – predicting medical expenses using linear regression

Step 1 – collecting data

Step 2 – exploring and preparing the data

Exploring relationships among features – the correlation matrix

Visualizing relationships among features – the scatterplot matrix

Step 3 – training a model on the data

Step 4 – evaluating model performance

Step 5 – improving model performance

Model specification – adding non-linear relationships

Transformation – converting a numeric variable to a binary indicator

Model specification – adding interaction effects

Putting it all together – an improved regression model

Understanding regression trees and model trees

Adding regression to trees

Example – estimating the quality of wines with regression trees and model trees

Step 1 – collecting data

Step 2 – exploring and preparing the data

Step 3 – training a model on the data

Visualizing decision trees

Step 4 – evaluating model performance

Measuring performance with the mean absolute error

Step 5 – improving model performance

7. Black Box Methods – Neural Networks and Support Vector Machines

Understanding neural networks

From biological to artificial neurons

Activation functions

Network topology

The number of layers

The direction of information travel

The number of nodes in each layer

Training neural networks with backpropagation

Example – Modeling the strength of concrete with ANNs

Step 1 – collecting data

Step 2 – exploring and preparing the data

Step 3 – training a model on the data

Step 4 – evaluating model performance

Step 5 – improving model performance

Understanding Support Vector Machines

Classification with hyperplanes

The case of linearly separable data

The case of nonlinearly separable data

Using kernels for non-linear spaces

Example – performing OCR with SVMs

Step 1 – collecting data

Step 2 – exploring and preparing the data

Step 3 – training a model on the data

Step 4 – evaluating model performance

Step 5 – improving model performance

8. Finding Patterns – Market Basket Analysis Using Association Rules

Understanding association rules

The Apriori algorithm for association rule learning

Measuring rule interest – support and confidence

Building a set of rules with the Apriori principle

Example – identifying frequently purchased groceries with association rules

Step 1 – collecting data

Step 2 – exploring and preparing the data

Data preparation – creating a sparse matrix for transaction data

Visualizing item support – item frequency plots

Visualizing the transaction data – plotting the sparse matrix

Step 3 – training a model on the data

Step 4 – evaluating model performance

Step 5 – improving model performance

Sorting the set of association rules

Taking subsets of association rules

Saving association rules to a file or data frame

9. Finding Groups of Data – Clustering with k-means

Understanding clustering

Clustering as a machine learning task

The k-means clustering algorithm

Using distance to assign and update clusters

Choosing the appropriate number of clusters

Example – finding teen market segments using k-means clustering

Step 1 – collecting data

Step 2 – exploring and preparing the data

Data preparation – dummy coding missing values

Data preparation – imputing the missing values

Step 3 – training a model on the data

Step 4 – evaluating model performance

Step 5 – improving model performance

10. Evaluating Model Performance

Measuring performance for classification

Working with classification prediction data in R

A closer look at confusion matrices

Using confusion matrices to measure performance

Beyond accuracy – other measures of performance

The kappa statistic

Sensitivity and specificity

Precision and recall

The F-measure

Visualizing performance trade-offs

ROC curves

Estimating future performance

The holdout method


Bootstrap sampling

11. Improving Model Performance

Tuning stock models for better performance

Using caret for automated parameter tuning

Creating a simple tuned model

Customizing the tuning process

Improving model performance with meta-learning

Understanding ensembles



Random forests

Training random forests

Evaluating random forest performance

12. Specialized Machine Learning Topics

Working with proprietary files and databases

Reading from and writing to Microsoft Excel, SAS, SPSS, and Stata files

Querying data in SQL databases

Working with online data and services

Downloading the complete text of web pages

Scraping data from web pages

Parsing XML documents

Parsing JSON from web APIs

Working with domain-specific data

Analyzing bioinformatics data

Analyzing and visualizing network data

Improving the performance of R

Managing very large datasets

Generalizing tabular data structures with dplyr

Making data frames faster with data.table

Creating disk-based data frames with ff

Using massive matrices with bigmemory

Learning faster with parallel computing

Measuring execution time

Working in parallel with multicore and snow

Taking advantage of parallel with foreach and doParallel

Parallel cloud computing with MapReduce and Hadoop

GPU computing

Deploying optimized learning algorithms

Building bigger regression models with biglm

Growing bigger and faster random forests with bigrf

Training and evaluating models in parallel with caret

A. Reflect and Test Yourself Answers

Module 1: Data Analysis with R

Chapter 1: RefresheR

Chapter 2: The Shape of Data

Chapter 3: Describing Relationships

Chapter 4: Probability

Chapter 5: Using Data to Reason About the World

Chapter 6: Testing Hypotheses

Chapter 7: Bayesian Methods

Chapter 8: Predicting Continuous Variables

Chapter 9: Predicting Categorical Variables

Chapter 10: Sources of Data

Chapter 11: Dealing with Messy Data

Chapter 12: Dealing with Large Data

Module 2: R Graphs

Chapter 1: R Graphics

Chapter 2: Basic Graph Functions

Chapter 3: Beyond the Basics – Adjusting Key Parameters

Chapter 4: Creating Scatter Plots

Chapter 5: Creating Line Graphs and Time Series Charts

Chapter 6: Creating Bar, Dot, and Pie Charts

Chapter 7: Creating Histograms

Chapter 8: Box and Whisker Plots

Chapter 9: Creating Heat Maps and Contour Plots

Module 4: Mastering R for Quantitative Finance

Chapter 1: Time Series Analysis

Chapter 3: Forecasting Volume

Chapter 4: Big Data – Advanced Analytics

Chapter 5: FX Derivatives

Chapter 6: Interest Rate Derivatives and Models

Chapter 7: Exotic Options

Chapter 8: Optimal Hedging

Chapter 9: Fundamental Analysis

Module 5: Machine Learning with R

Chapter 1: Introducing Machine Learning

Chapter 2: Managing and Understanding Data

Chapter 3: Lazy Learning – Classification Using Nearest Neighbors

Chapter 4: Probabilistic Learning – Classification Using Naive Bayes

Chapter 5: Divide and Conquer – Classification Using Decision Trees and Rules

Chapter 6: Forecasting Numeric Data – Regression Methods

Chapter 7: Black Box Methods – Neural Networks and Support Vector Machines

Chapter 8: Finding Patterns – Market Basket Analysis Using Association Rules

B. Bibliography


累计评论(0条) 0个书友正在讨论这本书 发表评论




