万本电子书0元读

万本电子书0元读

顶部广告

Clojure for Data Science电子书

售       价:¥

1人正在读 | 0人评论 9.8

作       者:Henry Garner

出  版  社:Packt Publishing

出版时间:2015-09-03

字       数:430.7万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
Statistics, big data, and machine learning for Clojure programmers About This Book Write code using Clojure to harness the power of your data Discover the libraries and frameworks that will help you succeed A practical guide to understanding how the Clojure programming language can be used to derive insights from data Who This Book Is For This book is aimed at developers who are already productive in Clojure but who are overwhelmed by the breadth and depth of understanding required to be effective in the field of data science. Whether you’re tasked with delivering a specific analytics project or simply suspect that you could be deriving more value from your data, this book will inspire you with the opportunities–and inform you of the risks–that exist in data of all shapes and sizes. What You Will Learn Perform hypothesis testing and understand feature selection and statistical significance to interpret your results with confidence Implement the core machine learning techniques of regression, classification, clustering and recommendation Understand the importance of the value of simple statistics and distributions in exploratory data analysis Scale algorithms to web-sized datasets efficiently using distributed programming models on Hadoop and Spark Apply suitable analytic approaches for text, graph, and time series data Interpret the terminology that you will encounter in technical papers Import libraries from other JVM languages such as Java and Scala Communicate your findings clearly and convincingly to nontechnical colleagues In Detail The term “data science” has been widely used to define this new profession that is expected to interpret vast datasets and translate them to improved decision-making and performance. Clojure is a powerful language that combines the interactivity of a *ing language with the speed of a compiled language. Together with its rich ecosystem of native libraries and an extremely simple and consistent functional approach to data manipulation, which maps closely to mathematical formula, it is an ideal, practical, and flexible language to meet a data scientist’s diverse needs. Taking you on a journey from simple summary statistics to sophisticated machine learning algorithms, this book shows how the Clojure programming language can be used to derive insights from data. Data scientists often forge a novel path, and you’ll see how to make use of Clojure’s Java interoperability capabilities to access libraries such as Mahout and Mllib for which Clojure wrappers don’t yet exist. Even seasoned Clojure developers will develop a deeper appreciation for their language’s flexibility! You’ll learn how to apply statistical thinking to your own data and use Clojure to explore, analyze, and visualize it in a technically and statistically robust way. You can also use Incanter for local data processing and ClojureScript to present interactive visualisations and understand how distributed platforms such as Hadoop sand Spark’s MapReduce and GraphX’s BSP solve the challenges of data analysis at scale, and how to explain algorithms using those programming models. Above all, by following the explanations in this book, you’ll learn not just how to be effective using the current state-of-the-art methods in data science, but why such methods work so that you can continue to be productive as the field evolves into the future. Style and approach This is a practical guide to data science that teaches theory by example through the libraries and frameworks accessible from the Clojure programming language.
目录展开

Clojure for Data Science

Table of Contents

Clojure for Data Science

Credits

About the Author

Acknowledgments

About the Reviewer

www.PacktPub.com

Support files, eBooks, discount offers, and more

Why subscribe?

Free access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

1. Statistics

Downloading the sample code

Running the examples

Downloading the data

Inspecting the data

Data scrubbing

Descriptive statistics

The mean

Interpreting mathematical notation

The median

Variance

Quantiles

Binning data

Histograms

The normal distribution

The central limit theorem

Poincaré's baker

Generating distributions

Skewness

Quantile-quantile plots

Comparative visualizations

Box plots

Cumulative distribution functions

The importance of visualizations

Visualizing electorate data

Adding columns

Adding derived columns

Comparative visualizations of electorate data

Visualizing the Russian election data

Comparative visualizations

Probability mass functions

Scatter plots

Scatter transparency

Summary

2. Inference

Introducing AcmeContent

Download the sample code

Load and inspect the data

Visualizing the dwell times

The exponential distribution

The distribution of daily means

The central limit theorem

Standard error

Samples and populations

Confidence intervals

Sample comparisons

Bias

Visualizing different populations

Hypothesis testing

Significance

Testing a new site design

Performing a z-test

Student's t-distribution

Degrees of freedom

The t-statistic

Performing the t-test

Two-tailed tests

One-sample t-test

Resampling

Testing multiple designs

Calculating sample means

Multiple comparisons

Introducing the simulation

Compile the simulation

The browser simulation

jStat

B1

Scalable Vector Graphics

Plotting probability densities

State and Reagent

Updating state

Binding the interface

Simulating multiple tests

The Bonferroni correction

Analysis of variance

The F-distribution

The F-statistic

The F-test

Effect size

Cohen's d

Summary

3. Correlation

About the data

Inspecting the data

Visualizing the data

The log-normal distribution

Visualizing correlation

Jittering

Covariance

Pearson's correlation

Sample r and population rho

Hypothesis testing

Confidence intervals

Regression

Linear equations

Residuals

Ordinary least squares

Slope and intercept

Interpretation

Visualization

Assumptions

Goodness-of-fit and R-square

Multiple linear regression

Matrices

Dimensions

Vectors

Construction

Addition and scalar multiplication

Matrix-vector multiplication

Matrix-matrix multiplication

Transposition

The identity matrix

Inversion

The normal equation

More features

Multiple R-squared

Adjusted R-squared

Incanter's linear model

The F-test of model significance

Categorical and dummy variables

Relative power

Collinearity

Multicollinearity

Prediction

The confidence interval of a prediction

Model scope

The final model

Summary

4. Classification

About the data

Inspecting the data

Comparisons with relative risk and odds

The standard error of a proportion

Estimation using bootstrapping

The binomial distribution

The standard error of a proportion formula

Significance testing proportions

Adjusting standard errors for large samples

Chi-squared multiple significance testing

Visualizing the categories

The chi-squared test

The chi-squared statistic

The chi-squared test

Classification with logistic regression

The sigmoid function

The logistic regression cost function

Parameter optimization with gradient descent

Gradient descent with Incanter

Convexity

Implementing logistic regression with Incanter

Creating a feature matrix

Evaluating the logistic regression classifier

The confusion matrix

The kappa statistic

Probability

Bayes theorem

Bayes theorem with multiple predictors

Naive Bayes classification

Implementing a naive Bayes classifier

Evaluating the naive Bayes classifier

Comparing the logistic regression and naive Bayes approaches

Decision trees

Information

Entropy

Information gain

Using information gain to identify the best predictor

Recursively building a decision tree

Using the decision tree for classification

Evaluating the decision tree classifier

Classification with clj-ml

Loading data with clj-ml

Building a decision tree in clj-ml

Bias and variance

Overfitting

Cross-validation

Addressing high bias

Ensemble learning and random forests

Bagging and boosting

Saving the classifier to a file

Summary

5. Big Data

Downloading the code and data

Inspecting the data

Counting the records

The reducers library

Parallel folds with reducers

Loading large files with iota

Creating a reducers processing pipeline

Curried reductions with reducers

Statistical folds with reducers

Associativity

Calculating the mean using fold

Calculating the variance using fold

Mathematical folds with Tesser

Calculating covariance with Tesser

Commutativity

Simple linear regression with Tesser

Calculating a correlation matrix

Multiple regression with gradient descent

The gradient descent update rule

The gradient descent learning rate

Feature scaling

Feature extraction

Creating a custom Tesser fold

Creating a matrix-sum fold

Calculating the total model error

Creating a matrix-mean fold

Applying a single step of gradient descent

Running iterative gradient descent

Scaling gradient descent with Hadoop

Gradient descent on Hadoop with Tesser and Parkour

Parkour distributed sources and sinks

Running a feature scale fold with Hadoop

Running gradient descent with Hadoop

Preparing our code for a Hadoop cluster

Building an uberjar

Submitting the uberjar to Hadoop

Stochastic gradient descent

Stochastic gradient descent with Parkour

Defining a mapper

Parkour shaping functions

Defining a reducer

Specifying Hadoop jobs with Parkour graph

Chaining mappers and reducers with Parkour graph

Summary

6. Clustering

Downloading the data

Extracting the data

Inspecting the data

Clustering text

Set-of-words and the Jaccard index

Tokenizing the Reuters files

Applying the Jaccard index to documents

The bag-of-words and Euclidean distance

Representing text as vectors

Creating a dictionary

Creating term frequency vectors

The vector space model and cosine distance

Removing stop words

Stemming

Clustering with k-means and Incanter

Clustering the Reuters documents

Better clustering with TF-IDF

Zipf's law

Calculating the TF-IDF weight

k-means clustering with TF-IDF

Better clustering with n-grams

Large-scale clustering with Mahout

Converting text documents to a sequence file

Using Parkour to create Mahout vectors

Creating distributed unique IDs

Distributed unique IDs with Hadoop

Sharing data with the distributed cache

Building Mahout vectors from input documents

Running k-means clustering with Mahout

Viewing k-means clustering results

Interpreting the clustered output

Cluster evaluation measures

Inter-cluster density

Intra-cluster density

Calculating the root mean square error with Parkour

Loading clustered points and centroids

Calculating the cluster RMSE

Determining optimal k with the elbow method

Determining optimal k with the Dunn index

Determining optimal k with the Davies-Bouldin index

The drawbacks of k-means

The Mahalanobis distance measure

The curse of dimensionality

Summary

7. Recommender Systems

Download the code and data

Inspect the data

Parse the data

Types of recommender systems

Collaborative filtering

Item-based and user-based recommenders

Slope One recommenders

Calculating the item differences

Making recommendations

Practical considerations for user and item recommenders

Building a user-based recommender with Mahout

k-nearest neighbors

Recommender evaluation with Mahout

Evaluating distance measures

The Pearson correlation similarity

Spearman's rank similarity

Determining optimum neighborhood size

Information retrieval statistics

Precision

Recall

Mahout's information retrieval evaluator

F-measure and the harmonic mean

Fall-out

Normalized discounted cumulative gain

Plotting the information retrieval results

Recommendation with Boolean preferences

Implicit versus explicit feedback

Probabilistic methods for large sets

Testing set membership with Bloom filters

Jaccard similarity for large sets with MinHash

Reducing pair comparisons with locality-sensitive hashing

Bucketing signatures

Dimensionality reduction

Plotting the Iris dataset

Principle component analysis

Singular value decomposition

Large-scale machine learning with Apache Spark and MLlib

Loading data with Sparkling

Mapping data

Distributed datasets and tuples

Filtering data

Persistence and caching

Machine learning on Spark with MLlib

Movie recommendations with alternating least squares

ALS with Spark and MLlib

Making predictions with ALS

Evaluating ALS

Calculating the sum of squared errors

Summary

8. Network Analysis

Download the data

Inspecting the data

Visualizing graphs with Loom

Graph traversal with Loom

The seven bridges of Königsberg

Breadth-first and depth-first search

Finding the shortest path

Minimum spanning trees

Subgraphs and connected components

SCC and the bow-tie structure of the web

Whole-graph analysis

Scale-free networks

Distributed graph computation with GraphX

Creating RDGs with Glittering

Measuring graph density with triangle counting

GraphX partitioning strategies

Running the built-in triangle counting algorithm

Implement triangle counting with Glittering

Step one – collecting neighbor IDs

Steps two, three, and four – aggregate messages

Step five – dividing the counts

Running the custom triangle counting algorithm

The Pregel API

Connected components with the Pregel API

Step one – map vertices

Steps two and three – the message function

Step four – update the attributes

Step five – iterate to convergence

Running connected components

Calculating the size of the largest connected component

Detecting communities with label propagation

Step one – map vertices

Step two – send the vertex attribute

Step three – aggregate value

Step four – vertex function

Step five – set the maximum iterations count

Running label propagation

Measuring community influence using PageRank

The flow formulation

Implementing PageRank with Glittering

Sort by highest influence

Running PageRank to determine community influencers

Summary

9. Time Series

About the data

Loading the Longley data

Fitting curves with a linear model

Time series decomposition

Inspecting the airline data

Visualizing the airline data

Stationarity

De-trending and differencing

Discrete time models

Random walks

Autoregressive models

Determining autocorrelation in AR models

Moving-average models

Determining autocorrelation in MA models

Combining the AR and MA models

Calculating partial autocorrelation

Autocovariance

PACF with Durbin-Levinson recursion

Plotting partial autocorrelation

Determining ARMA model order with ACF and PACF

ACF and PACF of airline data

Removing seasonality with differencing

Maximum likelihood estimation

Calculating the likelihood

Estimating the maximum likelihood

Nelder-Mead optimization with Apache Commons Math

Identifying better models with Akaike Information Criterion

Time series forecasting

Forecasting with Monte Carlo simulation

Summary

10. Visualization

Download the code and data

Exploratory data visualization

Representing a two-dimensional histogram

Using Quil for visualization

Drawing to the sketch window

Quil's coordinate system

Plotting the grid

Specifying the fill color

Color and fill

Outputting an image file

Visualization for communication

Visualizing wealth distribution

Bringing data to life with Quil

Drawing bars of differing widths

Adding a title and axis labels

Improving the clarity with illustrations

Adding text to the bars

Incorporating additional data

Drawing complex shapes

Drawing curves

Plotting compound charts

Output to PDF

Summary

Index

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部