售 价:¥
温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印
为你推荐
Clojure for Data Science
Table of Contents
Clojure for Data Science
Credits
About the Author
Acknowledgments
About the Reviewer
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. Statistics
Downloading the sample code
Running the examples
Downloading the data
Inspecting the data
Data scrubbing
Descriptive statistics
The mean
Interpreting mathematical notation
The median
Variance
Quantiles
Binning data
Histograms
The normal distribution
The central limit theorem
Poincaré's baker
Generating distributions
Skewness
Quantile-quantile plots
Comparative visualizations
Box plots
Cumulative distribution functions
The importance of visualizations
Visualizing electorate data
Adding columns
Adding derived columns
Comparative visualizations of electorate data
Visualizing the Russian election data
Comparative visualizations
Probability mass functions
Scatter plots
Scatter transparency
Summary
2. Inference
Introducing AcmeContent
Download the sample code
Load and inspect the data
Visualizing the dwell times
The exponential distribution
The distribution of daily means
The central limit theorem
Standard error
Samples and populations
Confidence intervals
Sample comparisons
Bias
Visualizing different populations
Hypothesis testing
Significance
Testing a new site design
Performing a z-test
Student's t-distribution
Degrees of freedom
The t-statistic
Performing the t-test
Two-tailed tests
One-sample t-test
Resampling
Testing multiple designs
Calculating sample means
Multiple comparisons
Introducing the simulation
Compile the simulation
The browser simulation
jStat
B1
Scalable Vector Graphics
Plotting probability densities
State and Reagent
Updating state
Binding the interface
Simulating multiple tests
The Bonferroni correction
Analysis of variance
The F-distribution
The F-statistic
The F-test
Effect size
Cohen's d
Summary
3. Correlation
About the data
Inspecting the data
Visualizing the data
The log-normal distribution
Visualizing correlation
Jittering
Covariance
Pearson's correlation
Sample r and population rho
Hypothesis testing
Confidence intervals
Regression
Linear equations
Residuals
Ordinary least squares
Slope and intercept
Interpretation
Visualization
Assumptions
Goodness-of-fit and R-square
Multiple linear regression
Matrices
Dimensions
Vectors
Construction
Addition and scalar multiplication
Matrix-vector multiplication
Matrix-matrix multiplication
Transposition
The identity matrix
Inversion
The normal equation
More features
Multiple R-squared
Adjusted R-squared
Incanter's linear model
The F-test of model significance
Categorical and dummy variables
Relative power
Collinearity
Multicollinearity
Prediction
The confidence interval of a prediction
Model scope
The final model
Summary
4. Classification
About the data
Inspecting the data
Comparisons with relative risk and odds
The standard error of a proportion
Estimation using bootstrapping
The binomial distribution
The standard error of a proportion formula
Significance testing proportions
Adjusting standard errors for large samples
Chi-squared multiple significance testing
Visualizing the categories
The chi-squared test
The chi-squared statistic
The chi-squared test
Classification with logistic regression
The sigmoid function
The logistic regression cost function
Parameter optimization with gradient descent
Gradient descent with Incanter
Convexity
Implementing logistic regression with Incanter
Creating a feature matrix
Evaluating the logistic regression classifier
The confusion matrix
The kappa statistic
Probability
Bayes theorem
Bayes theorem with multiple predictors
Naive Bayes classification
Implementing a naive Bayes classifier
Evaluating the naive Bayes classifier
Comparing the logistic regression and naive Bayes approaches
Decision trees
Information
Entropy
Information gain
Using information gain to identify the best predictor
Recursively building a decision tree
Using the decision tree for classification
Evaluating the decision tree classifier
Classification with clj-ml
Loading data with clj-ml
Building a decision tree in clj-ml
Bias and variance
Overfitting
Cross-validation
Addressing high bias
Ensemble learning and random forests
Bagging and boosting
Saving the classifier to a file
Summary
5. Big Data
Downloading the code and data
Inspecting the data
Counting the records
The reducers library
Parallel folds with reducers
Loading large files with iota
Creating a reducers processing pipeline
Curried reductions with reducers
Statistical folds with reducers
Associativity
Calculating the mean using fold
Calculating the variance using fold
Mathematical folds with Tesser
Calculating covariance with Tesser
Commutativity
Simple linear regression with Tesser
Calculating a correlation matrix
Multiple regression with gradient descent
The gradient descent update rule
The gradient descent learning rate
Feature scaling
Feature extraction
Creating a custom Tesser fold
Creating a matrix-sum fold
Calculating the total model error
Creating a matrix-mean fold
Applying a single step of gradient descent
Running iterative gradient descent
Scaling gradient descent with Hadoop
Gradient descent on Hadoop with Tesser and Parkour
Parkour distributed sources and sinks
Running a feature scale fold with Hadoop
Running gradient descent with Hadoop
Preparing our code for a Hadoop cluster
Building an uberjar
Submitting the uberjar to Hadoop
Stochastic gradient descent
Stochastic gradient descent with Parkour
Defining a mapper
Parkour shaping functions
Defining a reducer
Specifying Hadoop jobs with Parkour graph
Chaining mappers and reducers with Parkour graph
Summary
6. Clustering
Downloading the data
Extracting the data
Inspecting the data
Clustering text
Set-of-words and the Jaccard index
Tokenizing the Reuters files
Applying the Jaccard index to documents
The bag-of-words and Euclidean distance
Representing text as vectors
Creating a dictionary
Creating term frequency vectors
The vector space model and cosine distance
Removing stop words
Stemming
Clustering with k-means and Incanter
Clustering the Reuters documents
Better clustering with TF-IDF
Zipf's law
Calculating the TF-IDF weight
k-means clustering with TF-IDF
Better clustering with n-grams
Large-scale clustering with Mahout
Converting text documents to a sequence file
Using Parkour to create Mahout vectors
Creating distributed unique IDs
Distributed unique IDs with Hadoop
Sharing data with the distributed cache
Building Mahout vectors from input documents
Running k-means clustering with Mahout
Viewing k-means clustering results
Interpreting the clustered output
Cluster evaluation measures
Inter-cluster density
Intra-cluster density
Calculating the root mean square error with Parkour
Loading clustered points and centroids
Calculating the cluster RMSE
Determining optimal k with the elbow method
Determining optimal k with the Dunn index
Determining optimal k with the Davies-Bouldin index
The drawbacks of k-means
The Mahalanobis distance measure
The curse of dimensionality
Summary
7. Recommender Systems
Download the code and data
Inspect the data
Parse the data
Types of recommender systems
Collaborative filtering
Item-based and user-based recommenders
Slope One recommenders
Calculating the item differences
Making recommendations
Practical considerations for user and item recommenders
Building a user-based recommender with Mahout
k-nearest neighbors
Recommender evaluation with Mahout
Evaluating distance measures
The Pearson correlation similarity
Spearman's rank similarity
Determining optimum neighborhood size
Information retrieval statistics
Precision
Recall
Mahout's information retrieval evaluator
F-measure and the harmonic mean
Fall-out
Normalized discounted cumulative gain
Plotting the information retrieval results
Recommendation with Boolean preferences
Implicit versus explicit feedback
Probabilistic methods for large sets
Testing set membership with Bloom filters
Jaccard similarity for large sets with MinHash
Reducing pair comparisons with locality-sensitive hashing
Bucketing signatures
Dimensionality reduction
Plotting the Iris dataset
Principle component analysis
Singular value decomposition
Large-scale machine learning with Apache Spark and MLlib
Loading data with Sparkling
Mapping data
Distributed datasets and tuples
Filtering data
Persistence and caching
Machine learning on Spark with MLlib
Movie recommendations with alternating least squares
ALS with Spark and MLlib
Making predictions with ALS
Evaluating ALS
Calculating the sum of squared errors
Summary
8. Network Analysis
Download the data
Inspecting the data
Visualizing graphs with Loom
Graph traversal with Loom
The seven bridges of Königsberg
Breadth-first and depth-first search
Finding the shortest path
Minimum spanning trees
Subgraphs and connected components
SCC and the bow-tie structure of the web
Whole-graph analysis
Scale-free networks
Distributed graph computation with GraphX
Creating RDGs with Glittering
Measuring graph density with triangle counting
GraphX partitioning strategies
Running the built-in triangle counting algorithm
Implement triangle counting with Glittering
Step one – collecting neighbor IDs
Steps two, three, and four – aggregate messages
Step five – dividing the counts
Running the custom triangle counting algorithm
The Pregel API
Connected components with the Pregel API
Step one – map vertices
Steps two and three – the message function
Step four – update the attributes
Step five – iterate to convergence
Running connected components
Calculating the size of the largest connected component
Detecting communities with label propagation
Step one – map vertices
Step two – send the vertex attribute
Step three – aggregate value
Step four – vertex function
Step five – set the maximum iterations count
Running label propagation
Measuring community influence using PageRank
The flow formulation
Implementing PageRank with Glittering
Sort by highest influence
Running PageRank to determine community influencers
Summary
9. Time Series
About the data
Loading the Longley data
Fitting curves with a linear model
Time series decomposition
Inspecting the airline data
Visualizing the airline data
Stationarity
De-trending and differencing
Discrete time models
Random walks
Autoregressive models
Determining autocorrelation in AR models
Moving-average models
Determining autocorrelation in MA models
Combining the AR and MA models
Calculating partial autocorrelation
Autocovariance
PACF with Durbin-Levinson recursion
Plotting partial autocorrelation
Determining ARMA model order with ACF and PACF
ACF and PACF of airline data
Removing seasonality with differencing
Maximum likelihood estimation
Calculating the likelihood
Estimating the maximum likelihood
Nelder-Mead optimization with Apache Commons Math
Identifying better models with Akaike Information Criterion
Time series forecasting
Forecasting with Monte Carlo simulation
Summary
10. Visualization
Download the code and data
Exploratory data visualization
Representing a two-dimensional histogram
Using Quil for visualization
Drawing to the sketch window
Quil's coordinate system
Plotting the grid
Specifying the fill color
Color and fill
Outputting an image file
Visualization for communication
Visualizing wealth distribution
Bringing data to life with Quil
Drawing bars of differing widths
Adding a title and axis labels
Improving the clarity with illustrations
Adding text to the bars
Incorporating additional data
Drawing complex shapes
Drawing curves
Plotting compound charts
Output to PDF
Summary
Index
买过这本书的人还买过
读了这本书的人还在读
同类图书排行榜