售 价:¥
温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印
为你推荐
Mastering Text Mining with R
Table of Contents
Mastering Text Mining with R
Credits
About the Authors
About the Reviewers
www.PacktPub.com
eBooks, discount offers, and more
Why subscribe?
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Statistical Linguistics with R
Probability theory and basic statistics
Probability space and event
Theorem of compound probabilities
Conditional probability
Bayes' formula for conditional probability
Independent events
Random variables
Discrete random variables
Continuous random variables
Probability frequency function
Probability distributions using R
Cumulative distribution function
Joint distribution
Binomial distribution
Poisson distribution
Counting occurrences
Zipf's law
Heaps' law
Lexical richness
Lexical variation
Lexical density
Lexical originality
Lexical sophistication
Language models
N-gram models
Markov assumption
Hidden Markov models
Quantitative methods in linguistics
Document term matrix
Inverse document frequency
Words similarity and edit-distance functions
Euclidean distance
Cosine similarity
Levenshtein distance
Damerau-Levenshtein distance
Hamming distance
Jaro-Winkler distance
Measuring readability of a text
Gunning frog index
R packages for text mining
OpenNLP
Rweka
RcmdrPlugin.temis
tm
languageR
koRpus
RKEA
maxent
lsa
Summary
2. Processing Text
Accessing text from diverse sources
File system
PDF documents
Microsoft Word documents
HTML
XML
JSON
HTTP
Databases
Processing text using regular expressions
Tokenization and segmentation
Word tokenization
Operations on a document-term matrix
Sentence segmentation
Normalizing texts
Lemmatization and stemming
Stemming
Lemmatization
Synonyms
Lexical diversity
Analyse lexical diversity
Calculate lexical diversity
Readability
Automated readability index
Language detection
Summary
3. Categorizing and Tagging Text
Parts of speech tagging
POS tagging with R packages
Hidden Markov Models for POS tagging
Basic definitions and notations
Implementing HMMs
Viterbi underflow
Forward algorithm underflow
OpenNLP chunking
Chunk tags
Collocation and contingency tables
Extracting co-occurrences
Surface Co-occurrence
Textual co-occurrence
Syntactic co-occurrence
Co-occurrence in a document
Quantifying the relation between words
Contingency tables
Detailed analysis on textual collocations
Feature extraction
Synonymy and similarity
Multiwords, negation, and antonymy
Concept similarity
Path length
Resnik similarity
Lin similarity
Jiang – Conrath distance
Summary
4. Dimensionality Reduction
The curse of dimensionality
Distance concentration and computational infeasibility
Dimensionality reduction
Principal component analysis
Using R for PCA
Understanding the FactoMineR package
Amap package
Proportion of variance
Scree plot
Reconstruction error
Correspondence analysis
Canonical correspondence analysis
Pearson's Chi-squared test
Multiple correspondence analysis
Implementation of SVD using R
Summary
5. Text Summarization and Clustering
Topic modeling
Latent Dirichlet Allocation
Correlated topic model
Model selection
R Package for topic modeling
Fitting the LDA model with the VEM algorithm
Latent semantic analysis
R Package for latent semantic analysis
Illustrative example of LSA
Text clustering
Document clustering
Feature selection for text clustering
Mutual information
Statistic Chi Square feature selection
Frequency-based feature selection
Sentence completion
Summary
6. Text Classification
Text classification
Document representation
Feature hashing
Classifiers – inductive learning
Tree-based learning
Bayesian classifiers: Naive Bayes classification
K-Nearest neighbors
Kernel methods
Support vector machines
Kernel Trick
How to apply SVM on a real world example?
Number of instances is significantly larger than the number of dimensions.Maximum entropy classifier
Maxent implemenation in R
RTextTools: a text classification framework
Model evaluation
Confusion matrix
ROC curve
Precision-recall
Bias–variance trade-off and learning curve
Bias-variance decomposition
Learning curve
Dealing with reducible error components
Cross validation
Leave-one-out
k-Fold
Bootstrap
Stratified
Summary
7. Entity Recognition
Entity extraction
The rule-based approach
Machine learning
Sentence boundary detection
Word token annotator
Named entity recognition
Training a model with new features
Summary
Index
买过这本书的人还买过
读了这本书的人还在读
同类图书排行榜