万本电子书0元读

万本电子书0元读

顶部广告

Natural Language Processing: Python and NLTK电子书

售       价:¥

4人正在读 | 0人评论 9.8

作       者:Nitin Hardeniya

出  版  社:Packt Publishing

出版时间:2016-11-01

字       数:725.9万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
Learn to build expert NLP and machine learning projects using NLTK and other Python libraries About This Book Break text down into its component parts for spelling correction, feature extraction, and phrase transformation Work through NLP concepts with simple and easy-to-follow programming recipes Gain insights into the current and budding research topics of NLP Who This Book Is For If you are an NLP or machine learning enthusiast and an intermediate Python programmer who wants to quickly master NLTK for natural language processing, then this Learning Path will do you a lot of good. Students of linguistics and semantic/sentiment analysis professionals will find it invaluable. What You Will Learn The scope of natural language complexity and how they are processed by machines Clean and wrangle text using tokenization and chunking to help you process data better Tokenize text into sentences and sentences into words Classify text and perform sentiment analysis Implement string matching algorithms and normalization techniques Understand and implement the concepts of information retrieval and text summarization Find out how to implement various NLP tasks in Python In Detail Natural Language Processing is a field of computational linguistics and artificial intelligence that deals with human-computer interaction. It provides a seamless interaction between computers and human beings and gives computers the ability to understand human speech with the help of machine learning. The number of human-computer interaction instances are increasing so it’s becoming imperative that computers comprehend all major natural languages. The first NLTK Essentials module is an introduction on how to build systems around NLP, with a focus on how to create a customized tokenizer and parser from scratch. You will learn essential concepts of NLP, be given practical insight into open source tool and libraries available in Python, shown how to analyze social media sites, and be given tools to deal with large scale text. This module also provides a workaround using some of the amazing capabilities of Python libraries such as NLTK, scikit-learn, pandas, and NumPy. The second Python 3 Text Processing with NLTK 3 Cookbook module teaches you the essential techniques of text and language processing with simple, straightforward examples. This includes organizing text corpora, creating your own custom corpus, text classification with a focus on sentiment analysis, and distributed text processing methods. The third Mastering Natural Language Processing with Python module will help you become an expert and assist you in creating your own NLP projects using NLTK. You will be guided through model development with machine learning tools, shown how to create training data, and given insight into the best practices for designing and building NLP-based applications using Python. This Learning Path combines some of the best that Packt has to offer in one complete, curated package and is designed to help you quickly learn text processing with Python and NLTK. It includes content from the following Packt products: NTLK essentials by Nitin Hardeniya Python 3 Text Processing with NLTK 3 Cookbook by Jacob Perkins Mastering Natural Language Processing with Python by Deepti Chopra, Nisheeth Joshi, and Iti Mathur Style and approach This comprehensive course creates a smooth learning path that teaches you how to get started with Natural Language Processing using Python and NLTK. You’ll learn to create effective NLP and machine learning projects using Python and NLTK.
目录展开

Natural Language Processing: Python and NLTK

Table of Contents

Natural Language Processing: Python and NLTK

Natural Language Processing: Python and NLTK

Credits

Preface

What this learning path covers

What you need for this learning path

Who this learning path is for

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Module 1

1. Introduction to Natural Language Processing

Why learn NLP?

Let's start playing with Python!

Lists

Helping yourself

Regular expressions

Dictionaries

Writing functions

Diving into NLTK

Your turn

Summary

2. Text Wrangling and Cleansing

What is text wrangling?

Text cleansing

Sentence splitter

Tokenization

Stemming

Lemmatization

Stop word removal

Rare word removal

Spell correction

Your turn

Summary

3. Part of Speech Tagging

What is Part of speech tagging

Stanford tagger

Diving deep into a tagger

Sequential tagger

N-gram tagger

Regex tagger

Brill tagger

Machine learning based tagger

Named Entity Recognition (NER)

NER tagger

Your Turn

Summary

4. Parsing Structure in Text

Shallow versus deep parsing

The two approaches in parsing

Why we need parsing

Different types of parsers

A recursive descent parser

A shift-reduce parser

A chart parser

A regex parser

Dependency parsing

Chunking

Information extraction

Named-entity recognition (NER)

Relation extraction

Summary

5. NLP Applications

Building your first NLP application

Other NLP applications

Machine translation

Statistical machine translation

Information retrieval

Boolean retrieval

Vector space model

The probabilistic model

Speech recognition

Text classification

Information extraction

Question answering systems

Dialog systems

Word sense disambiguation

Topic modeling

Language detection

Optical character recognition

Summary

6. Text Classification

Machine learning

Text classification

Sampling

Naive Bayes

Decision trees

Stochastic gradient descent

Logistic regression

Support vector machines

The Random forest algorithm

Text clustering

K-means

Topic modeling in text

Installing gensim

References

Summary

7. Web Crawling

Web crawlers

Writing your first crawler

Data flow in Scrapy

The Scrapy shell

Items

The Sitemap spider

The item pipeline

External references

Summary

8. Using NLTK with Other Python Libraries

NumPy

ndarray

Indexing

Basic operations

Extracting data from an array

Complex matrix operations

Reshaping and stacking

Random numbers

SciPy

Linear algebra

eigenvalues and eigenvectors

The sparse matrix

Optimization

pandas

Reading data

Series data

Column transformation

Noisy data

matplotlib

Subplot

Adding an axis

A scatter plot

A bar plot

3D plots

External references

Summary

9. Social Media Mining in Python

Data collection

Twitter

Data extraction

Trending topics

Geovisualization

Influencers detection

Facebook

Influencer friends

Summary

10. Text Mining at Scale

Different ways of using Python on Hadoop

Python streaming

Hive/Pig UDF

Streaming wrappers

NLTK on Hadoop

A UDF

Python streaming

Scikit-learn on Hadoop

PySpark

Summary

2. Module 2

1. Tokenizing Text and WordNet Basics

Introduction

Tokenizing text into sentences

Getting ready

How to do it...

How it works...

There's more...

Tokenizing sentences in other languages

See also

Tokenizing sentences into words

How to do it...

How it works...

There's more...

Separating contractions

PunktWordTokenizer

WordPunctTokenizer

See also

Tokenizing sentences using regular expressions

Getting ready

How to do it...

How it works...

There's more...

Simple whitespace tokenizer

See also

Training a sentence tokenizer

Getting ready

How to do it...

How it works...

There's more...

See also

Filtering stopwords in a tokenized sentence

Getting ready

How to do it...

How it works...

There's more...

See also

Looking up Synsets for a word in WordNet

Getting ready

How to do it...

How it works...

There's more...

Working with hypernyms

Part of speech (POS)

See also

Looking up lemmas and synonyms in WordNet

How to do it...

How it works...

There's more...

All possible synonyms

Antonyms

See also

Calculating WordNet Synset similarity

How to do it...

How it works...

There's more...

Comparing verbs

Path and Leacock Chordorow (LCH) similarity

See also

Discovering word collocations

Getting ready

How to do it...

How it works...

There's more...

Scoring functions

Scoring ngrams

See also

2. Replacing and Correcting Words

Introduction

Stemming words

How to do it...

How it works...

There's more...

The LancasterStemmer class

The RegexpStemmer class

The SnowballStemmer class

See also

Lemmatizing words with WordNet

Getting ready

How to do it...

How it works...

There's more...

Combining stemming with lemmatization

See also

Replacing words matching regular expressions

Getting ready

How to do it...

How it works...

There's more...

Replacement before tokenization

See also

Removing repeating characters

Getting ready

How to do it...

How it works...

There's more...

See also

Spelling correction with Enchant

Getting ready

How to do it...

How it works...

There's more...

The en_GB dictionary

Personal word lists

See also

Replacing synonyms

Getting ready

How to do it...

How it works...

There's more...

CSV synonym replacement

YAML synonym replacement

See also

Replacing negations with antonyms

How to do it...

How it works...

There's more...

See also

3. Creating Custom Corpora

Introduction

Setting up a custom corpus

Getting ready

How to do it...

How it works...

There's more...

Loading a YAML file

See also

Creating a wordlist corpus

Getting ready

How to do it...

How it works...

There's more...

Names wordlist corpus

English words corpus

See also

Creating a part-of-speech tagged word corpus

Getting ready

How to do it...

How it works...

There's more...

Customizing the word tokenizer

Customizing the sentence tokenizer

Customizing the paragraph block reader

Customizing the tag separator

Converting tags to a universal tagset

See also

Creating a chunked phrase corpus

Getting ready

How to do it...

How it works...

There's more...

Tree leaves

Treebank chunk corpus

CoNLL2000 corpus

See also

Creating a categorized text corpus

Getting ready

How to do it...

How it works...

There's more...

Category file

Categorized tagged corpus reader

Categorized corpora

See also

Creating a categorized chunk corpus reader

Getting ready

How to do it...

How it works...

There's more...

Categorized CoNLL chunk corpus reader

See also

Lazy corpus loading

How to do it...

How it works...

There's more...

Creating a custom corpus view

How to do it...

How it works...

There's more...

Block reader functions

Pickle corpus view

Concatenated corpus view

See also

Creating a MongoDB-backed corpus reader

Getting ready

How to do it...

How it works...

There's more...

See also

Corpus editing with file locking

Getting ready

How to do it...

How it works...

4. Part-of-speech Tagging

Introduction

Default tagging

Getting ready

How to do it...

How it works...

There's more...

Evaluating accuracy

Tagging sentences

Untagging a tagged sentence

See also

Training a unigram part-of-speech tagger

How to do it...

How it works...

There's more...

Overriding the context model

Minimum frequency cutoff

See also

Combining taggers with backoff tagging

How to do it...

How it works...

There's more...

Saving and loading a trained tagger with pickle

See also

Training and combining ngram taggers

Getting ready

How to do it...

How it works...

There's more...

Quadgram tagger

See also

Creating a model of likely word tags

How to do it...

How it works...

There's more...

See also

Tagging with regular expressions

Getting ready

How to do it...

How it works...

There's more...

See also

Affix tagging

How to do it...

How it works...

There's more...

Working with min_stem_length

See also

Training a Brill tagger

How to do it...

How it works...

There's more...

Tracing

See also

Training the TnT tagger

How to do it...

How it works...

There's more...

Controlling the beam search

Significance of capitalization

See also

Using WordNet for tagging

Getting ready

How to do it...

How it works...

See also

Tagging proper names

How to do it...

How it works...

See also

Classifier-based tagging

How to do it...

How it works...

There's more...

Detecting features with a custom feature detector

Setting a cutoff probability

Using a pre-trained classifier

See also

Training a tagger with NLTK-Trainer

How to do it...

How it works...

There's more...

Saving a pickled tagger

Training on a custom corpus

Training with universal tags

Analyzing a tagger against a tagged corpus

Analyzing a tagged corpus

See also

5. Extracting Chunks

Introduction

Chunking and chinking with regular expressions

Getting ready

How to do it...

How it works...

There's more...

Parsing different chunk types

Parsing alternative patterns

Chunk rule with context

See also

Merging and splitting chunks with regular expressions

How to do it...

How it works...

There's more...

Specifying rule descriptions

See also

Expanding and removing chunks with regular expressions

How to do it...

How it works...

There's more...

See also

Partial parsing with regular expressions

How to do it...

How it works...

There's more...

The ChunkScore metrics

Looping and tracing chunk rules

See also

Training a tagger-based chunker

How to do it...

How it works...

There's more...

Using different taggers

See also

Classification-based chunking

How to do it...

How it works...

There's more...

Using a different classifier builder

See also

Extracting named entities

How to do it...

How it works...

There's more...

Binary named entity extraction

See also

Extracting proper noun chunks

How to do it...

How it works...

There's more...

See also

Extracting location chunks

How to do it...

How it works...

There's more...

See also

Training a named entity chunker

How to do it...

How it works...

There's more...

See also

Training a chunker with NLTK-Trainer

How to do it...

How it works...

There's more...

Saving a pickled chunker

Training a named entity chunker

Training on a custom corpus

Training on parse trees

Analyzing a chunker against a chunked corpus

Analyzing a chunked corpus

See also

6. Transforming Chunks and Trees

Introduction

Filtering insignificant words from a sentence

Getting ready

How to do it...

How it works...

There's more...

See also

Correcting verb forms

Getting ready

How to do it...

How it works...

See also

Swapping verb phrases

How to do it...

How it works...

There's more...

See also

Swapping noun cardinals

How to do it...

How it works...

See also

Swapping infinitive phrases

How to do it...

How it works...

There's more...

See also

Singularizing plural nouns

How to do it...

How it works...

See also

Chaining chunk transformations

How to do it...

How it works...

There's more...

See also

Converting a chunk tree to text

How to do it...

How it works...

There's more...

See also

Flattening a deep tree

Getting ready

How to do it...

How it works...

There's more...

The cess_esp and cess_cat treebank

See also

Creating a shallow tree

How to do it...

How it works...

See also

Converting tree labels

Getting ready

How to do it...

How it works...

See also

7. Text Classification

Introduction

Bag of words feature extraction

How to do it...

How it works...

There's more...

Filtering stopwords

Including significant bigrams

See also

Training a Naive Bayes classifier

Getting ready

How to do it...

How it works...

There's more...

Classification probability

Most informative features

Training estimator

Manual training

See also

Training a decision tree classifier

How to do it...

How it works...

There's more...

Controlling uncertainty with entropy_cutoff

Controlling tree depth with depth_cutoff

Controlling decisions with support_cutoff

See also

Training a maximum entropy classifier

Getting ready

How to do it...

How it works...

There's more...

Megam algorithm

See also

Training scikit-learn classifiers

Getting ready

How to do it...

How it works...

There's more...

Comparing Naive Bayes algorithms

Training with logistic regression

Training with LinearSVC

See also

Measuring precision and recall of a classifier

How to do it...

How it works...

There's more...

F-measure

See also

Calculating high information words

How to do it...

How it works...

There's more...

The MaxentClassifier class with high information words

The DecisionTreeClassifier class with high information words

The SklearnClassifier class with high information words

See also

Combining classifiers with voting

Getting ready

How to do it...

How it works...

See also

Classifying with multiple binary classifiers

Getting ready

How to do it...

How it works...

There's more...

See also

Training a classifier with NLTK-Trainer

How to do it...

How it works...

There's more...

Saving a pickled classifier

Using different training instances

The most informative features

The Maxent and LogisticRegression classifiers

SVMs

Combining classifiers

High information words and bigrams

Cross-fold validation

Analyzing a classifier

See also

8. Distributed Processing and Handling Large Datasets

Introduction

Distributed tagging with execnet

Getting ready

How to do it...

How it works...

There's more...

Creating multiple channels

Local versus remote gateways

See also

Distributed chunking with execnet

Getting ready

How to do it...

How it works...

There's more...

Python subprocesses

See also

Parallel list processing with execnet

How to do it...

How it works...

There's more...

See also

Storing a frequency distribution in Redis

Getting ready

How to do it...

How it works...

There's more...

See also

Storing a conditional frequency distribution in Redis

Getting ready

How to do it...

How it works...

There's more...

See also

Storing an ordered dictionary in Redis

Getting ready

How to do it...

How it works...

There's more...

See also

Distributed word scoring with Redis and execnet

Getting ready

How to do it...

How it works...

There's more...

See also

9. Parsing Specific Data Types

Introduction

Parsing dates and times with dateutil

Getting ready

How to do it...

How it works...

There's more...

See also

Timezone lookup and conversion

Getting ready

How to do it...

How it works...

There's more...

Local timezone

Custom offsets

See also

Extracting URLs from HTML with lxml

Getting ready

How to do it...

How it works...

There's more...

Extracting links directly

Parsing HTML from URLs or files

Extracting links with XPaths

See also

Cleaning and stripping HTML

Getting ready

How to do it...

How it works...

There's more...

See also

Converting HTML entities with BeautifulSoup

Getting ready

How to do it...

How it works...

There's more...

Extracting URLs with BeautifulSoup

See also

Detecting and converting character encodings

Getting ready

How to do it...

How it works...

There's more...

Converting to ASCII

UnicodeDammit conversion

See also

A. Penn Treebank Part-of-speech Tags

3. Module 3

1. Working with Strings

Tokenization

Tokenization of text into sentences

Tokenization of text in other languages

Tokenization of sentences into words

Tokenization using TreebankWordTokenizer

Tokenization using regular expressions

Normalization

Eliminating punctuation

Conversion into lowercase and uppercase

Dealing with stop words

Calculate stopwords in English

Substituting and correcting tokens

Replacing words using regular expressions

Example of the replacement of a text with another text

Performing substitution before tokenization

Dealing with repeating characters

Example of deleting repeating characters

Replacing a word with its synonym

Example of substituting word a with its synonym

Applying Zipf's law to text

Similarity measures

Applying similarity measures using Ethe edit distance algorithm

Applying similarity measures using Jaccard's Coefficient

Applying similarity measures using the Smith Waterman distance

Other string similarity metrics

Summary

2. Statistical Language Modeling

Understanding word frequency

Develop MLE for a given text

Hidden Markov Model estimation

Applying smoothing on the MLE model

Add-one smoothing

Good Turing

Kneser Ney estimation

Witten Bell estimation

Develop a back-off mechanism for MLE

Applying interpolation on data to get mix and match

Evaluate a language model through perplexity

Applying metropolis hastings in modeling languages

Applying Gibbs sampling in language processing

Summary

3. Morphology – Getting Our Feet Wet

Introducing morphology

Understanding stemmer

Understanding lemmatization

Developing a stemmer for non-English language

Morphological analyzer

Morphological generator

Search engine

Summary

4. Parts-of-Speech Tagging – Identifying Words

Introducing parts-of-speech tagging

Default tagging

Creating POS-tagged corpora

Selecting a machine learning algorithm

Statistical modeling involving the n-gram approach

Developing a chunker using pos-tagged corpora

Summary

5. Parsing – Analyzing Training Data

Introducing parsing

Treebank construction

Extracting Context Free Grammar (CFG) rules from Treebank

Creating a probabilistic Context Free Grammar from CFG

CYK chart parsing algorithm

Earley chart parsing algorithm

Summary

6. Semantic Analysis – Meaning Matters

Introducing semantic analysis

Introducing NER

A NER system using Hidden Markov Model

Training NER using Machine Learning Toolkits

NER using POS tagging

Generation of the synset id from Wordnet

Disambiguating senses using Wordnet

Summary

7. Sentiment Analysis – I Am Happy

Introducing sentiment analysis

Sentiment analysis using NER

Sentiment analysis using machine learning

Evaluation of the NER system

Summary

8. Information Retrieval – Accessing Information

Introducing information retrieval

Stop word removal

Information retrieval using a vector space model

Vector space scoring and query operator interaction

Developing an IR system using latent semantic indexing

Text summarization

Question-answering system

Summary

9. Discourse Analysis – Knowing Is Believing

Introducing discourse analysis

Discourse analysis using Centering Theory

Anaphora resolution

Summary

10. Evaluation of NLP Systems – Analyzing Performance

The need for evaluation of NLP systems

Evaluation of NLP tools (POS taggers, stemmers, and morphological analyzers)

Parser evaluation using gold data

Evaluation of IR system

Metrics for error identification

Metrics based on lexical matching

Metrics based on syntactic matching

Metrics using shallow semantic matching

Summary

B. Bibliography

Index

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部