售 价:¥
温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印
为你推荐
Hadoop MapReduce v2 Cookbook Second Edition
Table of Contents
Hadoop MapReduce v2 Cookbook Second Edition
Credits
About the Author
Acknowledgments
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why Subscribe?
Free Access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Getting Started with Hadoop v2
Introduction
Hadoop Distributed File System – HDFS
Hadoop YARN
Hadoop MapReduce
Hadoop installation modes
Setting up Hadoop v2 on your local machine
Getting ready
How to do it...
How it works...
Writing a WordCount MapReduce application, bundling it, and running it using the Hadoop local mode
Getting ready
How to do it...
How it works...
There's more...
See also
Adding a combiner step to the WordCount MapReduce program
How to do it...
How it works...
There's more...
Setting up HDFS
Getting ready
How to do it...
See also
Setting up Hadoop YARN in a distributed cluster environment using Hadoop v2
Getting ready
How to do it...
How it works...
See also
Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoop distribution
Getting ready
How to do it...
There's more...
HDFS command-line file operations
Getting ready
How to do it...
How it works...
There's more...
Running the WordCount program in a distributed cluster environment
Getting ready
How to do it...
How it works...
There's more...
Benchmarking HDFS using DFSIO
Getting ready
How to do it...
How it works...
There's more...
Benchmarking Hadoop MapReduce using TeraSort
Getting ready
How to do it...
How it works...
2. Cloud Deployments – Using Hadoop YARN on Cloud Environments
Introduction
Running Hadoop MapReduce v2 computations using Amazon Elastic MapReduce
Getting ready
How to do it...
See also
Saving money using Amazon EC2 Spot Instances to execute EMR job flows
How to do it...
There's more...
See also
Executing a Pig script using EMR
How to do it...
There's more...
Starting a Pig interactive session
Executing a Hive script using EMR
How to do it...
There's more...
Starting a Hive interactive session
See also
Creating an Amazon EMR job flow using the AWS Command Line Interface
Getting ready
How to do it...
There's more...
See also
Deploying an Apache HBase cluster on Amazon EC2 using EMR
Getting ready
How to do it...
See also
Using EMR bootstrap actions to configure VMs for the Amazon EMR jobs
How to do it...
There's more...
Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment
How to do it...
How it works...
See also
3. Hadoop Essentials – Configurations, Unit Tests, and Other APIs
Introduction
Optimizing Hadoop YARN and MapReduce configurations for cluster deployments
Getting ready
How to do it...
How it works...
There's more...
Shared user Hadoop clusters – using Fair and Capacity schedulers
How to do it...
How it works...
There's more...
Setting classpath precedence to user-provided JARs
How to do it...
How it works...
Speculative execution of straggling tasks
How to do it...
There's more...
Unit testing Hadoop MapReduce applications using MRUnit
Getting ready
How to do it...
See also
Integration testing Hadoop MapReduce applications using MiniYarnCluster
Getting ready
How to do it...
See also
Adding a new DataNode
Getting ready
How to do it...
There's more...
Rebalancing HDFS
See also
Decommissioning DataNodes
How to do it...
How it works...
See also
Using multiple disks/volumes and limiting HDFS disk usage
How to do it...
Setting the HDFS block size
How to do it...
There's more...
See also
Setting the file replication factor
How to do it...
How it works...
There's more...
See also
Using the HDFS Java API
How to do it...
How it works...
There's more...
Configuring the FileSystem object
Retrieving the list of data blocks of a file
4. Developing Complex Hadoop MapReduce Applications
Introduction
Choosing appropriate Hadoop data types
How to do it...
There's more...
See also
Implementing a custom Hadoop Writable data type
How to do it...
How it works...
There's more...
See also
Implementing a custom Hadoop key type
How to do it...
How it works...
See also
Emitting data of different value types from a Mapper
How to do it...
How it works...
There's more...
See also
Choosing a suitable Hadoop InputFormat for your input data format
How to do it...
How it works...
There's more...
See also
Adding support for new input data formats – implementing a custom InputFormat
How to do it...
How it works...
There's more...
See also
Formatting the results of MapReduce computations – using Hadoop OutputFormats
How to do it...
How it works...
There's more...
Writing multiple outputs from a MapReduce computation
How to do it...
How it works...
Using multiple input data types and multiple Mapper implementations in a single MapReduce application
See also
Hadoop intermediate data partitioning
How to do it...
How it works...
There's more...
TotalOrderPartitioner
KeyFieldBasedPartitioner
Secondary sorting – sorting Reduce input values
How to do it...
How it works...
See also
Broadcasting and distributing shared resources to tasks in a MapReduce job – Hadoop DistributedCache
How to do it...
How it works...
There's more...
Distributing archives using the DistributedCache
Adding resources to the DistributedCache from the command line
Adding resources to the classpath using the DistributedCache
Using Hadoop with legacy applications – Hadoop streaming
How to do it...
How it works...
There's more...
See also
Adding dependencies between MapReduce jobs
How to do it...
How it works...
There's more...
Hadoop counters to report custom metrics
How to do it...
How it works...
5. Analytics
Introduction
Simple analytics using MapReduce
Getting ready
How to do it...
How it works...
There's more...
Performing GROUP BY using MapReduce
Getting ready
How to do it...
How it works...
Calculating frequency distributions and sorting using MapReduce
Getting ready
How to do it...
How it works...
There's more...
Plotting the Hadoop MapReduce results using gnuplot
Getting ready
How to do it...
How it works...
There's more...
Calculating histograms using MapReduce
Getting ready
How to do it...
How it works...
Calculating Scatter plots using MapReduce
Getting ready
How to do it...
How it works...
Parsing a complex dataset with Hadoop
Getting ready
How to do it...
How it works...
There's more...
Joining two datasets using MapReduce
Getting ready
How to do it...
How it works...
6. Hadoop Ecosystem – Apache Hive
Introduction
Getting started with Apache Hive
How to do it...
See also
Creating databases and tables using Hive CLI
Getting ready
How to do it...
How it works...
There's more...
Hive data types
Hive external tables
Using the describe formatted command to inspect the metadata of Hive tables
Simple SQL-style data querying using Apache Hive
Getting ready
How to do it...
How it works...
There's more...
Using Apache Tez as the execution engine for Hive
See also
Creating and populating Hive tables and views using Hive query results
Getting ready
How to do it...
Utilizing different storage formats in Hive - storing table data using ORC files
Getting ready
How to do it...
How it works...
Using Hive built-in functions
Getting ready
How to do it...
How it works...
There's more...
See also
Hive batch mode - using a query file
How to do it...
How it works...
There's more...
See also
Performing a join with Hive
Getting ready
How to do it...
How it works...
See also
Creating partitioned Hive tables
Getting ready
How to do it...
Writing Hive User-defined Functions (UDF)
Getting ready
How to do it...
How it works...
HCatalog – performing Java MapReduce computations on data mapped to Hive tables
Getting ready
How to do it...
How it works...
HCatalog – writing data to Hive tables from Java MapReduce computations
Getting ready
How to do it...
How it works...
7. Hadoop Ecosystem II – Pig, HBase, Mahout, and Sqoop
Introduction
Getting started with Apache Pig
Getting ready
How to do it...
How it works...
There's more...
See also
Joining two datasets using Pig
How to do it...
How it works...
There's more...
Accessing a Hive table data in Pig using HCatalog
Getting ready
How to do it...
There's more...
See also
Getting started with Apache HBase
Getting ready
How to do it...
There's more...
See also
Data random access using Java client APIs
Getting ready
How to do it...
How it works...
Running MapReduce jobs on HBase
Getting ready
How to do it...
How it works...
Using Hive to insert data into HBase tables
Getting ready
How to do it...
See also
Getting started with Apache Mahout
How to do it...
How it works...
There's more...
Running K-means with Mahout
Getting ready
How to do it...
How it works...
Importing data to HDFS from a relational database using Apache Sqoop
Getting ready
How to do it...
Exporting data from HDFS to a relational database using Apache Sqoop
Getting ready
How to do it...
8. Searching and Indexing
Introduction
Generating an inverted index using Hadoop MapReduce
Getting ready
How to do it...
How it works...
There's more...
Outputting a random accessible indexed InvertedIndex
See also
Intradomain web crawling using Apache Nutch
Getting ready
How to do it...
See also
Indexing and searching web documents using Apache Solr
Getting ready
How to do it...
How it works...
See also
Configuring Apache HBase as the backend data store for Apache Nutch
Getting ready
How to do it...
How it works...
See also
Whole web crawling with Apache Nutch using a Hadoop/HBase cluster
Getting ready
How to do it...
How it works...
See also
Elasticsearch for indexing and searching
Getting ready
How to do it...
How it works...
See also
Generating the in-links graph for crawled web pages
Getting ready
How to do it...
How it works...
See also
9. Classifications, Recommendations, and Finding Relationships
Introduction
Performing content-based recommendations
How to do it...
How it works...
There's more...
Classification using the naïve Bayes classifier
How to do it...
How it works...
Assigning advertisements to keywords using the Adwords balance algorithm
How to do it...
How it works...
There's more...
10. Mass Text Data Processing
Introduction
Data preprocessing using Hadoop streaming and Python
Getting ready
How to do it...
How it works...
There's more...
See also
De-duplicating data using Hadoop streaming
Getting ready
How to do it...
How it works...
See also
Loading large datasets to an Apache HBase data store – importtsv and bulkload
Getting ready
How to do it…
How it works...
There's more...
Data de-duplication using HBase
See also
Creating TF and TF-IDF vectors for the text data
Getting ready
How to do it…
How it works…
See also
Clustering text data using Apache Mahout
Getting ready
How to do it...
How it works...
See also
Topic discovery using Latent Dirichlet Allocation (LDA)
Getting ready
How to do it…
How it works…
See also
Document classification using Mahout Naive Bayes Classifier
Getting ready
How to do it...
How it works...
See also
Index
买过这本书的人还买过
读了这本书的人还在读
同类图书排行榜