万本电子书0元读

万本电子书0元读

顶部广告

Learning Cascading电子书

售       价:¥

2人正在读 | 0人评论 9.8

作       者:Michael Covert

出  版  社:Packt Publishing

出版时间:2015-05-29

字       数:385.1万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
This book is intended for software developers, system architects and analysts, big data project managers, and data scientists who wish to deploy big data solutions using the Cascading framework. You must have a basic understanding of the big data paradigm and should be familiar with Java development techniques.
目录展开

Learning Cascading

Table of Contents

Learning Cascading

Credits

Foreword

About the Authors

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers, and more

Why subscribe?

Free access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

1. The Big Data Core Technology Stack

Reviewing Hadoop

Hadoop architecture

HDFS – the Hadoop Distributed File System

The NameNode

The secondary NameNode

DataNodes

MapReduce execution framework

The JobTracker

The TaskTracker

Hadoop jobs

Distributed cache

Counters

YARN – MapReduce version 2

A simple MapReduce job

Beyond MapReduce

The Cascading framework

The execution graph and flow planner

How Cascading produces MapReduce jobs

Summary

2. Cascading Basics in Detail

Understanding common Cascading themes

Data flows as processes

Understanding how Cascading represents records

Using tuples and defining fields

Using a Fields object, named field groups, and selectors

Data typing and coercion

Defining schemes

Schemes in detail

TupleEntry

Understanding how Cascading controls data flow

Using pipes

Creating and chaining

Pipe operations

Each

Splitting

GroupBy and sorting

Every

Merging and joining

The Merge pipe

The join pipes – CoGroup and HashJoin

CoGroup

HashJoin

Default output selectors

Using taps

Flow

FlowConnector

Cascades

Local and Hadoop modes

Common errors

Putting it all together

Summary

3. Understanding Custom Operations

Understanding operations

Operations and fields

The Operation class and interface hierarchy

The basic operation lifecycle

Contexts

FlowProcess

OperationCall<Context>

An operation processing sequence and its methods

Operation types

Each operations

Filters

Filter calling sequence

Built-in filters

Function

Function calling sequence

Built-in functions

Every operations

Aggregator

Aggregator calling sequence

Built-in aggregators

Buffers

Buffer calling sequence

Built-in buffers

Assertions

ValueAssertion calling sequence

GroupAssertion calling sequence

AssertionLevel

Using assertions

Built-in assertions

A note about implementing BaseOperation methods

Summary

4. Creating Custom Operations

Writing custom operations

Writing a filter

Writing a function

Writing an aggregator

Writing a custom assertion

Writing a buffer

Identifying common use cases for custom operations

Putting it all together

Summary

5. Code Reuse and Integration

Creating and using subassemblies

Built-in subassemblies

Creating a new custom subassembly

Using custom subassemblies

Using cascades

Building a complex workflow using cascades

Skipping a flow in a cascade

Intermediate file management

Dynamically controlling flows

Instrumentation and counters

Using counters to control flow

Using existing MapReduce jobs

Using fluent programming techniques

The FlowDef fluent interface

Integrating external components

Flow and cascade events

Using external JAR files

Using Cascading as insulation from big data migrations and upgrades

Summary

6. Testing a Cascading Application

Debugging a Cascading application

Getting your environment ready for debugging

Using Cascading local mode debugging

Setting up Eclipse

Remote debugging

Using assertions

The Debug() filter

Managing exceptions with traps

Checkpoints

Managing bad data

Viewing flow sequencing using DOT files

Testing strategies

Unit testing and JUnit

Mocking

Integration testing

Load and performance testing

Summary

7. Optimizing the Performance of a Cascading Application

Optimizing performance

Optimizing Cascading

Optimizing Hadoop

A note about the effective use of checkpoints

Summary

8. Creating a Real-world Application in Cascading

Project description – Business Intelligence case study on monitoring the competition

Project scope – understanding requirements

Understanding the project domain – text analytics and natural language processing (NLP)

Conducting a simple named entity extraction

Defining the project – the Cascading development methodology

Project roles and responsibilities

Conducting data analysis

Performing functional decomposition

Designing the process and components

Creating and integrating the operations

Creating and using subassemblies

Building the workflow

Building flows

Managing the context

Building the cascade

Designing the test plan

Performing a unit test

Performing an integration test

Performing a cluster test

Performing a full load test

Refining and adjusting

Software packaging and delivery to the cluster

Next steps

Summary

9. Planning for Future Growth

Finding online resources

Using other Cascading tools

Lingual

Pattern

Driven

Fluid

Load

Multitool

Support for other languages

Hortonworks

Custom taps

Cascading serializers

Java open source mock frameworks

Summary

A. Downloadable Software

Contents

Installing and using

Index

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部