售 价:¥
温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印
为你推荐
Data Lake Development with Big Data
Table of Contents
Data Lake Development with Big Data
Credits
About the Authors
Acknowledgement
About the Reviewer
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Errata
Piracy
Questions
1. The Need for Data Lake
Before the Data Lake
Need for Data Lake
Defining Data Lake
Key benefits of Data Lake
Challenges in implementing a Data Lake
When to go for a Data Lake implementation
Data Lake architecture
Architectural considerations
Architectural composition
Architectural details
Understanding Data Lake layers
The Data Governance and Security Layer
The Information Lifecycle Management layer
The Metadata Layer
Understanding Data Lake tiers
The Data Intake tier
The Source System Zone
The Transient Zone
The Raw Zone
Batch Raw Storage
The real-time Raw Storage
The Data Management tier
The Integration Zone
The Enrichment Zone
The Data Hub Zone
The Data Consumption tier
The Data Discovery Zone
The Data Provisioning Zone
Summary
2. Data Intake
Understanding Intake tier zones
Source System Zone functionalities
Understanding connectivity processing
Understanding Intake Processing for data variety
Structured data
The need for integrating Structured Data in the Data Lake
Structured data loading approaches
Semi-structured data
The need for integrating semi-structured data in the Data Lake
Semi-structured data loading approaches
Unstructured data
The need for integrating Unstructured data in the Data Lake
Unstructured data loading approaches
Transient Landing Zone functionalities
File validation checks
File duplication checks
File integrity checks
File size checks
File periodicity checks
Data Integrity checks
Checking record counts
Checking for column counts
Schema validation checks
Raw Storage Zone functionalities
Data lineage processes
Watermarking process
Metadata capture
Deep Integrity checks
Bit Level Integrity checks
Periodic checksum checks
Security and governance
Information Lifecycle Management
Practical Data Ingestion scenarios
Architectural guidance
Structured data use cases
Semi-structured and unstructured data use cases
Big Data tools and technologies
Ingestion of structured data
Sqoop
Use case scenarios for Sqoop
WebHDFS
Use case scenarios for WebHDFS
Ingestion of streaming data
Apache Flume
Use case scenarios for Flume
Fluentd
Use case scenarios for Fluentd
Kafka
Use case scenarios for Kafka
Amazon Kinesis
Use case scenarios for Kinesis
Apache Storm
Use case scenarios for Storm
Summary
3. Data Integration, Quality, and Enrichment
Introduction to the Data Management Tier
Understanding Data Integration
Introduction to Data Integration
Prominent features of Data Integration
Loosely coupled Integration
Ease of use
Secure access
High-quality data
Lineage tracking
Practical Data Integration scenarios
The workings of Data Integration
Raw data discovery
Data quality assessment
Profiling the data
Data cleansing
Deletion of missing, null, or invalid values
Imputation of missing, null, or invalid values
Data transformations
Unstructured text transformation techniques
Structured data transformations
Data enrichment
Collect metadata and track data lineage
Traditional Data Integration versus Data Lake
Data pipelines
Addressing the limitations using Data Lake
Data partitioning
Addressing the limitations using Data Lake
Scale on demand
Addressing the limitations using Data Lake
Data ingest parallelism
Addressing the limitations using Data Lake
Extensibility
Addressing the limitations using Data Lake
Big Data tools and technologies
Syncsort
Use case scenarios for Syncsort
Talend
Use case scenarios for Talend
Pentaho
Use case scenarios for Pentaho
Summary
4. Data Discovery and Consumption
Understanding the Data Consumption tier
Data Consumption – Traditional versus Data Lake
An introduction to Data Consumption
Practical Data Consumption scenarios
Data Discovery and metadata
Enabling Data Discovery
Data classification
Classifying unstructured data
Named entity recognition
Topic modeling
Text clustering
Applications of data classification
Relation extraction
Extracting relationships from unstructured data
Feature-based methods
Understanding how feature-based methods work
Implementation
Semantic technologies
Understanding how semantic technologies work
Implementation
Extracting Relationships from structured data
Applications of relation extraction
Indexing data
Inverted index
Understanding how inverted index works
Implementation
Applications of Indexing
Performing Data Discovery
Semantic search
Word sense disambiguation
Latent Semantic Analysis
Faceted search
Fuzzy search
Edit distance
Wildcard and regular expressions
Data Provisioning and metadata
Data publication
Data subscription
Data Provisioning functionalities
Data formatting
Data selection
Data Provisioning approaches
Post-provisioning processes
Architectural guidance
Data Discovery
Big Data tools and technologies
Elasticsearch
Use case scenarios for Elasticsearch
IBM InfoSphere Data Explorer
Use case scenarios for IBM InfoSphere Data Explorer
Tableau
Use case scenarios for Tableau
Splunk
Use case scenarios for Splunk
Data Provisioning
Big Data tools and technologies
Data Dispatch
Use case scenarios for Data Dispatch
Summary
5. Data Governance
Understanding Data Governance
Introduction to Data Governance
The need for Data Governance
Governing Big Data in the Data Lake
Data Governance – Traditional versus Data Lake
Practical Data Governance scenarios
Data Governance components
Metadata management and lineage tracking
Data security and privacy
Big Data implications for security and privacy
Security issues in the Data Lake tiers
The Intake Tier
The Management Tier
The Consumption Tier
Information Lifecycle Management
Big Data implications for ILM
Implementing ILM using Data Lake
The Intake Tier
The Management Tier
The Consumption Tier
Architectural guidance
Big Data tools and technologies
Apache Falcon
Understanding how Falcon works
Use case scenarios for Falcon
Apache Atlas
Understanding how Atlas works
Use case scenarios for Atlas
IBM Big Data platform
Understanding how governance is provided in IBM Big Data platform
Use case scenarios for IBM Big Data platform
The current and future trends
Data Lake and future enterprise trajectories
Future Data Lake technologies
Summary
Index
买过这本书的人还买过
读了这本书的人还在读
同类图书排行榜