万本电子书0元读

万本电子书0元读

顶部广告

Learning Pentaho Data Integration 8 CE - Third Edition电子书

售       价:¥

5人正在读 | 0人评论 9.8

作       者:María Carina Roldán

出  版  社:Packt Publishing

出版时间:2017-12-05

字       数:49.5万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
Get up and running with the Pentaho Data Integration tool using this hands-on, easy-to-read guide About This Book ? Manipulate your data by exploring, transforming, validating, and integrating it using Pentaho Data Integration 8 CE ? A comprehensive guide exploring the features of Pentaho Data Integration 8 CE ? Connect to any database engine, explore the databases, and perform all kind of operations on relational databases Who This Book Is For This book is a must-have for software developers, business intelligence analysts, IT students, or anyone involved or interested in developing ETL solutions. If you plan on using Pentaho Data Integration for doing any data manipulation task, this book will help you as well. This book is also a good starting point for data warehouse designers, architects, or anyone who is responsible for data warehouse projects and needs to load data into them. What You Will Learn ? Explore the features and capabilities of Pentaho Data Integration 8 Community Edition ? Install and get started with PDI ? Learn the ins and outs of Spoon, the graphical designer tool ? Learn to get data from all kind of data sources, such as plain files, Excel spreadsheets, databases, and XML files ? Use Pentaho Data Integration to perform CRUD (create, read, update, and delete) operations on relationaldatabases ? Populate a data mart with Pentaho Data Integration ? Use Pentaho Data Integration to organize files and folders, run daily processes, deal with errors, and more In Detail Pentaho Data Integration(PDI) is an intuitive and graphical environment packed with drag-and-drop design and powerful Extract-Tranform-Load (ETL) capabilities. This book shows and explains the new interactive features of Spoon, the revamped look and feel, and the newest features of the tool including transformations and jobs Executors and the invaluable Metadata Injection capability. We begin with the installation of PDI software and then move on to cover all the key PDI concepts. Each of the chapter introduces new features, enabling you to gradually get practicing with the tool. First, you will learn to do all kind of data manipulation and work with simple plain files. Then, the book teaches you how you can work with relational databases inside PDI. Moreover, you will be given a primer on data warehouse concepts and you will learn how to load data in a data warehouse. During the course of this book, you will be familiarized with its intuitive, graphical and drag-and-drop design environment. By the end of this book, you will learn everything you need to know in order to meet your data manipulation requirements. Besides, your will be given best practices and advises for designing and deploying your projects. Style and approach Step by step guide filled with practical, real world scenarios and examples.
目录展开

Title Page

Third Edition

Copyright

Learning Pentaho Data Integration 8 CE

Third Edition

Credits

About the Author

About the Reviewers

www.PacktPub.com

Why subscribe?

Customer Feedback

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

Getting Started with Pentaho Data Integration

Pentaho Data Integration and Pentaho BI Suite

Introducing Pentaho Data Integration

Using PDI in real-world scenarios

Loading data warehouses or data marts

Integrating data

Data cleansing

Migrating information

Exporting data

Integrating PDI along with other Pentaho tools

Installing PDI

Launching the PDI Graphical Designer - Spoon

Starting and customizing Spoon

Exploring the Spoon interface

Extending the PDI functionality through the Marketplace

Introducing transformations

The basics about transformations

Creating a Hello World! Transformation

Designing a Transformation

Previewing and running a Transformation

Installing useful related software

Summary

Getting Started with Transformations

Designing and previewing transformations

Getting familiar with editing features

Using the mouseover assistance toolbar

Adding steps and creating hops

Working with grids

Designing transformations

Putting the editing features in practice

Previewing and fixing errors as they appear

Looking at the results in the execution results pane

The Logging tab

The Step Metrics tab

Running transformations in an interactive fashion

Understanding PDI data and metadata

Understanding the PDI rowset

Adding or modifying fields by using different PDI steps

Explaining the PDI data types

Handling errors

Implementing the error handling functionality

Customizing the error handling

Summary

Creating Basic Task Flows

Introducing jobs

Learning the basics about jobs

Creating a Simple Job

Designing and running jobs

Revisiting the Spoon interface and the editing features

Designing jobs

Getting familiar with the job design process

Looking at the results in the Execution results window

The Logging tab

The Job metrics tab

Enriching your work by sending an email

Running transformations from a Job

Using the Transformation Job Entry

Understanding and changing the flow of execution

Changing the flow of execution based on conditions

Forcing a status with an abort Job or success entry

Changing the execution to be synchronous

Managing files

Creating a Job that moves some files

Selecting files and folders

Working with regular expressions

Summarizing the Job entries that deal with files

Customizing the file management

Knowing the basics about Kettle variables

Understanding the kettle.properties file

How and when you can use variables

Summary

Reading and Writing Files

Reading data from files

Reading a simple file

Troubleshooting reading files

Learning to read all kind of files

Specifying the name and location of the file

Reading several files at the same time

Reading files that are compressed or located on a remote server

Reading a file whose name is known at runtime

Describing the incoming fields

Reading Date fields

Reading Numeric fields

Reading only a subset of the file

Reading the most common kinds of sources

Reading text files

Reading spreadsheets

Reading XML files

Reading JSON files

Outputting data to files

Creating a simple file

Learning to create all kind of files and write data into them

Providing the name and location of an output file

Creating a file whose name is known only at runtime

Creating several files whose name depend on the content of the file

Describing the content of the output file

Formatting Date fields

Formatting Numeric fields

Creating the most common kinds of files

Creating text files

Creating spreadsheets

Creating XML files

Creating JSON files

Working with Big Data and cloud sources

Reading files from an AWS S3 instance

Writing files to an AWS S3 instance

Getting data from HDFS

Sending data to HDFS

Summary

Manipulating PDI Data and Metadata

Manipulating simple fields

Working with strings

Extracting parts of strings using regular expressions

Searching and replacing using regular expressions

Doing some math with Numeric fields

Operating with dates

Performing simple operations on dates

Subtracting dates with the Calculator step

Getting information relative to the current date

Using the Get System Info step

Performing other useful operations on dates

Getting the month names with a User Defined Java Class step

Modifying the metadata of streams

Working with complex structures

Working with XML

Introducing XML terminology

Getting familiar with the XPath notation

Parsing XML structures with PDI

Reading an XML file with the Get data from XML step

Parsing an XML structure stored in a field

PDI Transformation and Job files

Parsing JSON structures

Introducing JSON terminology

Getting familiar with the JSONPath notation

Parsing JSON structures with PDI

Reading a JSON file with the JSON input step

Parsing a JSON structure stored in a field

Summary

Controlling the Flow of Data

Filtering data

Filtering rows upon conditions

Reading a file and getting the list of words found in it

Filtering unwanted rows with a Filter rows step

Filtering rows by using the Java Filter step

Filtering data based on row numbers

Splitting streams unconditionally

Copying rows

Distributing rows

Introducing partitioning and clustering

Splitting the stream based on conditions

Splitting a stream based on a simple condition

Exploring PDI steps for splitting a stream based on conditions

Merging streams in several ways

Merging two or more streams

Customizing the way of merging streams

Looking up data

Looking up data with a Stream lookup step

Summary

Cleansing, Validating, and Fixing Data

Cleansing data

Cleansing data by example

Standardizing information

Improving the quality of data

Introducing PDI steps useful for cleansing data

Dealing with non-exact matches

Cleansing by doing a fuzzy search

Deduplicating non-exact matches

Validating data

Validating data with PDI

Validating and reporting errors to the log

Introducing common validations and their implementation with PDI

Treating invalid data by splitting and merging streams

Fixing data that doesn't match the rules

Summary

Manipulating Data by Coding

Doing simple tasks with the JavaScript step

Using the JavaScript language in PDI

Inserting JavaScript code using the JavaScript step

Adding fields

Modifying fields

Organizing your code

Controlling the flow using predefined constants

Testing the script using the Test script button

Parsing unstructured files with JavaScript

Doing simple tasks with the Java Class step

Using the Java language in PDI

Inserting Java code using the Java Class step

Learning to insert java code in a Java Class step

Data types equivalence

Adding fields

Modifying fields

Controlling the flow with the putRow() function

Testing the Java Class using the Test class button

Getting the most out of the Java Class step

Receiving parameters

Reading data from additional steps

Redirecting data to different target steps

Parsing JSON structures

Avoiding coding using purpose-built steps

Summary

Transforming the Dataset

Sorting data

Sorting a dataset with the sort rows step

Working on groups of rows

Aggregating data

Summarizing the PDI steps that operate on sets of rows

Converting rows to columns

Converting row data to column data using the Row denormaliser step

Aggregating data with a Row Denormaliser step

Normalizing data

Modifying the dataset with a Row Normaliser step

Going forward and backward across rows

Picking rows backward and forward with the Analytic Query step

Summary

Performing Basic Operations with Databases

Connecting to a database and exploring its content

Connecting with Relational Database Management Systems

Exploring a database with the Database Explorer

Previewing and getting data from a database

Getting data from the database with the Table input step

Using the Table input step to run flexible queries

Adding parameters to your queries

Using Kettle variables in your queries

Inserting, updating, and deleting data

Inserting new data into a database table

Inserting or updating data with the Insert / Update step

Deleting records of a database table with the Delete step

Performing CRUD operations with more flexibility

Verifying a connection, running DDL scripts, and doing other useful tasks

Looking up data in different ways

Doing simple lookups with the Database Value Lookup step

Making a performance difference when looking up data in a database

Performing complex database lookups

Looking for data using a Database join step

Looking for data using a Dynamic SQL row step

Summary

Loading Data Marts with PDI

Preparing the environment

Exploring the Jigsaw database model

Creating the database and configuring the environment

Introducing dimensional modeling

Loading dimensions with data

Learning the basics of dimensions

Understanding dimensions technical details

Loading a time dimension

Introducing and loading Type I slowly changing dimensions

Loading a Type I SCD with a combination lookup/update step

Introducing and loading Type II slowly changing dimension

Loading Type II SCDs with a dimension lookup/update step

Loading a Type II SDC for the first time

Loading a Type II SDC and verifying how history is kept

Explaining and loading Type III SCD and Hybrid SCD

Loading other kinds of dimensions

Loading a mini dimension

Loading junk dimensions

Explaining degenerate dimensions

Loading fact tables

Learning the basics about fact tables

Deciding the level of granularity

Translating the business keys into surrogate keys

Obtaining the surrogate key for Type I SCD

Obtaining the surrogate key for Type II SCD

Obtaining the surrogate key for the junk dimension

Obtaining the surrogate key for a time dimension

Loading a cumulative fact table

Loading a snapshot fact table

Loading a fact table by inserting snapshot data

Loading a fact table by overwriting snapshot data

Summary

Creating Portable and Reusable Transformations

Defining and using Kettle variables

Introducing all kinds of Kettle variables

Explaining predefined variables

Revisiting the kettle.properties file

Defining variables at runtime

Setting a variable with a constant value

Setting a variable with a value unknown beforehand

Setting variables with partial or total results of your flow

Defining and using named parameters

Using variables as fields of your stream

Creating reusable Transformations

Creating and executing sub-transformations

Creating and testing a sub-transformation

Executing a sub-transformation

Introducing more elaborate sub-transformations

Making the data flow between transformations

Transferring data using the copy/get rows mechanism

Executing transformations in an iterative way

Using Transformation executors

Configuring the executors with advanced settings

Getting the results of the execution of the inner transformation

Working with groups of data

Using variables and named parameters

Continuing the flow after executing the inner transformation

Summary

Implementing Metadata Injection

Introducing metadata injection

Explaining how metadata injection works

Creating a template Transformation

Injecting metadata

Discovering metadata and injecting it

Identifying use cases to implement metadata injection

Summary

Creating Advanced Jobs

Enhancing your processes with the use of variables

Running nested jobs

Understanding the scope of variables

Using named parameters

Using variables to create flexible processes

Using variables to name jobs and transformations

Using variables to name Job and Transformation folders

Accessing copied rows for different purposes

Using the copied rows to manage files in advanced ways

Using the copied rows as parameters of a Job or Transformation

Working with filelists

Maintaining a filelist

Using the filelist for different purposes

Attaching files in an email

Copying, moving, and deleting files

Introducing other ways to process the filelist

Executing jobs in an iterative way

Using Job executors

Configuring the executors with advanced settings

Getting the results of the execution of the job

Working with groups of data

Using variables and named parameters

Capturing the result filenames

Summary

Launching Transformations and Jobs from the Command Line

Using the Pan and Kitchen utilities

Running jobs and transformations

Checking the exit code

Supplying named parameters and variables

Using command-line arguments

Deciding between the use of a command-line argument and named parameters

Sending the output of executions to log files

Automating the execution

Summary

Best Practices for Designing and Deploying a PDI Project

Setting up a new project

Setting up the local environment

Defining a folder structure for the project

Dealing with external resources

Defining and adopting a versioning system

Best practices to design jobs and transformations

Styling your work

Making the work portable

Designing and developing reusable jobs and transformations

Maximizing the performance

Analyzing Steps Metrics

Analyzing performance graphs

Deploying the project in different environments

Modifying the Kettle home directory

Modifying the Kettle home in Windows

Modifying the Kettle home in Unix-like systems

Summary

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部