万本电子书0元读

万本电子书0元读

顶部广告

Pentaho 3.2 Data Integration: Beginner's Guide电子书

售       价:¥

30人正在读 | 0人评论 9.8

作       者:Maria Carina Roldan

出  版  社:Packt Publishing

出版时间:2010-04-09

字       数:244.0万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:此类商品不支持退换货,不支持下载打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
As part of Packt's Beginner's Guide, this book focuses on teaching by example. The book walks you through every aspect of PDI, giving step-by-step instructions in a friendly style, allowing you to learn in front of your computer, playing with the tool. The extensive use of drawings and screenshots make the process of learning PDI easy. Throughout the book numerous tips and helpful hints are provided that you will not find anywhere else. The book provides short, practical examples and also builds from scratch a small datamart intended to reinforce the learned concepts and to teach you the basics of data warehousing. This book is for software developers, database administrators, IT students, and everyone involved or interested in developing ETL solutions, or, more generally, doing any kind of data manipulation. If you have never used PDI before, this will be a perfect book to start with. You will find this book is a good starting point if you are a database administrator, data warehouse designer, architect, or any person who is responsible for data warehouse projects and need to load data into them. You don't need to have any prior data warehouse or database experience to read this book. Fundamental database and data warehouse technical terms and concepts are explained in easy-to-understand language.
目录展开

Pentaho 3.2 Data Integration Beginner's Guide

Table of Contents

Pentaho 3.2 Data Integration

Credits

Foreword

The Kettle Project

About the Author

About the Reviewers

Preface

How to read this book

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Errata

Piracy

Questions

1. Getting Started with Pentaho Data Integration

Pentaho Data Integration and Pentaho BI Suite

Exploring the Pentaho Demo

Pentaho Data Integration

Using PDI in real world scenarios

Loading datawarehouses or datamarts

Integrating data

Data cleansing

Migrating information

Exporting data

Integrating PDI using Pentaho BI

Pop quiz—PDI data sources

Installing PDI

Time for action—installing PDI

What just happened?

Pop quiz—PDI prerequisites

Launching the PDI graphical designer: Spoon

Time for action—starting and customizing Spoon

What just happened?

Spoon

Setting preferences in the Options window

Storing transformations and jobs in a repository

Creating your first transformation

Time for action—creating a hello world transformation

What just happened?

Directing the Kettle engine with transformations

Exploring the Spoon interface

Viewing the transformation structure

Running and previewing the transformation

Time for action—running and previewing the hello_world transformation

What just happened?

Previewing the results in the Execution Results window

Pop quiz—PDI basics

Installing MySQL

Time for action—installing MySQL on Windows

What just happened?

Time for action—installing MySQL on Ubuntu

What just happened?

Summary

2. Getting Started with Transformations

Reading data from files

Time for action—reading results of football matches from files

What just happened?

Input files

Input steps

Reading several files at once

Time for action—reading all your files at a time using a single Text file input step

What just happened?

Time for action reading all your files at a time using a single Text file input step and regular expressions

What just happened?

Regular expressions

Troubleshooting reading files

Grids

Have a go hero—explore your own files

Sending data to files

Time for action—sending the results of matches to a plain file

What just happened?

Output files

Output steps

Some data definitions

Rowset

Streams

The Select values step

Have a go hero—extending your transformations by writing output files

Getting system information

Time for action—updating a file with news about examinations

What just happened?

Getting information by using Get System Info step

Data types

Date fields

Numeric fields

Running transformations from a terminal window

Time for action—running the examination transformation from a terminal window

What just happened?

Have a go hero—using different date formats

Go for a hero formatting 99.55

Pop quiz—formatting data

XML files

Time for action—getting data from an XML file with information about countries

What just happened?

What is XML

PDI transformation files

Getting data from XML files

XPath

Configuring the Get data from XML step

Kettle variables

How and when you can use variables

Have a go hero—exploring XML files

Have a go hero—enhancing the output countries file

Have a go hero—documenting your work

Summary

3. Basic Data Manipulation

Basic calculations

Time for action—reviewing examinations by using the Calculator step

What just happened?

Adding or modifying fields by using different PDI steps

The Calculator step

The Formula step

Time for action—reviewing examinations by using the Formula step

What just happened?

Have a go hero—listing students and their examinations results

Pop quiz—concatenating strings

Calculations on groups of rows

Time for action—calculating World Cup statistics by grouping data

What just happened?

Group by step

Have a go hero—calculating statistics for the examinations

Have a go hero—listing the languages spoken by country

Filtering

Time for action—counting frequent words by filtering

What just happened?

Filtering rows using the Filter rows step

Have a go hero—playing with filters

Have a go hero—counting words and discarding those that are commonly used

Looking up data

Time for action—finding out which language people speak

What just happened?

The Stream lookup step

Have a go hero—counting words more precisely

Summary

4. Controlling the Flow of Data

Splitting streams

Time for action—browsing new PDI features by copying a dataset

What just happened?

Copying rows

Have a go hero—recalculating statistics

Distributing rows

Time for action—assigning tasks by distributing

What just happened?

Pop quiz—data movement (copying and distributing)

Splitting the stream based on conditions

Time for action - assigning tasks by filtering priorities with the Filter rows step

What just happened?

PDI steps for splitting the stream based on conditions

Time for action—assigning tasks by filtering priorities with the Switch/ Case step

What just happened?

Have a go hero—listing languages and countries

Pop quiz—splitting a stream

Merging streams

Time for action—gathering progress and merging all together

What just happened?

PDI options for merging streams

Time for action—giving priority to Bouchard by using Append Stream

What just happened?

Have a go hero—sorting and merging all tasks

Have a go hero—trying to find missing countries

Summary

5. Transforming Your Data with JavaScript Code and the JavaScript Step

Doing simple tasks with the JavaScript step

Time for action—calculating scores with JavaScript

What just happened?

Using the JavaScript language in PDI

Inserting JavaScript code using the Modified Java Script Value step

Adding fields

Modifying fields

Turning on the compatibility switch

Have a go hero—adding and modifying fields to the contest data

Testing your code

Time for action—testing the calculation of averages

What just happened?

Testing the script using the Test script button

Have a go hero—testing the new calculation of the average

Enriching the code

Time for action calculating flexible scores by using variables

What just happened?

Using named parameters

Using the special Start, Main, and End scripts

Using transformation predefined constants

Pop quiz—finding the 7 errors

Have a go hero—keeping the top 10 performances

Have a go hero—calculating scores with Java code

Reading and parsing unstructured files

Time for action—changing a list of house descriptions with JavaScript

What just happened?

Looking at previous rows

Have a go hero—enhancing the houses file

Have a go hero—fill gaps in the contest file

Avoiding coding by using purpose-built steps

Have a go hero—creating alternative solutions

Summary

6. Transforming the Row Set

Converting rows to columns

Time for action—enhancing a films file by converting rows to columns

What just happened?

Converting row data to column data by using the Row denormalizer step

Have a go hero—houses revisited

Aggregating data with a Row denormalizer step

Time for action—calculating total scores by performances by country

What just happened?

Using Row denormalizer for aggregating data

Have a go hero—calculating scores by skill by continent

Normalizing data

Time for action—enhancing the matches file by normalizing the dataset

What just happened?

Modifying the dataset with a Row Normalizer step

Summarizing the PDI steps that operate on sets of rows

Have a go hero—verifying the benefits of normalization

Have a go hero—normalizing the Films file

Have a go hero—calculating scores by judge

Generating a custom time dimension dataset by using Kettle variables

Time for action—creating the time dimension dataset

What just happened?

Getting variables

Time for action—getting variables for setting the default starting date

What just happened?

Using the Get Variables step

Have a go hero—enhancing the time dimension

Pop quiz—using Kettle variables inside transformations

Summary

7. Validating Data and Handling Errors

Capturing errors

Time for action—capturing errors while calculating the age of a film

What just happened?

Using PDI error handling functionality

Aborting a transformation

Time for action—aborting when there are too many errors

What just happened?

Aborting a transformation using the Abort step

Fixing captured errors

Time for action—treating errors that may appear

What just happened?

Treating rows coming to the error stream

Pop quiz—PDI error handling

Have a go hero—capturing errors while seeing who wins

Avoiding unexpected errors by validating data

Time for action validating genres with a Regex Evaluation step

What just happened?

Validating data

Time for action—checking films file with the Data Validator

What just happened?

Defining simple validation rules using the Data Validator

Have a go hero—validating the football matches file

Cleansing data

Have a go hero—cleansing films data

Summary

8. Working with Databases

Introducing the Steel Wheels sample database

Connecting to the Steel Wheels database

Time for action—creating a connection with the Steel Wheels database

What just happened?

Connecting with Relational Database Management Systems

Pop quiz—defining database connections

Have a go hero—connecting to your own databases

Exploring the Steel Wheels database

Time for action—exploring the sample database

What just happened?

A brief word about SQL

Exploring any configured database with the PDI Database explorer

Have a go hero—exploring the sample data in depth

Have a go hero—exploring your own databases

Querying a database

Time for action—getting data about shipped orders

What just happened?

Getting data from the database with the Table input step

Using the SELECT statement for generating a new dataset

Making flexible queries by using parameters

Time for action—getting orders in a range of dates by using parameters

What just happened?

Adding parameters to your queries

Making flexible queries by using Kettle variables

Time for action—getting orders in a range of dates by using variables

What just happened?

Using Kettle variables in your queries

Pop quiz—database datatypes versus PDI datatypes

Have a go hero—querying the sample data

Sending data to a database

Time for action—loading a table with a list of manufacturers

What just happened?

Inserting new data into a database table with the Table output step

Inserting or updating data by using other PDI steps

Time for action—inserting new products or updating existent ones

What just happened?

Time for action—testing the update of existing products

What just happened?

Inserting or updating data with the Insert/Update step

Have a go hero—populating a films database

Have a go hero—creating the time dimension

Have a go hero—populating the products table

Pop quiz—Insert/Update step versus Table Output/Update steps

Pop quiz—filtering the first 10 rows

Eliminating data from a database

Time for action—deleting data about discontinued items

What just happened?

Deleting records of a database table with the Delete step

Have a go hero—deleting old orders

Summary

9. Performing Advanced Operations with Databases

Preparing the environment

Time for action—populating the Jigsaw database

What just happened?

Exploring the Jigsaw database model

Looking up data in a database

Doing simple lookups

Time for action—using a Database lookup step to create a list of products to buy

What just happened?

Looking up values in a database with the Database lookup step

Have a go hero—preparing the delivery of the products

Have a go hero—refining the transformation

Doing complex lookups

Time for action using a Database join step to create a list of suggested products to buy

What just happened?

Joining data from the database to the stream data by using a Database join step

Have a go hero—rebuilding the list of customers

Introducing dimensional modeling

Loading dimensions with data

Time for action loading a region dimension with a Combination lookup/update step

What just happened?

Time for action—testing the transformation that loads the region dimension

What just happened?

Describing data with dimensions

Loading Type I SCD with a Combination lookup/update step

Have a go hero—adding regions to the Region Dimension

Have a go hero—loading the manufacturers dimension

Have a go hero—loading a mini-dimension

Keeping a history of changes

Time for action—keeping a history of product changes with the Dimension lookup/update step

What just happened?

Time for action—testing the transformation that keeps a history of product changes

What just happened?

Keeping an entire history of data with a Type II slowly changing dimension

Loading Type II SCDs with the Dimension lookup/update step

Have a go hero—keeping a history just for the theme of a product

Have a go hero—loading a Type II SCD dimension

Pop quiz—loading slowly changing dimensions

Pop quiz—loading type III slowly changing dimensions

Summary

10. Creating Basic Task Flows

Introducing PDI jobs

Time for action—creating a simple hello world job

What just happened?

Executing processes with PDI jobs

Using Spoon to design and run jobs

Using the transformation job entry

Pop quiz—defining PDI jobs

Have a go hero—loading the dimension tables

Receiving arguments and parameters in a job

Time for action—customizing the hello world file with arguments and parameters

What just happened?

Using named parameters in jobs

Have a go hero—backing up your work

Running jobs from a terminal window

Time for action—executing the hello world job from a terminal window

What just happened?

Have a go hero—experiencing Kitchen

Using named parameters and command-line arguments in transformations

Time for action—calling the hello world transformation with fixed arguments and parameters

What just happened?

Have a go hero—saying hello again and again

Have a go hero—loading the time dimension from a job

Deciding between the use of a command-line argument and a named parameter

Have a go hero—analysing the use of arguments and named parameters

Running job entries under conditions

Time for action—sending a sales report and warning the administrator if something is wrong

What just happened?

Changing the flow of execution on the basis of conditions

Have a go hero—refining the sales report

Creating and using a file results list

Have a go hero—sharing your work

Summary

11. Creating Advanced Transformations and Jobs

Enhancing your processes with the use of variables

Time for action—updating a file with news about examinations by setting a variable with the name of the file

What just happened?

Setting variables inside a transformation

Have a go hero—enhancing the examination tutorial even more

Have a go hero—enhancing the jigsaw database update process

Have a go hero—executing the proper jigsaw database update process

Enhancing the design of your processes

Time for action—generating files with top scores

What just happened?

Pop quiz—using the Add Sequence step

Reusing part of your transformations

Time for action—calculating the top scores with a subtransformation

What just happened?

Creating and using subtransformations

Have a go hero—refining the subtransformation

Have a go hero—counting words more precisely (second version)

Creating a job as a process flow

Time for action—splitting the generation of top scores by copying and getting rows

What just happened?

Transferring data between transformations by using the copy /get rows mechanism

Have a go hero—modifying the flow

Nesting jobs

Time for action—generating the files with top scores by nesting jobs

What just happened?

Running a job inside another job with a job entry

Understanding the scope of variables

Pop quiz—deciding the scope of variables

Iterating jobs and transformations

Time for action—generating custom files by executing a transformation for every input row

What just happened?

Executing for each row

Have a go hero—processing several files at once

Have a go hero—building lists of products to buy

Have a go hero—e-mail students to let them know how they did

Summary

12. Developing and Implementing a Simple Datamart

Exploring the sales datamart

Deciding the level of granularity

Loading the dimensions

Time for action—loading dimensions for the sales datamart

What just happened?

Extending the sales datamart model

Have a go hero—loading the dimensions for the puzzles star model

Loading a fact table with aggregated data

Time for action—loading the sales fact table by looking up dimensions

What just happened?

Getting the information from the source with SQL queries

Translating the business keys into surrogate keys

Obtaining the surrogate key for a Type I SCD

Obtaining the surrogate key for a Type II SCD

Obtaining the surrogate key for the Junk dimension

Obtaining the surrogate key for the Time dimension

Pop quiz—modifying a star model and loading the star with PDI

Have a go hero—loading a puzzles fact table

Getting facts and dimensions together

Time for action—loading the fact table using a range of dates obtained from the command line

What just happened?

Time for action—loading the sales star

What just happened?

Have a go hero—enhancing the loading process of the sales fact table

Have a go hero—loading the puzzles sales star

Have a go hero—loading the facts once a month

Getting rid of administrative tasks

Time for action—automating the loading of the sales datamart

What just happened?

Have a go hero—Creating a back up of your work automatically

Have a go hero—enhancing the automate process by sending an e-mail if an error occurs

Summary

13. Taking it Further

PDI best practices

Getting the most out of PDI

Extending Kettle with plugins

Have a go hero—listing the top 10 students by using the Head plugin step

Overcoming real world risks with some remote execution

Scaling out to overcome bigger risks

Pop quiz—remote execution and clustering

Integrating PDI and the Pentaho BI suite

PDI as a process action

PDI as a datasource

More about the Pentaho suite

PDI Enterprise Edition and Kettle Developer Support

Summary

A. Working with Repositories

Creating a repository

Time for action—creating a PDI repository

What just happened?

Creating repositories to store your transformations and jobs

Working with the repository storage system

Time for action—logging into a repository

What just happened?

Logging into a repository by using credentials

Defining repository user accounts

Creating transformations and jobs in repository folders

Creating database connections, partitions, servers, and clusters

Backing up and restoring a repository

Examining and modifying the contents of a repository with the Repository explorer

Migrating from a file-based system to a repository-based system and vice-versa

Summary

B. Pan and Kitchen: Launching Transformations and Jobs from the Command Line

Running transformations and jobs stored in files

Running transformations and jobs from a repository

Specifying command line options

Checking the exit code

Providing options when running Pan and Kitchen

Log details

Named parameters

Arguments

Variables

C. Quick Reference: Steps and Job Entries

Transformation steps

Job entries

D. Spoon Shortcuts

General shortcuts

Designing transformations and jobs

Grids

Repositories

E. Introducing PDI 4 Features

Agile BI

Visual improvements for designing transformations and jobs

Experiencing the mouse-over assistance

Time for action—creating a hop with the mouse-over assistance

What just happened?

Using the mouse-over assistance toolbar

Experiencing the sniff-testing feature

Experiencing the job drill-down feature

Experiencing even more visual changes

Enterprise features

Summary

F. Pop quiz—Answers

Chapter 1

PDI data sources

PDI prerequisites PDI basics

Chapter 2

formatting data

Chapter 3

concatenating strings

Chapter 4

data movement (copying and distributing)

splitting a stream

Chapter 5

finding the seven errors

Chapter 6

using Kettle variables inside transformations

Chapter 7

PDI error handling

Chapter 8

defining database connections

database datatypes versus PDI datatypes

Insert/Update step versus Table Output/Update steps

filtering the first 10 rows

Chapter 9

loading slowly changing dimensions

loading type III slowly changing dimensions

Chapter 10

defining PDI jobs

Chapter 11

using the Add sequence step

deciding the scope of variables

Chapter 12

modifying a star model and loading the star with PDI

Chapter 13

remote execution and clustering

Index

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部