售 价:¥
温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印
为你推荐
Pentaho 3.2 Data Integration Beginner's Guide
Table of Contents
Pentaho 3.2 Data Integration
Credits
Foreword
The Kettle Project
About the Author
About the Reviewers
Preface
How to read this book
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Errata
Piracy
Questions
1. Getting Started with Pentaho Data Integration
Pentaho Data Integration and Pentaho BI Suite
Exploring the Pentaho Demo
Pentaho Data Integration
Using PDI in real world scenarios
Loading datawarehouses or datamarts
Integrating data
Data cleansing
Migrating information
Exporting data
Integrating PDI using Pentaho BI
Pop quiz—PDI data sources
Installing PDI
Time for action—installing PDI
What just happened?
Pop quiz—PDI prerequisites
Launching the PDI graphical designer: Spoon
Time for action—starting and customizing Spoon
What just happened?
Spoon
Setting preferences in the Options window
Storing transformations and jobs in a repository
Creating your first transformation
Time for action—creating a hello world transformation
What just happened?
Directing the Kettle engine with transformations
Exploring the Spoon interface
Viewing the transformation structure
Running and previewing the transformation
Time for action—running and previewing the hello_world transformation
What just happened?
Previewing the results in the Execution Results window
Pop quiz—PDI basics
Installing MySQL
Time for action—installing MySQL on Windows
What just happened?
Time for action—installing MySQL on Ubuntu
What just happened?
Summary
2. Getting Started with Transformations
Reading data from files
Time for action—reading results of football matches from files
What just happened?
Input files
Input steps
Reading several files at once
Time for action—reading all your files at a time using a single Text file input step
What just happened?
Time for action reading all your files at a time using a single Text file input step and regular expressions
What just happened?
Regular expressions
Troubleshooting reading files
Grids
Have a go hero—explore your own files
Sending data to files
Time for action—sending the results of matches to a plain file
What just happened?
Output files
Output steps
Some data definitions
Rowset
Streams
The Select values step
Have a go hero—extending your transformations by writing output files
Getting system information
Time for action—updating a file with news about examinations
What just happened?
Getting information by using Get System Info step
Data types
Date fields
Numeric fields
Running transformations from a terminal window
Time for action—running the examination transformation from a terminal window
What just happened?
Have a go hero—using different date formats
Go for a hero formatting 99.55
Pop quiz—formatting data
XML files
Time for action—getting data from an XML file with information about countries
What just happened?
What is XML
PDI transformation files
Getting data from XML files
XPath
Configuring the Get data from XML step
Kettle variables
How and when you can use variables
Have a go hero—exploring XML files
Have a go hero—enhancing the output countries file
Have a go hero—documenting your work
Summary
3. Basic Data Manipulation
Basic calculations
Time for action—reviewing examinations by using the Calculator step
What just happened?
Adding or modifying fields by using different PDI steps
The Calculator step
The Formula step
Time for action—reviewing examinations by using the Formula step
What just happened?
Have a go hero—listing students and their examinations results
Pop quiz—concatenating strings
Calculations on groups of rows
Time for action—calculating World Cup statistics by grouping data
What just happened?
Group by step
Have a go hero—calculating statistics for the examinations
Have a go hero—listing the languages spoken by country
Filtering
Time for action—counting frequent words by filtering
What just happened?
Filtering rows using the Filter rows step
Have a go hero—playing with filters
Have a go hero—counting words and discarding those that are commonly used
Looking up data
Time for action—finding out which language people speak
What just happened?
The Stream lookup step
Have a go hero—counting words more precisely
Summary
4. Controlling the Flow of Data
Splitting streams
Time for action—browsing new PDI features by copying a dataset
What just happened?
Copying rows
Have a go hero—recalculating statistics
Distributing rows
Time for action—assigning tasks by distributing
What just happened?
Pop quiz—data movement (copying and distributing)
Splitting the stream based on conditions
Time for action - assigning tasks by filtering priorities with the Filter rows step
What just happened?
PDI steps for splitting the stream based on conditions
Time for action—assigning tasks by filtering priorities with the Switch/ Case step
What just happened?
Have a go hero—listing languages and countries
Pop quiz—splitting a stream
Merging streams
Time for action—gathering progress and merging all together
What just happened?
PDI options for merging streams
Time for action—giving priority to Bouchard by using Append Stream
What just happened?
Have a go hero—sorting and merging all tasks
Have a go hero—trying to find missing countries
Summary
5. Transforming Your Data with JavaScript Code and the JavaScript Step
Doing simple tasks with the JavaScript step
Time for action—calculating scores with JavaScript
What just happened?
Using the JavaScript language in PDI
Inserting JavaScript code using the Modified Java Script Value step
Adding fields
Modifying fields
Turning on the compatibility switch
Have a go hero—adding and modifying fields to the contest data
Testing your code
Time for action—testing the calculation of averages
What just happened?
Testing the script using the Test script button
Have a go hero—testing the new calculation of the average
Enriching the code
Time for action calculating flexible scores by using variables
What just happened?
Using named parameters
Using the special Start, Main, and End scripts
Using transformation predefined constants
Pop quiz—finding the 7 errors
Have a go hero—keeping the top 10 performances
Have a go hero—calculating scores with Java code
Reading and parsing unstructured files
Time for action—changing a list of house descriptions with JavaScript
What just happened?
Looking at previous rows
Have a go hero—enhancing the houses file
Have a go hero—fill gaps in the contest file
Avoiding coding by using purpose-built steps
Have a go hero—creating alternative solutions
Summary
6. Transforming the Row Set
Converting rows to columns
Time for action—enhancing a films file by converting rows to columns
What just happened?
Converting row data to column data by using the Row denormalizer step
Have a go hero—houses revisited
Aggregating data with a Row denormalizer step
Time for action—calculating total scores by performances by country
What just happened?
Using Row denormalizer for aggregating data
Have a go hero—calculating scores by skill by continent
Normalizing data
Time for action—enhancing the matches file by normalizing the dataset
What just happened?
Modifying the dataset with a Row Normalizer step
Summarizing the PDI steps that operate on sets of rows
Have a go hero—verifying the benefits of normalization
Have a go hero—normalizing the Films file
Have a go hero—calculating scores by judge
Generating a custom time dimension dataset by using Kettle variables
Time for action—creating the time dimension dataset
What just happened?
Getting variables
Time for action—getting variables for setting the default starting date
What just happened?
Using the Get Variables step
Have a go hero—enhancing the time dimension
Pop quiz—using Kettle variables inside transformations
Summary
7. Validating Data and Handling Errors
Capturing errors
Time for action—capturing errors while calculating the age of a film
What just happened?
Using PDI error handling functionality
Aborting a transformation
Time for action—aborting when there are too many errors
What just happened?
Aborting a transformation using the Abort step
Fixing captured errors
Time for action—treating errors that may appear
What just happened?
Treating rows coming to the error stream
Pop quiz—PDI error handling
Have a go hero—capturing errors while seeing who wins
Avoiding unexpected errors by validating data
Time for action validating genres with a Regex Evaluation step
What just happened?
Validating data
Time for action—checking films file with the Data Validator
What just happened?
Defining simple validation rules using the Data Validator
Have a go hero—validating the football matches file
Cleansing data
Have a go hero—cleansing films data
Summary
8. Working with Databases
Introducing the Steel Wheels sample database
Connecting to the Steel Wheels database
Time for action—creating a connection with the Steel Wheels database
What just happened?
Connecting with Relational Database Management Systems
Pop quiz—defining database connections
Have a go hero—connecting to your own databases
Exploring the Steel Wheels database
Time for action—exploring the sample database
What just happened?
A brief word about SQL
Exploring any configured database with the PDI Database explorer
Have a go hero—exploring the sample data in depth
Have a go hero—exploring your own databases
Querying a database
Time for action—getting data about shipped orders
What just happened?
Getting data from the database with the Table input step
Using the SELECT statement for generating a new dataset
Making flexible queries by using parameters
Time for action—getting orders in a range of dates by using parameters
What just happened?
Adding parameters to your queries
Making flexible queries by using Kettle variables
Time for action—getting orders in a range of dates by using variables
What just happened?
Using Kettle variables in your queries
Pop quiz—database datatypes versus PDI datatypes
Have a go hero—querying the sample data
Sending data to a database
Time for action—loading a table with a list of manufacturers
What just happened?
Inserting new data into a database table with the Table output step
Inserting or updating data by using other PDI steps
Time for action—inserting new products or updating existent ones
What just happened?
Time for action—testing the update of existing products
What just happened?
Inserting or updating data with the Insert/Update step
Have a go hero—populating a films database
Have a go hero—creating the time dimension
Have a go hero—populating the products table
Pop quiz—Insert/Update step versus Table Output/Update steps
Pop quiz—filtering the first 10 rows
Eliminating data from a database
Time for action—deleting data about discontinued items
What just happened?
Deleting records of a database table with the Delete step
Have a go hero—deleting old orders
Summary
9. Performing Advanced Operations with Databases
Preparing the environment
Time for action—populating the Jigsaw database
What just happened?
Exploring the Jigsaw database model
Looking up data in a database
Doing simple lookups
Time for action—using a Database lookup step to create a list of products to buy
What just happened?
Looking up values in a database with the Database lookup step
Have a go hero—preparing the delivery of the products
Have a go hero—refining the transformation
Doing complex lookups
Time for action using a Database join step to create a list of suggested products to buy
What just happened?
Joining data from the database to the stream data by using a Database join step
Have a go hero—rebuilding the list of customers
Introducing dimensional modeling
Loading dimensions with data
Time for action loading a region dimension with a Combination lookup/update step
What just happened?
Time for action—testing the transformation that loads the region dimension
What just happened?
Describing data with dimensions
Loading Type I SCD with a Combination lookup/update step
Have a go hero—adding regions to the Region Dimension
Have a go hero—loading the manufacturers dimension
Have a go hero—loading a mini-dimension
Keeping a history of changes
Time for action—keeping a history of product changes with the Dimension lookup/update step
What just happened?
Time for action—testing the transformation that keeps a history of product changes
What just happened?
Keeping an entire history of data with a Type II slowly changing dimension
Loading Type II SCDs with the Dimension lookup/update step
Have a go hero—keeping a history just for the theme of a product
Have a go hero—loading a Type II SCD dimension
Pop quiz—loading slowly changing dimensions
Pop quiz—loading type III slowly changing dimensions
Summary
10. Creating Basic Task Flows
Introducing PDI jobs
Time for action—creating a simple hello world job
What just happened?
Executing processes with PDI jobs
Using Spoon to design and run jobs
Using the transformation job entry
Pop quiz—defining PDI jobs
Have a go hero—loading the dimension tables
Receiving arguments and parameters in a job
Time for action—customizing the hello world file with arguments and parameters
What just happened?
Using named parameters in jobs
Have a go hero—backing up your work
Running jobs from a terminal window
Time for action—executing the hello world job from a terminal window
What just happened?
Have a go hero—experiencing Kitchen
Using named parameters and command-line arguments in transformations
Time for action—calling the hello world transformation with fixed arguments and parameters
What just happened?
Have a go hero—saying hello again and again
Have a go hero—loading the time dimension from a job
Deciding between the use of a command-line argument and a named parameter
Have a go hero—analysing the use of arguments and named parameters
Running job entries under conditions
Time for action—sending a sales report and warning the administrator if something is wrong
What just happened?
Changing the flow of execution on the basis of conditions
Have a go hero—refining the sales report
Creating and using a file results list
Have a go hero—sharing your work
Summary
11. Creating Advanced Transformations and Jobs
Enhancing your processes with the use of variables
Time for action—updating a file with news about examinations by setting a variable with the name of the file
What just happened?
Setting variables inside a transformation
Have a go hero—enhancing the examination tutorial even more
Have a go hero—enhancing the jigsaw database update process
Have a go hero—executing the proper jigsaw database update process
Enhancing the design of your processes
Time for action—generating files with top scores
What just happened?
Pop quiz—using the Add Sequence step
Reusing part of your transformations
Time for action—calculating the top scores with a subtransformation
What just happened?
Creating and using subtransformations
Have a go hero—refining the subtransformation
Have a go hero—counting words more precisely (second version)
Creating a job as a process flow
Time for action—splitting the generation of top scores by copying and getting rows
What just happened?
Transferring data between transformations by using the copy /get rows mechanism
Have a go hero—modifying the flow
Nesting jobs
Time for action—generating the files with top scores by nesting jobs
What just happened?
Running a job inside another job with a job entry
Understanding the scope of variables
Pop quiz—deciding the scope of variables
Iterating jobs and transformations
Time for action—generating custom files by executing a transformation for every input row
What just happened?
Executing for each row
Have a go hero—processing several files at once
Have a go hero—building lists of products to buy
Have a go hero—e-mail students to let them know how they did
Summary
12. Developing and Implementing a Simple Datamart
Exploring the sales datamart
Deciding the level of granularity
Loading the dimensions
Time for action—loading dimensions for the sales datamart
What just happened?
Extending the sales datamart model
Have a go hero—loading the dimensions for the puzzles star model
Loading a fact table with aggregated data
Time for action—loading the sales fact table by looking up dimensions
What just happened?
Getting the information from the source with SQL queries
Translating the business keys into surrogate keys
Obtaining the surrogate key for a Type I SCD
Obtaining the surrogate key for a Type II SCD
Obtaining the surrogate key for the Junk dimension
Obtaining the surrogate key for the Time dimension
Pop quiz—modifying a star model and loading the star with PDI
Have a go hero—loading a puzzles fact table
Getting facts and dimensions together
Time for action—loading the fact table using a range of dates obtained from the command line
What just happened?
Time for action—loading the sales star
What just happened?
Have a go hero—enhancing the loading process of the sales fact table
Have a go hero—loading the puzzles sales star
Have a go hero—loading the facts once a month
Getting rid of administrative tasks
Time for action—automating the loading of the sales datamart
What just happened?
Have a go hero—Creating a back up of your work automatically
Have a go hero—enhancing the automate process by sending an e-mail if an error occurs
Summary
13. Taking it Further
PDI best practices
Getting the most out of PDI
Extending Kettle with plugins
Have a go hero—listing the top 10 students by using the Head plugin step
Overcoming real world risks with some remote execution
Scaling out to overcome bigger risks
Pop quiz—remote execution and clustering
Integrating PDI and the Pentaho BI suite
PDI as a process action
PDI as a datasource
More about the Pentaho suite
PDI Enterprise Edition and Kettle Developer Support
Summary
A. Working with Repositories
Creating a repository
Time for action—creating a PDI repository
What just happened?
Creating repositories to store your transformations and jobs
Working with the repository storage system
Time for action—logging into a repository
What just happened?
Logging into a repository by using credentials
Defining repository user accounts
Creating transformations and jobs in repository folders
Creating database connections, partitions, servers, and clusters
Backing up and restoring a repository
Examining and modifying the contents of a repository with the Repository explorer
Migrating from a file-based system to a repository-based system and vice-versa
Summary
B. Pan and Kitchen: Launching Transformations and Jobs from the Command Line
Running transformations and jobs stored in files
Running transformations and jobs from a repository
Specifying command line options
Checking the exit code
Providing options when running Pan and Kitchen
Log details
Named parameters
Arguments
Variables
C. Quick Reference: Steps and Job Entries
Transformation steps
Job entries
D. Spoon Shortcuts
General shortcuts
Designing transformations and jobs
Grids
Repositories
E. Introducing PDI 4 Features
Agile BI
Visual improvements for designing transformations and jobs
Experiencing the mouse-over assistance
Time for action—creating a hop with the mouse-over assistance
What just happened?
Using the mouse-over assistance toolbar
Experiencing the sniff-testing feature
Experiencing the job drill-down feature
Experiencing even more visual changes
Enterprise features
Summary
F. Pop quiz—Answers
Chapter 1
PDI data sources
PDI prerequisites PDI basics
Chapter 2
formatting data
Chapter 3
concatenating strings
Chapter 4
data movement (copying and distributing)
splitting a stream
Chapter 5
finding the seven errors
Chapter 6
using Kettle variables inside transformations
Chapter 7
PDI error handling
Chapter 8
defining database connections
database datatypes versus PDI datatypes
Insert/Update step versus Table Output/Update steps
filtering the first 10 rows
Chapter 9
loading slowly changing dimensions
loading type III slowly changing dimensions
Chapter 10
defining PDI jobs
Chapter 11
using the Add sequence step
deciding the scope of variables
Chapter 12
modifying a star model and loading the star with PDI
Chapter 13
remote execution and clustering
Index
买过这本书的人还买过
读了这本书的人还在读
同类图书排行榜