Thursday, March 3, 2011

4 Steps to Loading Data Marts with PDI

You have been asked to populate a new data mart, now what? Even if you are familiar with building ETL with other tools, you will want to take some time to properly design your solution before biting off more than you can chew with Spoon. In this paper, we will walk through the design steps to build your data mart using Pentaho’s Data Integration (PDI) tools. The 4 steps are as follows:
  • Identify Transformations and Sequence
  • Choose Update Strategy
  • Define Source SQL
  • Identify Steps From Source to Target
Who’s on First?
In PDI, the place to design and build your Extract Transform and Load (ETL) is in Spoon. Within Spoon, you can create Jobs and Transformations. Jobs are used to sequence and orchestrate your transformations. A transformation reads data from an input, perform processing and write data to one or more outputs. Your first task is to identify the transformations that you will need and in which order they will be called.
So how many transformations do you need for ETL processing?

Here is the key to determining to right number. Review your mapping documentation to see how many combinations of source systems and target tables you have. For each combination of source system and target table, create one transformation. That’s it, it’s that simple! You can often perform all of the processing that you need to populate a target table in one transformation.

Now that you have identified all of your transformations, how many jobs do you need? A similar philosophy applies here. For each combination of source and target system, create one job. If your Data Mart is sourced from one system, it can often be loaded in one job.

So what is the sequence of the transformations within your job? The answer here lies in the Referential Integrity enforced or implied at the target. You will want to run the target table transformations that do not have any dependencies first. Next, you can run the target table transformations that were dependent on those tables, and so on. If you are creating one fact in a star schema data mart, you can load all of the dimensions first and then load the fact table. If you are creating multiple facts with conformed dimensions, create all of the conformed dimensions first, followed by the non-conformed dimensions and finally the fact tables.

Key to Your Future
You now know how many jobs and transformations that we will have and in what order they will be executed. For each target or output, you now need to determine how you will send updates from the source system to your data mart. It is tempting to always update the information in the target tables. However, if your business users want to see historical changes, this may not be appropriate.

In updating your Data Mart you add two types of data, facts and dimensions. New facts can and are often inserted into the target table. Updates to dimension are handled in a other ways. Fortunately, you have several options available to you in PDI.

  • Table Output – This style of output is appropriate for creating new records in a database. It is assumed that old data is not in the target and is not dealt with by this step. This is most commonly used for fact table loads. Table output is fast and has parameters to further tune its performance to load large volumes of fact transactions.
  • Update – As an opposite of Table Output, the Update step only performs update to existing data. This may appropriate if your dimensions primarily come from one source, but updates can be provided by another source system.
  • Insert/Update – A combination of both above, this step allows you to insert or update a database table based on keys from the source. This is commonly used for dimensions where you only need the latest greatest information for your dimension properties. It assumes you have a unique key that can identify one record for update processing. If the key is present in the target table, an update is performed. If the key is not present, an insert is performed.
  • Dimension Lookup / Update - This step is used to process Slowly Changing Dimensions (SCD). The most popular type of SCD’s is type 2, where you will insert a new record every time there is a change to your business data. Within this step, you can specify the fields to use as your business key and the fields to compare to see if there is a change within the business data.
  • Delete – While most data marts do not delete information, but mark it as inactive. There is a need to delete information from a database table. This single purpose step is right for this case.
As noted in many of these output options, it is necessary to have a key to determine how to process the data. The key is normally a business key that uniquely identifies each row based on actual data. Conversely, many data warehouses and marts have surrogate keys. Surrogate keys uniquely identify the record in the database not using source data. The surrogate key is often a random number or is generated via a sequence. The surrogate key is often defined as the primary key on the database table and is not used to determine uniqueness of business data.

Chose Wisely
You now have broken down your ETL into multiple transformations and determined how you are going to load your data into those target tables. Next we need to determine how to select the data from the source system most efficiently and to support the target update strategy.
The starting point is to determine the grain of data for each of your transformations. The grain means the level of detail for each row selected. For example, when loading a Customer dimension, one record will equate to one Customer. For your dimensions, it will be the business key that you determined in the last step. For facts, it is often the intersection of all dimension surrogate keys or a transactional key.

For purposes of your PDI inputs, you should limit the input rows to only those tables that establish the key to this row. It is tempting to place joins to all of the tables you need in your source SQL query. However, this may result in “missing” values later on in the process. If a row is excluded during the join process, it cannot be inspected, logged, or sent to an error file.
For those reasons defined above, it is best to keep the SQL select to minimum and make a heavier use of lookups later in the transformation.

Steps and Hops, Time to Dance!
Last, but not least, identifying the PDI steps to transform your source data into the target content and format. This is actually the meat of the design and development time. Although this is the largest chunk of development time, you should consider the other three elements prior to diving straight to this step.

In PDI, each logical operation that you perform on the data is separated into a step. You connect each step with hops. The only question is what steps are appropriate for your needs? It is fairly odd and/or bad practice to go straight from an input step to an output step. You will therefore need one or more of the following types of transformation steps:

  • Joins – Joins are provided to you to take one data set and columns of another data set to the original. The appropriate use of joins is when you have disparate types of data where the source cannot perform the join for you. They should be avoided when you are joining two tables from the same database.
  • Lookups – There are several types of lookups provided within PDI for your use. All follow the same philosophy. Take rows of your in-stream data set and add column values from another data set. This has numerous advantages over joining everything in the database. For starters, the lookups allow you take action based on the lookup results. For example: If your sales transactions are referring to a product code that does not exist on your product dimension, you will likely want to log this, error the row, and or notify the administrator.
  • Transforms – PDI transforms come in all shapes and sizes, but they are generally massaging your columns of data. The myriad of uses and steps to satisfy those uses are quite long and another paper could be written on just this topic.
  • Scripting - Although PDI provides you with many steps to build your transformation, when all else fails, you may need to break down and write some code! For this need there are scripts

Wrap Up
Hopefully this give you a good overview to get started designing your Pentaho Data Integration loads. Feel free to let me know if you have other flavors of designing your ETL.

Wednesday, January 19, 2011

PLA Recognized as Pentaho 2011 SI Partner of the Year!

I was shocked today at the 2011 Global Partner Summit to learn that Project Leadership Associates won the 2011 SI Partner of the year award. I couldn't be more excited about receiving this award today from Pentaho! I am not sure if it was the implementation services, community contributions, Data to Dashboards seminars, hosted Pentaho or all of the above. It does feel good.

Tuesday, December 14, 2010

Pentaho JavaScript API presentation

On this page, there is the video recording of my presentation to the Pentaho Technical group. Code is also over here.

Friday, April 9, 2010

Pentaho Solution Repository JavaScript API

Based on a previous post, I have created a JavaScript API and wrapper for the Pentaho Solution Repository.

System Integrators or others who want to integrate Pentaho into their web applications can use only JavaScript to communicate with the Pentaho BI Platform. No need to parse SOAP responses or otherwise deal with XML at all. It is all JavaScript!

Get the source here.

Enjoy!

Monday, March 29, 2010

Pentaho Analyzer - Calculations

In this short video, I demonstrate how to create new calculations that were not provided with Pentaho Analyzer.

Tuesday, March 23, 2010

Simple Pentaho Analyzer videos

I was recently asked by a client, "How do I add report totals and subtotals to my Analyzer report"? I looked around the documentation, knowledge base and wiki. I did not see any straightforward examples, so I decided to answer this and other questions with videos.

If pictures are worth 1,000 words, how many words are videos worth? Anyway, I wanted to create short and simple videos that convey one or two points. There are other, more comprehensive videos out the about Pentaho Analyzer, but these are short and sweet.

Getting Started
This is really bare bones and demonstrates how easy it is to get going.


Totals and Subtotals
Adding report level totals not intuitive, but easy just the same.

Monday, January 25, 2010

Pentaho Javascript API

As Systems Integrator, implementing the Pentaho BI platform is relatively straightforward. However, when you want to customize the User Interface and hook directly into the Pentaho Web Service API's, you need to wade through a lot of code and a little documentation. Additionally, the format of most API's is XML. Web developers like parsing XML about as much as they like having a root canal! I have had to do much of this parsing over the last few years and have started to formulate a way out. I believe that web developers would much rather deal with JSON over XML and use Javascript objects and methods over XMLHTTPRequest. This has taken the shape of a Javascript API for Pentaho. The intent is that it is library agnostic, although the front-end could be YUI, jQuery, dojo, closure or any other library.

In this wiki page, I describe the types of things I would like the API to perform. Some of these are already available as Pentaho Servlets or JSP's, while other require more ingenuity. In the end, a front-end web developer should not need to write java or parse XML to interact with the Pentaho BI Platform.

The current plan is to implement the API in stages. First, I have created the components for interacting with Pentaho Metadata issuing MQL queries here. Next up is Action Sequence interaction and then I will leverage Roland's XMLA4JS for OLAP.