Thursday, March 3, 2011

4 Steps to Loading Data Marts with PDI

You have been asked to populate a new data mart, now what? Even if you are familiar with building ETL with other tools, you will want to take some time to properly design your solution before biting off more than you can chew with Spoon. In this paper, we will walk through the design steps to build your data mart using Pentaho’s Data Integration (PDI) tools. The 4 steps are as follows:
  • Identify Transformations and Sequence
  • Choose Update Strategy
  • Define Source SQL
  • Identify Steps From Source to Target
Who’s on First?
In PDI, the place to design and build your Extract Transform and Load (ETL) is in Spoon. Within Spoon, you can create Jobs and Transformations. Jobs are used to sequence and orchestrate your transformations. A transformation reads data from an input, perform processing and write data to one or more outputs. Your first task is to identify the transformations that you will need and in which order they will be called.
So how many transformations do you need for ETL processing?

Here is the key to determining to right number. Review your mapping documentation to see how many combinations of source systems and target tables you have. For each combination of source system and target table, create one transformation. That’s it, it’s that simple! You can often perform all of the processing that you need to populate a target table in one transformation.

Now that you have identified all of your transformations, how many jobs do you need? A similar philosophy applies here. For each combination of source and target system, create one job. If your Data Mart is sourced from one system, it can often be loaded in one job.

So what is the sequence of the transformations within your job? The answer here lies in the Referential Integrity enforced or implied at the target. You will want to run the target table transformations that do not have any dependencies first. Next, you can run the target table transformations that were dependent on those tables, and so on. If you are creating one fact in a star schema data mart, you can load all of the dimensions first and then load the fact table. If you are creating multiple facts with conformed dimensions, create all of the conformed dimensions first, followed by the non-conformed dimensions and finally the fact tables.

Key to Your Future
You now know how many jobs and transformations that we will have and in what order they will be executed. For each target or output, you now need to determine how you will send updates from the source system to your data mart. It is tempting to always update the information in the target tables. However, if your business users want to see historical changes, this may not be appropriate.

In updating your Data Mart you add two types of data, facts and dimensions. New facts can and are often inserted into the target table. Updates to dimension are handled in a other ways. Fortunately, you have several options available to you in PDI.

  • Table Output – This style of output is appropriate for creating new records in a database. It is assumed that old data is not in the target and is not dealt with by this step. This is most commonly used for fact table loads. Table output is fast and has parameters to further tune its performance to load large volumes of fact transactions.
  • Update – As an opposite of Table Output, the Update step only performs update to existing data. This may appropriate if your dimensions primarily come from one source, but updates can be provided by another source system.
  • Insert/Update – A combination of both above, this step allows you to insert or update a database table based on keys from the source. This is commonly used for dimensions where you only need the latest greatest information for your dimension properties. It assumes you have a unique key that can identify one record for update processing. If the key is present in the target table, an update is performed. If the key is not present, an insert is performed.
  • Dimension Lookup / Update - This step is used to process Slowly Changing Dimensions (SCD). The most popular type of SCD’s is type 2, where you will insert a new record every time there is a change to your business data. Within this step, you can specify the fields to use as your business key and the fields to compare to see if there is a change within the business data.
  • Delete – While most data marts do not delete information, but mark it as inactive. There is a need to delete information from a database table. This single purpose step is right for this case.
As noted in many of these output options, it is necessary to have a key to determine how to process the data. The key is normally a business key that uniquely identifies each row based on actual data. Conversely, many data warehouses and marts have surrogate keys. Surrogate keys uniquely identify the record in the database not using source data. The surrogate key is often a random number or is generated via a sequence. The surrogate key is often defined as the primary key on the database table and is not used to determine uniqueness of business data.

Chose Wisely
You now have broken down your ETL into multiple transformations and determined how you are going to load your data into those target tables. Next we need to determine how to select the data from the source system most efficiently and to support the target update strategy.
The starting point is to determine the grain of data for each of your transformations. The grain means the level of detail for each row selected. For example, when loading a Customer dimension, one record will equate to one Customer. For your dimensions, it will be the business key that you determined in the last step. For facts, it is often the intersection of all dimension surrogate keys or a transactional key.

For purposes of your PDI inputs, you should limit the input rows to only those tables that establish the key to this row. It is tempting to place joins to all of the tables you need in your source SQL query. However, this may result in “missing” values later on in the process. If a row is excluded during the join process, it cannot be inspected, logged, or sent to an error file.
For those reasons defined above, it is best to keep the SQL select to minimum and make a heavier use of lookups later in the transformation.

Steps and Hops, Time to Dance!
Last, but not least, identifying the PDI steps to transform your source data into the target content and format. This is actually the meat of the design and development time. Although this is the largest chunk of development time, you should consider the other three elements prior to diving straight to this step.

In PDI, each logical operation that you perform on the data is separated into a step. You connect each step with hops. The only question is what steps are appropriate for your needs? It is fairly odd and/or bad practice to go straight from an input step to an output step. You will therefore need one or more of the following types of transformation steps:

  • Joins – Joins are provided to you to take one data set and columns of another data set to the original. The appropriate use of joins is when you have disparate types of data where the source cannot perform the join for you. They should be avoided when you are joining two tables from the same database.
  • Lookups – There are several types of lookups provided within PDI for your use. All follow the same philosophy. Take rows of your in-stream data set and add column values from another data set. This has numerous advantages over joining everything in the database. For starters, the lookups allow you take action based on the lookup results. For example: If your sales transactions are referring to a product code that does not exist on your product dimension, you will likely want to log this, error the row, and or notify the administrator.
  • Transforms – PDI transforms come in all shapes and sizes, but they are generally massaging your columns of data. The myriad of uses and steps to satisfy those uses are quite long and another paper could be written on just this topic.
  • Scripting - Although PDI provides you with many steps to build your transformation, when all else fails, you may need to break down and write some code! For this need there are scripts

Wrap Up
Hopefully this give you a good overview to get started designing your Pentaho Data Integration loads. Feel free to let me know if you have other flavors of designing your ETL.

Wednesday, January 19, 2011

PLA Recognized as Pentaho 2011 SI Partner of the Year!

I was shocked today at the 2011 Global Partner Summit to learn that Project Leadership Associates won the 2011 SI Partner of the year award. I couldn't be more excited about receiving this award today from Pentaho! I am not sure if it was the implementation services, community contributions, Data to Dashboards seminars, hosted Pentaho or all of the above. It does feel good.

Tuesday, March 23, 2010

Simple Pentaho Analyzer videos

I was recently asked by a client, "How do I add report totals and subtotals to my Analyzer report"? I looked around the documentation, knowledge base and wiki. I did not see any straightforward examples, so I decided to answer this and other questions with videos.

If pictures are worth 1,000 words, how many words are videos worth? Anyway, I wanted to create short and simple videos that convey one or two points. There are other, more comprehensive videos out the about Pentaho Analyzer, but these are short and sweet.

Getting Started
This is really bare bones and demonstrates how easy it is to get going.


Totals and Subtotals
Adding report level totals not intuitive, but easy just the same.

Tuesday, April 21, 2009

Reporting Architectures Take Two

When I wrote this whitepaper, I thought I was already a bit behind the times. Certainly after all of the years of building data warehouses and data marts, people would not go back to reporting off of replicated databases would they? As it turns out, yes they do. To my amazement, seven years later my clients still struggle with the same issues.

So how can this happen? As far as I can tell, the answer seems to organic growth. What started out to be one thing, grew into something much bigger. In a way, that is a good thing! If your systems do not grow and expand, you are probably going out of business.

Does that mean that we should continue to build reporting solutions directly off of our transaction systems? When that no longer works, build a replication system and use that? When that no longer works, build a data mart? When that no longer works, build a data warehouse? The way I see it, if you are still in business and you are moving up the scale, you are on the right path.

One obvious danger that I have recently witnessed is over engineering a reporting system. Can you be all things to all people? I doubt it. What ends up happening is that you do most things OK, some poorly and the people that depend on your solutions get frustrated and end up looking elsewhere for their data.

While I continue to advise my clients to architect a scalable solution, you must start small. When talking to a client the other day, I advised "Think about as much of the problem space as you can for requirements and architecture. When it comes time to implement something, do as little as you can and still be successful." This naturally translates into quicker cycle time and more cycles. The benefit of this approach is that you continue to demonstrate reliability in achieving your goals and objectives. And who would not come back to someone who continuously achieves what they said they would?

One key that I have found to be rather successful in starting small is the Logical Data Model. The Logical Data Model can take in all of those requirements and hang them together for the big picture. That is your data blueprint. Print out big copies and put them everywhere so that people know what you are building and what it will look like when it is done. Then go build something and create a Physical Data Model scoped to only what you need at the time. This says that you have an eye on the future, but are focused on the present delivery.

So go out there and deliver something! Anything!!

Wednesday, April 15, 2009

Adding Content to the ER/Studio Enterprise Portal with Pentaho Tools

At the Embarcadero Datarage last month, I provided a quick demonstration on how you can add content to the ER/Studio Enterprise Portal. Given that I helped create the v1.0 portal, it was easy to show how to add on.

Basically, it is just a matter of creating additional Pentaho content.
  • Use the Adhoc wizard
  • Use stand alone Report Wizard
  • Use Report Designer and Pentaho Design Studio for prompted reports
Check it out! You may have to register with Embarcadero Developers Network to view the content.

There are a ton of other great presentations from other Embarcadero Partners.

Andy

Sunday, April 5, 2009

Pentaho Partner Summit 2009

I spent the later part of last week in beautiful Menlo Park, CA at Pentaho's Partner Summit. In addition to all of the "life is good" talk, there was some real exciting presentations.

We heard from Brian Senseman about his experience deploying Pentaho in the Cloud.

We also heard from Larry Augustin about the health of the open source business model. Given the source the is not a surprising perspective. However it was backed up with several sources of facts. The message was that open source software like Pentaho is here to stay.

We saw the preview of Pentaho's Dashboard Designer where a business user can create a dashboard within 90 seconds. Now that is impressive.

The moral of the story is the Pentaho's Business Intelligence Platform has had an impressive past and is poised for an even more exciting future. I for one, am thrilled to be a part of it all.