- Identify Transformations and Sequence
- Choose Update Strategy
- Define Source SQL
- Identify Steps From Source to Target
In PDI, the place to design and build your Extract Transform and Load (ETL) is in Spoon. Within Spoon, you can create Jobs and Transformations. Jobs are used to sequence and orchestrate your transformations. A transformation reads data from an input, perform processing and write data to one or more outputs. Your first task is to identify the transformations that you will need and in which order they will be called.
So how many transformations do you need for ETL processing?
Here is the key to determining to right number. Review your mapping documentation to see how many combinations of source systems and target tables you have. For each combination of source system and target table, create one transformation. That’s it, it’s that simple! You can often perform all of the processing that you need to populate a target table in one transformation.
Now that you have identified all of your transformations, how many jobs do you need? A similar philosophy applies here. For each combination of source and target system, create one job. If your Data Mart is sourced from one system, it can often be loaded in one job.
So what is the sequence of the transformations within your job? The answer here lies in the Referential Integrity enforced or implied at the target. You will want to run the target table transformations that do not have any dependencies first. Next, you can run the target table transformations that were dependent on those tables, and so on. If you are creating one fact in a star schema data mart, you can load all of the dimensions first and then load the fact table. If you are creating multiple facts with conformed dimensions, create all of the conformed dimensions first, followed by the non-conformed dimensions and finally the fact tables.
Key to Your Future
You now know how many jobs and transformations that we will have and in what order they will be executed. For each target or output, you now need to determine how you will send updates from the source system to your data mart. It is tempting to always update the information in the target tables. However, if your business users want to see historical changes, this may not be appropriate.
In updating your Data Mart you add two types of data, facts and dimensions. New facts can and are often inserted into the target table. Updates to dimension are handled in a other ways. Fortunately, you have several options available to you in PDI.
- Table Output – This style of output is appropriate for creating new records in a database. It is assumed that old data is not in the target and is not dealt with by this step. This is most commonly used for fact table loads. Table output is fast and has parameters to further tune its performance to load large volumes of fact transactions.
- Update – As an opposite of Table Output, the Update step only performs update to existing data. This may appropriate if your dimensions primarily come from one source, but updates can be provided by another source system.
- Insert/Update – A combination of both above, this step allows you to insert or update a database table based on keys from the source. This is commonly used for dimensions where you only need the latest greatest information for your dimension properties. It assumes you have a unique key that can identify one record for update processing. If the key is present in the target table, an update is performed. If the key is not present, an insert is performed.
- Dimension Lookup / Update - This step is used to process Slowly Changing Dimensions (SCD). The most popular type of SCD’s is type 2, where you will insert a new record every time there is a change to your business data. Within this step, you can specify the fields to use as your business key and the fields to compare to see if there is a change within the business data.
- Delete – While most data marts do not delete information, but mark it as inactive. There is a need to delete information from a database table. This single purpose step is right for this case.
You now have broken down your ETL into multiple transformations and determined how you are going to load your data into those target tables. Next we need to determine how to select the data from the source system most efficiently and to support the target update strategy.
The starting point is to determine the grain of data for each of your transformations. The grain means the level of detail for each row selected. For example, when loading a Customer dimension, one record will equate to one Customer. For your dimensions, it will be the business key that you determined in the last step. For facts, it is often the intersection of all dimension surrogate keys or a transactional key.
For purposes of your PDI inputs, you should limit the input rows to only those tables that establish the key to this row. It is tempting to place joins to all of the tables you need in your source SQL query. However, this may result in “missing” values later on in the process. If a row is excluded during the join process, it cannot be inspected, logged, or sent to an error file.
For those reasons defined above, it is best to keep the SQL select to minimum and make a heavier use of lookups later in the transformation.
Steps and Hops, Time to Dance!
Last, but not least, identifying the PDI steps to transform your source data into the target content and format. This is actually the meat of the design and development time. Although this is the largest chunk of development time, you should consider the other three elements prior to diving straight to this step.
In PDI, each logical operation that you perform on the data is separated into a step. You connect each step with hops. The only question is what steps are appropriate for your needs? It is fairly odd and/or bad practice to go straight from an input step to an output step. You will therefore need one or more of the following types of transformation steps:
- Joins – Joins are provided to you to take one data set and columns of another data set to the original. The appropriate use of joins is when you have disparate types of data where the source cannot perform the join for you. They should be avoided when you are joining two tables from the same database.
- Lookups – There are several types of lookups provided within PDI for your use. All follow the same philosophy. Take rows of your in-stream data set and add column values from another data set. This has numerous advantages over joining everything in the database. For starters, the lookups allow you take action based on the lookup results. For example: If your sales transactions are referring to a product code that does not exist on your product dimension, you will likely want to log this, error the row, and or notify the administrator.
- Transforms – PDI transforms come in all shapes and sizes, but they are generally massaging your columns of data. The myriad of uses and steps to satisfy those uses are quite long and another paper could be written on just this topic.
- Scripting - Although PDI provides you with many steps to build your transformation, when all else fails, you may need to break down and write some code! For this need there are scripts
Hopefully this give you a good overview to get started designing your Pentaho Data Integration loads. Feel free to let me know if you have other flavors of designing your ETL.