Monday, January 5, 2009

Database Indexing for the Lookup SQL - A myth

I have seen people suggesting an index to improve the performance of any SQL. This suggestion is incorrect - many times. Specially when talking about indexing the condition port columns of Lookup SQL, it is far more "incorrect".

Before explaining why it is incorrect, I would try to detail the functionality of Lookup. To explain the stuff with an example, we take the usual HR schema EMP table. I have EMPNO, ENAME, SALARY as columns in EMP table.

Let us say, there is a lookup in ETL mapping that checks for a particular EMPNO and returns ENAME and SALARY from the Lookup. Now, the output ports for the Lookup are "ENAME" and "SALARY". The condition port is "EMPNO". Imagine that you are facing performance problems with this Lookup and one of the suggestion was to index the condition port.

As suggested (incorrectly) you create an index on EMPNO column in the underlying database table. Practically, the SQL the lookup executes is going to be this:

select EMPNO,
ENAME,
SALARY
from EMP
ORDER BY EMPNO,
ENAME,
SALARY;

The data resulted from this query is stored in the Lookup cache and then, each record from the source is looked up against this cache. So, the checking against the condition port column is done in the Informatica Lookup cache and "not in the database". So any index created in the database has no effect for this.

You may be wondering if we can replicate the same indexing here in Lookup Cache. You don't have to worry about it. PowerCenter create "index" cache and "data" cache for the Lookup. In this case, condition port data - "EMPNO" is indexed and hashed in "index" cache and the rest along with EMPNO is found in "data" cache.

I hope now you understand why indexing condition port columns doesn't increase performance.

Having said that, I want to take you to a different kind of lookup, where you would've disabled the caching. In this kind of Lookup, there is no cache. Everytime a row is sent into lookup, the SQL is executed against database. In this scenario, the database index "may" work. But, if the performance of the lookup is a problem, then "cache-less" lookup itself may be a problem.

I would go for cache-less lookup if my source data records is less than the number of records in my lookup table. In this case ONLY, indexing the condition ports will work. Everywhere else, it is just a mere chanse of luck, that makes the database pick up index.

I'm sure you will have many questions regarding Lookups after reading this blog article. I'm ready to answer, fire away.

Thursday, January 1, 2009

Designing the mapping/workflow efficiently - part I

I wanted to write some useful design tips for a mapping and workflow to increase the performance. As PowerCenter is a wide ocean and there can't be any exhaustive list of best practices. To address this, I want to add to this list in phases and that is the reason, this blog article is named as "part - I".

Please note that these are not any rules-of-thumb, but would lead you into a sensible decision while developing interfaces in PowerCenter. Nevertheless, the design decisions are to be driven by the architectural, organizational and business requirements.
  1. I would always suggest you to think twice before using an Update Strategy, though it adds me certain level of flexibility in the mapping. If you have a straight-through mapping which takes data from source and directly inserts all the records into the target, you wouldn’t need an update strategy.
  2. Use a pre-SQL delete statement if you wish to delete specific rows from target before loading into the target. Use truncate option in Session properties, if you wish to clean the table before loading. I would avoid a separate pipe-line in the mapping that roans before the load with update-strategy transformation.
  3. You have 3 sources and 3 targets with one-on-one mapping. If the load is independent according to business requirement, I would create 3 different mappings and 3 different session instances and they all run in parallel in my workflow after my “Start” task. I’ve observed that the workflow runtime comes down between 30-60% of serial processing.
  4. PowerCenter is built to work of high volumes of data. So let the server be completely busy. Induce parallelism as far as possible into the mapping/workflow.
  5. Needless to say, if any transformation waits for complete source data to be arrived in before it can proceed further in the mapping, use a sorter transformation to speed up the process. For eg: an Aggregator transformation, a Joiner transformation etc will perform well if sorted data is given to them and that a sorter transformation at the start will make the efficient.
  6. Push as much filtering as possible into the SQL override of the source-qualifier/lookup to make the database handle most of the filtering as databases are experts in doing that stuff than using a filter transformation.
  7. As I always say, the above task is similar to manual push-down optimization.

As written in the start of this article, we will continue to add to the list. Please make your contributions to this and we shall do our best to to make Informatica world generate efficient code for PowerCenter.

 
HTML Hit Counters