Radhakrishna Sarma

Monday, January 5, 2009

Database Indexing for the Lookup SQL - A myth

I have seen people suggesting an index to improve the performance of any SQL. This suggestion is incorrect - many times. Specially when talking about indexing the condition port columns of Lookup SQL, it is far more "incorrect".

Before explaining why it is incorrect, I would try to detail the functionality of Lookup. To explain the stuff with an example, we take the usual HR schema EMP table. I have EMPNO, ENAME, SALARY as columns in EMP table.

Let us say, there is a lookup in ETL mapping that checks for a particular EMPNO and returns ENAME and SALARY from the Lookup. Now, the output ports for the Lookup are "ENAME" and "SALARY". The condition port is "EMPNO". Imagine that you are facing performance problems with this Lookup and one of the suggestion was to index the condition port.

As suggested (incorrectly) you create an index on EMPNO column in the underlying database table. Practically, the SQL the lookup executes is going to be this:

select EMPNO,
ENAME,
SALARY
from EMP
ORDER BY EMPNO,
ENAME,
SALARY;

The data resulted from this query is stored in the Lookup cache and then, each record from the source is looked up against this cache. So, the checking against the condition port column is done in the Informatica Lookup cache and "not in the database". So any index created in the database has no effect for this.

You may be wondering if we can replicate the same indexing here in Lookup Cache. You don't have to worry about it. PowerCenter create "index" cache and "data" cache for the Lookup. In this case, condition port data - "EMPNO" is indexed and hashed in "index" cache and the rest along with EMPNO is found in "data" cache.

I hope now you understand why indexing condition port columns doesn't increase performance.

Having said that, I want to take you to a different kind of lookup, where you would've disabled the caching. In this kind of Lookup, there is no cache. Everytime a row is sent into lookup, the SQL is executed against database. In this scenario, the database index "may" work. But, if the performance of the lookup is a problem, then "cache-less" lookup itself may be a problem.

I would go for cache-less lookup if my source data records is less than the number of records in my lookup table. In this case ONLY, indexing the condition ports will work. Everywhere else, it is just a mere chanse of luck, that makes the database pick up index.

I'm sure you will have many questions regarding Lookups after reading this blog article. I'm ready to answer, fire away.

Thursday, January 1, 2009

Designing the mapping/workflow efficiently - part I

I wanted to write some useful design tips for a mapping and workflow to increase the performance. As PowerCenter is a wide ocean and there can't be any exhaustive list of best practices. To address this, I want to add to this list in phases and that is the reason, this blog article is named as "part - I".

Please note that these are not any rules-of-thumb, but would lead you into a sensible decision while developing interfaces in PowerCenter. Nevertheless, the design decisions are to be driven by the architectural, organizational and business requirements.

I would always suggest you to think twice before using an Update Strategy, though it adds me certain level of flexibility in the mapping. If you have a straight-through mapping which takes data from source and directly inserts all the records into the target, you wouldn’t need an update strategy.
Use a pre-SQL delete statement if you wish to delete specific rows from target before loading into the target. Use truncate option in Session properties, if you wish to clean the table before loading. I would avoid a separate pipe-line in the mapping that roans before the load with update-strategy transformation.
You have 3 sources and 3 targets with one-on-one mapping. If the load is independent according to business requirement, I would create 3 different mappings and 3 different session instances and they all run in parallel in my workflow after my “Start” task. I’ve observed that the workflow runtime comes down between 30-60% of serial processing.
PowerCenter is built to work of high volumes of data. So let the server be completely busy. Induce parallelism as far as possible into the mapping/workflow.
Needless to say, if any transformation waits for complete source data to be arrived in before it can proceed further in the mapping, use a sorter transformation to speed up the process. For eg: an Aggregator transformation, a Joiner transformation etc will perform well if sorted data is given to them and that a sorter transformation at the start will make the efficient.
Push as much filtering as possible into the SQL override of the source-qualifier/lookup to make the database handle most of the filtering as databases are experts in doing that stuff than using a filter transformation.
As I always say, the above task is similar to manual push-down optimization.

As written in the start of this article, we will continue to add to the list. Please make your contributions to this and we shall do our best to to make Informatica world generate efficient code for PowerCenter.

Tuesday, December 30, 2008

Why do you restrict developer?

I've been seeing some strange requests from developers these days. Here are few examples.

"I have a relational source and I want to take data for a column whose value is like '%XYZ%'. I want to accomplish this without SQL override on Source Qualifier".

"I have to create a copy of my target file in the archive location. Shall I create two target object instances in the mapping to acheive this? I can't use post-session command task".
...
...
...

It is certainly a surpirse to know as to why they want to avoid the obvious choice. All these developers have complained about development rules and guidelines imposing these restrictions at their organization level.

I have a question to the organizations.

"If you don't want to utilize the best options provided by PowerCenter to make your development/maintenance life easier, why do you opt to use PowerCenter at all?"

With the newer versions of PowerCenter, there is lot of flexibility provided to the developer to make his life lot easier than what it used to be during PowerMart times.

Imagine you code a mapping for the first developer question about '%XYZ%' and used a filter instead of overriding the SQL. If your soure table has 100 billion records and only 10 out of them satisfy '%XYZ%' condition. This will be the worst nightmare in support perspective. These are the mappings that eat into the organisations support budget and maintenance budget.

Why restrict the developer and suffer project financial and quality hammering?

There is another big problem with this kin of corporate behaviour. The inherent innovation capabilities of the developer are supressed and there is every possibility that the developer fails to use the best option when he/she moves to other organizations which give full freedom to the developers.

If this is the situation of the experienced developers, imagine some one who begin ETL PowerCenter development in these kind of restrictive-organiations. These poor developers would not even know that a certain option can be used in PowerCenter.

I hope this articles warns the organizations of a possible danger!

Friday, December 26, 2008

Avoid sequence generator

I would always advice you to avoid sequence generator while populating an ID column in the relational target table. I suggest you rather create a sequence on the target database and enable the trigger on that table to fetch the value from the database sequence.

There are many advantages by doing this way.

Less number of PowerCenter objects in the object which reduce development time and also less maintenance.
Today you may be writing the records into that target table using PowerCenter flow. What if you get another application tomorrow that also writes records into that table? Your ID column values go bizarre in this scenario.
During migration of objects from Dev to QA and also to Production further, there is no extra over-head of considering the persistent values of the sequence generator from the repository database.

In all of the above cases, a sequence created in the target database would make life lot easier for the table data maintenance and also for the PowerCenter development. In fact, databases will have specific mechanisms (focused) to deal with sequences and so you can implement manual Push-down optimization on your PowerCenter mapping design for yourself.

Performance of workflow - Most feared, but the easiest

With the newer and newer versions of PowerCenter, there is added flexibility for the developer to build better-performing workflows. Though performance tuning has been the most feared part of development, it is the easiest, if the intricacies are known. We shall discuss one major problems for performance - SQL query tuning.

SQL queries/actions occur in PowerCenter in one of the below ways.

Relational Source Qualifier
Lookup SQL Override
Stored Procedures
Relational Target

Accessing an execution plan for the query is the best way to tune it. Look for any table scans and remember "Full Table Scans are not evil", "Indexes are not always fast", and “Indexes can be slow too". Analyse the table data to see if picking up 20 records out of 20 million is best using index or using table scan. Fetching 10 records out of 15 using index is faster or using full table scan is easier.

Many times the relational target indexes create problem for the performance of the actual loading of records into the relational target. It may surprise you (some times?), the way PowerCenter has reduced the time of loading the target data if you disable the indexes and then enable them in post-SQL again.

Please note that the flow has to also consider the integrity constraints when you have disabled the indexes before loading.

FTP Connection object - Gain platform independence

If you have any files to be read as source from Windows server when your PowerCenter server is hosted on UNIX/LINUX, then makes use of FTP users on the Windows server and use File Reader with FTP Connection object. This connection object can be added as any other connection string. This gives the flexibility of platform independence. This will further reduce the over-head of having SAMBA mounts on to the Informatica UNIX boxes.

Monday, December 22, 2008

"This" and "That" of Pre/Post Session command tasks

It is a very good practice to email the success or failure status of a task, once it is done. In the same way, when a business requirement drives, make use of the Post Session Success and Failure email for proper communication. The built-in feature offers more flexibility with Session Logs as attachments and also provides other run-time data like Workflow run-instance ID
etc,.

Any archiving activities around the source and target flat files can be easily managed with-in the session using the post-session and pre-session commands. You can even build some resource-saving as well by issuing a gzip on the target file, to save space.

Radhakrishna Sarma

Monday, January 5, 2009

Database Indexing for the Lookup SQL - A myth

Thursday, January 1, 2009

Designing the mapping/workflow efficiently - part I

Tuesday, December 30, 2008

Why do you restrict developer?

Friday, December 26, 2008

Avoid sequence generator

Performance of workflow - Most feared, but the easiest

FTP Connection object - Gain platform independence

Monday, December 22, 2008

"This" and "That" of Pre/Post Session command tasks

Cricinfo International Scores

Followers

Blog Archive

About Me

Monday, January 5, 2009

Thursday, January 1, 2009

Tuesday, December 30, 2008

Friday, December 26, 2008

Monday, December 22, 2008

Cricinfo International Scores

Subscribe To

Followers

Blog Archive

About Me