Monday, January 5, 2009

Database Indexing for the Lookup SQL - A myth

I have seen people suggesting an index to improve the performance of any SQL. This suggestion is incorrect - many times. Specially when talking about indexing the condition port columns of Lookup SQL, it is far more "incorrect".

Before explaining why it is incorrect, I would try to detail the functionality of Lookup. To explain the stuff with an example, we take the usual HR schema EMP table. I have EMPNO, ENAME, SALARY as columns in EMP table.

Let us say, there is a lookup in ETL mapping that checks for a particular EMPNO and returns ENAME and SALARY from the Lookup. Now, the output ports for the Lookup are "ENAME" and "SALARY". The condition port is "EMPNO". Imagine that you are facing performance problems with this Lookup and one of the suggestion was to index the condition port.

As suggested (incorrectly) you create an index on EMPNO column in the underlying database table. Practically, the SQL the lookup executes is going to be this:

select EMPNO,
ENAME,
SALARY
from EMP
ORDER BY EMPNO,
ENAME,
SALARY;

The data resulted from this query is stored in the Lookup cache and then, each record from the source is looked up against this cache. So, the checking against the condition port column is done in the Informatica Lookup cache and "not in the database". So any index created in the database has no effect for this.

You may be wondering if we can replicate the same indexing here in Lookup Cache. You don't have to worry about it. PowerCenter create "index" cache and "data" cache for the Lookup. In this case, condition port data - "EMPNO" is indexed and hashed in "index" cache and the rest along with EMPNO is found in "data" cache.

I hope now you understand why indexing condition port columns doesn't increase performance.

Having said that, I want to take you to a different kind of lookup, where you would've disabled the caching. In this kind of Lookup, there is no cache. Everytime a row is sent into lookup, the SQL is executed against database. In this scenario, the database index "may" work. But, if the performance of the lookup is a problem, then "cache-less" lookup itself may be a problem.

I would go for cache-less lookup if my source data records is less than the number of records in my lookup table. In this case ONLY, indexing the condition ports will work. Everywhere else, it is just a mere chanse of luck, that makes the database pick up index.

I'm sure you will have many questions regarding Lookups after reading this blog article. I'm ready to answer, fire away.

25 comments:

sandeep said...

Nice Information to read.!!
Thanks Sarma..!!

~Sandeep Garg

Radhakrishna Sarma said...

Thank you Sandeep.

Cheers
Sarma.

bibendum59 said...

I was just working on a similar issue the other day and was confused as to how the caching was really being used. Thanks for the clarification.

Kal said...

Hi Radha,

Thanks for sharing wonderful tips.
well I am new to Informatica and I am trying to find out different possible Performance issues in real time and how to solve them.

Can U pls give us some scenarios with the explanation.

Thanks

kaleem.informatica@gmail.com

Radhakrishna Sarma said...

Hi Kaleem,

Thank you for viewing the article. All the above tips that I mentioned are real time scenarios only. Keep vising my blog and you will know more. If you have any specific question, you can raise it in technet.informatica.com forums or ittoolbox.com forums, I or anyone else will be able to answer.

Cheers
Sarma.

Bullet baba said...

Thanks for the recipe...I was thinking for a long time ..how to make database indexing masala with lookup curry...this article helped me a lot...Sorry i added a pinch of extra salary column into the curry and all my friends loved the recipe.
Next time please send more recipes about other informatica curries...do u sell seperate masalas also?if yes iam interested in buying it too..

Radhakrishna Sarma said...

Funny handle you have and the comment is also funny too.

BTW, I haven't really understood this masala stuff. Anyway, thanks for viewing the blog and leaving a comment on the same.

Cheers
Sarma.

Pratip Sen said...
This comment has been removed by the author.
Mahi said...

nice and informative article..

Thanks
Mahima

Kaleem said...

Nice article, however, I have quick question for you. U mentioned that Lookup with cache will not work, then what do U suggest for using the cache where we have Index cache and data cache at the informatica level and it will work as the index at the database level????

Radhakrishna Sarma said...

Hi Kaleem,

For tuning the performance of the lookup, I've written a seperate article in my blog, the link to which is below. Please go through it.

http://radhakrishnasarma.blogspot.com/2008/02/informatica-powercenter-performance.html

Cheers
Sarma.

Pramod Garre said...

HI ,

I have a source data of say 100 rows and lookup of 1 million records,why should i cache it ?i feel we should not and we should lookup from DB itself with index on where clause..your take ??

Pramod Garre said...

HI,

One more question .Coming to lookup with cache ,as you mentioned query will be as follows

select EMPNO,
ENAME,
SALARY
from EMP
ORDER BY EMPNO,
ENAME,
SALARY;


Why it is doing a order by.I want to lookup simply whether corresponding keys exists or not,then why it is doing order by or Sorting in ORACLE.I know how expensive and IO intensive a sort operation can be.I pity this is a default behaviour of Informatica :(

Radhakrishna Sarma said...

Hi Pramod,

That is what I have written this article. "I would go for cache-less lookup if my source data records is less than the number of records in my lookup table."

Cheers
Sarma.

Radhakrishna Sarma said...

Ordering the coluns should be the obvious phenomenon. Search against the ordered data is faster. At the same time, on multiple match, result out of ordered data makes a good point. :-)

Cheers
Sarma.

Nitin said...

Nice post Radhakrishna!
One quick comment: Most of the lookup overrides use filter conditions ("where" clause) for building lookup cache..index will definitely help in those cases..isn't it?

Radhakrishna Sarma said...

Hi Nitin,

Thank you for visiting my blog. As you said, Indexes will help when you have that on the columns in the WHERE clause, but my post is surrounded against the indexing of Lookup condition port columns.

Hope it helps.

Cheers
Sarma.

sowmya said...

hi sharma,

I hope a specail care need to be taken while using orderby clause in lookup override,I am confused with this concept,can you pleease explin this for me??

Radhakrishna Sarma said...

Hi Sowmya,

You are right. The special care is to appaend the SQL Override at the end with a double-hyphen. This is to disable the default ORDER BY clause generated by PowerCenter. Anything after double-hyphen is anyway ignored by the SQL engine, so your ORDER BY works. I hope you got it now.

Cheers
Sarma.

Ravz said...

Nice information... This is really a big myth that most of Powercentre techies believe in.. However I would like to ask one thing e.g. in our Lookup SQL override we are having some join or some condition where using indexing will fetch records faster for creating cache, we should def go for indexing on those columns.. I hope I am correct in this sentence.. Please confirm..

Radhakrishna Sarma said...

Hi Ravz,

As the article says, you should not reply on the condition port column index. If there is a JOIN in the SQL override, then all the rules that hold good for the SQL tuning hold good for that. So in that case, to quicken the execution, you can use Indexes on the JOIN key columns and not the condition port columns.

Both are two different issues that we are discussing here.

I hope you understand now.

Cheers
Sarma.

Abhi said...

Hi..

I am not too convinced about the indexing part..
Correct that informatica maintains its own index, but that is to avoid reading data for each input row from the lookup table.

The indexing is used to speed up the lookup override performance to create the cache..let me know what you think about it..

Radhakrishna Sarma said...

Hi Abhi,

You tell me how do you think the index on the condition ports of the lookup makes it easier for fetching the data. You don't have a predicate when you are fetching from the database. So how do you think it works better if you have an index?

Cheers
Sarma

Anonymous said...

Thanks Radha! Found your post to be very insightful and helpful...another good one for refresher:

explanation of Database index

srikanth said...

Hi Regularly follow your blog to improve my skills.can you share the integration of tableau with informatica
thank you!!
i learn informatica online course 2 months back at QEdge Technologies. is there any recommendation video tutorials for a fresher.

 
HTML Hit Counters