In few of my other blog articles, I've been mentioning about the push down optimization. In simple terms, try to push as much processing as possible into the database source or target based on the necessity. I will explain few instances where we can do so.
Using analytics
Instead of using aggregator transformation or a combination of sorter/expression/filter for occassions when they are demanded, we can make use of the anlytic functions of the database. For eg: While reading your source data from the table, you may need to identify the last record from the source. For this requirement, the first option that comes into mind is something like this:
One source pipeline picking up the data records and the second pipeline counting the number of source records and eventually you equate that number to the current record number. When they are equal, it implies that the current record is the last record from the source.
As I said earlier, people resort to a count(*) and GROUP BY from SQL or an aggregator/sorter/expression/filter combination for the same. Instead of this, the analytic functions provide you greater flexibility. The same task can be acheived with the below SQL.
select col1,
col2,
col2,
lead (col1) over (order by input_sort_criteria) next_record
from table_name
/
For the last record with the given "input_sort_criteria", the value will be NULL.
Avoid Stored Procedures
I have seen many mappings with unnecessary usage of Stored Procedures. Stored Procedures "are" performance killers. If run for every record in the source, stored procedures consume much of the workflow run-time. A database connection has to be established each time for a record and then the execution takes place in the database. As you can see, this beahviour leads to worst performing interfaces.
A general guideline is that when you can do it in PowerCenter using transformations, do not use store procedures.
Conclusion:
The two sections in this article seem to be conflicting with each other. In the first section, I've urged you to push as much processing as possible to the database and in the second section, I'm suggesting you to avoid database.
The difference here is when you connect to database using a Lookup or Source qualifier or pre/post-SQL procedures, you are really connecting to the database only ONCE for all the source records. When you connect to the database using a stored procedure transformation that is called for each record, then you are connecting to the database for as many times as the number of records in the source.
When you have a vehicle (a loading/unloading truck) at your disposal, would you take 100 bricks from one place to the other place at a time or each brick at a time in the vehicle?
Showing posts with label Performance. Show all posts
Showing posts with label Performance. Show all posts
Thursday, April 2, 2009
Saturday, February 16, 2008
Informatica PowerCenter performance - Lookups
Lookup performance
Lookup is an important and a useful transformation when used effectively.
What is a lookup transformation? It is just not another transformation which fetches you data to look against the source data. It is a transformation when used improperly, makes your flow run for ages.
I now try to explain different scenarios where you can face problems with Lookup and also how to tackle them.
Unwanted columns:
By default, when you create a lookup on a table, PowerCenter gives you all the columns in the table, but be sure to delete the unwanted columns from the lookup as they affect the lookup cache very much. You only need columns that are to be used in lookup condition and the ones that have to get returned from the lookup.
SQL query:
We will start from the database. Find the execution plan of the SQL override and see if you can add some indexes or hints to the query to make it fetch data faster. You may have to take the help of a database developer to accomplish this if you, yourself are not an SQLer.
Size of the source versus size of lookup:
Let us say, you have 10 rows in the source and one of the columns has to be checked against a big table (1 million rows). Then PowerCenter builds the cache for the lookup table and then checks the 10 source rows against the cache. It takes more time to build the cache of 1 million rows than going to the database 10 times and lookup against the table directly.
Use uncached lookup instead of building the static cache, as the number of source rows is quite less than that of the lookup.
Conditional call of lookup:
Instead of going for connected lookups with filters for a conditional lookup call, go for unconnected lookup. Is the single column return bothering for this? Go ahead and change the SQL override to concatenate the required columns into one big column. Break them at the calling side into individual columns again.
JOIN instead of Lookup:
In the same context as above, if the Lookup transformation is after the source qualifier and there is no active transformation in-between, you can as well go for the SQL over ride of source qualifier and join traditionally to the lookup table using database joins, if both the tables are in the same database and schema.
Increase cache:
If none of the above seems to be working, then the problem is certainly with the cache. The cache that you assigned for the lookup is not sufficient to hold the data or index of the lookup. Whatever data that doesn't fit into the cache is spilt into the cache files designated in $PMCacheDir. When the PowerCenter doesn't find the data you are lookingup in the cache, it swaps the data from the file to the cache and keeps doing this until it finds the data. This is quite expensive for obvious reasons being an I/O operation. Increase the cache so that the whole data resides in the memory.
What if your data is huge and your whole system cache is less than that? Don't promise PowerCenter the amount of cache that it can't be allotted during the runtime. If you promise 10 MB and during runtime, your system on which flow is running runs out of cache and can only assign 5MB. Then PowerCenter fails the session with an error.
Cachefile file-system:
In many cases, if you have cache directory in a different file-system than that of the hosting server, the cache file piling up may take time and result in latency. So with the help of your system administrator try to look into this aspect as well.
Useful cache utilities:
If the same lookup SQL is being used in someother lookup, then you have to go for shared cache or reuse the lookup. Also, if you have a table that doesn't get data updated or inserted quite often, then use the persistent cache because the consecutive runs of the flow don't have to build the cache and waste time.
After all these are tried out, sit and wait for the session to get completed - there is nothing else you can do about the lookup.
Lookup is an important and a useful transformation when used effectively.
What is a lookup transformation? It is just not another transformation which fetches you data to look against the source data. It is a transformation when used improperly, makes your flow run for ages.
I now try to explain different scenarios where you can face problems with Lookup and also how to tackle them.
Unwanted columns:
By default, when you create a lookup on a table, PowerCenter gives you all the columns in the table, but be sure to delete the unwanted columns from the lookup as they affect the lookup cache very much. You only need columns that are to be used in lookup condition and the ones that have to get returned from the lookup.
SQL query:
We will start from the database. Find the execution plan of the SQL override and see if you can add some indexes or hints to the query to make it fetch data faster. You may have to take the help of a database developer to accomplish this if you, yourself are not an SQLer.
Size of the source versus size of lookup:
Let us say, you have 10 rows in the source and one of the columns has to be checked against a big table (1 million rows). Then PowerCenter builds the cache for the lookup table and then checks the 10 source rows against the cache. It takes more time to build the cache of 1 million rows than going to the database 10 times and lookup against the table directly.
Use uncached lookup instead of building the static cache, as the number of source rows is quite less than that of the lookup.
Conditional call of lookup:
Instead of going for connected lookups with filters for a conditional lookup call, go for unconnected lookup. Is the single column return bothering for this? Go ahead and change the SQL override to concatenate the required columns into one big column. Break them at the calling side into individual columns again.
JOIN instead of Lookup:
In the same context as above, if the Lookup transformation is after the source qualifier and there is no active transformation in-between, you can as well go for the SQL over ride of source qualifier and join traditionally to the lookup table using database joins, if both the tables are in the same database and schema.
Increase cache:
If none of the above seems to be working, then the problem is certainly with the cache. The cache that you assigned for the lookup is not sufficient to hold the data or index of the lookup. Whatever data that doesn't fit into the cache is spilt into the cache files designated in $PMCacheDir. When the PowerCenter doesn't find the data you are lookingup in the cache, it swaps the data from the file to the cache and keeps doing this until it finds the data. This is quite expensive for obvious reasons being an I/O operation. Increase the cache so that the whole data resides in the memory.
What if your data is huge and your whole system cache is less than that? Don't promise PowerCenter the amount of cache that it can't be allotted during the runtime. If you promise 10 MB and during runtime, your system on which flow is running runs out of cache and can only assign 5MB. Then PowerCenter fails the session with an error.
Cachefile file-system:
In many cases, if you have cache directory in a different file-system than that of the hosting server, the cache file piling up may take time and result in latency. So with the help of your system administrator try to look into this aspect as well.
Useful cache utilities:
If the same lookup SQL is being used in someother lookup, then you have to go for shared cache or reuse the lookup. Also, if you have a table that doesn't get data updated or inserted quite often, then use the persistent cache because the consecutive runs of the flow don't have to build the cache and waste time.
After all these are tried out, sit and wait for the session to get completed - there is nothing else you can do about the lookup.
Labels:
Informatica,
Lookup,
Performance,
PowerCenter
Subscribe to:
Posts (Atom)