Tuesday, December 30, 2008

Why do you restrict developer?

I've been seeing some strange requests from developers these days. Here are few examples.

"I have a relational source and I want to take data for a column whose value is like '%XYZ%'. I want to accomplish this without SQL override on Source Qualifier".

"I have to create a copy of my target file in the archive location. Shall I create two target object instances in the mapping to acheive this? I can't use post-session command task".

It is certainly a surpirse to know as to why they want to avoid the obvious choice. All these developers have complained about development rules and guidelines imposing these restrictions at their organization level.

I have a question to the organizations.

"If you don't want to utilize the best options provided by PowerCenter to make your development/maintenance life easier, why do you opt to use PowerCenter at all?"

With the newer versions of PowerCenter, there is lot of flexibility provided to the developer to make his life lot easier than what it used to be during PowerMart times.

Imagine you code a mapping for the first developer question about '%XYZ%' and used a filter instead of overriding the SQL. If your soure table has 100 billion records and only 10 out of them satisfy '%XYZ%' condition. This will be the worst nightmare in support perspective. These are the mappings that eat into the organisations support budget and maintenance budget.

Why restrict the developer and suffer project financial and quality hammering?

There is another big problem with this kin of corporate behaviour. The inherent innovation capabilities of the developer are supressed and there is every possibility that the developer fails to use the best option when he/she moves to other organizations which give full freedom to the developers.

If this is the situation of the experienced developers, imagine some one who begin ETL PowerCenter development in these kind of restrictive-organiations. These poor developers would not even know that a certain option can be used in PowerCenter.

I hope this articles warns the organizations of a possible danger!

Friday, December 26, 2008

Avoid sequence generator

I would always advice you to avoid sequence generator while populating an ID column in the relational target table. I suggest you rather create a sequence on the target database and enable the trigger on that table to fetch the value from the database sequence.

There are many advantages by doing this way.

  • Less number of PowerCenter objects in the object which reduce development time and also less maintenance.
  • Today you may be writing the records into that target table using PowerCenter flow. What if you get another application tomorrow that also writes records into that table? Your ID column values go bizarre in this scenario.
  • During migration of objects from Dev to QA and also to Production further, there is no extra over-head of considering the persistent values of the sequence generator from the repository database.

In all of the above cases, a sequence created in the target database would make life lot easier for the table data maintenance and also for the PowerCenter development. In fact, databases will have specific mechanisms (focused) to deal with sequences and so you can implement manual Push-down optimization on your PowerCenter mapping design for yourself.

Performance of workflow - Most feared, but the easiest

With the newer and newer versions of PowerCenter, there is added flexibility for the developer to build better-performing workflows. Though performance tuning has been the most feared part of development, it is the easiest, if the intricacies are known. We shall discuss one major problems for performance - SQL query tuning.

SQL queries/actions occur in PowerCenter in one of the below ways.

  • Relational Source Qualifier
  • Lookup SQL Override
  • Stored Procedures
  • Relational Target

Accessing an execution plan for the query is the best way to tune it. Look for any table scans and remember "Full Table Scans are not evil", "Indexes are not always fast", and “Indexes can be slow too". Analyse the table data to see if picking up 20 records out of 20 million is best using index or using table scan. Fetching 10 records out of 15 using index is faster or using full table scan is easier.

Many times the relational target indexes create problem for the performance of the actual loading of records into the relational target. It may surprise you (some times?), the way PowerCenter has reduced the time of loading the target data if you disable the indexes and then enable them in post-SQL again.

Please note that the flow has to also consider the integrity constraints when you have disabled the indexes before loading.

FTP Connection object - Gain platform independence

If you have any files to be read as source from Windows server when your PowerCenter server is hosted on UNIX/LINUX, then makes use of FTP users on the Windows server and use File Reader with FTP Connection object. This connection object can be added as any other connection string. This gives the flexibility of platform independence. This will further reduce the over-head of having SAMBA mounts on to the Informatica UNIX boxes.

Monday, December 22, 2008

"This" and "That" of Pre/Post Session command tasks

It is a very good practice to email the success or failure status of a task, once it is done. In the same way, when a business requirement drives, make use of the Post Session Success and Failure email for proper communication. The built-in feature offers more flexibility with Session Logs as attachments and also provides other run-time data like Workflow run-instance ID

Any archiving activities around the source and target flat files can be easily managed with-in the session using the post-session and pre-session commands. You can even build some resource-saving as well by issuing a gzip on the target file, to save space.

Sunday, November 23, 2008

Never leave it a mystery - Multiple sources for one SQ

When you have different table joins in the source-qualifier, it is a bad practice to include the table in the SQL and not make it visible in the mapping, Though PowerCenter works fine without the source defined in Source Analyzer, it is a good practice to define the source then drag it to the mapping onto the common source qualifier.

Don't make your SQL override of the Source Qualifier orphan. The actual tables used in the query FROM clause have to be part of the mapping and should be visible in the mapping attached to the same source qualifier.

Tuesday, August 26, 2008

How to generate target file names dynamically from the mapping?

I've seen this question many times in the Devnet forums very recently. I thought I will just elaborate the PowerCenter documentation here.

In order to generate the target file names from the mapping, we should make use of the special "FileName" port in the target file. You can't create this special port from the usual New port button. There is a special button with label "F" on it to the right most corner of the target flat file when viewed in "Target Designer".

Below two screen-shots tell you how to create the special port in your target file.

Once this is done, the job is done. When you want to create the file name with a timestamp attached to it, just use a port from an Expression transformation before the target to pass a value of Output Port with expression $$FILE_NAMEto_char(sessstarttime, 'YYYYMMDDHH24:MISS')'.csv'.

Please note that $$FILE_NAME is a parameter to the mapping and I've used sessstarttime because it will be constant through out the session run.

If you use sysdate, it will change if you have 100s of millions of records and if the session may run for an hour, each second a new file will get created.

Please note that a new file gets created with the current value of the port when the port value which maps to the FileName changes.

We'll come to the mapping again. This mapping generates two files. One is a dummy file with zero bytes size and the file name is what is given in the Session properties under 'Mappings' tab for target file name. The other file is the actual file created with the desired file name and data.

When you have different sets of input data with different target files created, use the same instance, but with a Transaction Control transformation which defines the boundary for the source sets. A demonstration of this is available in the PowerCenter documentation.

Wednesday, March 5, 2008

10000 BC - utter disappointment (10000 bc movie review)

10000 BC...

One of the most boring movies I had seen in recent times. All the incidents from the film are neither factual nor practical.

***** SPOILER *****

Plot is quite simple. Our young hero goes in search of his stolen girl friend and travels across mountains, rivers, deserts and what not! He finally reaches the place where she is held in captivity as a slave. He fights with the people who held her and frees every one from the captivity.

Most of the scenes in the movie seem ridiculous to me. With great difficulty, I can pickup two of them to the most-ridculous category.

One - When the hero falls into a trap, he finds that a smilodon/sabre-toothed tiger is also trapped and the tiger got stuck in some wooden frame. He tries to kill it, but finally saves it thinking it wouldn't kill him. To our surpise, the tiger doesn't kill him, but leaves that place. When the hero was surrounded by a group of strangers who try to kill him, this smilodon appears and the scene looks as, if this tiger had come to save the hero.

Two - When the hero is around 12 years of age, he says this to his young girl friend. "All the clouds move, but that brighter cloud in the middle (is referring to the Moon), doesn't move. In the same way, my love to you, doesn't go away from my heart." This is the first dialogue he utters to his girl friend when he first meets her.

With some horrible disturbing characters, this film climax looked more like the Apocalypto human-sacrifice scene. If one can digest the gratitude of the obligate carnivorous, sentimental dialogues which are of distant-reality, this film is a very good entertainer.

One salute if you don't watch the movie after reading this review, two if you do.

Saturday, February 16, 2008

Informatica PowerCenter performance - Lookups

Lookup performance

Lookup is an important and a useful transformation when used effectively.
What is a lookup transformation? It is just not another transformation which fetches you data to look against the source data. It is a transformation when used improperly, makes your flow run for ages.

I now try to explain different scenarios where you can face problems with Lookup and also how to tackle them.

Unwanted columns:

By default, when you create a lookup on a table, PowerCenter gives you all the columns in the table, but be sure to delete the unwanted columns from the lookup as they affect the lookup cache very much. You only need columns that are to be used in lookup condition and the ones that have to get returned from the lookup.

SQL query:

We will start from the database. Find the execution plan of the SQL override and see if you can add some indexes or hints to the query to make it fetch data faster. You may have to take the help of a database developer to accomplish this if you, yourself are not an SQLer.

Size of the source versus size of lookup:

Let us say, you have 10 rows in the source and one of the columns has to be checked against a big table (1 million rows). Then PowerCenter builds the cache for the lookup table and then checks the 10 source rows against the cache. It takes more time to build the cache of 1 million rows than going to the database 10 times and lookup against the table directly.

Use uncached lookup instead of building the static cache, as the number of source rows is quite less than that of the lookup.

Conditional call of lookup:

Instead of going for connected lookups with filters for a conditional lookup call, go for unconnected lookup. Is the single column return bothering for this? Go ahead and change the SQL override to concatenate the required columns into one big column. Break them at the calling side into individual columns again.

JOIN instead of Lookup:

In the same context as above, if the Lookup transformation is after the source qualifier and there is no active transformation in-between, you can as well go for the SQL over ride of source qualifier and join traditionally to the lookup table using database joins, if both the tables are in the same database and schema.

Increase cache:

If none of the above seems to be working, then the problem is certainly with the cache. The cache that you assigned for the lookup is not sufficient to hold the data or index of the lookup. Whatever data that doesn't fit into the cache is spilt into the cache files designated in $PMCacheDir. When the PowerCenter doesn't find the data you are lookingup in the cache, it swaps the data from the file to the cache and keeps doing this until it finds the data. This is quite expensive for obvious reasons being an I/O operation. Increase the cache so that the whole data resides in the memory.

What if your data is huge and your whole system cache is less than that? Don't promise PowerCenter the amount of cache that it can't be allotted during the runtime. If you promise 10 MB and during runtime, your system on which flow is running runs out of cache and can only assign 5MB. Then PowerCenter fails the session with an error.

Cachefile file-system:

In many cases, if you have cache directory in a different file-system than that of the hosting server, the cache file piling up may take time and result in latency. So with the help of your system administrator try to look into this aspect as well.

Useful cache utilities:

If the same lookup SQL is being used in someother lookup, then you have to go for shared cache or reuse the lookup. Also, if you have a table that doesn't get data updated or inserted quite often, then use the persistent cache because the consecutive runs of the flow don't have to build the cache and waste time.

After all these are tried out, sit and wait for the session to get completed - there is nothing else you can do about the lookup.

Thursday, February 14, 2008

About me

I did all my education in the Kurnool district of AP, India. After I am out of my college, joined Automatic Data Processing Inc (ADP, Hyderabad) in June 2003 and worked there for almost 2.5 years. I decided to move out and joined Tata Cosultancy Services in October, 2005. I've resigned from TCS and am in Singapore at the moment.
DBMS (Database Management System) was one of the subjects I hated during my engineering (because of all the Normalization rules blah blah). Here now, I end up being a developer in Informatica PowerCenter with more than essential knowledge of Oracle SQL & PL/SQL. The more I learn about Oracle, the more I like it.
HTML Hit Counters