Radhakrishna Sarma: 2009

Wednesday, November 11, 2009

Informatica 9 Launch

Though the webinar on Informatica 9 launch is studded with Key Note, Breakouts etc., I could only grab these points from the presentation about the new version.

Logical Data objects - Something like a source, to which you can apply certain rules (set of mapplets or transformations) and you can deploy them as SQL services or web services so that you can view the data using a Browser. Note that till Informatica 8.x, you could only run transformations by creating a mapping, but in Informatica 9, you don't need a mapping, but a data object would do.
Informatica Developer tool - The only client tool I could see in the webinar is something with "D" as a task-bar icon, doesn't necessarily mean "Designer" as in 8.x.
Powerful Search tool to search the entire metadata for a port-name or even for a transformation name.
With the new Analyst tools, there is extensive flexibility to the Business user to customize the Informatica objects for his needs. This would substantially reduce the Development effort, but considering the effort/cost in training Business Users to customize the developed Informatica objects, I am not sure if it is all a cake-walk.
There are no two different objects as source or target. You can just import one object and can be used as a Source and also as a target, there by reducing the metadata storage.

Please note that I have not seen the New version yet, all the above is only from the webinar that I have attended at http://www.informatica.com/9

Wednesday, September 2, 2009

Informatica PowerCenter Repository tables

I am sure every PowerCenter developer either has an intention or necessity to know about the Informatica metadata tables and where information is stored etc. For the starters, all the objects that we create in Informatica PowerCenter - let them be sources, targets, mappings, workflows, sessions, expressions, be it anything related to PowerCenter, will get stored in a set of database tables (call them as metadata tables or OPB tables or repository tables).

* I want to know all the sessions in my folder that are calling some shell script/command in the Post-Session command task.
* I want to know how many mappings have transformations that contain "STOCK_CODE" defined as a port.
* I want to know all unused ports in my repository of 100 folders.

In repositories where you have many number of sessions or workflows or mappings, it gets difficult to achieve this with the help of Informatica PowerCenter client tools. After all, whole of this data is stored in some form in the metadata tables. So if you know the data model of these repository tables, you will be in a better position to answer these questions.

Before we proceed further, let me clearly urge for something very important. Data in the repository/metadata/OPB tables is very sensitive and that the modifications like insert or updates are to be made using the PowerCenter tools ONLY. DO NOT DIRECTLY USE UPDATE OR INSERT COMMANDS AGAINST THESE TABLES.

Please also note that there is no official documentation from Informatica Corporation on how these tables act. It is purely based on my assumption, research and experience that I am providing these details. I will not be responsible to any of the damages caused if you use any statement other than the SELECT, knowing the details from this blog article. This is my disclaimer. Let us move on to the contents now.

There around a couple of hundred OPB tables in 7.x version of PowerCenter, but in 8.x, this number crosses 400. In this regard, I am going to talk about few important tables in this articles. As such, this is not a small topic to cover in one article. I shall write few more to cover other important tables like OPB_TDS, OPB_SESSLOG etc.

We shall start with OPB_SUBJECT now.

OPB_SUBJECT - PowerCenter folders table

This table stores the name of each PowerCenter repository folder.

Usage: Join any of the repository tables that have SUBJECT_ID as column with that of SUBJ_ID in this table to know the folder name.

OPB_MAPPING - Mappings table

This table stores the name and ID of each mapping and its corresponding folder.

Usage: Join any of the repository tables that have MAPPING_ID as column with that of MAPPING_ID in this table to know the mapping name.

OPB_TASK - Tasks table like sessions, workflow etc

This table stores the name and ID of each task like session, workflow and its corresponding folder.

Usage: Join any of the repository tables that have TASK_ID as column with that of TASK_ID/SESSION_ID in this table to know the task name. Observe that the session and also workflow are stored as tasks in the repository. TASK_TYPE for session is 68 and that of the workflow is 71.

OPB_SESSION - Session & Mapping linkage table

This table stores the linkage between the session and the corresponding mapping. As informed in the earlier paragraph, you can use the SESSION_ID in this table to join with TASK_ID of OPB_TASK table.

OPB_TASK_ATTR - Task attributes tables

This is the table that stores the attribute values (like Session log name etc) for tasks.

Usage: Use the ATTR_ID of this table to that of the ATTR_ID of OPB_ATTR table to find what each attribute in this table means. You can know more about OPB_ATTR table in the next paragraphs.

OPB_WIDGET - Transformations table

This table stores the names and IDs of all the transformations with their folder details.

Usage: Use WIDGET_ID from this table to that of the WIDGET_ID of any of the tables to know the transformation name and the folder details. Use this table in conjunction with OPB_WIDGET_ATTR or OPB_WIDGET_EXPR to know more about each transformation etc.

OPB_WIDGET_FIELD - Transformation ports table

This table stores the names and IDs of all the transformation fields for each of the transformations.

Usage: Take the FIELD_ID from this table and match it against the FIELD_ID of any of the tables like OPB_WIDGET_DEP and you can get the corresponding information.

OPB_WIDGET_ATTR - Transformation properties table

This table stores all the properties details about each of the transformations.

Usage: Use the ATTR_ID of this table to that of the ATTR_ID of OPB_ATTR table to find what each attribute in this transformation means.

OPB_EXPRESSION - Expressions table

This table stores the details of the expressions used anywhere in PowerCenter.

Usage: Use this table in conjunction with OPB_WIDGET/OPB_WIDGET_INST and OPB_WIDGET_EXPR to get the expressions in the Expression transformation for a particular, mapping or a set.

OPB_ATTR - Attributes

This table has a list of attributes and their default values if any. You can get the ATTR_ID from this table and look it up against any of the tables where you can get the attribute value. You should also make a note of the ATTR_TYPE, OBJECT_TYPE_ID before you pick up the ATTR_ID. You can find the same ATTR_ID in the table, but with different ATTR_TYPE or OBJECT_TYPE_ID.

OPB_COMPONENT - Session Component

This table stores the component details like Post-Session-Success-Email, commands in Post-Session/pre-Session etc.

Usage: Match the TASK_ID with that of the SESSION_ID in OPB_SESSION table to get the SESSION_NAME and to get the shell command or batch command that is there for the session, join this table with OPB_TASK_VAL_LIST table on TASK_ID.

OPB_CFG_ATTR - Session Configuration Attributes

This table stores the attribute values for Session Object configuration like "Save Session log by", Session log path etc.

Wednesday, July 15, 2009

Error Logging in PowerCenter

In order to capture any Informatica PowerCenter errors into a flat file or database during runtime, Informatica Corporation suggests row-level error logging. The major disadvantage with this is that the performance of the workflow is affected because of row-level processing as opposed to block processing.

In order to overcome this, a simple approach can be followed that can be a common approach for all the workflows. This approach is based on the fact that $Session_Name.ErrorMsg will store NULL value if the session runs fine, otherwise stores the latest error message from the Session run.

1) Create two workflow variables - one for Error message $$ERROR_MESSAGE and the other $$SESSION_NAME to store the failed session name.

2) Create an assignment task in the workflow and create links to it from each of the sessions. Please note that the flow should be TOWARDS the assignment task from the sessions.

3) Modify the link expression for all these links to $Session_Name.PrevTaskStatus = FAILED.

4) In the assignment task, assign $Session_Name.ErrorMsg to the workflow variable $$ERROR_MESSAGE and assign Session_Name to $$SESSION_NAME.

5) You need a bit of nested iifs to achieve this.

For variable $$ERROR_MESSAGE, the expression contains

:udf.if_null_or_blank($Session_Name_1.ErrorMsg,
    :udf.if_null_or_blank($Session_Name_2.ErrorMsg,
        :udf.if_null_or_blank($Session_Name_3.ErrorMsg,
            :udf.if_null_or_blank($Session_Name_4.ErrorMsg ,
                :udf.if_null_or_blank($Session_Name_5.ErrorMsg ,
                    :udf.if_null_or_blank($Session_Name_6.ErrorMsg ,
                        :udf.if_null_or_blank($Session_Name_7.ErrorMsg ,
                            :udf.if_null_or_blank($Session_Name_8.ErrorMsg ,
                                :udf.if_null_or_blank($Session_Name_9.ErrorMsg ,
                                    :udf.if_null_or_blank($Session_Name_10.ErrorMsg ,
                                        :udf.if_null_or_blank($Session_Name_11.ErrorMsg ,
                                            :udf.if_null_or_blank($Session_Name_12.ErrorMsg ,
                                                 'A Fatal Error occurred' 
                                 ,$Session_Name_12.ErrorMsg
                                 )
                              ,$Session_Name_11.ErrorMsg
                              )
                           ,$Session_Name_10.ErrorMsg
                           )
                        ,$Session_Name_9.ErrorMsg
                        )
                     ,$Session_Name_8.ErrorMsg
                     )
                  ,$Session_Name_7.ErrorMsg
                  )
               ,$Session_Name_6.ErrorMsg
               )
            ,$Session_Name_5.ErrorMsg
            )
         ,$Session_Name_4.ErrorMsg
         )
      ,$Session_Name_3.ErrorMsg
      )
   ,$Session_Name_2.ErrorMsg
   )
,$Session_Name_1.ErrorMsg
)

:udf.if_null_or_blank(input_String_Argument, output_if_NULL_Blank, output_if_not_NULL_Blank) is a user-defined function with expression contents

iif(isnull(input_String_Argument) or length(ltrim(rtrim(input_String_Argument)))
= 0, output_if_NULL_Blank, output_if_not_NULL_Blank)

In the same way, create the expression for the $$SESSION_NAME. It should be the same expression as for the $$ERROR_MESSAGE but in the else part of the iif(), the session names should be specified instead of ErrorMsg.

6) From this assignment task take a link to a session which stores the contents of these workflow variables into a database table or a flat file. let us call this session mapping as LOG mapping.

You may question the scope of these workflow variable inside the Log mapping. If you can use the workflow variable in the source qualifier SQL override, then you can get the data from the same. Like this:

select '$$ERROR_MESSAGE', '$$SESSION_NAME'
from dual

Take 2 ports out of the source qualifier onto a expression transformation and then continue loading into a relation table target or a flat file target.

7) It is IMPORTANT to make sure that the General tab property of the Assignment task --> Treat Input Links as "OR". This makes sure if at least one session fails, the assignment task is triggered and the error is logged.

If you implement the Error Logging this way, you will be able to catch all kinds of Informatica errors.

Wednesday, April 22, 2009

Dynamically generate parameter files

I've seen this in many places, but just want to tell you the way I have been doing it for few years now. I will take the context in a bit more detailed manner to make every one understand it.

1. PowerCenter objects – Introduction:

• A repository is the highest physical entity of a project in PowerCenter.

• A folder is a logical entity in a PowerCenter project. For example, Customer_Data is a folder.

• A workflow is synonymous to a set of programs in any other programming language.

• A mapping is a single program unit that holds the logical mapping between source and target with required transformations. A mapping will just say a source table by name EMP exists with some structure. A target flat file by name EMP_FF exists with some structure. The mapping doesn’t say in which schema this EMP table exists and in which physical location this EMP_FF table going to be stored.

• A session is the physical representation of the mapping. The session defines what a maping didn’t do. The session stores the information about where this EMP table comes from. Which schema, with what username and password can we access this table in that schema. It also tells about the target flat file. In which physical location the file is going to get created.

• A transformation is a sub-program that performs a specific task with the input it gets and returns some output. It can be assumed as a stored procedure in any database. Typical examples of transformations are Filter, Lookup, Aggregator, Sorter etc.

• A set of transformations, that are reusable can be built into something called mapplet. A mapplet is a set of transformations aligned in a specific order of execution.

As with any other tool or programing language, PowerCenter also allows parameters to be passed to have flexibility built into the flow. Parameters are always passed as data in flat files to PowerCenter and that file is called the parameter file.

2. Parameter file format for PowerCenter:

For a workflow parameter which can be used by any session in the workflow, below is the format in which the parameter file has to be created.

[Folder_name:WF.Workflow_Name]
$$parameter_name1=value
$$parameter_name2=value

For a session parameter which can be used by the particular session, below is the format in which the parameter file has to be created.

[Folder_name:WF.Workflow_Name:ST.Session_Name]
$$parameter_name1=value
$$parameter_name2=value

3. Parameter handling in a data model:

• To have flexibility in maintaining the parameter files.

• To reduce the overhead for the support to change the parameter file every time a value of a parameter changes

• To ease the deployment,
all the parameters have to be maintained in Oracle or any database tables and a PowerCenter session is created to generate the parameter file in the required format automatically.

For this, 4 tables are to be created in the database:

1. FOLDER table will have entries for each folder.

2. WORKFLOWS table will have the list of each workflow but with a reference to the FOLDERS table to say which folder this workflow is created in.

3. PARAMETERS table will hold all the parameter names irrespective of folder/workflow.

4. PARAMETER_VALUES table will hold the parameter of each session with references to PARMETERS table for parameter name and WORKFLOWS table for the workflow name. When the session name is NULL, that means the parameter is a workflow variable which can be used across all the sessions in the workflow.

To get the actual names because PARAMETER_VALUES table holds only ID columns of workflow and parameter, we create a view that gets all the names for us in the required format of the parameter file. Below is the DDL for the view.

a. Parameter file view:

CREATE OR REPLACE VIEW PARAMETER_FILE
(
HEADER,
DETAIL
)
AS
select '['fol.folder_name'.WF:' wfw.workflow_name']' header
,pmr.parameter_namenvl2(dtl.logical_name, '_'dtl.logical_name, NULL)'='
dtl.value detail
from folder fol
,parameters pmr
,WORKFLOWS wfw
,PARAMETER_VALUES dtl
where fol.id = wfw.folder_id
and dtl.pmr_id = pmr.id
and dtl.wfw_id = wfw.id
and dtl.session_name is null
UNION
select '['fol.folder_name'.WF:' wfw.workflow_name'.ST:' dtl.session_name']' header
,decode(dtl.mapplet_name, NULL, NULL, dtl.mapplet_name'.')
pmr.parameter_namenvl2(dtl.logical_name, '_'dtl.logical_name, NULL)'=' dtl.value detail
from folder fol
,parameters pmr
,WORKFLOWS wfw
,PARAMETER_VALUES dtl
where fol.id = wfw.folder_id
and dtl.pmr_id = pmr.id
and dtl.wfw_id = wfw.id
and dtl.session_name is not null

b. FOLDER table

ID (NUMBER)
FOLDER_NAME (varchar50)
DESCRIPTION (varchar50)

c. WORKFLOWS table

ID (NUMBER)
WORKFLOW_NAME (varchar50)
FOLDER_ID (NUMBER) Foreign Key to FOLDER.ID
DESCRIPTION (varchar50)

d. PARAMETERS table

ID (NUMBER)
PARAMETER_NAME (varchar50)
DESCRIPTION (varchar50)

e. PARAMETER_VALUES table

ID (NUMBER)
WF_ID (NUMBER)
PMR_ID (NUMBER)
LOGICAL_NAME (varchar50)
VALUE (varchar50)
SESSION_NAME (varchar50)

• LOGICAL_NAME is a normalization initiative in the above parameter logic. For example, in a mapping if we need to use $$SOURCE_FX as a parameter and also $$SOURCE_TRANS as another mapping parameter, instead of creating 2 different parameters in the PARAMETERS table, we create one parameter $$SOURCE. Then FX and TRANS will be two LOGICAL_NAME records of the PARAMETER_VALUES table.

• m_PARAMETER_FILE is the mapping that creates the parameter file in the desired format and the corresponding session name is s_m_PARAMETER_FILE.

Thursday, April 2, 2009

Designing the mapping/workflow efficiently - part II

In few of my other blog articles, I've been mentioning about the push down optimization. In simple terms, try to push as much processing as possible into the database source or target based on the necessity. I will explain few instances where we can do so.

Using analytics

Instead of using aggregator transformation or a combination of sorter/expression/filter for occassions when they are demanded, we can make use of the anlytic functions of the database. For eg: While reading your source data from the table, you may need to identify the last record from the source. For this requirement, the first option that comes into mind is something like this:

One source pipeline picking up the data records and the second pipeline counting the number of source records and eventually you equate that number to the current record number. When they are equal, it implies that the current record is the last record from the source.

As I said earlier, people resort to a count(*) and GROUP BY from SQL or an aggregator/sorter/expression/filter combination for the same. Instead of this, the analytic functions provide you greater flexibility. The same task can be acheived with the below SQL.

select col1,
col2,
col2,
lead (col1) over (order by input_sort_criteria) next_record
from table_name
/

For the last record with the given "input_sort_criteria", the value will be NULL.

Avoid Stored Procedures

I have seen many mappings with unnecessary usage of Stored Procedures. Stored Procedures "are" performance killers. If run for every record in the source, stored procedures consume much of the workflow run-time. A database connection has to be established each time for a record and then the execution takes place in the database. As you can see, this beahviour leads to worst performing interfaces.

A general guideline is that when you can do it in PowerCenter using transformations, do not use store procedures.

Conclusion:

The two sections in this article seem to be conflicting with each other. In the first section, I've urged you to push as much processing as possible to the database and in the second section, I'm suggesting you to avoid database.

The difference here is when you connect to database using a Lookup or Source qualifier or pre/post-SQL procedures, you are really connecting to the database only ONCE for all the source records. When you connect to the database using a stored procedure transformation that is called for each record, then you are connecting to the database for as many times as the number of records in the source.

When you have a vehicle (a loading/unloading truck) at your disposal, would you take 100 bricks from one place to the other place at a time or each brick at a time in the vehicle?

Monday, January 5, 2009

Database Indexing for the Lookup SQL - A myth

I have seen people suggesting an index to improve the performance of any SQL. This suggestion is incorrect - many times. Specially when talking about indexing the condition port columns of Lookup SQL, it is far more "incorrect".

Before explaining why it is incorrect, I would try to detail the functionality of Lookup. To explain the stuff with an example, we take the usual HR schema EMP table. I have EMPNO, ENAME, SALARY as columns in EMP table.

Let us say, there is a lookup in ETL mapping that checks for a particular EMPNO and returns ENAME and SALARY from the Lookup. Now, the output ports for the Lookup are "ENAME" and "SALARY". The condition port is "EMPNO". Imagine that you are facing performance problems with this Lookup and one of the suggestion was to index the condition port.

As suggested (incorrectly) you create an index on EMPNO column in the underlying database table. Practically, the SQL the lookup executes is going to be this:

select EMPNO,
ENAME,
SALARY
from EMP
ORDER BY EMPNO,
ENAME,
SALARY;

The data resulted from this query is stored in the Lookup cache and then, each record from the source is looked up against this cache. So, the checking against the condition port column is done in the Informatica Lookup cache and "not in the database". So any index created in the database has no effect for this.

You may be wondering if we can replicate the same indexing here in Lookup Cache. You don't have to worry about it. PowerCenter create "index" cache and "data" cache for the Lookup. In this case, condition port data - "EMPNO" is indexed and hashed in "index" cache and the rest along with EMPNO is found in "data" cache.

I hope now you understand why indexing condition port columns doesn't increase performance.

Having said that, I want to take you to a different kind of lookup, where you would've disabled the caching. In this kind of Lookup, there is no cache. Everytime a row is sent into lookup, the SQL is executed against database. In this scenario, the database index "may" work. But, if the performance of the lookup is a problem, then "cache-less" lookup itself may be a problem.

I would go for cache-less lookup if my source data records is less than the number of records in my lookup table. In this case ONLY, indexing the condition ports will work. Everywhere else, it is just a mere chanse of luck, that makes the database pick up index.

I'm sure you will have many questions regarding Lookups after reading this blog article. I'm ready to answer, fire away.

Thursday, January 1, 2009

Designing the mapping/workflow efficiently - part I

I wanted to write some useful design tips for a mapping and workflow to increase the performance. As PowerCenter is a wide ocean and there can't be any exhaustive list of best practices. To address this, I want to add to this list in phases and that is the reason, this blog article is named as "part - I".

Please note that these are not any rules-of-thumb, but would lead you into a sensible decision while developing interfaces in PowerCenter. Nevertheless, the design decisions are to be driven by the architectural, organizational and business requirements.

I would always suggest you to think twice before using an Update Strategy, though it adds me certain level of flexibility in the mapping. If you have a straight-through mapping which takes data from source and directly inserts all the records into the target, you wouldn’t need an update strategy.
Use a pre-SQL delete statement if you wish to delete specific rows from target before loading into the target. Use truncate option in Session properties, if you wish to clean the table before loading. I would avoid a separate pipe-line in the mapping that roans before the load with update-strategy transformation.
You have 3 sources and 3 targets with one-on-one mapping. If the load is independent according to business requirement, I would create 3 different mappings and 3 different session instances and they all run in parallel in my workflow after my “Start” task. I’ve observed that the workflow runtime comes down between 30-60% of serial processing.
PowerCenter is built to work of high volumes of data. So let the server be completely busy. Induce parallelism as far as possible into the mapping/workflow.
Needless to say, if any transformation waits for complete source data to be arrived in before it can proceed further in the mapping, use a sorter transformation to speed up the process. For eg: an Aggregator transformation, a Joiner transformation etc will perform well if sorted data is given to them and that a sorter transformation at the start will make the efficient.
Push as much filtering as possible into the SQL override of the source-qualifier/lookup to make the database handle most of the filtering as databases are experts in doing that stuff than using a filter transformation.
As I always say, the above task is similar to manual push-down optimization.

As written in the start of this article, we will continue to add to the list. Please make your contributions to this and we shall do our best to to make Informatica world generate efficient code for PowerCenter.

Radhakrishna Sarma

Wednesday, November 11, 2009

Informatica 9 Launch

Wednesday, September 2, 2009

Informatica PowerCenter Repository tables

Wednesday, July 15, 2009

Error Logging in PowerCenter

Wednesday, April 22, 2009

Dynamically generate parameter files

Thursday, April 2, 2009

Designing the mapping/workflow efficiently - part II

Monday, January 5, 2009

Database Indexing for the Lookup SQL - A myth

Thursday, January 1, 2009

Designing the mapping/workflow efficiently - part I

Cricinfo International Scores

Followers

Blog Archive

About Me

Wednesday, November 11, 2009

Wednesday, September 2, 2009

Wednesday, July 15, 2009

Wednesday, April 22, 2009

Thursday, April 2, 2009

Monday, January 5, 2009

Thursday, January 1, 2009

Cricinfo International Scores

Subscribe To

Followers

Blog Archive

About Me