Wednesday, April 22, 2009

Dynamically generate parameter files

I've seen this in many places, but just want to tell you the way I have been doing it for few years now. I will take the context in a bit more detailed manner to make every one understand it.

1. PowerCenter objects – Introduction:

• A repository is the highest physical entity of a project in PowerCenter.


• A folder is a logical entity in a PowerCenter project. For example, Customer_Data is a folder.


• A workflow is synonymous to a set of programs in any other programming language.


• A mapping is a single program unit that holds the logical mapping between source and target with required transformations. A mapping will just say a source table by name EMP exists with some structure. A target flat file by name EMP_FF exists with some structure. The mapping doesn’t say in which schema this EMP table exists and in which physical location this EMP_FF table going to be stored.


• A session is the physical representation of the mapping. The session defines what a maping didn’t do. The session stores the information about where this EMP table comes from. Which schema, with what username and password can we access this table in that schema. It also tells about the target flat file. In which physical location the file is going to get created.


• A transformation is a sub-program that performs a specific task with the input it gets and returns some output. It can be assumed as a stored procedure in any database. Typical examples of transformations are Filter, Lookup, Aggregator, Sorter etc.


• A set of transformations, that are reusable can be built into something called mapplet. A mapplet is a set of transformations aligned in a specific order of execution.

As with any other tool or programing language, PowerCenter also allows parameters to be passed to have flexibility built into the flow. Parameters are always passed as data in flat files to PowerCenter and that file is called the parameter file.

2. Parameter file format for PowerCenter:

For a workflow parameter which can be used by any session in the workflow, below is the format in which the parameter file has to be created.

[Folder_name:WF.Workflow_Name]
$$parameter_name1=value
$$parameter_name2=value

For a session parameter which can be used by the particular session, below is the format in which the parameter file has to be created.

[Folder_name:WF.Workflow_Name:ST.Session_Name]
$$parameter_name1=value
$$parameter_name2=value


3. Parameter handling in a data model:

• To have flexibility in maintaining the parameter files.


• To reduce the overhead for the support to change the parameter file every time a value of a parameter changes


• To ease the deployment,
all the parameters have to be maintained in Oracle or any database tables and a PowerCenter session is created to generate the parameter file in the required format automatically.


For this, 4 tables are to be created in the database:

1. FOLDER table will have entries for each folder.


2. WORKFLOWS table will have the list of each workflow but with a reference to the FOLDERS table to say which folder this workflow is created in.


3. PARAMETERS table will hold all the parameter names irrespective of folder/workflow.


4. PARAMETER_VALUES table will hold the parameter of each session with references to PARMETERS table for parameter name and WORKFLOWS table for the workflow name. When the session name is NULL, that means the parameter is a workflow variable which can be used across all the sessions in the workflow.

To get the actual names because PARAMETER_VALUES table holds only ID columns of workflow and parameter, we create a view that gets all the names for us in the required format of the parameter file. Below is the DDL for the view.


a. Parameter file view:


CREATE OR REPLACE VIEW PARAMETER_FILE
(
HEADER,
DETAIL
)
AS
select '['fol.folder_name'.WF:' wfw.workflow_name']' header
,pmr.parameter_namenvl2(dtl.logical_name, '_'dtl.logical_name, NULL)'='
dtl.value detail
from folder fol
,parameters pmr
,WORKFLOWS wfw
,PARAMETER_VALUES dtl
where fol.id = wfw.folder_id
and dtl.pmr_id = pmr.id
and dtl.wfw_id = wfw.id
and dtl.session_name is null
UNION
select '['fol.folder_name'.WF:' wfw.workflow_name'.ST:' dtl.session_name']' header
,decode(dtl.mapplet_name, NULL, NULL, dtl.mapplet_name'.')
pmr.parameter_namenvl2(dtl.logical_name, '_'dtl.logical_name, NULL)'=' dtl.value detail
from folder fol
,parameters pmr
,WORKFLOWS wfw
,PARAMETER_VALUES dtl
where fol.id = wfw.folder_id
and dtl.pmr_id = pmr.id
and dtl.wfw_id = wfw.id
and dtl.session_name is not null


b. FOLDER table

ID (NUMBER)
FOLDER_NAME (varchar50)
DESCRIPTION (varchar50)


c. WORKFLOWS table

ID (NUMBER)
WORKFLOW_NAME (varchar50)
FOLDER_ID (NUMBER) Foreign Key to FOLDER.ID
DESCRIPTION (varchar50)


d. PARAMETERS table

ID (NUMBER)
PARAMETER_NAME (varchar50)
DESCRIPTION (varchar50)


e. PARAMETER_VALUES table

ID (NUMBER)
WF_ID (NUMBER)
PMR_ID (NUMBER)
LOGICAL_NAME (varchar50)
VALUE (varchar50)
SESSION_NAME (varchar50)

• LOGICAL_NAME is a normalization initiative in the above parameter logic. For example, in a mapping if we need to use $$SOURCE_FX as a parameter and also $$SOURCE_TRANS as another mapping parameter, instead of creating 2 different parameters in the PARAMETERS table, we create one parameter $$SOURCE. Then FX and TRANS will be two LOGICAL_NAME records of the PARAMETER_VALUES table.

• m_PARAMETER_FILE is the mapping that creates the parameter file in the desired format and the corresponding session name is s_m_PARAMETER_FILE.

Thursday, April 2, 2009

Designing the mapping/workflow efficiently - part II

In few of my other blog articles, I've been mentioning about the push down optimization. In simple terms, try to push as much processing as possible into the database source or target based on the necessity. I will explain few instances where we can do so.

Using analytics

Instead of using aggregator transformation or a combination of sorter/expression/filter for occassions when they are demanded, we can make use of the anlytic functions of the database. For eg: While reading your source data from the table, you may need to identify the last record from the source. For this requirement, the first option that comes into mind is something like this:

One source pipeline picking up the data records and the second pipeline counting the number of source records and eventually you equate that number to the current record number. When they are equal, it implies that the current record is the last record from the source.

As I said earlier, people resort to a count(*) and GROUP BY from SQL or an aggregator/sorter/expression/filter combination for the same. Instead of this, the analytic functions provide you greater flexibility. The same task can be acheived with the below SQL.

select col1,
col2,
col2,
lead (col1) over (order by input_sort_criteria) next_record
from table_name
/

For the last record with the given "input_sort_criteria", the value will be NULL.


Avoid Stored Procedures

I have seen many mappings with unnecessary usage of Stored Procedures. Stored Procedures "are" performance killers. If run for every record in the source, stored procedures consume much of the workflow run-time. A database connection has to be established each time for a record and then the execution takes place in the database. As you can see, this beahviour leads to worst performing interfaces.

A general guideline is that when you can do it in PowerCenter using transformations, do not use store procedures.


Conclusion:

The two sections in this article seem to be conflicting with each other. In the first section, I've urged you to push as much processing as possible to the database and in the second section, I'm suggesting you to avoid database.

The difference here is when you connect to database using a Lookup or Source qualifier or pre/post-SQL procedures, you are really connecting to the database only ONCE for all the source records. When you connect to the database using a stored procedure transformation that is called for each record, then you are connecting to the database for as many times as the number of records in the source.

When you have a vehicle (a loading/unloading truck) at your disposal, would you take 100 bricks from one place to the other place at a time or each brick at a time in the vehicle?
 
HTML Hit Counters