how to implement oracle lead function in datastage

PBalamurugan · Post by **PBalamurugan** » Fri May 30, 2008 5:35 am

Hi,

Trying to implement following scenario in my DS job:

Field1 Date1 Date2
------ ------ -------
F1 D11 NULL
F1 D12 D11
F1 D13 D12
F1 D14 D13
F2 D21 NULL
F2 D22 D21
F3 D31 NULL

For given Field1, the Date2 is to be populated based on Date1 field.
I have more than 13million records in my source. I'm joining this source table with 6 dimension tables(some of them have 3million records).

For above requirement, I have used oracle join and lead function. The oracle join is OK interms of performance, but lead function taking longer time.

Please advice if any one has idea to improve the performance.

Is there any DS functionality (if any) to solve this problem?

Thanks

chulett · Post by **chulett** » Fri May 30, 2008 5:52 am

You haven't explained what your output needs to look like nor what the 'lead' function does. Not everyone here is fluent in Oracle and your 'problem' is unclear.

DS_SUPPORT · Post by **DS_SUPPORT** » Fri May 30, 2008 5:54 am

Could You please expalin, how you want your output? You can always do it by using Stage Variables. For ex, if the following is your requirement, i will order by salary in descending and do the calculations in the stage variables

Code: Select all

The following example uses the LEAD function to return the salary from the next row, and to calulate the difference between the salary of the current row and the following row.

SELECT empno,
       ename,
       job,
       sal,
       LEAD(sal, 1, 0) OVER (ORDER BY sal) AS sal_next,
       LEAD(sal, 1, 0) OVER (ORDER BY sal) - sal AS sal_diff
FROM   emp;

     EMPNO ENAME      JOB              SAL   SAL_NEXT   SAL_DIFF
---------- ---------- --------- ---------- ---------- ----------
      7369 SMITH      CLERK            800        950        150
      7900 JAMES      CLERK            950       1100        150
      7876 ADAMS      CLERK           1100       1250        150
      7521 WARD       SALESMAN        1250       1250          0
      7654 MARTIN     SALESMAN        1250       1300         50
      7934 MILLER     CLERK           1300       1500        200
      7844 TURNER     SALESMAN        1500       1600        100
      7499 ALLEN      SALESMAN        1600       2450        850
      7782 CLARK      MANAGER         2450       2850        400
      7698 BLAKE      MANAGER         2850       2975        125
      7566 JONES      MANAGER         2975       3000         25
      7788 SCOTT      ANALYST         3000       3000          0
      7902 FORD       ANALYST         3000       5000       2000
      7839 KING       PRESIDENT       5000          0      -5000

PBalamurugan · Post by **PBalamurugan** » Fri May 30, 2008 7:17 pm

Hi DS_SUPPORT,

Our requirement is exactly similar to SAL_NEXT in your example.

Please advice whether the usage of stage variable will affect the performance, because it sounds like the system will process all records in sequence mode (I may be wrong as I have only few weeks experience in DS). Please clarify. My source has more than 13 million records.

Thanks for support.

chulett · Post by **chulett** » Fri May 30, 2008 7:40 pm

'Exactly similar'?

Please take a moment and show us, based on the input data you posted earlier, what your output data needs to look like after the transformations are performed.

PBalamurugan · Post by **PBalamurugan** » Fri May 30, 2008 8:21 pm

Hi Chulett,

In the below example, Date2 field is calculated as follows:

Select F1,Date1,
lead(Date1,1) over(partition by Field1 order by Date1) Date2

This query will give values as:

Field1 Date1 Date2
------ ------ -------
F1 D11 NULL
F1 D12 D11
F1 D13 D12
F1 D14 D13
F2 D21 NULL
F2 D22 D21
F3 D31 NULL

Basically the field "Date2" is derived from "Field1" and previous value of "Date1".

My doubt is that if the stage variable is used, how Datastage will process the records.... is it in the row level (row by row) or table level (all rows rows in parallel) ?

Pls let me know if you have any queries.

Thanks. [/list]

ray.wurlod · Post by **ray.wurlod** » Sat May 31, 2008 2:07 am

DataStage does not really have a lookahead capability, but you can "remember" the values from the prior row using stage variables.

This is done row by row.

chulett · Post by **chulett** » Sat May 31, 2008 6:25 am

Ok, so actually no input examples, just output with the lead function. Fine. This is just bog standard 'group change detection' done with stage variables. As noted, you can't look ahead but you can easily look 'behind' and stash values from previous records.

So you'll need variables to save the current 'group' which is your Field1 and for the previous value of Date1. When the group changes, set Date2 to null otherwise while you are in the same group use the saved value of Date1 from the previous record for Date2.

It is important to note that this is possible because stage variables are evaluated in order and row by row. This is key, especially the former. And IF the 'lead' function in your source query is indeed the source of your slowness, this will alleviate that.