Issue with sort and group by

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
tej2004
Participant
Posts: 3
Joined: Fri Sep 03, 2004 8:37 am

Issue with sort and group by

Post by tej2004 »

Hi,
i want to create a datastage server job with following functionality.

Source contains columns like key, base amount.
In a lookup transformation i do lookup against key col, i get data of following sort
(key, base amount , PYEAR_NUM , PPERIOD_CD)
1 , 100 , 2003 , 01
2 , 500 , 2003 , 02
3 , 200 , 2003 , 03
4 , 1000 , 2003 , 04
5 , 100 , 2004 , 01
6 , 1000 , 2004 , 02

then i want to sort on PYEAR_NUM PPERIOD_CD
so that i will get data for 2003 in order 04,03,02,01.
and then i want to do calculation on base amounts for PPERIOD_CD 04,03 and 02 for a PYEAR_NUM i.e curr row val + (curr row-1) val + (curr row-2) val

i.e somthing like 1000+200+500 for PYEAR_NUM = 2003
and then this data is stored in tgt.

How can i do that?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Welcome aboard! :D

Obviously, from your description, you've gotten the lookup part fairly well sorted out, and now you seek to take the sum of base amount the three highest PPERIOD_CD values in each particular PYEAR_NUM.

To accomplish this you need to be able to remember the two previous values. If you search the forum, you will find techniques for remembering the previous value, but not the previous two.

It would be difficult to use stage variables to remember two rows back (though not impossible using dynamic array techniques): I believe an approach using a Routine and variables declared to be in a COMMON area of memory would be easier to implement.

All of that would require your data to be sorted - you could employ a Sort stage but these aren't all that fast.

It may, indeed, be easier (and therefore more easily maintained) to insert your unsorted data into a temporary intermediate table and to construct a three-level correlated sub-query to produce the desired results.

Another solution might involve dumping the lookup result into a text file, sorting it, and reading this in three streams, one beginning at row 1, one beginning at row 2 and the third beginning at row 3. (You can read beginning at row N using a filter on the Sequential File stage, for example using tail and wc -l).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
tej2004
Participant
Posts: 3
Joined: Fri Sep 03, 2004 8:37 am

Post by tej2004 »

Hi Ray,
Thanks for quick reply.
But i can not use temporary table option for the access problems that may arise while dropping and creating this tbl with every run, and the sequential file stage option as part of our development restriction. I think i will have to go for the Routine and variable option.But I am considerably new to datastage and i would highly appreciate if u can elaborate on this.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

A Routine of type "transform function" is invoked for every row processed, accepting one or more argument values and returning one value (though this can be a dynamic array, and other values can be returned through system variables).

A simple routine can return the sum of the current and previous two rows.

To do this, it must keep a memory of what was in the previous two rows, which is achieved by declaring the variables to hold these values to be in "COMMON" memory - somewhat akin to static or persistent variables that don't lose their values between invocations.

This example shows the technique without detecting change. You'd invoke the routine passing "base amount" as the single argument.

Code: Select all

FUNCTION SumThree(MyValue)
* Returns the sum of MyValue in this and the previous two rows

* Declare variables to hold two previous row values.
* Fortuitously, COMMON variables are automatically initialized to zero.
COMMON /PrevRows/PrevPrevMyValue, PrevMyValue

* Form the sum
Ans = PrevPrevMyValue + PrevMyValue + MyValue

* Push everything back
PrevPrevMyValue = PrevMyValue
PrevMyValue = MyValue

RETURN(Ans)
A similar approach can be taken to detecting change, and whether three values have yet been seen for the current year. Declare more variables in COMMON and remember the year (and period if you like) from the previous and previous previous rows, and incorporate logic to detect the change and, perhaps, to return 0 if fewer than three periods for the current year have been seen.

Exactly how you go about this depends on whether you've sorted the period in ascending or descending order (as the second level sort after year). If descending order, you've got what you want after the first three periods, and have to return 0 thereafter. I think this is the easier approach.

The invoking Transformer stage can constrain its output on the result of the routine being greater than zero, if this is a sensible thing to do in your design.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
andru
Participant
Posts: 21
Joined: Tue Mar 02, 2004 12:25 am
Location: Chennai

Post by andru »

If your input can be sorted in the desc order of the year, then you can do the same thing which the routine does in the transformer stage itself.

declare three stage variables as below

Derivation Stage Variable
rowproccomparewithpreviousvalue(key) keychg

if keychg then 1 else count+1 count (initial value=0)

if count<=3 then amt+currentamount base amount
else amt +0

you can route these stage variable values to your output link.
Post Reply