Retrieving MINimum value from a list of columns

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Retrieving MINimum value from a list of columns

Post by ArndW »

I have a job where I have about 20 columns containing timestamp values. I would like to output on column which contains the lowest value from those 20 columns. The only function I could find is the MIN() function, which operates on numbers. If I put the result of a derivation "Min(Min(Min...))) into a stage variable of type timestamp I get a compile conversion error, but if I use a number then I would need to convert it to timestamp, but that wouldn't work.
All I can think of doing now is to create a number of stage variables, each doing a simple compare with a column value and its predecessor but somehow I think that a simple solution is eluding me and I just can't think of one; so any suggestions would be welcome.
It would be nice to be able to use the server function, which can operate on a list, but that isn't an option in this job.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You're not able to employ a BASIC Transformer stage?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

No, not with the large amount of data going through per day; this job needs to remain a PX one and putting basic transform stages in isn't an option in this case.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

In an SMP environment the BASIC Transformer stage is capable of parallel execution. (Of course you still have the overhead of translation to and from the typeless environment.)

Short of a stage-variable-based solution, or writing your own C++ equivalent of the server function, I can't envisage any other way to work this one.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

One of the conditions here is that the finished application be ready for distributed processing, so that precludes BASIC stages. I was hoping that I'd missed some glaringly obvious solution. I've now got an ugly amount of stage variables doing the comparisons and will probably leave that instead of coding.
bandish
Participant
Posts: 41
Joined: Tue Oct 11, 2005 1:30 am

Post by bandish »

We can create a key column for each record if not already present; and use a Pivot stage and create a single column which will create 20 rows/record. After this we can sort on the Key(generated) and Timestamp field and retain the first record using "Remove Duplicates" stage.

Thanks
Bandish
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Hello Bandish - that sounds like it might work, but in this case we have a potential of millions of rows per hour, so doing a sort is not an option for performance and resource reasons.
bandish
Participant
Posts: 41
Joined: Tue Oct 11, 2005 1:30 am

Post by bandish »

Yes, Sorting millions of records would be a performance bottleneck.

Then probably, after pivoting, we can use few (i think 3 or 4) stage variables to get the min TimeStamp as the number of TimeStamp column would be just one instead of 20.
But one doubt I have here: Would pivoted data have all records with the same Key one after the other, in the same node (Else we can't use stage variables for Timestamp comparison)?
As per my understanding it should be.
JeroenDmt
Premium Member
Premium Member
Posts: 107
Joined: Wed Oct 26, 2005 7:36 am

Post by JeroenDmt »

ArndW wrote:Hello Bandish - that sounds like it might work, but in this case we have a potential of millions of rows per hour, so doing a sort is not an option for performance and resource reasons.
I think you don't need to sort on all columns, just on the timestamp column.
If you generate the record-identifier-key, so that is sorted, you can sort on [record-identifier-key] as dont-sort-previously-sorted, and timestamp. Then the sort only needs to sort within each 20 records.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Doing a sort of any type in this specific scenario will be much slower than simple declaring <n>/2 stage variables (where <n> is the number of columns to compare); then setting the result of each to the min of the 2 values, like leaves, then processing each leaf with its neighbour and so on. Not pretty but practicable.
The Server MIN() and MAX() functions can work on lists instead of just 2 values so that would have been a simple programming alernative had performance not been a significant issue.
Post Reply