Retrieving MINimum value from a list of columns

ArndW · Post by **ArndW** » Wed Aug 08, 2007 7:39 pm

I have a job where I have about 20 columns containing timestamp values. I would like to output on column which contains the lowest value from those 20 columns. The only function I could find is the MIN() function, which operates on numbers. If I put the result of a derivation "Min(Min(Min...))) into a stage variable of type timestamp I get a compile conversion error, but if I use a number then I would need to convert it to timestamp, but that wouldn't work.
All I can think of doing now is to create a number of stage variables, each doing a simple compare with a column value and its predecessor but somehow I think that a simple solution is eluding me and I just can't think of one; so any suggestions would be welcome.
It would be nice to be able to use the server function, which can operate on a list, but that isn't an option in this job.

ray.wurlod · Post by **ray.wurlod** » Wed Aug 08, 2007 8:20 pm

You're not able to employ a BASIC Transformer stage?

ArndW · Post by **ArndW** » Wed Aug 08, 2007 8:53 pm

No, not with the large amount of data going through per day; this job needs to remain a PX one and putting basic transform stages in isn't an option in this case.

ray.wurlod · Post by **ray.wurlod** » Wed Aug 08, 2007 8:58 pm

In an SMP environment the BASIC Transformer stage is capable of parallel execution. (Of course you still have the overhead of translation to and from the typeless environment.)

Short of a stage-variable-based solution, or writing your own C++ equivalent of the server function, I can't envisage any other way to work this one.

ArndW · Post by **ArndW** » Wed Aug 08, 2007 9:04 pm

One of the conditions here is that the finished application be ready for distributed processing, so that precludes BASIC stages. I was hoping that I'd missed some glaringly obvious solution. I've now got an ugly amount of stage variables doing the comparisons and will probably leave that instead of coding.

bandish · Post by **bandish** » Fri Aug 10, 2007 2:22 am

We can create a key column for each record if not already present; and use a Pivot stage and create a single column which will create 20 rows/record. After this we can sort on the Key(generated) and Timestamp field and retain the first record using "Remove Duplicates" stage.

Thanks
Bandish

ArndW · Post by **ArndW** » Fri Aug 10, 2007 3:22 am

Hello Bandish - that sounds like it might work, but in this case we have a potential of millions of rows per hour, so doing a sort is not an option for performance and resource reasons.

bandish · Post by **bandish** » Fri Aug 10, 2007 6:54 am

Yes, Sorting millions of records would be a performance bottleneck.

Then probably, after pivoting, we can use few (i think 3 or 4) stage variables to get the min TimeStamp as the number of TimeStamp column would be just one instead of 20.
But one doubt I have here: Would pivoted data have all records with the same Key one after the other, in the same node (Else we can't use stage variables for Timestamp comparison)?
As per my understanding it should be.

JeroenDmt · Post by **JeroenDmt** » Fri Aug 10, 2007 7:22 am

ArndW wrote:Hello Bandish - that sounds like it might work, but in this case we have a potential of millions of rows per hour, so doing a sort is not an option for performance and resource reasons.

I think you don't need to sort on all columns, just on the timestamp column.
If you generate the record-identifier-key, so that is sorted, you can sort on [record-identifier-key] as dont-sort-previously-sorted, and timestamp. Then the sort only needs to sort within each 20 records.

ArndW · Post by **ArndW** » Fri Aug 10, 2007 9:38 pm

Doing a sort of any type in this specific scenario will be much slower than simple declaring <n>/2 stage variables (where <n> is the number of columns to compare); then setting the result of each to the min of the 2 values, like leaves, then processing each leaf with its neighbour and so on. Not pretty but practicable.
The Server MIN() and MAX() functions can work on lists instead of just 2 values so that would have been a simple programming alernative had performance not been a significant issue.