Obtaining Last Non-Null Value Per Column

chriscamp · Post by **chriscamp** » Tue Jun 24, 2008 1:16 pm

Hi All.

I am trying to determine the best method for obtaining the last non-null value for each column within a key (or group of keys). Could not find anything by searching the forum.

Here's a data scenario:
<pre>
ROW# C1 (KEY) C2 C3 C4
---- -------- ------ ------ ------
1 1000 ACME <NULL> 999
2 1000 <NULL> RED 111
3 1000 ZEKE BLUE <NULL>
4 1000 PLUTO <NULL> <NULL>
5 1000 <NULL> <NULL> 555

</pre>

DESIRED RESULT:
<pre>
ROW# C1 (KEY) C2 C3 C4
---- -------- ------ ------ ------
1 1000 PLUTO BLUE 555

</pre>

The most obvious place in the current product would seem to be a "Last Not Null" and/or "First Not Null' operation to the Aggregator stage since it operates per column / per key, not per record / per key. But it doesn't exist.

Remove Duplicates - doesn't appear to work because it accepts or rejects an entire record

Aggregator - doesn't have a "Last non-null" or equivalent. MIN and MAX will not work because the columns are not sorted individually

Sort/Unique - like Remove Duplicates, accepts/rejects entire record

Join/Merge with "case" statements to choose columns will not work since the number of records per key is unknown amongst other reasons.

A custom solution using stage variables for every column is a potential solution, but there are many jobs and many columns. I am hoping there is a more efficient, more "out of the box" approach.

Appreciate any assistance!!!

ray.wurlod · Post by **ray.wurlod** » Tue Jun 24, 2008 4:32 pm

Transformer stage with a complex set of If..Then..Else tests.

Database query (if available) using COALESCE function.

OddJob · Post by **OddJob** » Wed Jun 25, 2008 3:02 am

If you have it available, Quality Stage has a survivorship feature that allows you to perform this type of operation.

Survivorship lets you create rules for choosing the best quality data for each field across a group of records. In your case I suspect it would be quite easy achieve 'retain last non-null value' on a per field basis across your fields.

I suspect the alternatve in DataStage will involve using a sort (Don't sort, previously sorted) to get a group change field, followed by a transformer that carries forward non-null values into the following null value recs. The last record of a group should then hold the information you desire, getting hold of this via a Remove Duplicates stage set to Retain Last Record.

chriscamp · Post by **chriscamp** » Wed Jun 25, 2008 7:26 am

QualityStage might not be an option, but I'll look into it for sure.

It looks like I'm stuck with the good ol' Transformer and stage variables as suspected. The Sort / Group Change trick should save some transformer logic.

The Server version of Aggregator has a First and Last operator - not sure how it would work on Null and it would break the parallelism.

Thanks for the tips!

ray.wurlod · Post by **ray.wurlod** » Wed Jun 25, 2008 4:12 pm

In parallel jobs the Remove Duplicates stage has a First or Last option, and does not "break parallelism".

chriscamp · Post by **chriscamp** » Thu Jun 26, 2008 7:34 am

ray.wurlod wrote:In parallel jobs the Remove Duplicates stage has a First or Last option, and does not "break parallelism". ...

True, but as mentioned above, it accepts/rejects an entire record and does not examine each column to give a LAST NON-NULL value per column. I wish it did - would make life easier for me

richdhan · Post by **richdhan** » Fri Jun 27, 2008 3:37 pm

Hi,

A Transformer stage with 4 stage variables for C1, C2, C3, C4 should do it.

HTH
--Rich