Obtaining Last Non-Null Value Per Column

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
chriscamp
Participant
Posts: 3
Joined: Mon Apr 12, 2004 3:38 pm

Obtaining Last Non-Null Value Per Column

Post by chriscamp »

Hi All.

I am trying to determine the best method for obtaining the last non-null value for each column within a key (or group of keys). Could not find anything by searching the forum.

Here's a data scenario:
<pre>
ROW# C1 (KEY) C2 C3 C4
---- -------- ------ ------ ------
1 1000 ACME <NULL> 999
2 1000 <NULL> RED 111
3 1000 ZEKE BLUE <NULL>
4 1000 PLUTO <NULL> <NULL>
5 1000 <NULL> <NULL> 555


</pre>

DESIRED RESULT:
<pre>
ROW# C1 (KEY) C2 C3 C4
---- -------- ------ ------ ------
1 1000 PLUTO BLUE 555


</pre>

The most obvious place in the current product would seem to be a "Last Not Null" and/or "First Not Null' operation to the Aggregator stage since it operates per column / per key, not per record / per key. But it doesn't exist.

Remove Duplicates - doesn't appear to work because it accepts or rejects an entire record

Aggregator - doesn't have a "Last non-null" or equivalent. MIN and MAX will not work because the columns are not sorted individually

Sort/Unique - like Remove Duplicates, accepts/rejects entire record

Join/Merge with "case" statements to choose columns will not work since the number of records per key is unknown amongst other reasons.

A custom solution using stage variables for every column is a potential solution, but there are many jobs and many columns. I am hoping there is a more efficient, more "out of the box" approach.


Appreciate any assistance!!!
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Transformer stage with a complex set of If..Then..Else tests.

Database query (if available) using COALESCE function.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
OddJob
Participant
Posts: 163
Joined: Tue Feb 28, 2006 5:00 am
Location: Sheffield, UK

Post by OddJob »

If you have it available, Quality Stage has a survivorship feature that allows you to perform this type of operation.

Survivorship lets you create rules for choosing the best quality data for each field across a group of records. In your case I suspect it would be quite easy achieve 'retain last non-null value' on a per field basis across your fields.

I suspect the alternatve in DataStage will involve using a sort (Don't sort, previously sorted) to get a group change field, followed by a transformer that carries forward non-null values into the following null value recs. The last record of a group should then hold the information you desire, getting hold of this via a Remove Duplicates stage set to Retain Last Record.
chriscamp
Participant
Posts: 3
Joined: Mon Apr 12, 2004 3:38 pm

Post by chriscamp »

QualityStage might not be an option, but I'll look into it for sure.

It looks like I'm stuck with the good ol' Transformer and stage variables as suspected. The Sort / Group Change trick should save some transformer logic.

The Server version of Aggregator has a First and Last operator - not sure how it would work on Null and it would break the parallelism.

Thanks for the tips!
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

In parallel jobs the Remove Duplicates stage has a First or Last option, and does not "break parallelism".
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chriscamp
Participant
Posts: 3
Joined: Mon Apr 12, 2004 3:38 pm

Post by chriscamp »

ray.wurlod wrote:In parallel jobs the Remove Duplicates stage has a First or Last option, and does not "break parallelism". ...
True, but as mentioned above, it accepts/rejects an entire record and does not examine each column to give a LAST NON-NULL value per column. I wish it did - would make life easier for me ;)
richdhan
Premium Member
Premium Member
Posts: 364
Joined: Thu Feb 12, 2004 12:24 am

Post by richdhan »

Hi,

A Transformer stage with 4 stage variables for C1, C2, C3, C4 should do it.

HTH
--Rich
Post Reply