Hi All.
I am trying to determine the best method for obtaining the last non-null value for each column within a key (or group of keys). Could not find anything by searching the forum.
Here's a data scenario:
<pre>
ROW# C1 (KEY) C2 C3 C4
---- -------- ------ ------ ------
1 1000 ACME <NULL> 999
2 1000 <NULL> RED 111
3 1000 ZEKE BLUE <NULL>
4 1000 PLUTO <NULL> <NULL>
5 1000 <NULL> <NULL> 555
</pre>
DESIRED RESULT:
<pre>
ROW# C1 (KEY) C2 C3 C4
---- -------- ------ ------ ------
1 1000 PLUTO BLUE 555
</pre>
The most obvious place in the current product would seem to be a "Last Not Null" and/or "First Not Null' operation to the Aggregator stage since it operates per column / per key, not per record / per key. But it doesn't exist.
Remove Duplicates - doesn't appear to work because it accepts or rejects an entire record
Aggregator - doesn't have a "Last non-null" or equivalent. MIN and MAX will not work because the columns are not sorted individually
Sort/Unique - like Remove Duplicates, accepts/rejects entire record
Join/Merge with "case" statements to choose columns will not work since the number of records per key is unknown amongst other reasons.
A custom solution using stage variables for every column is a potential solution, but there are many jobs and many columns. I am hoping there is a more efficient, more "out of the box" approach.
Appreciate any assistance!!!
Obtaining Last Non-Null Value Per Column
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
If you have it available, Quality Stage has a survivorship feature that allows you to perform this type of operation.
Survivorship lets you create rules for choosing the best quality data for each field across a group of records. In your case I suspect it would be quite easy achieve 'retain last non-null value' on a per field basis across your fields.
I suspect the alternatve in DataStage will involve using a sort (Don't sort, previously sorted) to get a group change field, followed by a transformer that carries forward non-null values into the following null value recs. The last record of a group should then hold the information you desire, getting hold of this via a Remove Duplicates stage set to Retain Last Record.
Survivorship lets you create rules for choosing the best quality data for each field across a group of records. In your case I suspect it would be quite easy achieve 'retain last non-null value' on a per field basis across your fields.
I suspect the alternatve in DataStage will involve using a sort (Don't sort, previously sorted) to get a group change field, followed by a transformer that carries forward non-null values into the following null value recs. The last record of a group should then hold the information you desire, getting hold of this via a Remove Duplicates stage set to Retain Last Record.
QualityStage might not be an option, but I'll look into it for sure.
It looks like I'm stuck with the good ol' Transformer and stage variables as suspected. The Sort / Group Change trick should save some transformer logic.
The Server version of Aggregator has a First and Last operator - not sure how it would work on Null and it would break the parallelism.
Thanks for the tips!
It looks like I'm stuck with the good ol' Transformer and stage variables as suspected. The Sort / Group Change trick should save some transformer logic.
The Server version of Aggregator has a First and Last operator - not sure how it would work on Null and it would break the parallelism.
Thanks for the tips!
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
True, but as mentioned above, it accepts/rejects an entire record and does not examine each column to give a LAST NON-NULL value per column. I wish it did - would make life easier for meray.wurlod wrote:In parallel jobs the Remove Duplicates stage has a First or Last option, and does not "break parallelism". ...
![Wink ;)](./images/smilies/icon_wink.gif)