Merge multiple rows into one based on key values

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
sbass1
Premium Member
Premium Member
Posts: 211
Joined: Wed Jan 28, 2009 9:00 pm
Location: Sydney, Australia

Merge multiple rows into one based on key values

Post by sbass1 »

Hi,

I have data from disparate sources (sequential files and database tables) and need to get frequency counts based on the keys.

Very simplified example:

Code: Select all

seq_file --> aggregator --> output_link_1

db_table --> select count(*) from db_table group by key1, key2 --> output_link_2

etc. ... 5 more data sources, mix of file types
Say both processes create link variables like so:

output_link_1: Key1, Key2, Count1

output_link_2: Key1, Key2, Count2

Desired output is:

Key1, Key2, Count1, Count2

Is there a way to merge the rows in memory, without having to write to one or more intermediate seq files, hashed files, or db tables, then coding the merge "manually"?

Thanks,
Scott
Some people are the Michael Jordan's of Datastage. I'm more like Muggsy Bogues :-)
DS_SUPPORT
Premium Member
Premium Member
Posts: 232
Joined: Fri Aug 04, 2006 1:20 am
Location: Bangalore

Post by DS_SUPPORT »

I don't know whether this method will work, but give a try.

Try writing the Values to Userstatus and fetch it when required.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Link Collector stage, Transformer stage, Aggregator stage ("Last")
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

one change in Ray's suggestion...

It will be value in specific count and zeros for all others.

For example
source1, flatfile, 10, 0, 0, 0
source2, db, 0, 10, 0, 0
...
....


tx -> agg (MAX of each columns)
sbass1
Premium Member
Premium Member
Posts: 211
Joined: Wed Jan 28, 2009 9:00 pm
Location: Sydney, Australia

Post by sbass1 »

Thanks for the replies.

I ran into two issues with this approach:

1) For the sequential files, I wanted to first aggregate the data, as I needed frequency counts per primary keys. If I didn't first aggregate, I couldn't see how to get the frequency counts in the final aggregate stage from multiple input sequential files..

But the link collector did not accept the aggregate link. Sorry, I didn't save the program to supply the exact error msg; something about cannot accept active-to-active stream inputs.

2) The link collector requires the same column structure for all the input and output links. I didn't want to add a bunch of bogus code to the DRS stages just to create zero column counts, just to comply with this link collector requirement, plus I couldn't see how to get the aggregator stage to create zero column counts for the "unneccesary" columns.

I ended up feeding one sequential file --> aggregator --> xfm as the primary input link, subsequent sequential files --> aggregator --> hashed file --> xfm, and lookups on the hashed files and DRS stages on the primary keys to merge the rows.

Thanks again for the ideas.

Scott
Some people are the Michael Jordan's of Datastage. I'm more like Muggsy Bogues :-)
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Well, why didn't you mention all that up front? Dammit, we've answered what turned out to be the wrong question!!!

Probably could also have solved this "new problem" with intermediate Sequential File stages.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
sbass1
Premium Member
Premium Member
Posts: 211
Joined: Wed Jan 28, 2009 9:00 pm
Location: Sydney, Australia

Post by sbass1 »

No Ray, what I asked was:
Is there a way to merge the rows in memory, without having to write to one or more intermediate seq files, hashed files, or db tables, then coding the merge "manually"?
What you replied was:
Link Collector stage, Transformer stage, Aggregator stage ("Last")
Dammit, genius is in the details!!! :P

I then stated the problems I experienced with your stated approach. Besides, how can I "mention all that up front", when I'm (newly) trying your stated approaches for the first time!!!

Given the limitations of the link collector in requiring identical structure for all input and output links, and not accepting both DRS and Aggregator links as input, and the limitation of the aggregator stage in not allowing me to create "zero" columns (at least as I could see)...I go back to my original question:
Is there a way to merge the rows in memory, without having to write to one or more intermediate seq files, hashed files, or db tables, then coding the merge "manually"?
The only way I could figure out to solve it was my undesired use of intermediate hashed files.
Some people are the Michael Jordan's of Datastage. I'm more like Muggsy Bogues :-)
Kryt0n
Participant
Posts: 584
Joined: Wed Jun 22, 2005 7:28 pm

Post by Kryt0n »

Issue 1 is related to inter- vs intra- process communication. Either shove a passive stage in between (which is what you wanted to avoid) or set inter-process communication in job properties
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

More thought required...

(The answer, in the interim, is "probably", but nothing is leaping into the forebrain just yet.) The idea about IPC stage or simply inter-process row buffering is probably useful.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply