Merge multiple rows into one based on key values

sbass1 · Post by **sbass1** » Thu Jul 16, 2009 9:38 pm

Hi,

I have data from disparate sources (sequential files and database tables) and need to get frequency counts based on the keys.

Very simplified example:

Code: Select all

seq_file --> aggregator --> output_link_1

db_table --> select count(*) from db_table group by key1, key2 --> output_link_2

etc. ... 5 more data sources, mix of file types

Say both processes create link variables like so:

output_link_1: Key1, Key2, Count1

output_link_2: Key1, Key2, Count2

Desired output is:

Key1, Key2, Count1, Count2

Is there a way to merge the rows in memory, without having to write to one or more intermediate seq files, hashed files, or db tables, then coding the merge "manually"?

Thanks,
Scott

DS_SUPPORT · Post by **DS_SUPPORT** » Thu Jul 16, 2009 9:44 pm

I don't know whether this method will work, but give a try.

Try writing the Values to Userstatus and fetch it when required.

ray.wurlod · Post by **ray.wurlod** » Thu Jul 16, 2009 11:25 pm

Link Collector stage, Transformer stage, Aggregator stage ("Last")

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Fri Jul 17, 2009 1:57 am

one change in Ray's suggestion...

It will be value in specific count and zeros for all others.

For example
source1, flatfile, 10, 0, 0, 0
source2, db, 0, 10, 0, 0
...
....

tx -> agg (MAX of each columns)

sbass1 · Post by **sbass1** » Tue Jul 21, 2009 7:45 pm

Thanks for the replies.

I ran into two issues with this approach:

1) For the sequential files, I wanted to first aggregate the data, as I needed frequency counts per primary keys. If I didn't first aggregate, I couldn't see how to get the frequency counts in the final aggregate stage from multiple input sequential files..

But the link collector did not accept the aggregate link. Sorry, I didn't save the program to supply the exact error msg; something about cannot accept active-to-active stream inputs.

2) The link collector requires the same column structure for all the input and output links. I didn't want to add a bunch of bogus code to the DRS stages just to create zero column counts, just to comply with this link collector requirement, plus I couldn't see how to get the aggregator stage to create zero column counts for the "unneccesary" columns.

I ended up feeding one sequential file --> aggregator --> xfm as the primary input link, subsequent sequential files --> aggregator --> hashed file --> xfm, and lookups on the hashed files and DRS stages on the primary keys to merge the rows.

Thanks again for the ideas.

Scott

ray.wurlod · Post by **ray.wurlod** » Tue Jul 21, 2009 8:41 pm

Well, why didn't you mention all that up front? Dammit, we've answered what turned out to be the wrong question!!!

Probably could also have solved this "new problem" with intermediate Sequential File stages.

sbass1 · Post by **sbass1** » Tue Jul 21, 2009 10:21 pm

No Ray, what I asked was:

Is there a way to merge the rows in memory, without having to write to one or more intermediate seq files, hashed files, or db tables, then coding the merge "manually"?

What you replied was:

Link Collector stage, Transformer stage, Aggregator stage ("Last")

Dammit, genius is in the details!!!

I then stated the problems I experienced with your stated approach. Besides, how can I "mention all that up front", when I'm (newly) trying your stated approaches for the first time!!!

Given the limitations of the link collector in requiring identical structure for all input and output links, and not accepting both DRS and Aggregator links as input, and the limitation of the aggregator stage in not allowing me to create "zero" columns (at least as I could see)...I go back to my original question:

Is there a way to merge the rows in memory, without having to write to one or more intermediate seq files, hashed files, or db tables, then coding the merge "manually"?

The only way I could figure out to solve it was my undesired use of intermediate hashed files.

Kryt0n · Post by **Kryt0n** » Tue Jul 21, 2009 10:32 pm

Issue 1 is related to inter- vs intra- process communication. Either shove a passive stage in between (which is what you wanted to avoid) or set inter-process communication in job properties

ray.wurlod · Post by **ray.wurlod** » Wed Jul 22, 2009 12:29 am

More thought required...

(The answer, in the interim, is "probably", but nothing is leaping into the forebrain just yet.) The idea about IPC stage or simply inter-process row buffering is probably useful.