Merge Stage - pure inner join speed

ArndW · Post by **ArndW** » Thu Oct 27, 2005 9:02 am

I have two sequential files, each with two columns - the first column is the 'key' and the other is data. The source files are in the same order and of the same length - meaning that the n-th row of file A will have the same 'key' as the n-th row of file B.

I want to output three columns, the key, FileACol2 and FileBCol2. When I use the merge stage with the pure inner join on this one key column I get abysmal performance (the files are several gigabytes long) - somewhere on the order of 50 rows/second.

If I write a 10-liner DataStage program that just loops and reads a row from File A and File B then outputs the merged information I get 80,000 rows per second; but this is not the solution I would prefer to use.

I can't think of anything that I can do to speed up the merge; but perhaps I've missed something. Also, is there another approach to merging these two files (that does not involve writing interim data to disk) that could be used in a server job?

kcbland · Post by **kcbland** » Thu Oct 27, 2005 9:22 am

I would use a SEQ stage to read file 1 and go to a XFM and leave the data in the column position it's in, but add a third column with a derivation of "".

I would add another SEQ stage to read file 2 and go to a XFM and map the data into the third column created just like above, but put "" for the derivation of the second column.

I'd send the output of both XFM stages to a link collector and specify the sort key and then to an AGG with the sort information there as well. In the AGG stage, use MAX() as the derivation. If the design works, you should get a merged row, where both source files contribute a row you'll have full population, where only one file contributes you'll have a blank in the opposite data column.

This should be really fast, because of the pre-sorted data the link collector should be really fast and the aggregator only will hold max 2 rows in memory because of sorted data.

ArndW · Post by **ArndW** » Thu Oct 27, 2005 9:42 am

Ken,

thanks! Right after I posted the query I thought about the link collector, and wrote a test job using that. Essentially the rows alternate between File A and File B so with a stage variable in a transform it works perfectly. The speed is about 125,000 rows per second on this box (I'm using named pipes to speed up things).

Thanks,

kcbland · Post by **kcbland** » Thu Oct 27, 2005 10:08 am

Don't trust the alternate without using the sorting to guarantee it. You might get them out of order. I didn't suggest stage variables because the aggregator is more scalable if you have a lot of data columns. The AGG works well even with 100's of columns. I use this technique quite often to great success in merging rather than joining data and building lots of hash lookups, especially when full outer joins are required.

ArndW · Post by **ArndW** » Thu Oct 27, 2005 10:42 am

Ken,

since these data streams are going to contain hundreds of millions of records I really don't want to do any sorting at all. In reality I don't have two files, I actually have just one stream that has been split in a transform, with one stream going into a very complex shared container which returns just one value, then I wish to merge this one new value back into the original stream.

I used the 2 source file example to simplify the post. I will always have a 1:1 correspondence on a row level; thus the alternating link collector shouldn't have a chance of getting out of synch - or do you think that it might?

kcbland · Post by **kcbland** » Thu Oct 27, 2005 10:48 am

The collector sort/merge reads multiple sorted input links. Since your data is sorted, use this collection type instead of round robin. Designate the sort key column, and you're all set. It should maintain the sort order when collating the link rows. In result it will just do a round-robin, but with row buffering I don't know if you can guarantee it.

It comes out sorted, so you're really not sorting it again to the aggregator. The most number of rows to be sorted by the collector are 2, because it reads from each link until the key changes on all the links, sorts them, and then outputs them. It should be fine, but at least check it out.

roy · Post by **roy** » Thu Oct 27, 2005 10:48 am

Hi,
If you combine using the Agg stage there is always a chance of getting off sequence stated in the order by to gain performance.
But you said you have a solution using transformer stage so is there no such problem there?

ArndW · Post by **ArndW** » Thu Oct 27, 2005 10:54 am

Ken,

I understand what you meant now - I orginaly assumed that it would do a complete data sort {just as that durn merge stage is obviously doing}. But do you think that the round-robin collector could ever get out of synch? I wrote small test job and started it 50 times (thank god for multi-instance) and it worked each time, meaning the link and data order was correct. I can't think of any external factor that could affect the ordering in this case.

I know what you've said will work, but I think you can tell how I'm fighting tooth-and-nail to avoid using an aggravator stage

I'll give it a try and if the performance is fast I'll stick with your recommendation.

kcbland · Post by **kcbland** » Thu Oct 27, 2005 11:07 am

There's absolute metaphysical certitude that your rows are ordered correctly, and then there's everything else. Even if performance is 1/10th your current row/sec, you're still chugging along quite well.

You roll those dice. I'm fortunate in that my data is never sorted going in, so I just end up concatenating, sorting, and then aggregating. You're starting with sorted files that need subtle manipulating before collecting. You could always use 2 more SEQ files instead of the collector, concatenate, then sort. But, since your data is sorted and ready to go, using the link collector with sort/merge collection seems easy.

You only have 2 data columns, why not use stage variables instead of the aggregator? It's over-engineering to use the Aggregator, but even that's not much effort. However, your stage variable technique doesn't scale with 100 columns, because you'll need 102 stage variables (one to hold last key, one to hold key change indicator, 100 to hold all of the values from the previous row, and then there's that whole pesky how to you output the last row thing.

You did figure out how to solve the last row issue, right?

ArndW · Post by **ArndW** » Thu Oct 27, 2005 11:19 am

Ken, thanks for the help. The aggregator worked as you stated, but in this case the link collector and a subsequent transform with two stage variables will work for me.

kcbland · Post by **kcbland** » Thu Oct 27, 2005 11:21 am

kcbland wrote:You did figure out how to solve the last row issue, right?

ashokyadav · Post by **ashokyadav** » Thu Oct 27, 2005 11:26 am

[quote="kcbland"]I would use a SEQ stage to read file 1 and go to a XFM and leave the data in the column position it's in, but add a third column with a derivation of "".

I would add another SEQ stage to read file 2 and go to a XFM and map the data into the third column created just like above, but put "" for the derivation of the second column.

I'd send the output of both XFM stages to a link collector and specify the sort key and then to an AGG with the sort information there as well. In the AGG stage, use MAX() as the derivation. If the design works, you should get a merged row, where both source files contribute a row you'll have full population, where only one file contributes you'll have a blank in the opposite data column.

[color=brown]Ken ,

I am new to Datastage, Can you please let me know why you added the Aggregate stage in this solution.

Thanks,
Ashok[/color]

This should be really fast, because of the pre-sorted data the link collector should be really fast and the aggregator only will hold max 2 rows in memory because of sorted data.[/quote][color=brown][/color]

kcbland · Post by **kcbland** » Thu Oct 27, 2005 11:34 am

Because 2 rows have to "merge" into 1. Here's the same solution written in SQL:

Code: Select all

select key, max(col1), max(col2) from (
select key, col1 "COL1", null "COL2" from table1
union all
select key, null "COL1", col2 "COL2" from table2
)
group by key

You can see that 2 tables are doing a full outer join, just as in Arnd's situation.