Full outer join of more than two files in DataStage

kaps · Post by **kaps** » Wed Sep 27, 2006 10:02 am

All

We are looking to do full outer join of two files in DataStage. I can do this using Merge Stage. But reading some of the postings in this forum I think the performance is not good and some people say its ugly !
Can anyone tell me why the performace is not good when we use Merge stage ? If I do this using Aggregator will it be fast ?

I also want to know how can I do a full outer join of more than two files in DataStage without using Merge Stage ?

I appreciate any help !

Thanks

kris007 · Post by **kris007** » Wed Sep 27, 2006 10:10 am

Merge stage is kind of picky. It's really hard for me to say why it behaves in such a way, but I have seen it behave like that. You might want to try it and see what you achieve before you come to any conclusions for yourself. You never know, if might work good for you. I am not sure how you intend to perform a full outer join using Aggregator stage. One way you can achieve is, if any of your files doesn't have duplicates, then you can load of the files in a hashed file and then use as lookup and include all the columns in the output.

chulett · Post by **chulett** » Wed Sep 27, 2006 10:13 am

Performance is a relative thing and depends on many different factors. One man's poor is another man's just fine. It's simple enough to setup the Merge stage - give it a shot and see how it handles your two files on your system.

For multiple files, you could setup a series of Merge stages, I suppose. The first would merge two files, then land the results that so it could be read back in and merged with the next file. Lather, rinse, repeat. I'd probably use named pipes in that case to handle the 'landed' files or at least give that a try first.

The Aggregator can't be used for this.

meena · Post by **meena** » Wed Sep 27, 2006 10:18 am

Hi,
Only one way of joining more then two files is to use "merge stage". I never heard of aggregator stage used for this scenario because the aggregator is used to aggregate function/totals etc.And it takes only one input stream( you can not use this stage). Well about the performance It depends..

ray.wurlod · Post by **ray.wurlod** » Wed Sep 27, 2006 3:33 pm

Merge stage must read its source files. Therefore to join more than two files you need more than one job.

To do it without the Merge stage you would need to load the text files into temporary tables (UV tables would do) and use the database to effect the N-way full outer join. Hashed file lookups do not support full outer joins; only left outer joins.