Join Stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
varshanswamy
Participant
Posts: 48
Joined: Thu Mar 11, 2004 10:32 pm

Join Stage

Post by varshanswamy »

I have ajoin which performs a cartesian product between
2 files having a common column called DUMMY which is defaulted to 1.
Source File - 2000 records
Reference File - 34,2066 records

I want to know how I can quicken the process, and also if join can do the operation because of the heavy volume of data, or I need to use some other stage.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You didn't state how you were doing the join - presumably a Join stage.
No matter what method you use, you're going to have to process a large number of rows (684132000 based on the figures you supplied).

There's no reason that DataStage would not be able to process this volume, provided you have the resources to support it.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
varshanswamy
Participant
Posts: 48
Joined: Thu Mar 11, 2004 10:32 pm

Post by varshanswamy »

I am used an Inner Join as it is based on the dummy column created for the purpose of cartesian product.
ray.wurlod wrote:You didn't state how you were doing the join - presumably a Join stage.
No matter what method you use, you're going to have to process a large number of rows (684132000 based on the figures you supplied).

There's no reason that DataStage would not be able to process this volume, provided you have the resources to support it.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Either you or I misunderstand the term Cartesian product, then. As I understand the term, it's all rows from table B for each row in table A.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
T42
Participant
Posts: 499
Joined: Thu Nov 11, 2004 6:45 pm

Post by T42 »

Ray is correct. In fact, you probably are better off with a Lookup stage. Heavy volume? Hah. Maybe for output, but not for a lookup. 300k of reference file that does not have to be sorted? Do a lookup.

Join stage sorts within the framework (unless you already sort the data beforehand.) That adds time.
Post Reply