JOIN STAGE takes hell lot of time

kumarjit · Post by **kumarjit** » Mon Nov 12, 2012 12:35 am

I have created a datastage job which takes around 5 MILLION records as input . One of the columns of this dataset is ADDRESS_ID (Varchar type, max length 100 chars). In the later stages of the job, there is a join stage which joins the input data with another reference dataset ( of comparable data volume as the input) on the ADDRESS_ID column .
An example of the data which is in the ADDRESS_ID column could be like :
7767657567575089/001/01
My question here is why this join stage is taking a hell lot long to process the data tru it ?
From the Job Monitor screen , I have checked that the job stucks as soon as it reaches the join stage.
I may be wrong , but the way I interpret is that as because the LENGTH of the key column (ADDRESS_ID) is on the higher side and it has to scan the reference dataset with so many records for each key value read from the input , it takes so long to process .
If so, can there be any workaround to this problem.

Thanks,
Kumarjit.

jhmckeever · Post by **jhmckeever** » Mon Nov 12, 2012 12:48 am

Hi Kumarjit,

I suspect the reason you're seeing the delay is because Your data is being sorted on Address_ID before it can be joined. For this reason it's likely that a runtime sort operator has been inserted into your OSH. Take a look at your OSH in Directory to see evidence of this.

J.

kumarjit · Post by **kumarjit** » Mon Nov 12, 2012 1:13 am

Thanks J for your response . But the business logic demands that a join must be made on ADDRESS_ID (basically to lookup the ADDRESS_ID of the input against a reference data), so cant escape the join.

What I am looking for is like this :
If there is a way that we can encode the ADDRESS_ID column in such a format so that
a) the length of characters reduces (and then decode it back to its normal form (varchar) after all the processings are done
b) the encoding yields UNIQUE values for EACH ADDRESS_ID(varchar) record
we can speed things up a bit.
Cant we ????

Regards,
Kumarjit.

ArndW · Post by **ArndW** » Mon Nov 12, 2012 5:06 am

While there are a couple of algorithmic approaches to making your ADDRESS_ID more compact to fit in fewer bytes, the real reason for the join taking a long time has nothing at all to do with the size of the join key and is all about the partitioning and sorting of your two input links.

As jhmckeever has already stated, you need to ensure that both of your incoming links are partitioned and sorted identically and that the data is sorted on ADDRESS_ID.

kumarjit · Post by **kumarjit** » Tue Nov 13, 2012 12:15 am

I already have the partition and sort done IDENTICALLY on both the links to the join stage,on ADDRESS_ID column , and still the jobs runs long enough to raise my eyebrow......
As Andrew said , can I have any of the algorithms to compact my ADDRESS_ID field size ....... I just need to give it a try and check whether it enhances the performance of the job.

Regards,
Kumarjit.

eph · Post by **eph** » Tue Nov 13, 2012 7:10 am

Hi,

Instead of partition and sort on links, you could do it using a sort stage and try to tune the Restrict Memory Usage to an higer value.

Eric

chulett · Post by **chulett** » Tue Nov 13, 2012 7:49 am

kumarjit wrote:As Andrew said , can I have any of the algorithms to compact my ADDRESS_ID field size ....... I just need to give it a try and check whether it enhances the performance of the job.

It won't. Your issue isn't about the size of the join key.

kumarjit also wrote:I already have the partition and sort done IDENTICALLY on both the links to the join stage,on ADDRESS_ID column , and still the jobs runs long enough to raise my eyebrow......

Does the job know that? Sure doesn't sound like it if the job gets "stuck" at that stage. Have you dumped the score to verify that it isn't adding another sort as Jim suggested you do? It may be as simple as adding a Sort stage before the join, set to "Don't sort, already sorted" to solve the problem in that case.

Edit: People hardly ever bother to rate posts around here and yet mine, which couldn't be more "on topic" gets rated "off-topic/superfluous". Awesome.