JOIN STAGE takes hell lot of time

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
kumarjit
Participant
Posts: 99
Joined: Fri Oct 12, 2012 7:47 am
Location: Kolkata

JOIN STAGE takes hell lot of time

Post by kumarjit »

I have created a datastage job which takes around 5 MILLION records as input . One of the columns of this dataset is ADDRESS_ID (Varchar type, max length 100 chars). In the later stages of the job, there is a join stage which joins the input data with another reference dataset ( of comparable data volume as the input) on the ADDRESS_ID column .
An example of the data which is in the ADDRESS_ID column could be like :
7767657567575089/001/01
My question here is why this join stage is taking a hell lot long to process the data tru it ?
From the Job Monitor screen , I have checked that the job stucks as soon as it reaches the join stage.
I may be wrong , but the way I interpret is that as because the LENGTH of the key column (ADDRESS_ID) is on the higher side and it has to scan the reference dataset with so many records for each key value read from the input , it takes so long to process .
If so, can there be any workaround to this problem.


Thanks,
Kumarjit.
Pain is the best teacher, but very few attend his class..
jhmckeever
Premium Member
Premium Member
Posts: 301
Joined: Thu Jul 14, 2005 10:27 am
Location: Melbourne, Australia
Contact:

Post by jhmckeever »

Hi Kumarjit,

I suspect the reason you're seeing the delay is because Your data is being sorted on Address_ID before it can be joined. For this reason it's likely that a runtime sort operator has been inserted into your OSH. Take a look at your OSH in Directory to see evidence of this.

J.
<b>John McKeever</b>
Data Migrators
<b><a href="https://www.mettleci.com">MettleCI</a> - DevOps for DataStage</b>
<a href="http://www.datamigrators.com/"><img src="https://www.datamigrators.com/assets/im ... l.png"></a>
kumarjit
Participant
Posts: 99
Joined: Fri Oct 12, 2012 7:47 am
Location: Kolkata

Post by kumarjit »

Thanks J for your response . But the business logic demands that a join must be made on ADDRESS_ID (basically to lookup the ADDRESS_ID of the input against a reference data), so cant escape the join.

What I am looking for is like this :
If there is a way that we can encode the ADDRESS_ID column in such a format so that
a) the length of characters reduces (and then decode it back to its normal form (varchar) after all the processings are done
b) the encoding yields UNIQUE values for EACH ADDRESS_ID(varchar) record
we can speed things up a bit.
Cant we ???? :wink:

Regards,
Kumarjit.
Pain is the best teacher, but very few attend his class..
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

While there are a couple of algorithmic approaches to making your ADDRESS_ID more compact to fit in fewer bytes, the real reason for the join taking a long time has nothing at all to do with the size of the join key and is all about the partitioning and sorting of your two input links.

As jhmckeever has already stated, you need to ensure that both of your incoming links are partitioned and sorted identically and that the data is sorted on ADDRESS_ID.
kumarjit
Participant
Posts: 99
Joined: Fri Oct 12, 2012 7:47 am
Location: Kolkata

Post by kumarjit »

I already have the partition and sort done IDENTICALLY on both the links to the join stage,on ADDRESS_ID column , and still the jobs runs long enough to raise my eyebrow......
As Andrew said , can I have any of the algorithms to compact my ADDRESS_ID field size ....... I just need to give it a try and check whether it enhances the performance of the job.

Regards,
Kumarjit.
Pain is the best teacher, but very few attend his class..
eph
Premium Member
Premium Member
Posts: 110
Joined: Mon Oct 18, 2010 10:25 am

Post by eph »

Hi,

Instead of partition and sort on links, you could do it using a sort stage and try to tune the Restrict Memory Usage to an higer value.

Eric
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

kumarjit wrote:As Andrew said , can I have any of the algorithms to compact my ADDRESS_ID field size ....... I just need to give it a try and check whether it enhances the performance of the job.
It won't. Your issue isn't about the size of the join key.
kumarjit also wrote:I already have the partition and sort done IDENTICALLY on both the links to the join stage,on ADDRESS_ID column , and still the jobs runs long enough to raise my eyebrow......
Does the job know that? Sure doesn't sound like it if the job gets "stuck" at that stage. Have you dumped the score to verify that it isn't adding another sort as Jim suggested you do? It may be as simple as adding a Sort stage before the join, set to "Don't sort, already sorted" to solve the problem in that case.


Edit: People hardly ever bother to rate posts around here and yet mine, which couldn't be more "on topic" gets rated "off-topic/superfluous". Awesome. :roll:
Last edited by chulett on Wed Nov 14, 2012 8:16 am, edited 2 times in total.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply