Difference in Record Count in Sort/Join Stages

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
vvsaucse
Participant
Posts: 16
Joined: Thu Aug 27, 2009 11:23 pm
Location: BAngalore

Difference in Record Count in Sort/Join Stages

Post by vvsaucse »

Hi, I am using DS V 8.1

In this, there is a requirement wherein i am using the output of Sort Stage to be provided as the Right Link for Join Stage for an INNER JOIN. When the job is being run, i see that in JOB MONITOR, the output link from Sort is producing 800K rows where as the Inner Join Rec Link (Same link) is having only 700K odd rows. Any clue as to why this difference?

Eg : Sort Stage Output Link Name : lnk_srt_Join_RecCount

The record count during run time when viewed from Monitor: 800K

Join Stage Right Link Name (Inner Join) : lnk_srt_Join_RecCount

The record count during run time when viewed from Monitor: 750K.

Regards
Subbu
TCS BAngalore
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

An inner join, by definition, only outputs rows when a match is encountered between the left and other input links, and is typically driven by the left input. If there is no match to a row in the left input, there will not be a record output.

Typical reasons for what you are seeing with an inner join include:

1) Not as many actual matches as you are expecting
2) The partitioning is incorrect (matching records don't appear in the same partition)

How many rows are present in your left input link?
Is your data partitioned correctly so that all rows for a given key value will be placed into the same partition?

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
vvsaucse
Participant
Posts: 16
Joined: Thu Aug 27, 2009 11:23 pm
Location: BAngalore

Post by vvsaucse »

@James: Thanks for the quick response.

The Left Link has around 10K Records. (Comparitively very less to the reference link) in this particular case. There is no problem with the output as all the Source records have found valid matches in the reference. The partitioning also is as per requirement (HASH) for the specific columns.

However, the question remains, why there is a discrepancy in Record Count Display (Job Monitor) on the same link.

Sort Stage Output Link Name : lnk_srt_Join_RecCount

Join Stage Right Link Name (Inner Join) : lnk_srt_Join_RecCount

Since this is the same link thats connecting the Sort Stage and Join Stage.
Subbu
TCS BAngalore
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

Thank you...I misunderstood the question. Have you set the APT_RECORD_COUNTS environment variable to show input/output counts for the various operators in the job logs?

One other possibility in your situation is:

Once all Left Input records have been processed, the remaining records on the right input may have been abandoned...there is no need to process them as there's nothing more to match to. As you only have 10,000 records on the left input, this is a distinct possibility.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
vvsaucse
Participant
Posts: 16
Joined: Thu Aug 27, 2009 11:23 pm
Location: BAngalore

Post by vvsaucse »

Yeps. That seems to be the reason. Got it checked. Thnks James.
Subbu
TCS BAngalore
Post Reply