Strange Problem - Join Performance Inconsistencies

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
nishadkapadia
Charter Member
Charter Member
Posts: 47
Joined: Fri Mar 18, 2005 5:59 am

Strange Problem - Join Performance Inconsistencies

Post by nishadkapadia »

Hi,

I have a simple job flow design/
Seq File(47 Mill Rec) is getting joined with Another Seq File.(20 Mill ).After doing join, there are some transformation and my target file is dataset file.
Since we are dealing with huge volumes, we have sort stage using Hash partition on join keys in both the sequential files. Also taking care of keeping the partition to Same.

We are able to execute the job in a jiffy getting throughput of 25000 rec / sec from Join stage,however today the throughput from join stage is as slow as 250 records per sec.

Nothing is changed in the job , neither the parameters.
Also have ran the job with APT_NO_SORT_INSERTION/APT_NO_PART_INSERTION params ,however without any luck in performance.

Am i missing something here.

Thanks in anticipation.
gpatton
Premium Member
Premium Member
Posts: 47
Joined: Mon Jan 05, 2004 8:21 am

Post by gpatton »

What other tasks are going on - on the server while you were running today?

Did someone change the configuration file your job is using?
jdmiceli
Premium Member
Premium Member
Posts: 309
Joined: Wed Feb 22, 2006 10:03 am
Location: Urbandale, IA

Other possibilities...

Post by jdmiceli »

Hi all!

Other things to check:

1. Target indexes - if you constantly move data into the target, then the indexes may need to be refreshed/rebuilt.

2. Source indexes - same as above

3. Index structures - if you are using user defined queries, make certain the database optimizer has one or more indexes to follow.

4. Disk constraints maybe...

5. Limitations on a login's availability for resource usage (depending on OS and DB engine)

Those are just some things that come to mind from a DBA point of view as opposed to a DS pov. Hope it helps!

Bestest!
Bestest!

John Miceli
System Specialist, MCP, MCDBA
Berkley Technology Services


"Good Morning. This is God. I will be handling all your problems today. I will not need your help. So have a great day!"
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

What's your hardware setup? A single SMP box? Use glance, vmstat, memstat, top, prstat, whatever you can to monitor cpu, disk, and memory utilization. If something else is running, you're completing for resources.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
amitava.ghosh@gmail.com
Participant
Posts: 1
Joined: Wed Aug 23, 2006 6:15 pm

Post by amitava.ghosh@gmail.com »

Hi,

You are joining 2 sequential files with huge data. As DataSatge is reading from sequential files, this is anyway serial. In this case the conversion is Sequential to Parallel.

Insert a "Column Import" Stage after each sequential file and turn RCP on. If you have 8 nodes then you can expect a lot of performance benefit by inserting the column import stage. In this case the conversion will be from parallel to parallel.

Turn the "APT_DUMP_SCORE" on and check the job log with and without Column Import Stage. You will see the difference.

Regards
Amitava G
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Let me take the call behalf of Ray this time, "Nothing changed" is not the right answere. You need to find out what is actully changed between the two run.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
sanjay
Premium Member
Premium Member
Posts: 203
Joined: Fri Apr 23, 2004 2:22 am

Post by sanjay »

Hi
what turn RCP on will do in this case.

Thanks
Sanjay

amitava.ghosh@gmail.com wrote:Hi,

You are joining 2 sequential files with huge data. As DataSatge is reading from sequential files, this is anyway serial. In this case the conversion is Sequential to Parallel.

Insert a "Column Import" Stage after each sequential file and turn RCP on. If you have 8 nodes then you can expect a lot of performance benefit by inserting the column import stage. In this case the conversion will be from parallel to parallel.

Turn the "APT_DUMP_SCORE" on and check the job log with and without Column Import Stage. You will see the difference.

Regards
nishadkapadia
Charter Member
Charter Member
Posts: 47
Joined: Fri Mar 18, 2005 5:59 am

Post by nishadkapadia »

Thank you all for pointers.
Will try and will post on this thread my findings.

sanjay wrote:Hi
what turn RCP on will do in this case.

Thanks
Sanjay

amitava.ghosh@gmail.com wrote:Hi,

You are joining 2 sequential files with huge data. As DataSatge is reading from sequential files, this is anyway serial. In this case the conversion is Sequential to Parallel.

Insert a "Column Import" Stage after each sequential file and turn RCP on. If you have 8 nodes then you can expect a lot of performance benefit by inserting the column import stage. In this case the conversion will be from parallel to parallel.

Turn the "APT_DUMP_SCORE" on and check the job log with and without Column Import Stage. You will see the difference.

Regards
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

That nothing has changed in the job does not mean that nothing has changed on the system. For example it will be faster to insert 50 million rows into an empty table than into a table that already has very many rows. Newly added indexes, triggers and constraints will also slow load performance. Query performance might be slowed by the reverse; removed or invalidated indexes, removed ability to use precomputed aggregates, removed indexes on join columns, and so on.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
nishadkapadia
Charter Member
Charter Member
Posts: 47
Joined: Fri Mar 18, 2005 5:59 am

Post by nishadkapadia »

Hi,

Yes, different executions do change the system. I need to be more careful.

There was some background process runnning on the AIX box, resulting in the box. Got it sorted out.

Thanks all once again.
Post Reply