Page 1 of 1

Strange Problem - Join Performance Inconsistencies

Posted: Wed Aug 23, 2006 9:50 am
by nishadkapadia
Hi,

I have a simple job flow design/
Seq File(47 Mill Rec) is getting joined with Another Seq File.(20 Mill ).After doing join, there are some transformation and my target file is dataset file.
Since we are dealing with huge volumes, we have sort stage using Hash partition on join keys in both the sequential files. Also taking care of keeping the partition to Same.

We are able to execute the job in a jiffy getting throughput of 25000 rec / sec from Join stage,however today the throughput from join stage is as slow as 250 records per sec.

Nothing is changed in the job , neither the parameters.
Also have ran the job with APT_NO_SORT_INSERTION/APT_NO_PART_INSERTION params ,however without any luck in performance.

Am i missing something here.

Thanks in anticipation.

Posted: Wed Aug 23, 2006 11:05 am
by gpatton
What other tasks are going on - on the server while you were running today?

Did someone change the configuration file your job is using?

Other possibilities...

Posted: Wed Aug 23, 2006 1:48 pm
by jdmiceli
Hi all!

Other things to check:

1. Target indexes - if you constantly move data into the target, then the indexes may need to be refreshed/rebuilt.

2. Source indexes - same as above

3. Index structures - if you are using user defined queries, make certain the database optimizer has one or more indexes to follow.

4. Disk constraints maybe...

5. Limitations on a login's availability for resource usage (depending on OS and DB engine)

Those are just some things that come to mind from a DBA point of view as opposed to a DS pov. Hope it helps!

Bestest!

Posted: Wed Aug 23, 2006 2:08 pm
by kcbland
What's your hardware setup? A single SMP box? Use glance, vmstat, memstat, top, prstat, whatever you can to monitor cpu, disk, and memory utilization. If something else is running, you're completing for resources.

Posted: Wed Aug 23, 2006 6:33 pm
by amitava.ghosh@gmail.com
Hi,

You are joining 2 sequential files with huge data. As DataSatge is reading from sequential files, this is anyway serial. In this case the conversion is Sequential to Parallel.

Insert a "Column Import" Stage after each sequential file and turn RCP on. If you have 8 nodes then you can expect a lot of performance benefit by inserting the column import stage. In this case the conversion will be from parallel to parallel.

Turn the "APT_DUMP_SCORE" on and check the job log with and without Column Import Stage. You will see the difference.

Regards

Posted: Wed Aug 23, 2006 8:12 pm
by kumar_s
Let me take the call behalf of Ray this time, "Nothing changed" is not the right answere. You need to find out what is actully changed between the two run.

Posted: Wed Aug 23, 2006 8:16 pm
by sanjay
Hi
what turn RCP on will do in this case.

Thanks
Sanjay

amitava.ghosh@gmail.com wrote:Hi,

You are joining 2 sequential files with huge data. As DataSatge is reading from sequential files, this is anyway serial. In this case the conversion is Sequential to Parallel.

Insert a "Column Import" Stage after each sequential file and turn RCP on. If you have 8 nodes then you can expect a lot of performance benefit by inserting the column import stage. In this case the conversion will be from parallel to parallel.

Turn the "APT_DUMP_SCORE" on and check the job log with and without Column Import Stage. You will see the difference.

Regards

Posted: Wed Aug 23, 2006 11:45 pm
by nishadkapadia
Thank you all for pointers.
Will try and will post on this thread my findings.

sanjay wrote:Hi
what turn RCP on will do in this case.

Thanks
Sanjay

amitava.ghosh@gmail.com wrote:Hi,

You are joining 2 sequential files with huge data. As DataSatge is reading from sequential files, this is anyway serial. In this case the conversion is Sequential to Parallel.

Insert a "Column Import" Stage after each sequential file and turn RCP on. If you have 8 nodes then you can expect a lot of performance benefit by inserting the column import stage. In this case the conversion will be from parallel to parallel.

Turn the "APT_DUMP_SCORE" on and check the job log with and without Column Import Stage. You will see the difference.

Regards

Posted: Thu Aug 24, 2006 12:39 am
by ray.wurlod
That nothing has changed in the job does not mean that nothing has changed on the system. For example it will be faster to insert 50 million rows into an empty table than into a table that already has very many rows. Newly added indexes, triggers and constraints will also slow load performance. Query performance might be slowed by the reverse; removed or invalidated indexes, removed ability to use precomputed aggregates, removed indexes on join columns, and so on.

Posted: Thu Aug 24, 2006 9:00 am
by nishadkapadia
Hi,

Yes, different executions do change the system. I need to be more careful.

There was some background process runnning on the AIX box, resulting in the box. Got it sorted out.

Thanks all once again.