Strange Problem - Join Performance Inconsistencies
Moderators: chulett, rschirm, roy
-
- Charter Member
- Posts: 47
- Joined: Fri Mar 18, 2005 5:59 am
Strange Problem - Join Performance Inconsistencies
Hi,
I have a simple job flow design/
Seq File(47 Mill Rec) is getting joined with Another Seq File.(20 Mill ).After doing join, there are some transformation and my target file is dataset file.
Since we are dealing with huge volumes, we have sort stage using Hash partition on join keys in both the sequential files. Also taking care of keeping the partition to Same.
We are able to execute the job in a jiffy getting throughput of 25000 rec / sec from Join stage,however today the throughput from join stage is as slow as 250 records per sec.
Nothing is changed in the job , neither the parameters.
Also have ran the job with APT_NO_SORT_INSERTION/APT_NO_PART_INSERTION params ,however without any luck in performance.
Am i missing something here.
Thanks in anticipation.
I have a simple job flow design/
Seq File(47 Mill Rec) is getting joined with Another Seq File.(20 Mill ).After doing join, there are some transformation and my target file is dataset file.
Since we are dealing with huge volumes, we have sort stage using Hash partition on join keys in both the sequential files. Also taking care of keeping the partition to Same.
We are able to execute the job in a jiffy getting throughput of 25000 rec / sec from Join stage,however today the throughput from join stage is as slow as 250 records per sec.
Nothing is changed in the job , neither the parameters.
Also have ran the job with APT_NO_SORT_INSERTION/APT_NO_PART_INSERTION params ,however without any luck in performance.
Am i missing something here.
Thanks in anticipation.
Other possibilities...
Hi all!
Other things to check:
1. Target indexes - if you constantly move data into the target, then the indexes may need to be refreshed/rebuilt.
2. Source indexes - same as above
3. Index structures - if you are using user defined queries, make certain the database optimizer has one or more indexes to follow.
4. Disk constraints maybe...
5. Limitations on a login's availability for resource usage (depending on OS and DB engine)
Those are just some things that come to mind from a DBA point of view as opposed to a DS pov. Hope it helps!
Bestest!
Other things to check:
1. Target indexes - if you constantly move data into the target, then the indexes may need to be refreshed/rebuilt.
2. Source indexes - same as above
3. Index structures - if you are using user defined queries, make certain the database optimizer has one or more indexes to follow.
4. Disk constraints maybe...
5. Limitations on a login's availability for resource usage (depending on OS and DB engine)
Those are just some things that come to mind from a DBA point of view as opposed to a DS pov. Hope it helps!
Bestest!
Bestest!
John Miceli
System Specialist, MCP, MCDBA
Berkley Technology Services
"Good Morning. This is God. I will be handling all your problems today. I will not need your help. So have a great day!"
John Miceli
System Specialist, MCP, MCDBA
Berkley Technology Services
"Good Morning. This is God. I will be handling all your problems today. I will not need your help. So have a great day!"
What's your hardware setup? A single SMP box? Use glance, vmstat, memstat, top, prstat, whatever you can to monitor cpu, disk, and memory utilization. If something else is running, you're completing for resources.
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
-
- Participant
- Posts: 1
- Joined: Wed Aug 23, 2006 6:15 pm
Hi,
You are joining 2 sequential files with huge data. As DataSatge is reading from sequential files, this is anyway serial. In this case the conversion is Sequential to Parallel.
Insert a "Column Import" Stage after each sequential file and turn RCP on. If you have 8 nodes then you can expect a lot of performance benefit by inserting the column import stage. In this case the conversion will be from parallel to parallel.
Turn the "APT_DUMP_SCORE" on and check the job log with and without Column Import Stage. You will see the difference.
Regards
You are joining 2 sequential files with huge data. As DataSatge is reading from sequential files, this is anyway serial. In this case the conversion is Sequential to Parallel.
Insert a "Column Import" Stage after each sequential file and turn RCP on. If you have 8 nodes then you can expect a lot of performance benefit by inserting the column import stage. In this case the conversion will be from parallel to parallel.
Turn the "APT_DUMP_SCORE" on and check the job log with and without Column Import Stage. You will see the difference.
Regards
Amitava G
Hi
what turn RCP on will do in this case.
Thanks
Sanjay
what turn RCP on will do in this case.
Thanks
Sanjay
amitava.ghosh@gmail.com wrote:Hi,
You are joining 2 sequential files with huge data. As DataSatge is reading from sequential files, this is anyway serial. In this case the conversion is Sequential to Parallel.
Insert a "Column Import" Stage after each sequential file and turn RCP on. If you have 8 nodes then you can expect a lot of performance benefit by inserting the column import stage. In this case the conversion will be from parallel to parallel.
Turn the "APT_DUMP_SCORE" on and check the job log with and without Column Import Stage. You will see the difference.
Regards
-
- Charter Member
- Posts: 47
- Joined: Fri Mar 18, 2005 5:59 am
Thank you all for pointers.
Will try and will post on this thread my findings.
Will try and will post on this thread my findings.
sanjay wrote:Hi
what turn RCP on will do in this case.
Thanks
Sanjay
amitava.ghosh@gmail.com wrote:Hi,
You are joining 2 sequential files with huge data. As DataSatge is reading from sequential files, this is anyway serial. In this case the conversion is Sequential to Parallel.
Insert a "Column Import" Stage after each sequential file and turn RCP on. If you have 8 nodes then you can expect a lot of performance benefit by inserting the column import stage. In this case the conversion will be from parallel to parallel.
Turn the "APT_DUMP_SCORE" on and check the job log with and without Column Import Stage. You will see the difference.
Regards
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
That nothing has changed in the job does not mean that nothing has changed on the system. For example it will be faster to insert 50 million rows into an empty table than into a table that already has very many rows. Newly added indexes, triggers and constraints will also slow load performance. Query performance might be slowed by the reverse; removed or invalidated indexes, removed ability to use precomputed aggregates, removed indexes on join columns, and so on.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Charter Member
- Posts: 47
- Joined: Fri Mar 18, 2005 5:59 am