Lengthy start up time for 90% of datastage jobs in 11.5.0.2
Moderators: chulett, rschirm, roy
Re: Lengthy start up time for 90% of datastage jobs in 11.5.
It's a bit buried in the original post...
arvind_ds wrote:Work load management(WLM) is disabled in our environment.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Premium Member
- Posts: 425
- Joined: Sat Nov 19, 2005 9:26 am
- Location: New York City
- Contact:
Hi Arvind_ds,
Wondering if you get to resolved the issue? A lot of us could learn from this situation ....
We have a similar environment with 11.5.0.1 and Oracle 11g for XMETA, and we are planning to apply Fix Patch 2 and migrating the schemas to Oracle 12 c soon, just to provide the new features on the Governance Catalog to our Governance people.... so trying to find out what was the root cause of your issues and if it was related to the oracle 12c and/or Fix Patch 2
Please let us know
Wondering if you get to resolved the issue? A lot of us could learn from this situation ....
We have a similar environment with 11.5.0.1 and Oracle 11g for XMETA, and we are planning to apply Fix Patch 2 and migrating the schemas to Oracle 12 c soon, just to provide the new features on the Governance Catalog to our Governance people.... so trying to find out what was the root cause of your issues and if it was related to the oracle 12c and/or Fix Patch 2
Please let us know
Julio Rodriguez
ETL Developer by choice
"Sure we have lots of reasons for being rude - But no excuses
ETL Developer by choice
"Sure we have lots of reasons for being rude - But no excuses
Problem still NOT resolved. I will make sure to update this post once we get a permanent solution.
Sev 1 PMR in place. Following up closely with IBM Customer Support. Exchanged lot of log files in last 2 weeks with them.
They suggested to do disk re-configuration on the AIX server where DS engine is installed, making it similar to what we had in the old 9.1.2 environment.
We have done the disk re-configuration and the situation has improved slightly(20% gain in performance).
Now the Customer support is suggesting to increase the CPUs on the datastage engine by 50%. This is in progress.
Will keep you all posted.
Sev 1 PMR in place. Following up closely with IBM Customer Support. Exchanged lot of log files in last 2 weeks with them.
They suggested to do disk re-configuration on the AIX server where DS engine is installed, making it similar to what we had in the old 9.1.2 environment.
We have done the disk re-configuration and the situation has improved slightly(20% gain in performance).
Now the Customer support is suggesting to increase the CPUs on the datastage engine by 50%. This is in progress.
Will keep you all posted.
Arvind
So... as a scope check is just startup time the issue? Meaning, once it actually gets going disk access isn't an issue? I only ask because the one time in the past when we had similar issues which required 'reconfiguring the file system' we were using, disk access was crap all around. And everything sprang back to life once the file system settings were set correctly.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
Adding more CPU is nuts and the IBM guy who is recommending that is a newb. Not to mention you've just signed up for 50% more licensing cost because of a product shortcoming. This is not LOAD based issue. I bet if you nmon your box and look that the CPU load during your slowness you'll prove that.
Thumbs down on the increase in your cores.
Thumbs down on the increase in your cores.
Well, based upon your described symptoms, you slow down over time. You did not indicate that there were a lot of concurrently running jobs running at the time. That implies that your CPU should not be pushing it's limits.
Run NMON on your box and capture the stats every X amount of minutes. "Reset" your environment to make it fast again... then let it slow down over time. Afterwards you look at the CPU consumption of the box during that timeframe and determine if you need more cores.
I suspect that you will not need more cores.
One thing that would be helpful is to ensure we are talking about a common view of what you are describing as slow startup time.
Please detail that interpretation, and be descriptive.
- did the DSD.RUN start?
- Did the osh start up?
- What is the log saying?
- Any database connections linked yet?
- etc...
Run NMON on your box and capture the stats every X amount of minutes. "Reset" your environment to make it fast again... then let it slow down over time. Afterwards you look at the CPU consumption of the box during that timeframe and determine if you need more cores.
I suspect that you will not need more cores.
One thing that would be helpful is to ensure we are talking about a common view of what you are describing as slow startup time.
Please detail that interpretation, and be descriptive.
- did the DSD.RUN start?
- Did the osh start up?
- What is the log saying?
- Any database connections linked yet?
- etc...
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
You might also consider enabling reporting environment variables such as APT_STARTUP_STATUS and APT_PLAYER_TIMINGS to capture some figures about how long things are taking and resource consumption.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Thank you all for your valuable inputs. We tried settings below variables at project level, as advised by PMR support engineers.
APT_DEBUG_CLEANUP=1
APT_NO_JOBMON=1
APT_SHOW_COMPONENT_CALLS=1
APT_PM_PLAYER_TIMING=1
APT_NO_PM_SIGNAL_HANDLERS=1
CORE_NAMING=true
APT_DUMP_SCORE=true
APT_PM_SHOW_PIDS=true
APT_STARTUP_STATUS=true
CC_MSG_LEVEL=2
APT_DISABLE_COMBINATION=true
The problem is still not resolved completely. After doing multiple tests and sharing the log files(director logs and stack trace logs) with PMR support, the issue is pointing more towards disk configuration across all file systems used in the Engine tier.
First round of disk re-configuration is completed and we have ovserved 30 to 40 % improvement in performance of the jobs. Jobs are no more hanging now post disk re-configuration BUT yes they are still slow when compared to 9.1.2 environment. Both the environments are identical wrt capacity now.
We are targeting to further fine tune the disk re-configuration. We are aiming to setup the non shareable HDD disks across different file systems on the engine tier(below file systems).
(1) Scratch : Scratch file system
(2) DataSets : datasets file system
(3) TMPDIR : File system corresponding to TMPDIR variable
(4) Project Plus : This one is used to store application specific data files and scripts.
(5) Project : DS projects are created under this file system
(6) /opt/IBM/InformationServer : Datastage binaries on this one.
At present, some of the file systems are using shared disks underneath. Eg datasets and scratch file systems are sharing same set of HDD disks(~10 disks each of size 500 GB). Similarly Project plus and Project and TMPDIR file systems are sharing another set of disks(But different from scratch and datasets disks).
In addition, we are also targeting to replace HDD with SSD.
Will keep you posted.
APT_DEBUG_CLEANUP=1
APT_NO_JOBMON=1
APT_SHOW_COMPONENT_CALLS=1
APT_PM_PLAYER_TIMING=1
APT_NO_PM_SIGNAL_HANDLERS=1
CORE_NAMING=true
APT_DUMP_SCORE=true
APT_PM_SHOW_PIDS=true
APT_STARTUP_STATUS=true
CC_MSG_LEVEL=2
APT_DISABLE_COMBINATION=true
The problem is still not resolved completely. After doing multiple tests and sharing the log files(director logs and stack trace logs) with PMR support, the issue is pointing more towards disk configuration across all file systems used in the Engine tier.
First round of disk re-configuration is completed and we have ovserved 30 to 40 % improvement in performance of the jobs. Jobs are no more hanging now post disk re-configuration BUT yes they are still slow when compared to 9.1.2 environment. Both the environments are identical wrt capacity now.
We are targeting to further fine tune the disk re-configuration. We are aiming to setup the non shareable HDD disks across different file systems on the engine tier(below file systems).
(1) Scratch : Scratch file system
(2) DataSets : datasets file system
(3) TMPDIR : File system corresponding to TMPDIR variable
(4) Project Plus : This one is used to store application specific data files and scripts.
(5) Project : DS projects are created under this file system
(6) /opt/IBM/InformationServer : Datastage binaries on this one.
At present, some of the file systems are using shared disks underneath. Eg datasets and scratch file systems are sharing same set of HDD disks(~10 disks each of size 500 GB). Similarly Project plus and Project and TMPDIR file systems are sharing another set of disks(But different from scratch and datasets disks).
In addition, we are also targeting to replace HDD with SSD.
Will keep you posted.
Arvind
If we leave the system as such at the time of slowness, the jobs used to run till completion, the only issue is that if any job which is supposed to finish in eg 30 minutes(baseline 9.1.2 run of same job against same data volume), it takes 10X more time in 11.5
Yes, at the time of slowness - multiple jobs are running in parallel, slowness happens with time BUT now after disk re-configuration, the slowness is still there BUT it has reduced from 10X to 2X wrt time.
No jobs gets aborted when the system experiences slowness, it just take more time to finish.
Another thing is that whenever it goes slow, the jobs will still appear in RUNNING stage but it will appear to end users as if the system is in HUNG stage because the job monitor will not show any progress for a long time.
At slowness, the jobs will finish after taking longer time and when only 1 or 2 jobs are left in the batch then they will complete nicely. These 1 or 2 jobs in the end of the batch don't experience any more slowness.
We tried to browse through IGC at the time of slowness, we queried XMETA with all possible options in IGC, no slowness observed there. We executed ISALite general health check also(at the time of slowness), it finished fine within 10 minutes. No issues reported in the ISALite report either.
Yes, at the time of slowness - multiple jobs are running in parallel, slowness happens with time BUT now after disk re-configuration, the slowness is still there BUT it has reduced from 10X to 2X wrt time.
No jobs gets aborted when the system experiences slowness, it just take more time to finish.
Another thing is that whenever it goes slow, the jobs will still appear in RUNNING stage but it will appear to end users as if the system is in HUNG stage because the job monitor will not show any progress for a long time.
At slowness, the jobs will finish after taking longer time and when only 1 or 2 jobs are left in the batch then they will complete nicely. These 1 or 2 jobs in the end of the batch don't experience any more slowness.
We tried to browse through IGC at the time of slowness, we queried XMETA with all possible options in IGC, no slowness observed there. We executed ISALite general health check also(at the time of slowness), it finished fine within 10 minutes. No issues reported in the ISALite report either.
Arvind