Lengthy start up time for 90% of datastage jobs in 11.5.0.2

IBM Analytics Champion 2009 - 2020 · Post by **asorrell** » Tue Aug 01, 2017 12:12 pm

Do you have the new Work Load Manager turned off?

chulett · Post by **chulett** » Tue Aug 01, 2017 12:15 pm

It's a bit buried in the original post...

arvind_ds wrote:Work load management(WLM) is disabled in our environment.

JRodriguez · Post by **JRodriguez** » Thu Aug 10, 2017 10:03 am

Hi Arvind_ds,
Wondering if you get to resolved the issue? A lot of us could learn from this situation ....

We have a similar environment with 11.5.0.1 and Oracle 11g for XMETA, and we are planning to apply Fix Patch 2 and migrating the schemas to Oracle 12 c soon, just to provide the new features on the Governance Catalog to our Governance people.... so trying to find out what was the root cause of your issues and if it was related to the oracle 12c and/or Fix Patch 2

Please let us know

arvind_ds · Post by **arvind_ds** » Sat Aug 12, 2017 3:23 am

Problem still NOT resolved. I will make sure to update this post once we get a permanent solution.

Sev 1 PMR in place. Following up closely with IBM Customer Support. Exchanged lot of log files in last 2 weeks with them.

They suggested to do disk re-configuration on the AIX server where DS engine is installed, making it similar to what we had in the old 9.1.2 environment.

We have done the disk re-configuration and the situation has improved slightly(20% gain in performance).

Now the Customer support is suggesting to increase the CPUs on the datastage engine by 50%. This is in progress.

Will keep you all posted.

chulett · Post by **chulett** » Sat Aug 12, 2017 1:56 pm

So... as a scope check is just startup time the issue? Meaning, once it actually gets going disk access isn't an issue? I only ask because the one time in the past when we had similar issues which required 'reconfiguring the file system' we were using, disk access was crap all around. And everything sprang back to life once the file system settings were set correctly.

PaulVL · Post by **PaulVL** » Mon Aug 14, 2017 7:52 am

Adding more CPU is nuts and the IBM guy who is recommending that is a newb. Not to mention you've just signed up for 50% more licensing cost because of a product shortcoming. This is not LOAD based issue. I bet if you nmon your box and look that the CPU load during your slowness you'll prove that.

Thumbs down on the increase in your cores.

arvind_ds · Post by **arvind_ds** » Mon Aug 14, 2017 12:49 pm

Ok PaulVL. So do you think that its a product short coming? Kindly share some details if you have also observed similar issue in the upgraded version of InfoSphere Information Server 11.5.0.2

We are literally struggling. Appreciate your inputs.

PaulVL · Post by **PaulVL** » Mon Aug 14, 2017 2:27 pm

Well, based upon your described symptoms, you slow down over time. You did not indicate that there were a lot of concurrently running jobs running at the time. That implies that your CPU should not be pushing it's limits.

Run NMON on your box and capture the stats every X amount of minutes. "Reset" your environment to make it fast again... then let it slow down over time. Afterwards you look at the CPU consumption of the box during that timeframe and determine if you need more cores.

I suspect that you will not need more cores.

One thing that would be helpful is to ensure we are talking about a common view of what you are describing as slow startup time.

Please detail that interpretation, and be descriptive.

- did the DSD.RUN start?
- Did the osh start up?
- What is the log saying?
- Any database connections linked yet?
- etc...

ray.wurlod · Post by **ray.wurlod** » Thu Aug 17, 2017 12:35 am

You might also consider enabling reporting environment variables such as APT_STARTUP_STATUS and APT_PLAYER_TIMINGS to capture some figures about how long things are taking and resource consumption.

arvind_ds · Post by **arvind_ds** » Thu Aug 17, 2017 12:49 pm

Thank you all for your valuable inputs. We tried settings below variables at project level, as advised by PMR support engineers.

APT_DEBUG_CLEANUP=1
APT_NO_JOBMON=1
APT_SHOW_COMPONENT_CALLS=1
APT_PM_PLAYER_TIMING=1
APT_NO_PM_SIGNAL_HANDLERS=1
CORE_NAMING=true
APT_DUMP_SCORE=true
APT_PM_SHOW_PIDS=true
APT_STARTUP_STATUS=true
CC_MSG_LEVEL=2
APT_DISABLE_COMBINATION=true

The problem is still not resolved completely. After doing multiple tests and sharing the log files(director logs and stack trace logs) with PMR support, the issue is pointing more towards disk configuration across all file systems used in the Engine tier.

First round of disk re-configuration is completed and we have ovserved 30 to 40 % improvement in performance of the jobs. Jobs are no more hanging now post disk re-configuration BUT yes they are still slow when compared to 9.1.2 environment. Both the environments are identical wrt capacity now.

We are targeting to further fine tune the disk re-configuration. We are aiming to setup the non shareable HDD disks across different file systems on the engine tier(below file systems).

(1) Scratch : Scratch file system
(2) DataSets : datasets file system
(3) TMPDIR : File system corresponding to TMPDIR variable
(4) Project Plus : This one is used to store application specific data files and scripts.
(5) Project : DS projects are created under this file system
(6) /opt/IBM/InformationServer : Datastage binaries on this one.

At present, some of the file systems are using shared disks underneath. Eg datasets and scratch file systems are sharing same set of HDD disks(~10 disks each of size 500 GB). Similarly Project plus and Project and TMPDIR file systems are sharing another set of disks(But different from scratch and datasets disks).

In addition, we are also targeting to replace HDD with SSD.

Will keep you posted.

PaulVL · Post by **PaulVL** » Thu Aug 17, 2017 3:42 pm

I still don't think that is the issue. You said your slowness happens over time.

Are you sure that at the time of slowness you don't have multiple jobs running?

Meaning... once you go slow.. .if you only had one job running... would he be still be slow?

arvind_ds · Post by **arvind_ds** » Fri Aug 18, 2017 6:45 am

If we leave the system as such at the time of slowness, the jobs used to run till completion, the only issue is that if any job which is supposed to finish in eg 30 minutes(baseline 9.1.2 run of same job against same data volume), it takes 10X more time in 11.5

Yes, at the time of slowness - multiple jobs are running in parallel, slowness happens with time BUT now after disk re-configuration, the slowness is still there BUT it has reduced from 10X to 2X wrt time.
No jobs gets aborted when the system experiences slowness, it just take more time to finish.

Another thing is that whenever it goes slow, the jobs will still appear in RUNNING stage but it will appear to end users as if the system is in HUNG stage because the job monitor will not show any progress for a long time.

At slowness, the jobs will finish after taking longer time and when only 1 or 2 jobs are left in the batch then they will complete nicely. These 1 or 2 jobs in the end of the batch don't experience any more slowness.

We tried to browse through IGC at the time of slowness, we queried XMETA with all possible options in IGC, no slowness observed there. We executed ISALite general health check also(at the time of slowness), it finished fine within 10 minutes. No issues reported in the ISALite report either.

PaulVL · Post by **PaulVL** » Fri Aug 18, 2017 8:05 am

Are there a lot of tourists in the environment at the time of slowness? (dsapi slaves from your ops folks all seeking to eyeball the slow jobs)

attu · Post by **attu** » Fri Aug 18, 2017 9:56 pm

How many instances of a job are you running? Have you captured the I/O when the problem occurs?

Thanks

arvind_ds · Post by **arvind_ds** » Fri Aug 18, 2017 11:21 pm

Tourist count at the time of slowness is ~15 and total number of jobs running in parallel at that time is around 15 to 20 with 25% of the jobs being multiple instance jobs.

DSXchange

Lengthy start up time for 90% of datastage jobs in 11.5.0.2

Re: Lengthy start up time for 90% of datastage jobs in 11.5.