Page 1 of 1

Fault type is 11 - Aggregator Stage Issue

Posted: Mon Sep 20, 2010 1:42 pm
by sohasaid
Dears,

We've a fresh DataStage v8.1.0.1 (Server Edition) installation on AIX v6. Jobs are exported from Windows Server 2003. After moving to the AIX environment, all imported jobs are working fine except for those which contains Aggregator stage.

After reading around 8 million records from the data source's table (the whole table volume is 9 million records), the job aborted and logs the following error after reset at Director:

From previous run
DataStage Job 65 Phantom 20574
Abnormal termination of DataStage.
Fault type is 11. Layer type is BASIC run machine.
Fault occurred in BASIC program DSD.WriteLog at address 20


Job design is as follows:

Code: Select all

DRS stage\Read --> Aggregator --> Transformer --> DRS stage\Write
At this Aggregator stage, I'm grouping by 5 columns and getting the maximum of the sixth column.

And here're our troubleshooting trials:
1- After putting a Sort stage before the Aggregator stage, the job worked fine.
2- When grouping by 2 or 3 columns only, the job worked fine.

We thought after these trials, that it's a memory issue but the server has 16 GBs RAM! So, we swithched the datastage user which runs the jobs from 'dsadm' to be 'root' to ensure that there's no memory allocation limit for that user and job still aborts.

Another observation has been noticed at another job:

Code: Select all

DRS stage\Read --> Transformer --> DRS stage\Write
When setting the 'Array size' to a large value such as '5000' instead of '1', the job aborts. After resetting to '1', the job works fine which points the finger to memory again.

I still think it's a memory issue after troubleshooting trials but there's something still misssing.

Any help will be appreciated and sorry for long post.

P.S. I've searched the forum for 'Fault Type 11', but none of posts helped.

Regards.

Posted: Mon Sep 20, 2010 2:15 pm
by ray.wurlod
It's not array size. Your Aggregator stage is running out of memory. That's why sorting the data removes the problem - sorted data enables a more efficient memory-consumption strategy to be used in the Aggregator stage.

Posted: Mon Sep 20, 2010 5:14 pm
by chulett
Not only do you have to sort your incoming data to relieve the memory pressure on the Aggregator stage, you need to assert that sorted order in the stage itself. And lastly you have to sort in an order that supports the grouping you are doing. Only if all three steps are done properly can the Aggregator handle pretty much any amount of input data.

Posted: Wed Sep 22, 2010 4:40 am
by sohasaid
@Ray, we've discovered that it's a heap allocation error on AIX. This post from IBM says that you don't have to change the 'LDR_CNTRL' parameter at 'dsenv' file in case of server edition in opposite to the parallel one:

http://www-01.ibm.com/support/docview.w ... wg21411997

We've even tried to increase the 'LDR_CNTRL' value to be 'LDR_CNTRL=MAXDATA=0' which means that's heap size is greater than 3GB and also tried this value '0x80000000@DSA' which means heap size equals 2.5 GB, but still no good news.

@chulett, I just want to make use of the whole memory (16 GB) instead of changing the job design.

[Edit] When maximizing the value of 'LDR_CNTRL', I can't login to Datastage because it gets a broken connection error.

Any ideas?

Regards.

Posted: Wed Sep 22, 2010 7:11 am
by chulett
sohasaid wrote:@chulett, I just want to make use of the whole memory (16 GB) instead of changing the job design.
I honestly don't believe you will have any other choice. We shall see, I suppose.

Posted: Fri Oct 01, 2010 4:40 pm
by sohasaid
Back with solution. :D

The main problem was with limiting the DataStage to use only 1.5 GB out of 16 GB of server's memory. After opening a ticket with IBM support, the problem is fixed.

The whole issue was about the sequence of starting up WAS and DataStage services and the 'LDR_CNTRL' parameter at dsenv file.

So, solution steps are as-follows:
1- Comment the 'LDR_CNTRL' parameter at dsenv file to make it invisible.
2- Start up WAS service.
3- Uncomment the 'LDR_CNTRL' parameter at dsenv file and also set it to the memory limit you want. I think it's 1.5 GB as a default value, in our case we've set it to 64 GB.
4- Start up DataStage service.

IBM consultant said that there's something wrong between WAS and 'LDR_CNTRL', honestly I don't know what it is! But also we've noticed that the value of 'LDR_CNTRL' parameter is overriding the memory limit assigned to a specific user at operating system level. For example, 'dsadm' user is the one we use to run DataStage jobs. When running 'ulimit -a' at AIX command line, it retrieves that (data(kbytes) unlimited), when running the same command from before job subroutine, it retrieves that (data(kbytes) 1566720--> 1.5 GB) which is the same value at 'LDR_CNTRL' in hexadecimal format.

Although he emphasises on redesigning jobs to include sort stage before each aggregator stage, as Craig advised above.

Now we can use our full memory.

Cheers :)

Posted: Fri Oct 01, 2010 4:57 pm
by chulett
Cool. Luckily for the rest of us, that LDR_CNTRL issue is specific to AIX from what I recall. :wink: