Fault type is 11 - Aggregator Stage Issue

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
sohasaid
Premium Member
Premium Member
Posts: 115
Joined: Tue May 20, 2008 3:02 am
Location: Cairo, Egypt

Fault type is 11 - Aggregator Stage Issue

Post by sohasaid »

Dears,

We've a fresh DataStage v8.1.0.1 (Server Edition) installation on AIX v6. Jobs are exported from Windows Server 2003. After moving to the AIX environment, all imported jobs are working fine except for those which contains Aggregator stage.

After reading around 8 million records from the data source's table (the whole table volume is 9 million records), the job aborted and logs the following error after reset at Director:

From previous run
DataStage Job 65 Phantom 20574
Abnormal termination of DataStage.
Fault type is 11. Layer type is BASIC run machine.
Fault occurred in BASIC program DSD.WriteLog at address 20


Job design is as follows:

Code: Select all

DRS stage\Read --> Aggregator --> Transformer --> DRS stage\Write
At this Aggregator stage, I'm grouping by 5 columns and getting the maximum of the sixth column.

And here're our troubleshooting trials:
1- After putting a Sort stage before the Aggregator stage, the job worked fine.
2- When grouping by 2 or 3 columns only, the job worked fine.

We thought after these trials, that it's a memory issue but the server has 16 GBs RAM! So, we swithched the datastage user which runs the jobs from 'dsadm' to be 'root' to ensure that there's no memory allocation limit for that user and job still aborts.

Another observation has been noticed at another job:

Code: Select all

DRS stage\Read --> Transformer --> DRS stage\Write
When setting the 'Array size' to a large value such as '5000' instead of '1', the job aborts. After resetting to '1', the job works fine which points the finger to memory again.

I still think it's a memory issue after troubleshooting trials but there's something still misssing.

Any help will be appreciated and sorry for long post.

P.S. I've searched the forum for 'Fault Type 11', but none of posts helped.

Regards.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

It's not array size. Your Aggregator stage is running out of memory. That's why sorting the data removes the problem - sorted data enables a more efficient memory-consumption strategy to be used in the Aggregator stage.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Not only do you have to sort your incoming data to relieve the memory pressure on the Aggregator stage, you need to assert that sorted order in the stage itself. And lastly you have to sort in an order that supports the grouping you are doing. Only if all three steps are done properly can the Aggregator handle pretty much any amount of input data.
-craig

"You can never have too many knives" -- Logan Nine Fingers
sohasaid
Premium Member
Premium Member
Posts: 115
Joined: Tue May 20, 2008 3:02 am
Location: Cairo, Egypt

Post by sohasaid »

@Ray, we've discovered that it's a heap allocation error on AIX. This post from IBM says that you don't have to change the 'LDR_CNTRL' parameter at 'dsenv' file in case of server edition in opposite to the parallel one:

http://www-01.ibm.com/support/docview.w ... wg21411997

We've even tried to increase the 'LDR_CNTRL' value to be 'LDR_CNTRL=MAXDATA=0' which means that's heap size is greater than 3GB and also tried this value '0x80000000@DSA' which means heap size equals 2.5 GB, but still no good news.

@chulett, I just want to make use of the whole memory (16 GB) instead of changing the job design.

[Edit] When maximizing the value of 'LDR_CNTRL', I can't login to Datastage because it gets a broken connection error.

Any ideas?

Regards.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

sohasaid wrote:@chulett, I just want to make use of the whole memory (16 GB) instead of changing the job design.
I honestly don't believe you will have any other choice. We shall see, I suppose.
-craig

"You can never have too many knives" -- Logan Nine Fingers
sohasaid
Premium Member
Premium Member
Posts: 115
Joined: Tue May 20, 2008 3:02 am
Location: Cairo, Egypt

Post by sohasaid »

Back with solution. :D

The main problem was with limiting the DataStage to use only 1.5 GB out of 16 GB of server's memory. After opening a ticket with IBM support, the problem is fixed.

The whole issue was about the sequence of starting up WAS and DataStage services and the 'LDR_CNTRL' parameter at dsenv file.

So, solution steps are as-follows:
1- Comment the 'LDR_CNTRL' parameter at dsenv file to make it invisible.
2- Start up WAS service.
3- Uncomment the 'LDR_CNTRL' parameter at dsenv file and also set it to the memory limit you want. I think it's 1.5 GB as a default value, in our case we've set it to 64 GB.
4- Start up DataStage service.

IBM consultant said that there's something wrong between WAS and 'LDR_CNTRL', honestly I don't know what it is! But also we've noticed that the value of 'LDR_CNTRL' parameter is overriding the memory limit assigned to a specific user at operating system level. For example, 'dsadm' user is the one we use to run DataStage jobs. When running 'ulimit -a' at AIX command line, it retrieves that (data(kbytes) unlimited), when running the same command from before job subroutine, it retrieves that (data(kbytes) 1566720--> 1.5 GB) which is the same value at 'LDR_CNTRL' in hexadecimal format.

Although he emphasises on redesigning jobs to include sort stage before each aggregator stage, as Craig advised above.

Now we can use our full memory.

Cheers :)
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Cool. Luckily for the rest of us, that LDR_CNTRL issue is specific to AIX from what I recall. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply