Page 1 of 1

Heap size error in Generic job

Posted: Wed Aug 18, 2010 12:22 pm
by aj
Hi DS Gurus,

Even though this has been discussed lot of times, but still I couldn't find exact answer related to my issue.

I have a generic parallel job in IIS 8.0 which takes data from dataset & fastloads it into teradata. As this is generic job we have RCP enabled.
Job design is simple as below:
Dataset --> Column Generator (running in parallel mode and propogate partitioning) --> Transformer stage, adding 3 columns for date etc.(preserve sort order, auto partition)--> Teradata Enterprise stage (round robin)

Its on AIX with 2 server 8 node config.
$ ulimit -a
time(seconds) unlimited
file(blocks) unlimited
data(kbytes) unlimited
stack(kbytes) unlimited
memory(kbytes) unlimited
coredump(blocks) 2097151
nofiles(descriptors) 10240

I can see we have enough space on resource (1300GB) & scratch (247GB) disk.

The input dataset has 47+ Million records with overall dataset size being 12.07GB

My job is failing with heap size error and throws other errors as below:
APT_ParallelSortMergeOperator,0: Unbalanced input from partition 1: 10000 records buffered [parallelsortmerge/parallelsortmerge.C:781]

APT_ParallelSortMergeOperator,0: The current soft limit on the data segment (heap) size (2147483645) is less than the hard limit (2147483647), consider increasing the heap size limit

APT_ParallelSortMergeOperator,0: Fatal Error: Throwing exception: APT_BadAlloc: Heap allocation failed. [error_handling/exception.C:132]

APT_CombinedOperatorController,3: Fatal Error: Unable to allocate communication resources [iomgr/iomgr.C:227]
node_node1: Player 1 terminated unexpectedly. [processmgr/player.C:160]
APT_CombinedOperatorController,1: Fatal Error: Unable to allocate communication resources [iomgr/iomgr.C:227]

On of the suggestion we got is to split the data & try to load or change the job design. I don't think datastage will fail because of this much volume.
and moreover I have the same job running in other environment where it is running successfully with 46.5+ Million records with 11.55GB size of dataset.

Can you please help me with this?

Regards,
Aj

Posted: Wed Aug 18, 2010 6:20 pm
by ray.wurlod
At first glance it appears that you are using a Sort/Merge collector (on the input to the Teradata stage?) and have wildly different row counts coming from the different partitions. You claim to be using Round Robin, but there's an APT_ParallelSortMergeOperator featuring prominently in the error messages. Perhaps dumping the score will give you a better idea just what is going on.

Re: Heap size error in Generic job

Posted: Thu Aug 19, 2010 12:32 am
by ghila
Hello,

We got similar troubles. Seems some heap allocation limitations are due to the setting of the AIX environment variable LDR_CNTRL.
You might check how it is set in your "dsenv" file.

That link can also be useful :
http://www-01.ibm.com/support/docview.w ... wg21411997

Re: Heap size error in Generic job

Posted: Thu Aug 19, 2010 10:11 pm
by aj
Even we do have LDR_CTRL set in the job. The value is set to 0x80000000 as suggested on ibm support site.

Ray,
Yes I think round robin used in Teradata stage is putting sort/merg internally by DS while processing. At this moment I can not change this & try running the job as this is in prod...
And definitely same job is running fine on different environment with 46.5+ million recordss.

Re: Heap size error in Generic job

Posted: Thu Aug 26, 2010 12:10 pm
by prasanna_anbu
aj wrote:Even we do have LDR_CTRL set in the job. The value is set to 0x80000000 as suggested on ibm support site.

Ray,
Yes I think round robin used in Teradata stage is putting sort/merg internally by DS while processing. At this moment I can not change this & try running the job as this is in prod...
And definitely same job is running fine on different environment with 46.5+ million recordss.
Please check the APT_DEFAULT_TRANSPORT_BLOCK_SIZE = 131072
APT_PHYSICAL_DATASET_BLOCK_SIZE = NULL