##F TFIO 000153 18:58:22(003) <input repartition(1),0> Fatal Error: Unable to allocate communication resources
##E TFPM 000000 18:58:22(001) <node_edwdev2> operator [{natural="/u001/DataStage_work/at/at_dmnd_dep_tran_final.ds", synthetic="inpu
t repartition(1)"}], partition 0 of 8, processID 8,212,494 on edwdev2, player 2 terminated unexpectedly.
##E TFPM 000338 18:58:22(001) <main_program> Unexpected exit status 1
##E TOFN 000001 18:58:22(000) <funnel,2> Failure during execution of operator logic.
##I TOFN 000163 18:58:22(001) <funnel,2> Input 0 consumed 63525 records.
##I TOFN 000163 18:58:22(002) <funnel,2> Input 1 consumed 86123 records.
##I TOFN 000094 18:58:22(003) <funnel,2> Output 0 produced 149648 records.
##F TFOR 000151 18:58:22(004) <funnel,2> Fatal Error: APT_SYSselect returned error status -1 and no inputs reached EOF.
Also, one of the input datasets are pretty large. DS1 is 48 million records, DS2 is about 500,000. However, I ran against a much smaller set and got the same error (about 3.9 million and 0 recs). This is a generic process and deals with any size input dataset and writes to a parameterized DB2 table via db2write.
Hmmm, good point. The datasets that it is using are already partitioned, so we really don't need the hash in front of each. Let me try it without the hash and see what happens.
On one hand, hopefully it will work and I can move on. On the other hand, then we still don't really know why it failed in the first place.
Okay, I updated the program to not re-hash the datasets going into the funnel stage and that eliminated the issue. However, the fact that it works does not give me warm fuzzy feelings when I don't know or understand why it failed to begin with.
The exact same code works in our production environment just fine. Why does it fail in dev? I am guesing that it is something environmental (not necessarily code related). I have tried the process with varying input sizes and come to the conclusion that input record count does not matter - it fails with large and small volumes.
Does anyone have some suggestions about what might be causing the error? This is a generic process used by dozens of production jobs. I am loathe to update a production process without a true understanding of why this failure is occuring - especially when the production process is running just fine.
Any help would be greatly appreciated!
Brad.
ps. I am NOT flagging this with a workaround. I don't know about the rest of you, but I tend to ignore entries that are resolved or marked as workarounds.
just for curiosity
1. Are you running same server version on both Production and
Development Environments ?
2. Are you running same job with same node configuration on both systems ?
3. Environment Variables under Parallel branch are exactly same on both systems ?
DataStage version is the same on dev and prod, as are environment variables/settings. The node configuration is different, but only in terms of the number of nodes. The way the nodes are configured is the same.
Continuous, sort/merge or sequential Funnel? I suspect the second, and that the process that watches all inputs simultaneously to figure out which is next to preserve sorted order has taken some kind of abort. It's not totally clear why - but that's the line of investigation I'd be following, even unto re-instating the original partitioning to see whether it's reproducible. Maybe production has the "preserve partitioning" flag set to Clear but the development has "Set" or "Propagate"?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Just a quick update. Turns out it was probably an IBM DataStage patch that caused our problems.
On development, this patch was installed in May. However, the job we are working with is run very rarely in dev so we never ran into the error until very recently. We therefore did not put together the connection between the patch and our error.
On the other hand, this patch was just installed into our production environment and lo and behold the same job failed instantly with the same error message.
The patch has been backed out and we are now awaiting a fix from IBM to resolve the issue.
When I get the patch identification information and the related fix, I will post as much info as I can.
we are also having the same issue and our environment also installed with same patches you mentioned. As you mentioned that in topic that you have a workaround... could you please specify what is that workaround until we get a fix to the patches.