Switching from Windows to Unix issues

taylor.hermann · Post by **taylor.hermann** » Wed Jan 14, 2015 10:04 am

So we are the process of moving all our jobs from Windows over to Unix and are experiencing some strange errors.

Our main sequence jobs consist of about 8 small parallel jobs. The smaller parallel jobs we had set to use 1 node, as most of them are simple jobs writing a single line to a table. (for audit and control purposes)

Now all these big sequence jobs and parallel jobs are 100% working on windows. But some of the small parallel jobs are breaking on unix. And all the sequential jobs use the same parallel jobs, just made them multi-instanced.

Now the funky part..... Some of these small one-node parallel jobs are failing with "Parallel job reports failure (code 256)". That's the only error.... And its only for random sequential jobs, but they all use the same settings, just move different data.

The good news is I found a workaround for it while troubleshooting.... If I switch the jobs that are breaking from using one node to two nodes, the jobs work. But since we are new to Unix, we thought there may be a bigger underlying problem that we don't know about. And dont have warm n fuzzy feelings moving these jobs to production using workarounds and not understanding why they are breaking / why they work with 2 nodes.

Is there something that you may think is causing these random jobs to break in unix? Or can someone maybe further explain why some jobs randomly give error code 256 and most of the jobs don't?

Thanks,

chulett · Post by **chulett** » Wed Jan 14, 2015 10:48 am

Try doing an exact search here for "code 256" and see if any of the previous posts with that error help, if you haven't already.

taylor.hermann · Post by **taylor.hermann** » Wed Jan 14, 2015 10:52 am

I have tried that :/
And most of the posts are seeing things like job core dumps. Which we are not seeing. And using DB2 which we are not using.
Most of the jobs failing just are using oracle connector, xfm, and sequential stages.
But again its completely random which multi-instance job fails

ArndW · Post by **ArndW** » Wed Jan 14, 2015 10:52 am

Do the same jobs always abort, or are the errors indeterminate as well?

Off-hand I can't think of a cause for the problems as you've described.

Can you think of a common denominator for the failing jobs - do they use the same tables or some stage not used in other jobs? Do the jobs contain partitioning/repartitioning that isn't present in other jobs?

taylor.hermann · Post by **taylor.hermann** » Wed Jan 14, 2015 11:00 am

Yes, the only consistent thing is that the same main sequential jobs will fail. And that it will always fail with the same error while the parallel job is set to use 1 node. And they seem to always fail at the same parallel job. But as soon as I make those parallel jobs to use 2 nodes, it works.... But 20+ other seq job can use 1 node, using the same exact parallel job, and it works. But the select few cannot.

But that has been the issue, there is no common denominator that I can tell.... Because every other job uses the same exact stages, just different data moving. They all go to the same table too. Because the jobs that are failing are multi-instanced, the only real different thing is the invocation ID / data moving.

ArndW · Post by **ArndW** » Wed Jan 14, 2015 11:25 am

Your best bet is to take a job that fails, make a copy of it, and successively remove stages from tailing end (replacing with PEEK as necessary) until the error goes away; that will point to the part of the job that might be the culprit. I can't think of anything better at the moment - but once it is known which stage is triggering the errors it might be possible to make an educated guess at the underlying issue.