Switching from Windows to Unix issues
Posted: Wed Jan 14, 2015 10:04 am
So we are the process of moving all our jobs from Windows over to Unix and are experiencing some strange errors.
Our main sequence jobs consist of about 8 small parallel jobs. The smaller parallel jobs we had set to use 1 node, as most of them are simple jobs writing a single line to a table. (for audit and control purposes)
Now all these big sequence jobs and parallel jobs are 100% working on windows. But some of the small parallel jobs are breaking on unix. And all the sequential jobs use the same parallel jobs, just made them multi-instanced.
Now the funky part..... Some of these small one-node parallel jobs are failing with "Parallel job reports failure (code 256)". That's the only error.... And its only for random sequential jobs, but they all use the same settings, just move different data.
The good news is I found a workaround for it while troubleshooting.... If I switch the jobs that are breaking from using one node to two nodes, the jobs work. But since we are new to Unix, we thought there may be a bigger underlying problem that we don't know about. And dont have warm n fuzzy feelings moving these jobs to production using workarounds and not understanding why they are breaking / why they work with 2 nodes.
Is there something that you may think is causing these random jobs to break in unix? Or can someone maybe further explain why some jobs randomly give error code 256 and most of the jobs don't?
Thanks,
Our main sequence jobs consist of about 8 small parallel jobs. The smaller parallel jobs we had set to use 1 node, as most of them are simple jobs writing a single line to a table. (for audit and control purposes)
Now all these big sequence jobs and parallel jobs are 100% working on windows. But some of the small parallel jobs are breaking on unix. And all the sequential jobs use the same parallel jobs, just made them multi-instanced.
Now the funky part..... Some of these small one-node parallel jobs are failing with "Parallel job reports failure (code 256)". That's the only error.... And its only for random sequential jobs, but they all use the same settings, just move different data.
The good news is I found a workaround for it while troubleshooting.... If I switch the jobs that are breaking from using one node to two nodes, the jobs work. But since we are new to Unix, we thought there may be a bigger underlying problem that we don't know about. And dont have warm n fuzzy feelings moving these jobs to production using workarounds and not understanding why they are breaking / why they work with 2 nodes.
Is there something that you may think is causing these random jobs to break in unix? Or can someone maybe further explain why some jobs randomly give error code 256 and most of the jobs don't?
Thanks,