Page 1 of 1

Data Set for Link collector links need to be the same size?

Posted: Tue Sep 09, 2008 5:22 pm
by jpr196
Hey All,

Hopefully an easy question. We have 2 sequential files (same structure) loading to one table. We were using link collector stage but our job has failed a few times after running for a period of time. The files have different amounts of data (1st has 50 million rows and 2nd has 70 million). Would this cause a timeout error using the link collector stage, round-robin algorithm? Is the simplest solution to break it up into 2 jobs and process each separately? Thanks in advance!

Posted: Tue Sep 09, 2008 7:54 pm
by chulett
What kind of failure, a timeout? Basically yes, they should be of a similar volume but I don't think that's a hard and fast requirement. :?

Posted: Tue Sep 09, 2008 8:30 pm
by ray.wurlod
The Link Collector stage is notorious for this behaviour. It takes Round Robin to mean "wait", rather than "skip if not ready". It does not process the "end of data" token gracefully.

You could cat the files together in a Filter command then, within the job if you want to, use Link Partitioner and Link Collector stages in concert to cause parallel processing. But try it without these stages first - I think that the speed of the Sequential File stage will be more than adequate.

Posted: Tue Sep 09, 2008 11:10 pm
by chulett
That probably explains why I don't use it. As noted, prefer concatenation, either before job or dynamically via the Filter.

Posted: Wed Sep 10, 2008 8:50 am
by jpr196
Thanks for the responses and suggestions. I need to filter data from each file before loading (and this is a one time load) so I think I'll process each file separately to work around the link collector.