Data Set for Link collector links need to be the same size?

jpr196 · Post by **jpr196** » Tue Sep 09, 2008 5:22 pm

Hey All,

Hopefully an easy question. We have 2 sequential files (same structure) loading to one table. We were using link collector stage but our job has failed a few times after running for a period of time. The files have different amounts of data (1st has 50 million rows and 2nd has 70 million). Would this cause a timeout error using the link collector stage, round-robin algorithm? Is the simplest solution to break it up into 2 jobs and process each separately? Thanks in advance!

chulett · Post by **chulett** » Tue Sep 09, 2008 7:54 pm

What kind of failure, a timeout? Basically yes, they should be of a similar volume but I don't think that's a hard and fast requirement.

ray.wurlod · Post by **ray.wurlod** » Tue Sep 09, 2008 8:30 pm

The Link Collector stage is notorious for this behaviour. It takes Round Robin to mean "wait", rather than "skip if not ready". It does not process the "end of data" token gracefully.

You could cat the files together in a Filter command then, within the job if you want to, use Link Partitioner and Link Collector stages in concert to cause parallel processing. But try it without these stages first - I think that the speed of the Sequential File stage will be more than adequate.

chulett · Post by **chulett** » Tue Sep 09, 2008 11:10 pm

That probably explains why I don't use it. As noted, prefer concatenation, either before job or dynamically via the Filter.

jpr196 · Post by **jpr196** » Wed Sep 10, 2008 8:50 am

Thanks for the responses and suggestions. I need to filter data from each file before loading (and this is a one time load) so I think I'll process each file separately to work around the link collector.