Sequential File Stage

Nagac · Post by **Nagac** » Sat Aug 04, 2012 12:11 pm

Hi

Will there be any difference in performance when we read file as single column using Sequential File Stage and then use column import stage to divide the multiple columns and Reading the file as multiple columns using Sequential File Stage and do the rest of the transformation process.

Thanks
Naga

chulett · Post by **chulett** » Sat Aug 04, 2012 4:03 pm

Probably, yes maybe. No clue if it would be faster / better / slower / worser or even if it would be all that different as there are too many variables at play. Honestly, the only way to properly answer the question would be to try both ways on your system with your data and see. And hopefully the volume of data would be large enough to make any metrics statistically meaningful.

ray.wurlod · Post by **ray.wurlod** » Sun Aug 05, 2012 1:03 am

Probably, because you are parsing in parallel mode. But, as Craig notes, you won't notice much of a difference with small data volumes.

chandra.shekhar@tcs.com · Mon Aug 06, 2012 1:52 am

Ray and Craig are correct, there will be better performance when you use multiple columns in the Source itself.
I have had the same situation in one of my jobs but I used Field function in the transformer to divide the columns.
And the surprising part was there was almost negligible change in the performance. When using as muliple columns in the source, the job finished in 200 sec while it took 230 sec when using only 1 column.
Half a minute is not a big deal

ArndW · Post by **ArndW** » Mon Aug 06, 2012 1:58 am

Usually such jobs are limited by I/O speed and not CPU, and in both cases the same amount of I/O is being done. Parsing the columns directly in the stage should be somewhat more efficient than doing using a field() function, so the results that Chandra has seen are what I would expect; the increase in time is due to higher CPU loads.

Nagac · Post by **Nagac** » Tue Aug 14, 2012 1:51 am

Thanks Everyone.

ray.wurlod · Post by **ray.wurlod** » Tue Aug 14, 2012 3:09 pm

It would be more interesting had Chandra advised the data volume on which the test was done, and if the Column Import stage had been user rather than a Transformer stage for the parsing.

chandra.shekhar@tcs.com · Thu Aug 16, 2012 5:12 am

@Ray,
The job which I have mentioned had around 17 million records.
I didnt used the column Import stage, testing was done for reading as a single column and as multiple columns.
And using the Field function only.