Page 1 of 1

Sequential file vs dataset in performance

Posted: Wed Jan 17, 2007 10:28 pm
by vij
Hi all,

I have 2 jobs, both uses the same sequential file (has about 100 Million records) as input. As I have two different jobs using the same sequential file, i thought if i use a dataset which gets loaded from the sequential file and then use this dataset in those two jobs as the input the performance would be better, am i rite? advice me pls..

Thanks in advance!

Posted: Thu Jan 18, 2007 4:34 am
by kumar_s
There are two sequential read that you are planning to optimize. If you approach the dataset conversion and reading the generated dataset, you again have to do one sequential read (To convert that into Dataset). The other read should be compromised with the two dataset read. Again Dataset access wont be 100 effecient, it will also consume some IO. The rate of access depends on the number of partition, CPU utilization at the point of read, network congestion etc...
So it would be more realistic, if you could do a test run in you site by yourself and determine the difference.
And you can post the stats to this site if interesting.

Posted: Thu Jan 18, 2007 7:43 pm
by ray.wurlod
If the data are in a sequential file, you have to read the sequential file (even to get the data into a persistent Data Set). So there's no "either/or" about it.

Investigate "multiple readers per node" property of the Sequential File stage.