Sequential file performance issue

hemant · Post by **hemant** » Sat Jun 05, 2004 12:13 am

Dear All

I have a server job in which from one sequential file i am splitting the data into many sequential file,
further the records are been looked up by many hash files(i.e 4 hash files) .
The speed i am getting rows/sec is very low thats why my job total timespan increases and eventually takes a
longer time to complete .
1.What are the ways i can increase the speed of this job in terms of rows/sec?
2. What are the performance tuning parameters of sequential file which affects the speed,is there any document over that i can refer ?

suggest

Regards:

Hemant Krishnatrey

ray.wurlod · Post by **ray.wurlod** » Sat Jun 05, 2004 1:03 am

You could be overloading the machine. How many CPUs are there?
What is the basis for splitting the rows? Is this being done in a single Transformer stage? How complex are the business rules? Are the rows in the sequential file improbably large?

There are no tunables for the Sequential File stage.

The Sequential File stage is VERY fast. To prove this, construct the following job.

Code: Select all

SeqFile  ----->  Transformer  ----->  SeqFile

Make the output constraint on the Transformer stage the system variable @FALSE. Now run the job. This will give you some idea of how well the Sequential File stage can perform. Then go and discover where the real problem lies.

ogmios · Post by **ogmios** » Sat Jun 05, 2004 2:26 am

A hunch, the culprit is in the transformer. Let me guess: you have 1 transformer and 20 to 40 outputfiles, for every row in input all of the constraints have to evaluated and this takes time.

In some cases it even makes sense to split such a job in a couple of jobs and have the file processed several times by different jobs.

DataStage is a tool not a magical bullet, most jobs I write have at maximum 4 to 6 stages (including maybe 1 lookup) and no fancy stages (only databasestages and sequential files).

With DataStage less is more speed, and you need all the speed you can get.

Ogmios

mandyli · Post by **mandyli** » Sun Jun 06, 2004 6:12 am

Hi,

first of all Sequential file itself improving performance. at the same time u need to check you file data also. spiliting into 4 file also good idea.

if u spiliting more files some time it will give low performance .

vmcburney · Post by **vmcburney** » Sun Jun 06, 2004 7:57 pm

Just some ideas, a single job writing out to four sequential files is no faster than a job writing to a single file. You achieve extra performance if you process the source data using four parallel jobs and you have the CPUs available to service each parallel job.

One thing you can try with sequential files is writing out to files on different disk partitions.

There are a few tunables on hash files which have been mentioned on quite a few previous threads. Have a look at the memory settings on your hash file stages.

ray.wurlod · Post by **ray.wurlod** » Sun Jun 06, 2004 8:03 pm

Just to clarify, Vince's point about using separate disks really means different drives (spindles). Different partitions that are slices on the same spindle don't result in any gain; indeed, they can result in contention and therefore reduced throughput.

Four parallel streams all reading the same source file will mean that three of those streams will read from cache. If they are writing to separate output files, this will be very fast. The separate output files can later be combined very fast with cat (UNIX) or type or copy (DOS). For example

Code: Select all

cat file2 file3 file4 >> file1