Dear All
I have a server job in which from one sequential file i am splitting the data into many sequential file,
further the records are been looked up by many hash files(i.e 4 hash files) .
The speed i am getting rows/sec is very low thats why my job total timespan increases and eventually takes a
longer time to complete .
1.What are the ways i can increase the speed of this job in terms of rows/sec?
2. What are the performance tuning parameters of sequential file which affects the speed,is there any document over that i can refer ?
suggest
Regards:
Hemant Krishnatrey
Sequential file performance issue
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
You could be overloading the machine. How many CPUs are there?
What is the basis for splitting the rows? Is this being done in a single Transformer stage? How complex are the business rules? Are the rows in the sequential file improbably large?
There are no tunables for the Sequential File stage.
The Sequential File stage is VERY fast. To prove this, construct the following job.
Make the output constraint on the Transformer stage the system variable @FALSE. Now run the job. This will give you some idea of how well the Sequential File stage can perform. Then go and discover where the real problem lies.
What is the basis for splitting the rows? Is this being done in a single Transformer stage? How complex are the business rules? Are the rows in the sequential file improbably large?
There are no tunables for the Sequential File stage.
The Sequential File stage is VERY fast. To prove this, construct the following job.
Code: Select all
SeqFile -----> Transformer -----> SeqFile
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
A hunch, the culprit is in the transformer. Let me guess: you have 1 transformer and 20 to 40 outputfiles, for every row in input all of the constraints have to evaluated and this takes time.
In some cases it even makes sense to split such a job in a couple of jobs and have the file processed several times by different jobs.
DataStage is a tool not a magical bullet, most jobs I write have at maximum 4 to 6 stages (including maybe 1 lookup) and no fancy stages (only databasestages and sequential files).
With DataStage less is more speed, and you need all the speed you can get.
Ogmios
In some cases it even makes sense to split such a job in a couple of jobs and have the file processed several times by different jobs.
DataStage is a tool not a magical bullet, most jobs I write have at maximum 4 to 6 stages (including maybe 1 lookup) and no fancy stages (only databasestages and sequential files).
With DataStage less is more speed, and you need all the speed you can get.
Ogmios
-
- Participant
- Posts: 3593
- Joined: Thu Jan 23, 2003 5:25 pm
- Location: Australia, Melbourne
- Contact:
Just some ideas, a single job writing out to four sequential files is no faster than a job writing to a single file. You achieve extra performance if you process the source data using four parallel jobs and you have the CPUs available to service each parallel job.
One thing you can try with sequential files is writing out to files on different disk partitions.
There are a few tunables on hash files which have been mentioned on quite a few previous threads. Have a look at the memory settings on your hash file stages.
One thing you can try with sequential files is writing out to files on different disk partitions.
There are a few tunables on hash files which have been mentioned on quite a few previous threads. Have a look at the memory settings on your hash file stages.
Certus Solutions
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Just to clarify, Vince's point about using separate disks really means different drives (spindles). Different partitions that are slices on the same spindle don't result in any gain; indeed, they can result in contention and therefore reduced throughput.
Four parallel streams all reading the same source file will mean that three of those streams will read from cache. If they are writing to separate output files, this will be very fast. The separate output files can later be combined very fast with cat (UNIX) or type or copy (DOS). For example
Four parallel streams all reading the same source file will mean that three of those streams will read from cache. If they are writing to separate output files, this will be very fast. The separate output files can later be combined very fast with cat (UNIX) or type or copy (DOS). For example
Code: Select all
cat file2 file3 file4 >> file1
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.