Job using dataset files is slower than sequential files
Moderators: chulett, rschirm, roy
Job using dataset files is slower than sequential files
I created 2 sets of jobs. Both sets are exactly identical. Both sets have loops. Here is the loop code:
StartLoop --> ExecCmd1 --> JobActivity --> ExecCmd2 --> EndLoop
There is another link from the EndLoop to StartLoop. The job in the JobAcitivity stage is:
SeqFile --> Modify --> SurrogateKeyGenerator --> Transformer --> TargetFile
The difference between the 2 sets is that the TargetFile is a sequential file in one and data set file in the other. The set of jobs with sequential file is significantly faster than the set with data set files. I would think that just the reverse should be true. Is there any config file manipulation that I can do?
StartLoop --> ExecCmd1 --> JobActivity --> ExecCmd2 --> EndLoop
There is another link from the EndLoop to StartLoop. The job in the JobAcitivity stage is:
SeqFile --> Modify --> SurrogateKeyGenerator --> Transformer --> TargetFile
The difference between the 2 sets is that the TargetFile is a sequential file in one and data set file in the other. The set of jobs with sequential file is significantly faster than the set with data set files. I would think that just the reverse should be true. Is there any config file manipulation that I can do?
-
- Charter Member
- Posts: 822
- Joined: Sat Sep 17, 2005 5:25 pm
- Location: USA
Is there any chance your problem is related to the following post?
http://dsxchange.com/viewtopic.php?t=97 ... e5057e602f
http://dsxchange.com/viewtopic.php?t=97 ... e5057e602f
Is environment idle for both the case? You server might be loaded on the later case.
There may be a chance where the additional node might be be easily accessible by dataset to write the data into. For testing, you can try the give single node where the sequential file is created.
There may be a chance where the additional node might be be easily accessible by dataset to write the data into. For testing, you can try the give single node where the sequential file is created.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Your server might be busy with other stuff when you are testing with dataset and might be comparatively idle when you process sequential file. This will make your dataset preparation to run slower. You can measure the CPU usage on both the cases.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
As balajisr mentioned, you might be running sequentially instead of in parallel. Turn on APT_DUMP_SCORE to see what is really happening at runtime. A dataset, even in sequential mode, should run at about the same speed as a sequential file for what you've described, so something is definately amiss.
These are the outputs from the dump score for the sequential and dataset file versions. For 20 source files, the sequeantial
file version takes 48 secs and the dataset file version takes 68 secs.
Sequential File version(2 nodes):
main_program: This step has 2 datasets:
ds0: {op0[1p] (sequential Seq_File)
eAny<>eCollectAny
op1[2p] (parallel APT_CombinedOperatorController:sk_Add_SrcID)}
ds1: {op1[2p] (parallel APT_CombinedOperatorController:APT_TransformOperatorImplV2S0_MyJob_xfm1 in xfm1)
>>eCollectAny
op2[1p] (sequential APT_RealFileExportOperator in MasterFile)}
It has 3 operators:
op0[1p] {(sequential Seq_File)
on nodes (
node1[op0,p0]
)}
op1[2p] {(parallel APT_CombinedOperatorController:
(sk_Add_SrcID)
(APT_TransformOperatorImplV2S0_MyJob_xfm1 in xfm1)
) on nodes (
node1[op1,p0]
node2[op1,p1]
)}
op2[1p] {(sequential APT_RealFileExportOperator in MasterFile)
on nodes (
node2[op2,p0]
)}
It runs 4 processes on 2 nodes.
Dataset file version (2 nodes):
main_program: This step has 2 datasets:
ds0: {op0[1p] (sequential Seq_File)
eAny<>eCollectAny
op1[2p] (parallel APT_TransformOperatorImplV6S3_MyJob_xfm1 in xfm1)}
ds1: {op1[2p] (parallel APT_TransformOperatorImplV6S3_MyJob_xfm1 in xfm1)
=>
/Fld1/Fld2/Fld3/MyDS.ds}
It has 2 operators:
op0[1p] {(sequential Seq_File)
on nodes (
node1[op0,p0]
)}
op1[2p] {(parallel APT_TransformOperatorImplV6S3_MyJob_xfm1 in xfm1)
on nodes (
node1[op1,p0]
node2[op1,p1]
)}
It runs 3 processes on 2 nodes.
file version takes 48 secs and the dataset file version takes 68 secs.
Sequential File version(2 nodes):
main_program: This step has 2 datasets:
ds0: {op0[1p] (sequential Seq_File)
eAny<>eCollectAny
op1[2p] (parallel APT_CombinedOperatorController:sk_Add_SrcID)}
ds1: {op1[2p] (parallel APT_CombinedOperatorController:APT_TransformOperatorImplV2S0_MyJob_xfm1 in xfm1)
>>eCollectAny
op2[1p] (sequential APT_RealFileExportOperator in MasterFile)}
It has 3 operators:
op0[1p] {(sequential Seq_File)
on nodes (
node1[op0,p0]
)}
op1[2p] {(parallel APT_CombinedOperatorController:
(sk_Add_SrcID)
(APT_TransformOperatorImplV2S0_MyJob_xfm1 in xfm1)
) on nodes (
node1[op1,p0]
node2[op1,p1]
)}
op2[1p] {(sequential APT_RealFileExportOperator in MasterFile)
on nodes (
node2[op2,p0]
)}
It runs 4 processes on 2 nodes.
Dataset file version (2 nodes):
main_program: This step has 2 datasets:
ds0: {op0[1p] (sequential Seq_File)
eAny<>eCollectAny
op1[2p] (parallel APT_TransformOperatorImplV6S3_MyJob_xfm1 in xfm1)}
ds1: {op1[2p] (parallel APT_TransformOperatorImplV6S3_MyJob_xfm1 in xfm1)
=>
/Fld1/Fld2/Fld3/MyDS.ds}
It has 2 operators:
op0[1p] {(sequential Seq_File)
on nodes (
node1[op0,p0]
)}
op1[2p] {(parallel APT_TransformOperatorImplV6S3_MyJob_xfm1 in xfm1)
on nodes (
node1[op1,p0]
node2[op1,p1]
)}
It runs 3 processes on 2 nodes.
My APT_CONFIG_FILE is located in /home/dsadm/Ascential/DataStage/Configurations. Both versions output the file to the same folder.
This does not make sense to me. I would think that performance would at least be same. I am having doubts about the need of datasets now. I am not seeing any benefit other than being able to store larger files on my multiple disks.
This does not make sense to me. I would think that performance would at least be same. I am having doubts about the need of datasets now. I am not seeing any benefit other than being able to store larger files on my multiple disks.
Arnd, Even in sequential mode Dataset, shouldn't be quicker than the Sequential file, atleast theoretically? Dataset will be written in native format and not necessary to convert into Ascii.
splayer, more over recording benchmark for data worth of processing within few seconds will not give out exact result. Check for startup time and production time for each case. Because these will be interms of seconds.
splayer, more over recording benchmark for data worth of processing within few seconds will not give out exact result. Check for startup time and production time for each case. Because these will be interms of seconds.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'