Job using dataset files is slower than sequential files

splayer · Post by **splayer** » Wed Feb 07, 2007 11:43 am

I created 2 sets of jobs. Both sets are exactly identical. Both sets have loops. Here is the loop code:

StartLoop --> ExecCmd1 --> JobActivity --> ExecCmd2 --> EndLoop

There is another link from the EndLoop to StartLoop. The job in the JobAcitivity stage is:
SeqFile --> Modify --> SurrogateKeyGenerator --> Transformer --> TargetFile

The difference between the 2 sets is that the TargetFile is a sequential file in one and data set file in the other. The set of jobs with sequential file is significantly faster than the set with data set files. I would think that just the reverse should be true. Is there any config file manipulation that I can do?

us1aslam1us · Post by **us1aslam1us** » Wed Feb 07, 2007 12:24 pm

Do Job Monitoring and check whether the move from transformer to Target file (DataSet) taking more time or the overall process? Is it done in sequential mode.

patonp · Post by **patonp** » Wed Feb 07, 2007 12:49 pm

Is there any chance your problem is related to the following post?

http://dsxchange.com/viewtopic.php?t=97 ... e5057e602f

splayer · Post by **splayer** » Wed Feb 07, 2007 10:49 pm

I did job monitoring. There is nothing specific that I can fiind there. I tried changing the config file from a 2 node file to a 4 node file. It does not split the transformer processing into 4 nodes which is kind of strange.

kumar_s · Post by **kumar_s** » Wed Feb 07, 2007 11:07 pm

Is environment idle for both the case? You server might be loaded on the later case.
There may be a chance where the additional node might be be easily accessible by dataset to write the data into. For testing, you can try the give single node where the sequential file is created.

splayer · Post by **splayer** » Thu Feb 08, 2007 12:27 am

kumar_s, can you elaborate a little bit? What does "environment idle" mean? Mine is a dev environment and I have just one box but 4 processors, from what I know. If I use 4 nodes, shouldn't I see 4 instances for the transformer in job monitor window?

balajisr · Post by **balajisr** » Thu Feb 08, 2007 12:32 am

Are you by any chance running the transformer in sequential mode?

kumar_s · Post by **kumar_s** » Thu Feb 08, 2007 12:48 am

Your server might be busy with other stuff when you are testing with dataset and might be comparatively idle when you process sequential file. This will make your dataset preparation to run slower. You can measure the CPU usage on both the cases.

ArndW · Post by **ArndW** » Thu Feb 08, 2007 3:55 am

As balajisr mentioned, you might be running sequentially instead of in parallel. Turn on APT_DUMP_SCORE to see what is really happening at runtime. A dataset, even in sequential mode, should run at about the same speed as a sequential file for what you've described, so something is definately amiss.

splayer · Post by **splayer** » Thu Feb 08, 2007 10:20 am

No, I am running in parallel as I am seeing xfm x 2 in the monitor where xfm is the transformer. I set APT_DUMP_SCORE to True. I don't see anything additional in the log. Shouldn't I see the output in the log?

splayer · Post by **splayer** » Thu Feb 08, 2007 1:26 pm

These are the outputs from the dump score for the sequential and dataset file versions. For 20 source files, the sequeantial
file version takes 48 secs and the dataset file version takes 68 secs.

Sequential File version(2 nodes):

main_program: This step has 2 datasets:
ds0: {op0[1p] (sequential Seq_File)
eAny<>eCollectAny
op1[2p] (parallel APT_CombinedOperatorController:sk_Add_SrcID)}
ds1: {op1[2p] (parallel APT_CombinedOperatorController:APT_TransformOperatorImplV2S0_MyJob_xfm1 in xfm1)
>>eCollectAny
op2[1p] (sequential APT_RealFileExportOperator in MasterFile)}
It has 3 operators:
op0[1p] {(sequential Seq_File)
on nodes (
node1[op0,p0]
)}
op1[2p] {(parallel APT_CombinedOperatorController:
(sk_Add_SrcID)
(APT_TransformOperatorImplV2S0_MyJob_xfm1 in xfm1)
) on nodes (
node1[op1,p0]
node2[op1,p1]
)}
op2[1p] {(sequential APT_RealFileExportOperator in MasterFile)
on nodes (
node2[op2,p0]
)}
It runs 4 processes on 2 nodes.

Dataset file version (2 nodes):

main_program: This step has 2 datasets:
ds0: {op0[1p] (sequential Seq_File)
eAny<>eCollectAny
op1[2p] (parallel APT_TransformOperatorImplV6S3_MyJob_xfm1 in xfm1)}
ds1: {op1[2p] (parallel APT_TransformOperatorImplV6S3_MyJob_xfm1 in xfm1)
=>
/Fld1/Fld2/Fld3/MyDS.ds}
It has 2 operators:
op0[1p] {(sequential Seq_File)
on nodes (
node1[op0,p0]
)}
op1[2p] {(parallel APT_TransformOperatorImplV6S3_MyJob_xfm1 in xfm1)
on nodes (
node1[op1,p0]
node2[op1,p1]
)}
It runs 3 processes on 2 nodes.

ArndW · Post by **ArndW** » Thu Feb 08, 2007 4:15 pm

What happens to the speed of the sequential version if you output it to the same directory as your dataset data files as specified in the APT_CONFIG file? If is slows down then it might be related to the disk partition and not directly to the DS job.

splayer · Post by **splayer** » Thu Feb 08, 2007 4:27 pm

My APT_CONFIG_FILE is located in /home/dsadm/Ascential/DataStage/Configurations. Both versions output the file to the same folder.

This does not make sense to me. I would think that performance would at least be same. I am having doubts about the need of datasets now. I am not seeing any benefit other than being able to store larger files on my multiple disks.

balajisr · Post by **balajisr** » Thu Feb 08, 2007 11:06 pm

What is your partitioning type when you load into dataset?

kumar_s · Post by **kumar_s** » Thu Feb 08, 2007 11:16 pm

Arnd, Even in sequential mode Dataset, shouldn't be quicker than the Sequential file, atleast theoretically? Dataset will be written in native format and not necessary to convert into Ascii.
splayer, more over recording benchmark for data worth of processing within few seconds will not give out exact result. Check for startup time and production time for each case. Because these will be interms of seconds.