Job using dataset files is slower than sequential files

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

splayer
Charter Member
Charter Member
Posts: 502
Joined: Mon Apr 12, 2004 5:01 pm

Job using dataset files is slower than sequential files

Post by splayer »

I created 2 sets of jobs. Both sets are exactly identical. Both sets have loops. Here is the loop code:

StartLoop --> ExecCmd1 --> JobActivity --> ExecCmd2 --> EndLoop

There is another link from the EndLoop to StartLoop. The job in the JobAcitivity stage is:
SeqFile --> Modify --> SurrogateKeyGenerator --> Transformer --> TargetFile

The difference between the 2 sets is that the TargetFile is a sequential file in one and data set file in the other. The set of jobs with sequential file is significantly faster than the set with data set files. I would think that just the reverse should be true. Is there any config file manipulation that I can do?
us1aslam1us
Charter Member
Charter Member
Posts: 822
Joined: Sat Sep 17, 2005 5:25 pm
Location: USA

Post by us1aslam1us »

Do Job Monitoring and check whether the move from transformer to Target file (DataSet) taking more time or the overall process? Is it done in sequential mode.
I haven't failed, I've found 10,000 ways that don't work.
Thomas Alva Edison(1847-1931)
patonp
Premium Member
Premium Member
Posts: 110
Joined: Thu Mar 11, 2004 7:59 am
Location: Toronto, ON

Post by patonp »

Is there any chance your problem is related to the following post?

http://dsxchange.com/viewtopic.php?t=97 ... e5057e602f
splayer
Charter Member
Charter Member
Posts: 502
Joined: Mon Apr 12, 2004 5:01 pm

Post by splayer »

I did job monitoring. There is nothing specific that I can fiind there. I tried changing the config file from a 2 node file to a 4 node file. It does not split the transformer processing into 4 nodes which is kind of strange.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Is environment idle for both the case? You server might be loaded on the later case.
There may be a chance where the additional node might be be easily accessible by dataset to write the data into. For testing, you can try the give single node where the sequential file is created.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
splayer
Charter Member
Charter Member
Posts: 502
Joined: Mon Apr 12, 2004 5:01 pm

Post by splayer »

kumar_s, can you elaborate a little bit? What does "environment idle" mean? Mine is a dev environment and I have just one box but 4 processors, from what I know. If I use 4 nodes, shouldn't I see 4 instances for the transformer in job monitor window?
balajisr
Charter Member
Charter Member
Posts: 785
Joined: Thu Jul 28, 2005 8:58 am

Post by balajisr »

Are you by any chance running the transformer in sequential mode?
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Your server might be busy with other stuff when you are testing with dataset and might be comparatively idle when you process sequential file. This will make your dataset preparation to run slower. You can measure the CPU usage on both the cases.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

As balajisr mentioned, you might be running sequentially instead of in parallel. Turn on APT_DUMP_SCORE to see what is really happening at runtime. A dataset, even in sequential mode, should run at about the same speed as a sequential file for what you've described, so something is definately amiss.
splayer
Charter Member
Charter Member
Posts: 502
Joined: Mon Apr 12, 2004 5:01 pm

Post by splayer »

No, I am running in parallel as I am seeing xfm x 2 in the monitor where xfm is the transformer. I set APT_DUMP_SCORE to True. I don't see anything additional in the log. Shouldn't I see the output in the log?
splayer
Charter Member
Charter Member
Posts: 502
Joined: Mon Apr 12, 2004 5:01 pm

Post by splayer »

These are the outputs from the dump score for the sequential and dataset file versions. For 20 source files, the sequeantial
file version takes 48 secs and the dataset file version takes 68 secs.

Sequential File version(2 nodes):

main_program: This step has 2 datasets:
ds0: {op0[1p] (sequential Seq_File)
eAny<>eCollectAny
op1[2p] (parallel APT_CombinedOperatorController:sk_Add_SrcID)}
ds1: {op1[2p] (parallel APT_CombinedOperatorController:APT_TransformOperatorImplV2S0_MyJob_xfm1 in xfm1)
>>eCollectAny
op2[1p] (sequential APT_RealFileExportOperator in MasterFile)}
It has 3 operators:
op0[1p] {(sequential Seq_File)
on nodes (
node1[op0,p0]
)}
op1[2p] {(parallel APT_CombinedOperatorController:
(sk_Add_SrcID)
(APT_TransformOperatorImplV2S0_MyJob_xfm1 in xfm1)
) on nodes (
node1[op1,p0]
node2[op1,p1]
)}
op2[1p] {(sequential APT_RealFileExportOperator in MasterFile)
on nodes (
node2[op2,p0]
)}
It runs 4 processes on 2 nodes.


Dataset file version (2 nodes):

main_program: This step has 2 datasets:
ds0: {op0[1p] (sequential Seq_File)
eAny<>eCollectAny
op1[2p] (parallel APT_TransformOperatorImplV6S3_MyJob_xfm1 in xfm1)}
ds1: {op1[2p] (parallel APT_TransformOperatorImplV6S3_MyJob_xfm1 in xfm1)
=>
/Fld1/Fld2/Fld3/MyDS.ds}
It has 2 operators:
op0[1p] {(sequential Seq_File)
on nodes (
node1[op0,p0]
)}
op1[2p] {(parallel APT_TransformOperatorImplV6S3_MyJob_xfm1 in xfm1)
on nodes (
node1[op1,p0]
node2[op1,p1]
)}
It runs 3 processes on 2 nodes.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

What happens to the speed of the sequential version if you output it to the same directory as your dataset data files as specified in the APT_CONFIG file? If is slows down then it might be related to the disk partition and not directly to the DS job.
splayer
Charter Member
Charter Member
Posts: 502
Joined: Mon Apr 12, 2004 5:01 pm

Post by splayer »

My APT_CONFIG_FILE is located in /home/dsadm/Ascential/DataStage/Configurations. Both versions output the file to the same folder.

This does not make sense to me. I would think that performance would at least be same. I am having doubts about the need of datasets now. I am not seeing any benefit other than being able to store larger files on my multiple disks.
balajisr
Charter Member
Charter Member
Posts: 785
Joined: Thu Jul 28, 2005 8:58 am

Post by balajisr »

What is your partitioning type when you load into dataset?
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Arnd, Even in sequential mode Dataset, shouldn't be quicker than the Sequential file, atleast theoretically? Dataset will be written in native format and not necessary to convert into Ascii.
splayer, more over recording benchmark for data worth of processing within few seconds will not give out exact result. Check for startup time and production time for each case. Because these will be interms of seconds.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Post Reply