process big data

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
wuruima
Participant
Posts: 65
Joined: Mon Nov 04, 2013 10:15 pm

process big data

Post by wuruima »

Hi,

I meet a problem that the sequential job process too long time.

there are many Parallel jobs to run in order, as the input files' size are very big, that cause the temporary files so big that the job require long time to read/write.

In such condition, what should I do to tune the job so that the job can process faster ??

I am thinking about combining the parallel jobs to 1 parallel job so that i can recude the read/write. but this is not good idea.

Please kindly advise me how to do. Thanks experts.
walter/
wuruimao
ShaneMuir
Premium Member
Premium Member
Posts: 508
Joined: Tue Jun 15, 2004 5:00 am
Location: London

Post by ShaneMuir »

This question is almost impossible to answer without knowing a lot more about your jobs and your environment set up.

But a starting place would be to ensure that your job designs are avoiding unnecessary re-partitioning and sorting of data.
rkashyap
Premium Member
Premium Member
Posts: 532
Joined: Fri Dec 02, 2011 12:02 pm
Location: Richmond VA

Re: process big data

Post by rkashyap »

What are the constraints that make combining the parallel jobs to 1 parallel job not a good idea?

On landing data in temporary files [for use by subsequent jobs], one loses advantage of pipeline parallelism. When possible, combine the jobs to minimize use of temporary files. Also if use of temporary files is unavoidable, then use datasets as this will help in avoiding repartitioning and sorting of data (as Shane advised above). Also see this blog.
Post Reply