Page 1 of 1

process big data

Posted: Mon Oct 19, 2015 2:05 am
by wuruima
Hi,

I meet a problem that the sequential job process too long time.

there are many Parallel jobs to run in order, as the input files' size are very big, that cause the temporary files so big that the job require long time to read/write.

In such condition, what should I do to tune the job so that the job can process faster ??

I am thinking about combining the parallel jobs to 1 parallel job so that i can recude the read/write. but this is not good idea.

Please kindly advise me how to do. Thanks experts.
walter/

Posted: Mon Oct 19, 2015 2:31 am
by ShaneMuir
This question is almost impossible to answer without knowing a lot more about your jobs and your environment set up.

But a starting place would be to ensure that your job designs are avoiding unnecessary re-partitioning and sorting of data.

Re: process big data

Posted: Mon Oct 19, 2015 10:12 am
by rkashyap
What are the constraints that make combining the parallel jobs to 1 parallel job not a good idea?

On landing data in temporary files [for use by subsequent jobs], one loses advantage of pipeline parallelism. When possible, combine the jobs to minimize use of temporary files. Also if use of temporary files is unavoidable, then use datasets as this will help in avoiding repartitioning and sorting of data (as Shane advised above). Also see this blog.