Hi,
I meet a problem that the sequential job process too long time.
there are many Parallel jobs to run in order, as the input files' size are very big, that cause the temporary files so big that the job require long time to read/write.
In such condition, what should I do to tune the job so that the job can process faster ??
I am thinking about combining the parallel jobs to 1 parallel job so that i can recude the read/write. but this is not good idea.
Please kindly advise me how to do. Thanks experts.
walter/
process big data
Moderators: chulett, rschirm, roy
Re: process big data
What are the constraints that make combining the parallel jobs to 1 parallel job not a good idea?
On landing data in temporary files [for use by subsequent jobs], one loses advantage of pipeline parallelism. When possible, combine the jobs to minimize use of temporary files. Also if use of temporary files is unavoidable, then use datasets as this will help in avoiding repartitioning and sorting of data (as Shane advised above). Also see this blog.
On landing data in temporary files [for use by subsequent jobs], one loses advantage of pipeline parallelism. When possible, combine the jobs to minimize use of temporary files. Also if use of temporary files is unavoidable, then use datasets as this will help in avoiding repartitioning and sorting of data (as Shane advised above). Also see this blog.