Is Sorting preserved across multiple stages in parallel jobs

ssunda6 · Post by **ssunda6** » Tue Mar 25, 2008 7:38 am

Hi All,

My job requirement is to check if job activities of employees are proper.
For example, His/her first task should be a particular job code and not 'OUT or Lunch'.
The last task is always 'OUT'. And lot more conditions.

To implement this
After reading from the file in a parallel job, if I sort on the key columns(employee number,Business date,activities timestamp) using explicit sort stage and then propagate the data to other stages , Is it guaranteed that sorting will be preserved across multiple stages?

I am applying these conditions in a transformer after some stages. In all middle stages, sort is preserved and partitioning is left to default(propagate). My whole logic will depend on the sorted data and hence I want to make sure if this is guranteed to work.

One more doubt is .. I can implement the job in 2 ways.
First, Copy the whole data (1million) to 3 output links from a transformer and then apply some conditions on one link, some other on 2nd link and remaining on 3rd link and funnel the data to output.
Otherwise I can also handle all conditions in a single tranformer instead of routing all data to 3 output links but the conditions will become a bit complex. Will we have any significant improvement in performance(time) between both the above cases?

Please let me know your inputs.

Regards,
Ssunda.

ray.wurlod · Post by **ray.wurlod** » Tue Mar 25, 2008 3:24 pm

Sorting is guaranteed to be preserved unless you repartition the data.

Your first option would require a Join stage rather than a Funnel stage, otherwise you'll get three copies of each source row. Your second approach (all in one Transformer) may be quite efficient - make your derivation expressions as efficient as possible.

ssunda6 · Post by **ssunda6** » Tue Mar 25, 2008 10:39 pm

Hi Ray,

Thanks for the reply.
I was worried since it is a parallel job but your answer has cleared my doubt now.
And I forgot mentioning that when using 3 transformers and funnel, I am using a remove duplicates stage. So I will not be getting 3 copies of data.

Thanks again.
Ssunda.

ray.wurlod · Post by **ray.wurlod** » Tue Mar 25, 2008 11:29 pm

You will (may) get multiple copies if your data are not partitioned as per the keys mentioned in the Remove Duplicates stage.