Partition and Dataset

mavrick21 · Post by **mavrick21** » Thu Jan 17, 2008 7:19 am

Hi,

I have 2 jobs. Output of first job is a Dataset which is the input to the 2nd job.

The 2nd job looks like this

                             -----
                             |DB2|
                             -----
                                 |
                              ------
                             |Sort2|
                              -----
                               |
|DS of 1 | --> | Sort1 |-->|join | ---> |ouput |

In the 2nd job shown above, we have hash partitioned and sorted based on keys in Sort1 stage.

My doubt here is, can we avoid Sort1 stage in 2nd job ?
In my 1st job, while writing to the output dataset, if I hash parition and sort data based on keys in the second job, can i avoid the Sort1 stage in the 2nd job ?

1. Will the parition & sort be maintained in a Dataset ?
2. Will the parition & sort be maintained in a Dataset for a different Configuration file ?

Thanks in advance for your comments

throbinson · Post by **throbinson** » Thu Jan 17, 2008 8:19 am

1. Yes.
2. No.

The Dataset *.ds contains the configuration file used to build it. When the dataset is used as a source, this information is relayed to the second job. It only makes sense, if one uses a 4-node Configuration file to create a dataset, then there generally will be 4 data files that make up the dataset. If the second job uses a 2-node configuration file then naturally re-partitioning will have to be done on the dataset data to get the data out of 4 data files and into 2 partitions.

mavrick21 · Post by **mavrick21** » Thu Jan 17, 2008 11:44 pm

Thanks throbinson for your reply.

I removed the Sort1 stage and ran the 2nd job but number of records after join were different.

Does this mean that Dataset maintains partition only but not sort order ?

Maveric · Post by **Maveric** » Fri Jan 18, 2008 12:01 am

Sort order or sorted data does not effect the number of records coming out of the join. Partitioning does. What partitioning have you used in job 1 while creating the dataset and what partitioning are you using in join?

mavrick21 · Post by **mavrick21** » Fri Jan 18, 2008 12:33 am

In the 1st job before writing into the dataset I have used Sort stage hash partition on keys and to sort records and in dataset I have used same partition.

In the 2nd job, i have used same partition in input link of join

throbinson · Post by **throbinson** » Fri Jan 18, 2008 6:50 am

The join stage requires that all input links are partitioned and sorted alike. Sorting is VERY important to getting correct results. From the documentation...

The data sets input to the Join stage must be key partitioned and
sorted. This ensures that rows with the same key column values are
located in the same partition and will be processed by the same node.
It also minimizes memory requirements because fewer rows need to
be in memory at any one time.

Why not explicitly sort and partition the inputs and run the job. If you get correct results then you know the problem lies in your assumption that the input was partitioned and sorted the way you thought it was. It there is no change in your output then you know the problem is not partitioning or sorting the inputs.

mavrick21 · Post by **mavrick21** » Fri Jan 18, 2008 7:54 am

The 2nd job is taking 5 hours to complete since there are 2 transformers which are having few complex transform and i wanted to reduce number of stages in this job. So i thought of partitioning & sorting in previous job and storing them in Dataset.

mavrick21 · Post by **mavrick21** » Fri Jan 18, 2008 7:56 am

The 2nd job is taking 5 hours to complete since there are 2 transformers which are having few complex transform and i wanted to reduce number of stages in this job. So i thought of partitioning & sorting in previous job and storing them in Dataset.

mavrick21 · Post by **mavrick21** » Fri Jan 18, 2008 7:57 am

The 2nd job is taking 5 hours to complete since there are 2 transformers which are having few complex transform and i wanted to reduce number of stages in this job. So i thought of partitioning & sorting in previous job and storing them in Dataset.

mavrick21 · Post by **mavrick21** » Fri Jan 18, 2008 8:07 am

Sorry for 3 posts above. Problem with my browser.

ravibabu · Post by **ravibabu** » Fri Jan 18, 2008 8:07 am

Hay,

Make sure that....Keys should be the same order,While parition the data

-----
|DB2|
-----
| order like A and B
------
|Sort2|
-----
|
|DS of 1 | -->(A and B) | Sort1 |-->|join | ---> |ouput |

ray.wurlod · Post by **ray.wurlod** » Fri Jan 18, 2008 3:08 pm

mavrick21 wrote:Sorry for 3 posts above. Problem with my browser.

See this post

ravibabu, please enclose job designs in Code tags so that indenting is preserved.