Partition and Dataset

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
mavrick21
Premium Member
Premium Member
Posts: 335
Joined: Sun Apr 23, 2006 11:25 pm

Partition and Dataset

Post by mavrick21 »

Hi,

I have 2 jobs. Output of first job is a Dataset which is the input to the 2nd job.

The 2nd job looks like this

Code: Select all

                             -----
                             |DB2|
                             -----
                                 |
                              ------
                             |Sort2|
                              -----
                               |
|DS of 1 | --> | Sort1 |-->|join | ---> |ouput |

In the 2nd job shown above, we have hash partitioned and sorted based on keys in Sort1 stage.

My doubt here is, can we avoid Sort1 stage in 2nd job ?
In my 1st job, while writing to the output dataset, if I hash parition and sort data based on keys in the second job, can i avoid the Sort1 stage in the 2nd job ?

1. Will the parition & sort be maintained in a Dataset ?
2. Will the parition & sort be maintained in a Dataset for a different Configuration file ?

Thanks in advance for your comments
throbinson
Charter Member
Charter Member
Posts: 299
Joined: Wed Nov 13, 2002 5:38 pm
Location: USA

Post by throbinson »

1. Yes.
2. No.

The Dataset *.ds contains the configuration file used to build it. When the dataset is used as a source, this information is relayed to the second job. It only makes sense, if one uses a 4-node Configuration file to create a dataset, then there generally will be 4 data files that make up the dataset. If the second job uses a 2-node configuration file then naturally re-partitioning will have to be done on the dataset data to get the data out of 4 data files and into 2 partitions.
mavrick21
Premium Member
Premium Member
Posts: 335
Joined: Sun Apr 23, 2006 11:25 pm

Post by mavrick21 »

Thanks throbinson for your reply.

I removed the Sort1 stage and ran the 2nd job but number of records after join were different.

Does this mean that Dataset maintains partition only but not sort order ?
Maveric
Participant
Posts: 388
Joined: Tue Mar 13, 2007 1:28 am

Post by Maveric »

Sort order or sorted data does not effect the number of records coming out of the join. Partitioning does. What partitioning have you used in job 1 while creating the dataset and what partitioning are you using in join?
mavrick21
Premium Member
Premium Member
Posts: 335
Joined: Sun Apr 23, 2006 11:25 pm

Post by mavrick21 »

In the 1st job before writing into the dataset I have used Sort stage hash partition on keys and to sort records and in dataset I have used same partition.

In the 2nd job, i have used same partition in input link of join
throbinson
Charter Member
Charter Member
Posts: 299
Joined: Wed Nov 13, 2002 5:38 pm
Location: USA

Post by throbinson »

The join stage requires that all input links are partitioned and sorted alike. Sorting is VERY important to getting correct results. From the documentation...

The data sets input to the Join stage must be key partitioned and
sorted. This ensures that rows with the same key column values are
located in the same partition and will be processed by the same node.
It also minimizes memory requirements because fewer rows need to
be in memory at any one time.

Why not explicitly sort and partition the inputs and run the job. If you get correct results then you know the problem lies in your assumption that the input was partitioned and sorted the way you thought it was. It there is no change in your output then you know the problem is not partitioning or sorting the inputs.
mavrick21
Premium Member
Premium Member
Posts: 335
Joined: Sun Apr 23, 2006 11:25 pm

Post by mavrick21 »

The 2nd job is taking 5 hours to complete since there are 2 transformers which are having few complex transform and i wanted to reduce number of stages in this job. So i thought of partitioning & sorting in previous job and storing them in Dataset.
mavrick21
Premium Member
Premium Member
Posts: 335
Joined: Sun Apr 23, 2006 11:25 pm

Post by mavrick21 »

The 2nd job is taking 5 hours to complete since there are 2 transformers which are having few complex transform and i wanted to reduce number of stages in this job. So i thought of partitioning & sorting in previous job and storing them in Dataset.
mavrick21
Premium Member
Premium Member
Posts: 335
Joined: Sun Apr 23, 2006 11:25 pm

Post by mavrick21 »

The 2nd job is taking 5 hours to complete since there are 2 transformers which are having few complex transform and i wanted to reduce number of stages in this job. So i thought of partitioning & sorting in previous job and storing them in Dataset.
mavrick21
Premium Member
Premium Member
Posts: 335
Joined: Sun Apr 23, 2006 11:25 pm

Post by mavrick21 »

Sorry for 3 posts above. Problem with my browser.
ravibabu
Participant
Posts: 39
Joined: Tue Feb 13, 2007 12:18 am
Location: vijayawada

Partition and Dataset

Post by ravibabu »

Hay,


Make sure that....Keys should be the same order,While parition the data


-----
|DB2|
-----
| order like A and B
------
|Sort2|
-----
|
|DS of 1 | -->(A and B) | Sort1 |-->|join | ---> |ouput |
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

mavrick21 wrote:Sorry for 3 posts above. Problem with my browser.
See this post

ravibabu, please enclose job designs in Code tags so that indenting is preserved.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply