Pre partitioned - sorted Dataset read Very slow

rohanf · Post by **rohanf** » Mon Jun 01, 2009 3:23 am

Hi,

I have Partitioned and sorted data in a dataset with approximately 26 Million records.
When using the same dataset as a source in another job and not changing the Partition or sort order, i.e. using same partinioning as below:-

Dataset(Set) ------>Join(Same Partitioning)

For a very long time the Number or records read from this source dataset is "0". I was expecting that using a pre-partitioned and sorted dataset as source will Increase performance. But there seem to be no difference as the dataset was read after a very very long time by the datastage.

Is there some setting required for faster reads.

Thanks in Advance
Rohan

miwinter · Post by **miwinter** » Mon Jun 01, 2009 3:27 am

What other stages exist after the join stage? I'm certain this isn't your whole job flow and the issue could likely lie elsewhere...

How long is a "very very long time" for it to read the rows into the join stage? Bear in mind it is loading these up into memory, so the issue could also be contention there or contention in the filesystem the dataset resides in.

rohanf · Post by **rohanf** » Mon Jun 01, 2009 3:37 am

The stages after join in order is as below:

1: Filter (For capturing error failures)
2: Lookup
3: Transfomer
4: Surrogate key stage
5: Target dataset

Tell me if more info is needed.

Thanks & Regards
Rohan[/img]

mahadev.v · Post by **mahadev.v** » Mon Jun 01, 2009 3:52 am

DataStage puts in a Sort stage before the join. And probably the DataSet and Sort are combined onto one operator. Look at the score of the job to determine if this is the case. I suggest you put in a sort stage and add the property Sort Key Mode to "Don't sort Previously Sorted". This might help.

miwinter · Post by **miwinter** » Mon Jun 01, 2009 3:57 am

The same end can be achieved on the input tab of your join, where you partition same. Check to sort and check stable to ensure it doesn't re-sort your dataset and negate the need to adding a new stage.

rohanf · Post by **rohanf** » Mon Jun 01, 2009 4:05 am

I assume the inline stable sort option still utilize the scratch disk space. I have restrictions on the scratch space size as well. This is the reason I had sorted at Oracle side in my extraction step to generate this Partitioned and sorted dataset.

However I assume the "Don't sort previously sorted" option does not use the Scratch disk space.

Please correct me If I am wrong.

Regards
Rohan

miwinter · Post by **miwinter** » Mon Jun 01, 2009 4:25 am

So having ordered by in Oracle in your previous job, you then partitioned inline on the join keys before the dataset write?

rohanf · Post by **rohanf** » Mon Jun 01, 2009 4:30 am

Correct!

In the previous extraction job below are the steps performed:-

1: Sorting at oracle side using order by( on say K1)
2: Partitioning on datastage side on the same key K1
3: Write to dataset.

Now in the next job my join key is "K1".

Regards
Rohan

miwinter · Post by **miwinter** » Mon Jun 01, 2009 4:36 am

OK, so you go with the sort stage using don't sort, previously sorted on your sort/join key.

You could also add APT_NO_SORT_INSERTION to your job, but this will disable at job-level, so you'd need to take care of any other sorting requirements manually for the other stages in your second job.

Let us know which way you go and what effect it produces.