sort key problem

bart12872 · Post by **bart12872** » Mon Jan 26, 2009 7:45 am

Hi,

A job cause me a huge grief because of the use of sorts.

I explain me :

In my job, I have a input with a huge numbers of lines ordered by col1.
Then, i make an inner join and an agregation on col1,col2,col3,col4.
So i use a sort stage (key sort col1,col2,col3,col4) with col1 previously sorted.

So, at this moment data as sort by col1,col2,col3,col4

After that, I need to sort by col2,col3,col4 only.

Is there a method to do this without cut the dataflow ?
Do I have to write in a dataset all data and then sort ?

thanks,
martin.

mk_ds09 · Post by **mk_ds09** » Mon Jan 26, 2009 8:55 am

In order to have better design of the job..

1. There is join stage in the job...It is advised that is you are having huge unsorted data, u can use lookup stage.. ( of course ..here the other link where you are putting the join should have less rows which can fit in your physical memory or performance will degrade again ! )

2.Do not use stable sort which is much more expensive..

3.Use restirct memory clause in sort, which can improve the performance.

you have mentioned that writing the dataset and then sorting..
are you using database stages currently ?

-------
MK

shamshad · Post by **shamshad** » Mon Jan 26, 2009 9:04 am

Martin,

This might not be the answer you looking for but whenever we have to sort and rearrange huge amount of data, we do it via a UNIX script rather than using the ETL Tool.

UNIX does these operation fairly quickly and efficiently and we never had any memory issues etc. The only catch is you will have to add few extra
steps in your Sequence like calling Shell script from Master Sequence etc.

After all no ETL tool is built to handle almost every situation efficiently.

bart12872 · Post by **bart12872** » Mon Jan 26, 2009 9:26 am

mk_ds09 wrote:In order to have better design of the job..

1. There is join stage in the job...It is advised that is you are having huge unsorted data, u can use lookup stage.. ( of course ..here the other link where you are putting the join should have less rows which can fit in your physical memory or performance will degrade again ! )

2.Do not use stable sort which is much more expensive..

3.Use restirct memory clause in sort, which can improve the performance.

you have mentioned that writing the dataset and then sorting..
are you using database stages currently ?

-------
MK

Thanks for your response.
1-well, I didn't developped key join in my join stage. The key is col1,col2. So as my input are sorted by col1, col2. The dataflow is not broken.
2- I didn't use the stable sort. In fact, I never find a situation with the need of stable sort.
3- I must admit I doesn't consider this parameter. I always let it to 20MB, the default value. Can you tell me me how you define it ?

no, i didn't use database, except to extract data.

ray.wurlod · Post by **ray.wurlod** » Mon Jan 26, 2009 1:33 pm

In the Sort stage mark the sort mode for Col1 "don't sort, already sorted" and sort normally by Col2, Col3 and Col4.