Page 1 of 2

Problem with SORT

Posted: Mon Jul 02, 2012 3:47 pm
by vishu19aug
Hi,

I am doing the sort on two key columns, but seems like it is doing the sort on the combination of these columns. What i want is - it should sort on first column and in case duplicate records for first column then sort on second column.

Can somebody help please?

Thanks,
Vishal

Posted: Mon Jul 02, 2012 4:26 pm
by Kryt0n
That's exactly what sorting on two columns does...

or are you trying to say

Code: Select all

1   20
12  0
are both being sorted as 120?

Posted: Mon Jul 02, 2012 4:46 pm
by chulett
Yah, was going to chime in to say that both of your scenarios are equivalent in my mind. Whether it "sorts on a combination" of the two columns or "sorts on the first then the second" the end result should be the same. Now, string sorts can confuse people if that's what you are seeing or doing.

Can you post some examples with real data and let us know the data types involved? For example, how they come in, how they are being sorted and how you'd like them sorted. That would help us help you.

some example

Posted: Mon Jul 02, 2012 5:21 pm
by vishu19aug
My data is something like ---

Field1 field2
18834 *
18834 0
18832 *
18832 0

the result is coming as -
18832 *
18834 *
18832 0
18834 0

but i want the result as -
18832 *
18832 0
18834 *
18834 0

the first field is 'Double' and the second one is varchar.

Posted: Mon Jul 02, 2012 6:04 pm
by jwiles
Craig asks
Can you post some examples with real data
Vishu replies
My data is something like ---
:?: :roll:

Are you properly partitioning your data?
In which order are you specifying your sort keys in the sort stage?
Is the output from a Peek or a sequential file?

Regards,

Posted: Mon Jul 02, 2012 7:04 pm
by vishu19aug
Here are some more details -
1. Input is a Data set
2. Then a filter to get only these two fields.
3. Did not change anything in partitioning (Auto)
4. the order in sort stage is 1. field1 2. field2
5. output is going to a data set via a Copy state.
6. I tried to change the input partition to use Hash partition for these two keys in sort stage but no success.

Posted: Mon Jul 02, 2012 7:44 pm
by SURA
jwiles wrote:Is the output from a Peek or a sequential file?

Regards,
Consider this!

Posted: Mon Jul 02, 2012 8:26 pm
by vishu19aug
I am new to Datastage.. couldn't quite understand this..

Posted: Mon Jul 02, 2012 8:28 pm
by Kryt0n
vishu19aug wrote:Here are some more details -
5. output is going to a data set via a Copy state.
6. I tried to change the input partition to use Hash partition for these two keys in sort stage but no success.
Try run in sequential mode if the entire output needs to be sorted.

Not sure about viewing order in a dataset but when viewing the dataset, read from only one partition as you may be getting the writes from multiple partitions playing with your display (of course, assuming you have been running on multiple nodes)

Posted: Mon Jul 02, 2012 9:39 pm
by jwiles
How are you viewing the data? The View Data button or menu entry, or maybe orchadmin command or the Dataset Management Tool?

Regards,

Posted: Tue Jul 03, 2012 7:49 am
by vishu19aug
There is no problem with viewing the data.. I can see the problem is with the records where field2 is '*'. Is there any special handling of * in datastage?

Posted: Tue Jul 03, 2012 7:56 am
by ArndW
What data type is field 2?

If the data type is varchar() then most collating sequences will sort "*" (ASCII Character 42) after the digits (ASCII Characters 48-57).

If the "*" denotes a NULL value in that field, datastage defaults to sorting that before other values.

Could you be inadvertantly sorting on "Field2" then "Field1"?

Posted: Tue Jul 03, 2012 7:58 am
by vishu19aug
It is varchar()

Posted: Tue Jul 03, 2012 8:00 am
by ArndW
In that case it would seem that you are sorting by Field2 then Field1. Do you have other sample data which can help show the problem?

Posted: Tue Jul 03, 2012 9:05 am
by jwiles
There is no problem with viewing the data
Reread my question...that is neither what I asked nor implied.

Have you run the job with a single-node configuration file rather than the typical default 2-node configuration to see if there are any differences in the results? If so, that usually points to partitioning issues, but could also be the results of a collection operation.

You are writing to a parallel dataset using the Dataset stage, correct? Is the Dataset stage set to run in sequential or parallel mode?

The evidence you've provided so far points to a couple of possibilities:

1) Sort key order is backwards from what you claim
2) Your job is running in parallel with a target file/dataset that is sequential/single node. The collector will by default run in round-robin mode and could cause the issue you see.

Regards,