Regarding Aggregation and Remove Duplicates Stage

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
pkothana
Participant
Posts: 50
Joined: Tue Oct 14, 2003 6:12 am

Regarding Aggregation and Remove Duplicates Stage

Post by pkothana »

Hi,
I'm having very unusual problems.

Aggregator Problem:

There is a very simple requirement. I have two fields
A Int, B Date
I've to group by A and find the maximum value of date. While running the job, It gives error something like can not convert date to dfloat. I checked everything for the correctness but failed.
After this I changed the Date to Julian Date (Integer) and did the aggregation. This time Job ran without aborting but I got the same output as Input i.e. without aggregation.

In the input data i'm having only 4 rows with two distinct values of field A.

Remove Duplicates Problem:

I m trying to remove duplicate records from a file.I initially sort it using the sort stage then give it to the duplicate remove stage.But the ouput gives me the same records as in my input file without removing the duplicate records. Here again i have only two fields.

Any prompt help in this regard is highly appreciated.

Best Regards
Pinkesh
Amos.Rosmarin
Premium Member
Premium Member
Posts: 385
Joined: Tue Oct 07, 2003 4:55 am

Post by Amos.Rosmarin »

Pinkesh,

2 ideas .....


first, instead of julian date use DS iconv and stop worrying about date formats ... use oconv before you write it to the target.

2nd - Instead of usinf aggregator use a sort stage into a hash file.
If you have 2 columns : A Int, B Date sort the file by A asc, B asc
and write it to hash. Use the hash file behavior to maintain key uniqeness let the hash file filter the entries keeping A and max(B)

If you can get your soure already sorted you can gain a lot of processing time (in case it's relevant)


HTH,
Amos
Rahul
Participant
Posts: 19
Joined: Wed Oct 29, 2003 10:21 pm

Post by Rahul »

Pinkesh,

Make it a thumb rule to convert the date to Internal fomat using Iconv and then carryout your arithmetical operation ( Max in your case). This is the best approach to do any work on Dates. You can convert back from the internal format back to the display one using OConv.

Hope this helps,

Rahul
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Rahul

I totally disagree. I always put dates and timestamps in varchar format. I only change them to internal format when I need to do math with them. The processing time you lose by converting to internal format and back is a waste of time 90 percent of the time. Most of the time data is just moved from one database to another. If you have 9 date fields then only one is used to build summary data like the sales date. You wasted all those computer cylces by having date or timestamp in your metadata. It is also a lot easier to debug when it is a varchar because DataStage does nothing to a varchar but pass it along.

Kim.
Mamu Kim
Teej
Participant
Posts: 677
Joined: Fri Aug 08, 2003 9:26 am
Location: USA

Post by Teej »

Remember, this is a Parallel Job.

For 6.0, iconv and oconv is a Server-only routine.

The aggregator stage can only handle a limited set of datatypes. I do not believe it is listed anywhere, but someone here were having issues getting the aggregator to handle integer fields properly. So he went ahead and created a buildop stage to do the aggregation. Instead of using a sledgehammer to hammer a nail, I would suggest converting to a decimal for field A. I think you will see the correct results now.

-T.J.
Developer of DataStage Parallel Engine (Orchestrate).
bigpoppa
Participant
Posts: 190
Joined: Fri Feb 28, 2003 11:39 am

Regarding Aggregation and Remove Duplicates Stage

Post by bigpoppa »

As with almost all parallel stages that work on key fields, you must first hash the data and then sort the data by the key fields before you pass it to aggregator and remove dups. Are HASHing AND SORTing by the right keys b/c you use aggregator or rem_dup?

Thanks,
BP
Teej
Participant
Posts: 677
Joined: Fri Aug 08, 2003 9:26 am
Location: USA

Re: Regarding Aggregation and Remove Duplicates Stage

Post by Teej »

bigpoppa wrote:As with almost all parallel stages that work on key fields, you must first hash the data and then sort the data by the key fields before you pass it to aggregator and remove dups. Are HASHing AND SORTing by the right keys b/c you use aggregator or rem_dup?

Thanks,
BP
Hashing is only critical part. Sorting depends on what you want to do inside the stage. If you MUST get the first of something, or last of something, then sorting will guarantee that the data is in order for your logic to work. If you don't care which record you're outputting as long as the task is done, then just use hash which guarantee that all common key fields are grouped in the same partition.

A little tidbit -- Lookup stage should have hash on both input and reference links in order to go fully parallel. It's partly parallel if you don't do that. There are cost and benefits for both options, so do some performance testing. 7.0.1 should fix this by defaulting Lookup to Hash.

-T.J.
Developer of DataStage Parallel Engine (Orchestrate).
bigpoppa
Participant
Posts: 190
Joined: Fri Feb 28, 2003 11:39 am

Regarding Aggregation and Remove Duplicates Stage

Post by bigpoppa »

Sorting is required for remdup as it looks at adjacent rows to find key matches.

- BP
Post Reply