Regarding Aggregation and Remove Duplicates Stage

pkothana · Post by **pkothana** » Thu Nov 06, 2003 9:02 am

Hi,
I'm having very unusual problems.

Aggregator Problem:

There is a very simple requirement. I have two fields
A Int, B Date
I've to group by A and find the maximum value of date. While running the job, It gives error something like can not convert date to dfloat. I checked everything for the correctness but failed.
After this I changed the Date to Julian Date (Integer) and did the aggregation. This time Job ran without aborting but I got the same output as Input i.e. without aggregation.

In the input data i'm having only 4 rows with two distinct values of field A.

Remove Duplicates Problem:

I m trying to remove duplicate records from a file.I initially sort it using the sort stage then give it to the duplicate remove stage.But the ouput gives me the same records as in my input file without removing the duplicate records. Here again i have only two fields.

Any prompt help in this regard is highly appreciated.

Best Regards
Pinkesh

Amos.Rosmarin · Post by **Amos.Rosmarin** » Thu Nov 06, 2003 10:16 am

Pinkesh,

2 ideas .....

first, instead of julian date use DS iconv and stop worrying about date formats ... use oconv before you write it to the target.

2nd - Instead of usinf aggregator use a sort stage into a hash file.
If you have 2 columns : A Int, B Date sort the file by A asc, B asc
and write it to hash. Use the hash file behavior to maintain key uniqeness let the hash file filter the entries keeping A and max(B)

If you can get your soure already sorted you can gain a lot of processing time (in case it's relevant)

HTH,
Amos

Rahul · Post by **Rahul** » Thu Nov 06, 2003 1:11 pm

Pinkesh,

Make it a thumb rule to convert the date to Internal fomat using Iconv and then carryout your arithmetical operation ( Max in your case). This is the best approach to do any work on Dates. You can convert back from the internal format back to the display one using OConv.

Hope this helps,

Rahul

kduke · Post by **kduke** » Thu Nov 06, 2003 1:58 pm

Rahul

I totally disagree. I always put dates and timestamps in varchar format. I only change them to internal format when I need to do math with them. The processing time you lose by converting to internal format and back is a waste of time 90 percent of the time. Most of the time data is just moved from one database to another. If you have 9 date fields then only one is used to build summary data like the sales date. You wasted all those computer cylces by having date or timestamp in your metadata. It is also a lot easier to debug when it is a varchar because DataStage does nothing to a varchar but pass it along.

Kim.

Teej · Post by **Teej** » Thu Nov 06, 2003 2:10 pm

Remember, this is a Parallel Job.

For 6.0, iconv and oconv is a Server-only routine.

The aggregator stage can only handle a limited set of datatypes. I do not believe it is listed anywhere, but someone here were having issues getting the aggregator to handle integer fields properly. So he went ahead and created a buildop stage to do the aggregation. Instead of using a sledgehammer to hammer a nail, I would suggest converting to a decimal for field A. I think you will see the correct results now.

-T.J.

bigpoppa · Post by **bigpoppa** » Fri Nov 07, 2003 3:15 pm

As with almost all parallel stages that work on key fields, you must first hash the data and then sort the data by the key fields before you pass it to aggregator and remove dups. Are HASHing AND SORTing by the right keys b/c you use aggregator or rem_dup?

Thanks,
BP

Teej · Post by **Teej** » Fri Nov 07, 2003 5:23 pm

bigpoppa wrote:As with almost all parallel stages that work on key fields, you must first hash the data and then sort the data by the key fields before you pass it to aggregator and remove dups. Are HASHing AND SORTing by the right keys b/c you use aggregator or rem_dup?

Thanks,
BP

Hashing is only critical part. Sorting depends on what you want to do inside the stage. If you MUST get the first of something, or last of something, then sorting will guarantee that the data is in order for your logic to work. If you don't care which record you're outputting as long as the task is done, then just use hash which guarantee that all common key fields are grouped in the same partition.

A little tidbit -- Lookup stage should have hash on both input and reference links in order to go fully parallel. It's partly parallel if you don't do that. There are cost and benefits for both options, so do some performance testing. 7.0.1 should fix this by defaulting Lookup to Hash.

-T.J.

bigpoppa · Post by **bigpoppa** » Mon Nov 10, 2003 10:05 am

Sorting is required for remdup as it looks at adjacent rows to find key matches.

- BP

DSXchange