Page 1 of 2

Problem in Deduping

Posted: Thu Mar 23, 2006 4:04 am
by Archana
Hi ,
I am trying to dedup a sequential file on some keys (combination of 3 keys) . The output is a hashed file with those three fields as keys .
But the files ia not getting deduped , I am finding records with dupliacate values of those keys . The parameters and all sre fine in hash file .
Please , tell me the solution .

Thanks

Posted: Thu Mar 23, 2006 4:29 am
by satheesh_color
Hai,

Why don't you try with the simple SQL by using Distinct keyword.It's also worked fine by using hashed file.





Regards,
Satheesh.Rajan

Posted: Thu Mar 23, 2006 4:53 am
by ArndW
Archana, a hashed file key is unique. If you specified all 3 key columns when you wrote to the hashed file you cannot have duplicates. Are you certain you specified all 3 columns as keys for the hashed file? If you are certain, could you show an example of a duplicate record?

Posted: Thu Mar 23, 2006 5:34 am
by Archana
ArndW wrote:Archana, a hashed file key is unique. If you specified all 3 key columns when you wrote to the hashed file you cannot have duplicates. Are you certain you specified all 3 columns as keys for the hashed file? If you are certain, could you show an example of a duplicate record?

Ys , I have specified three columns as key columns .

The duplicate records are

DOYLESTOWN}PA}18901}ADDR_DIM
DOYLESTOWN}PA}18901}ADDR_DIM

The first 3 columns are specified as key columns.

Posted: Thu Mar 23, 2006 5:42 am
by kumar_s
Keys in the hash file cannot have duplicates.
Are the records shown are from Hash file?
What is the datatype of the key fields?

Posted: Thu Mar 23, 2006 5:48 am
by parag.s.27
ArndW wrote:Archana, a hashed file key is unique. If you specified all 3 key columns when you wrote to the hashed file you cannot have duplicates. Are you certain you specified all 3 columns as keys for the hashed file? If you are certain, could you show an example of a duplicate record?
Hi Archana,

I also had same problem, but later i found out that the data was having extra spaces with it. so i just Trimed the data and again tried, this time it worked fine. You also trim the data before transformation.

may be this will help

Posted: Thu Mar 23, 2006 5:59 am
by ArndW
Archana wrote:Yes , I have specified three columns as key columns .

The duplicate records are

DOYLESTOWN}PA}18901}ADDR_DIM
DOYLESTOWN}PA}18901}ADDR_DIM

The first 3 columns are specified as key columns.
Hashed files are unable to have duplicate keys; so either you have embedded spaces or undisplayed characters in the 3 key columns or you have not declared all 3 columns as "Key" in your hashed file stage.

Posted: Thu Mar 23, 2006 6:09 am
by Archana
kumar_s wrote:Keys in the hash file cannot have duplicates.
Are the records shown are from Hash file?
What is the datatype of the key fields?
Ya the records are from hashed file .
The datatype is varchar for all the fields.

Posted: Thu Mar 23, 2006 6:18 am
by Archana
parag.s.27 wrote:
ArndW wrote:Archana, a hashed file key is unique. If you specified all 3 key columns when you wrote to the hashed file you cannot have duplicates. Are you certain you specified all 3 columns as keys for the hashed file? If you are certain, could you show an example of a duplicate record?
Hi Archana,

I also had same problem, but later i found out that the data was having extra spaces with it. so i just Trimed the data and again tried, this time it worked fine. You also trim the data before transformation.

may be this will help

No there can not be extra spaces as the file is a delimited file and the delimiter is '}' .

Posted: Thu Mar 23, 2006 6:22 am
by parag.s.27
No there can not be extra spaces as the file is a delimited file and the delimiter is '}' .[/quote]

Hi,

My source file was also '~' delimited, but when you define the metadata, then if the spaces are there with the data between delimiters then those spaces also come in with the data.

Posted: Thu Mar 23, 2006 6:25 am
by ArndW
1. Write a temporary DS job copy of your original to write just the 3 key columns to a sequential file call myfile.txt.
2. wc -l myfile.txt to get the number of lines.
3. sort -u myfile.txt > otherfile.txt
4. wc -l otherfile.txt

are the counts in 2 & 4 the same? if you look at otherfile.txt do you see what look like identical "keys"? (you can do a cat -v to display non-displayable texts)

Posted: Thu Mar 23, 2006 6:31 am
by kumar_s
To make a unique sort on the specific columns, you can use
cut -d '}' -f1,2,3 | sort -u > Filename.txt
If 1,2,3 are the key columns.

Posted: Thu Mar 23, 2006 6:42 am
by satheesh_color
Hai,

Finally i got the answer.While defining sequential metadata there are three delimiter Tab,Space,Comma.But your delimiter is }.So you have to specify } in other Delimiter textbox.So that the varchar data also get deduplicated.







Regards,
Satheesh.Rajan

Posted: Thu Mar 23, 2006 6:55 am
by kumar_s
Hi Satheesh,

'No there can not be extra spaces as the file is a delimited file and the delimiter is '}' . '
Hope you have you noticed this?

Posted: Thu Mar 23, 2006 7:02 am
by parag.s.27
kumar_s wrote:Hi Satheesh,

'No there can not be extra spaces as the file is a delimited file and the delimiter is '}' . '
Hope you have you noticed this?
Hi Kumar,

Even in the delimited file spaces can come. I am handling all the source files those are '~' delimited. This is a small part of my data. where length of second column is 8.

Code: Select all

DEV~S0001          ~STORE~S0001     ~20030902~43.00000~.00000~.00000~.00000~.00000~ ~40.00000~ ~80.00000~.00000
If the source file is generated from sequel server, where the data was imported from a XL sheet then spaces can come.

And also for a delimited file, spaces can come, only advantage is, since the data is delimited so meta data will not fail for length of the field