Problem in Deduping

Archana · Post by **Archana** » Thu Mar 23, 2006 4:04 am

Hi ,
I am trying to dedup a sequential file on some keys (combination of 3 keys) . The output is a hashed file with those three fields as keys .
But the files ia not getting deduped , I am finding records with dupliacate values of those keys . The parameters and all sre fine in hash file .
Please , tell me the solution .

Thanks

satheesh_color · Post by **satheesh_color** » Thu Mar 23, 2006 4:29 am

Hai,

Why don't you try with the simple SQL by using Distinct keyword.It's also worked fine by using hashed file.

Regards,
Satheesh.Rajan

ArndW · Post by **ArndW** » Thu Mar 23, 2006 4:53 am

Archana, a hashed file key is unique. If you specified all 3 key columns when you wrote to the hashed file you cannot have duplicates. Are you certain you specified all 3 columns as keys for the hashed file? If you are certain, could you show an example of a duplicate record?

Archana · Post by **Archana** » Thu Mar 23, 2006 5:34 am

ArndW wrote:Archana, a hashed file key is unique. If you specified all 3 key columns when you wrote to the hashed file you cannot have duplicates. Are you certain you specified all 3 columns as keys for the hashed file? If you are certain, could you show an example of a duplicate record?

Ys , I have specified three columns as key columns .

The duplicate records are

DOYLESTOWN}PA}18901}ADDR_DIM
DOYLESTOWN}PA}18901}ADDR_DIM

The first 3 columns are specified as key columns.

kumar_s · Post by **kumar_s** » Thu Mar 23, 2006 5:42 am

Keys in the hash file cannot have duplicates.
Are the records shown are from Hash file?
What is the datatype of the key fields?

parag.s.27 · Post by **parag.s.27** » Thu Mar 23, 2006 5:48 am

ArndW wrote:Archana, a hashed file key is unique. If you specified all 3 key columns when you wrote to the hashed file you cannot have duplicates. Are you certain you specified all 3 columns as keys for the hashed file? If you are certain, could you show an example of a duplicate record?

Hi Archana,

I also had same problem, but later i found out that the data was having extra spaces with it. so i just Trimed the data and again tried, this time it worked fine. You also trim the data before transformation.

may be this will help

ArndW · Post by **ArndW** » Thu Mar 23, 2006 5:59 am

Archana wrote:Yes , I have specified three columns as key columns .

The duplicate records are

DOYLESTOWN}PA}18901}ADDR_DIM
DOYLESTOWN}PA}18901}ADDR_DIM

The first 3 columns are specified as key columns.

Hashed files are unable to have duplicate keys; so either you have embedded spaces or undisplayed characters in the 3 key columns or you have not declared all 3 columns as "Key" in your hashed file stage.

Archana · Post by **Archana** » Thu Mar 23, 2006 6:09 am

kumar_s wrote:Keys in the hash file cannot have duplicates.
Are the records shown are from Hash file?
What is the datatype of the key fields?

Ya the records are from hashed file .
The datatype is varchar for all the fields.

Archana · Post by **Archana** » Thu Mar 23, 2006 6:18 am

parag.s.27 wrote:
ArndW wrote:Archana, a hashed file key is unique. If you specified all 3 key columns when you wrote to the hashed file you cannot have duplicates. Are you certain you specified all 3 columns as keys for the hashed file? If you are certain, could you show an example of a duplicate record?
Hi Archana,

I also had same problem, but later i found out that the data was having extra spaces with it. so i just Trimed the data and again tried, this time it worked fine. You also trim the data before transformation.

may be this will help

No there can not be extra spaces as the file is a delimited file and the delimiter is '}' .

parag.s.27 · Post by **parag.s.27** » Thu Mar 23, 2006 6:22 am

No there can not be extra spaces as the file is a delimited file and the delimiter is '}' .[/quote]

Hi,

My source file was also '~' delimited, but when you define the metadata, then if the spaces are there with the data between delimiters then those spaces also come in with the data.

ArndW · Post by **ArndW** » Thu Mar 23, 2006 6:25 am

1. Write a temporary DS job copy of your original to write just the 3 key columns to a sequential file call myfile.txt.
2. wc -l myfile.txt to get the number of lines.
3. sort -u myfile.txt > otherfile.txt
4. wc -l otherfile.txt

are the counts in 2 & 4 the same? if you look at otherfile.txt do you see what look like identical "keys"? (you can do a cat -v to display non-displayable texts)

kumar_s · Post by **kumar_s** » Thu Mar 23, 2006 6:31 am

To make a unique sort on the specific columns, you can use
cut -d '}' -f1,2,3 | sort -u > Filename.txt
If 1,2,3 are the key columns.

satheesh_color · Post by **satheesh_color** » Thu Mar 23, 2006 6:42 am

Hai,

Finally i got the answer.While defining sequential metadata there are three delimiter Tab,Space,Comma.But your delimiter is }.So you have to specify } in other Delimiter textbox.So that the varchar data also get deduplicated.

Regards,
Satheesh.Rajan

kumar_s · Post by **kumar_s** » Thu Mar 23, 2006 6:55 am

Hi Satheesh,

'No there can not be extra spaces as the file is a delimited file and the delimiter is '}' . '
Hope you have you noticed this?

parag.s.27 · Post by **parag.s.27** » Thu Mar 23, 2006 7:02 am

kumar_s wrote:Hi Satheesh,

'No there can not be extra spaces as the file is a delimited file and the delimiter is '}' . '
Hope you have you noticed this?

Hi Kumar,

Even in the delimited file spaces can come. I am handling all the source files those are '~' delimited. This is a small part of my data. where length of second column is 8.

Code: Select all

DEV~S0001          ~STORE~S0001     ~20030902~43.00000~.00000~.00000~.00000~.00000~ ~40.00000~ ~80.00000~.00000

If the source file is generated from sequel server, where the data was imported from a XL sheet then spaces can come.

And also for a delimited file, spaces can come, only advantage is, since the data is delimited so meta data will not fail for length of the field

Code: Select all