Problem in Deduping

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Archana
Participant
Posts: 15
Joined: Thu Mar 23, 2006 3:55 am
Location: Hyderabad
Contact:

Problem in Deduping

Post by Archana »

Hi ,
I am trying to dedup a sequential file on some keys (combination of 3 keys) . The output is a hashed file with those three fields as keys .
But the files ia not getting deduped , I am finding records with dupliacate values of those keys . The parameters and all sre fine in hash file .
Please , tell me the solution .

Thanks
Archana
satheesh_color
Participant
Posts: 182
Joined: Thu Jun 16, 2005 2:05 am

Post by satheesh_color »

Hai,

Why don't you try with the simple SQL by using Distinct keyword.It's also worked fine by using hashed file.





Regards,
Satheesh.Rajan
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Archana, a hashed file key is unique. If you specified all 3 key columns when you wrote to the hashed file you cannot have duplicates. Are you certain you specified all 3 columns as keys for the hashed file? If you are certain, could you show an example of a duplicate record?
Archana
Participant
Posts: 15
Joined: Thu Mar 23, 2006 3:55 am
Location: Hyderabad
Contact:

Post by Archana »

ArndW wrote:Archana, a hashed file key is unique. If you specified all 3 key columns when you wrote to the hashed file you cannot have duplicates. Are you certain you specified all 3 columns as keys for the hashed file? If you are certain, could you show an example of a duplicate record?

Ys , I have specified three columns as key columns .

The duplicate records are

DOYLESTOWN}PA}18901}ADDR_DIM
DOYLESTOWN}PA}18901}ADDR_DIM

The first 3 columns are specified as key columns.
Archana
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Keys in the hash file cannot have duplicates.
Are the records shown are from Hash file?
What is the datatype of the key fields?
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
parag.s.27
Participant
Posts: 221
Joined: Fri Feb 17, 2006 3:38 am
Location: India
Contact:

Post by parag.s.27 »

ArndW wrote:Archana, a hashed file key is unique. If you specified all 3 key columns when you wrote to the hashed file you cannot have duplicates. Are you certain you specified all 3 columns as keys for the hashed file? If you are certain, could you show an example of a duplicate record?
Hi Archana,

I also had same problem, but later i found out that the data was having extra spaces with it. so i just Trimed the data and again tried, this time it worked fine. You also trim the data before transformation.

may be this will help
Thanks & Regards
Parag Saundattikar
Certified for Infosphere DataStage v8.0
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Archana wrote:Yes , I have specified three columns as key columns .

The duplicate records are

DOYLESTOWN}PA}18901}ADDR_DIM
DOYLESTOWN}PA}18901}ADDR_DIM

The first 3 columns are specified as key columns.
Hashed files are unable to have duplicate keys; so either you have embedded spaces or undisplayed characters in the 3 key columns or you have not declared all 3 columns as "Key" in your hashed file stage.
Archana
Participant
Posts: 15
Joined: Thu Mar 23, 2006 3:55 am
Location: Hyderabad
Contact:

Post by Archana »

kumar_s wrote:Keys in the hash file cannot have duplicates.
Are the records shown are from Hash file?
What is the datatype of the key fields?
Ya the records are from hashed file .
The datatype is varchar for all the fields.
Archana
Archana
Participant
Posts: 15
Joined: Thu Mar 23, 2006 3:55 am
Location: Hyderabad
Contact:

Post by Archana »

parag.s.27 wrote:
ArndW wrote:Archana, a hashed file key is unique. If you specified all 3 key columns when you wrote to the hashed file you cannot have duplicates. Are you certain you specified all 3 columns as keys for the hashed file? If you are certain, could you show an example of a duplicate record?
Hi Archana,

I also had same problem, but later i found out that the data was having extra spaces with it. so i just Trimed the data and again tried, this time it worked fine. You also trim the data before transformation.

may be this will help

No there can not be extra spaces as the file is a delimited file and the delimiter is '}' .
Archana
parag.s.27
Participant
Posts: 221
Joined: Fri Feb 17, 2006 3:38 am
Location: India
Contact:

Post by parag.s.27 »

No there can not be extra spaces as the file is a delimited file and the delimiter is '}' .[/quote]

Hi,

My source file was also '~' delimited, but when you define the metadata, then if the spaces are there with the data between delimiters then those spaces also come in with the data.
Thanks & Regards
Parag Saundattikar
Certified for Infosphere DataStage v8.0
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

1. Write a temporary DS job copy of your original to write just the 3 key columns to a sequential file call myfile.txt.
2. wc -l myfile.txt to get the number of lines.
3. sort -u myfile.txt > otherfile.txt
4. wc -l otherfile.txt

are the counts in 2 & 4 the same? if you look at otherfile.txt do you see what look like identical "keys"? (you can do a cat -v to display non-displayable texts)
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

To make a unique sort on the specific columns, you can use
cut -d '}' -f1,2,3 | sort -u > Filename.txt
If 1,2,3 are the key columns.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
satheesh_color
Participant
Posts: 182
Joined: Thu Jun 16, 2005 2:05 am

Post by satheesh_color »

Hai,

Finally i got the answer.While defining sequential metadata there are three delimiter Tab,Space,Comma.But your delimiter is }.So you have to specify } in other Delimiter textbox.So that the varchar data also get deduplicated.







Regards,
Satheesh.Rajan
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Hi Satheesh,

'No there can not be extra spaces as the file is a delimited file and the delimiter is '}' . '
Hope you have you noticed this?
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
parag.s.27
Participant
Posts: 221
Joined: Fri Feb 17, 2006 3:38 am
Location: India
Contact:

Post by parag.s.27 »

kumar_s wrote:Hi Satheesh,

'No there can not be extra spaces as the file is a delimited file and the delimiter is '}' . '
Hope you have you noticed this?
Hi Kumar,

Even in the delimited file spaces can come. I am handling all the source files those are '~' delimited. This is a small part of my data. where length of second column is 8.

Code: Select all

DEV~S0001          ~STORE~S0001     ~20030902~43.00000~.00000~.00000~.00000~.00000~ ~40.00000~ ~80.00000~.00000
If the source file is generated from sequel server, where the data was imported from a XL sheet then spaces can come.

And also for a delimited file, spaces can come, only advantage is, since the data is delimited so meta data will not fail for length of the field
Thanks & Regards
Parag Saundattikar
Certified for Infosphere DataStage v8.0
Post Reply