Problem in Deduping
Moderators: chulett, rschirm, roy
Problem in Deduping
Hi ,
I am trying to dedup a sequential file on some keys (combination of 3 keys) . The output is a hashed file with those three fields as keys .
But the files ia not getting deduped , I am finding records with dupliacate values of those keys . The parameters and all sre fine in hash file .
Please , tell me the solution .
Thanks
I am trying to dedup a sequential file on some keys (combination of 3 keys) . The output is a hashed file with those three fields as keys .
But the files ia not getting deduped , I am finding records with dupliacate values of those keys . The parameters and all sre fine in hash file .
Please , tell me the solution .
Thanks
Archana
-
- Participant
- Posts: 182
- Joined: Thu Jun 16, 2005 2:05 am
Archana, a hashed file key is unique. If you specified all 3 key columns when you wrote to the hashed file you cannot have duplicates. Are you certain you specified all 3 columns as keys for the hashed file? If you are certain, could you show an example of a duplicate record?
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
ArndW wrote:Archana, a hashed file key is unique. If you specified all 3 key columns when you wrote to the hashed file you cannot have duplicates. Are you certain you specified all 3 columns as keys for the hashed file? If you are certain, could you show an example of a duplicate record?
Ys , I have specified three columns as key columns .
The duplicate records are
DOYLESTOWN}PA}18901}ADDR_DIM
DOYLESTOWN}PA}18901}ADDR_DIM
The first 3 columns are specified as key columns.
Archana
-
- Participant
- Posts: 221
- Joined: Fri Feb 17, 2006 3:38 am
- Location: India
- Contact:
Hi Archana,ArndW wrote:Archana, a hashed file key is unique. If you specified all 3 key columns when you wrote to the hashed file you cannot have duplicates. Are you certain you specified all 3 columns as keys for the hashed file? If you are certain, could you show an example of a duplicate record?
I also had same problem, but later i found out that the data was having extra spaces with it. so i just Trimed the data and again tried, this time it worked fine. You also trim the data before transformation.
may be this will help
Thanks & Regards
Parag Saundattikar
Certified for Infosphere DataStage v8.0
Parag Saundattikar
Certified for Infosphere DataStage v8.0
Hashed files are unable to have duplicate keys; so either you have embedded spaces or undisplayed characters in the 3 key columns or you have not declared all 3 columns as "Key" in your hashed file stage.Archana wrote:Yes , I have specified three columns as key columns .
The duplicate records are
DOYLESTOWN}PA}18901}ADDR_DIM
DOYLESTOWN}PA}18901}ADDR_DIM
The first 3 columns are specified as key columns.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
parag.s.27 wrote:Hi Archana,ArndW wrote:Archana, a hashed file key is unique. If you specified all 3 key columns when you wrote to the hashed file you cannot have duplicates. Are you certain you specified all 3 columns as keys for the hashed file? If you are certain, could you show an example of a duplicate record?
I also had same problem, but later i found out that the data was having extra spaces with it. so i just Trimed the data and again tried, this time it worked fine. You also trim the data before transformation.
may be this will help
No there can not be extra spaces as the file is a delimited file and the delimiter is '}' .
Archana
-
- Participant
- Posts: 221
- Joined: Fri Feb 17, 2006 3:38 am
- Location: India
- Contact:
No there can not be extra spaces as the file is a delimited file and the delimiter is '}' .[/quote]
Hi,
My source file was also '~' delimited, but when you define the metadata, then if the spaces are there with the data between delimiters then those spaces also come in with the data.
Hi,
My source file was also '~' delimited, but when you define the metadata, then if the spaces are there with the data between delimiters then those spaces also come in with the data.
Thanks & Regards
Parag Saundattikar
Certified for Infosphere DataStage v8.0
Parag Saundattikar
Certified for Infosphere DataStage v8.0
1. Write a temporary DS job copy of your original to write just the 3 key columns to a sequential file call myfile.txt.
2. wc -l myfile.txt to get the number of lines.
3. sort -u myfile.txt > otherfile.txt
4. wc -l otherfile.txt
are the counts in 2 & 4 the same? if you look at otherfile.txt do you see what look like identical "keys"? (you can do a cat -v to display non-displayable texts)
2. wc -l myfile.txt to get the number of lines.
3. sort -u myfile.txt > otherfile.txt
4. wc -l otherfile.txt
are the counts in 2 & 4 the same? if you look at otherfile.txt do you see what look like identical "keys"? (you can do a cat -v to display non-displayable texts)
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Participant
- Posts: 182
- Joined: Thu Jun 16, 2005 2:05 am
-
- Participant
- Posts: 221
- Joined: Fri Feb 17, 2006 3:38 am
- Location: India
- Contact:
Hi Kumar,kumar_s wrote:Hi Satheesh,
'No there can not be extra spaces as the file is a delimited file and the delimiter is '}' . '
Hope you have you noticed this?
Even in the delimited file spaces can come. I am handling all the source files those are '~' delimited. This is a small part of my data. where length of second column is 8.
Code: Select all
DEV~S0001 ~STORE~S0001 ~20030902~43.00000~.00000~.00000~.00000~.00000~ ~40.00000~ ~80.00000~.00000
And also for a delimited file, spaces can come, only advantage is, since the data is delimited so meta data will not fail for length of the field
Code: Select all
Thanks & Regards
Parag Saundattikar
Certified for Infosphere DataStage v8.0
Parag Saundattikar
Certified for Infosphere DataStage v8.0