Remove Duplicates from Sequential File

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

waitnsee
Participant
Posts: 23
Joined: Tue Jul 06, 2004 10:20 am

Remove Duplicates from Sequential File

Post by waitnsee »

How can we remove duplicates from a sequential file? I know we can load the data to a hashed file and based on the specified key, duplicates will be removed.
Is there anything called CRC32 transform to remove duplicates ?
If so please let me know.

Thanks,
WNS
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Yes, but it's not totally reliable. By its very nature CRC32 introduces a 1 in 2**32 risk of missing identifying a duplicate.
A hashed file or some form of searchable list (in a Routine) are the more common ways of removing duplicates. Or you could invest in PX and use the RemoveDuplicates stage. Or, within reason in a server job, an Aggregator stage.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Post by mhester »

Also, to expand on what Ray mentioned regarding CRC32. I do not want you to not use CRC32 for the wrong reasons such as reliability. The statistics behind CRC32 are published and I will provide some of that research here.

If you process 4 billion rows of data in a given run and generate a CRC32 for each row, the 1 in 4 billion does not mean that you will have a failure. Rather, it means that for each row there is a 1 in 4 billion chance of a failure and likely you could process 8 billion and not see a failure.

The very algorithm that is used in Ascential's version of CRC32 is the very algorithm in use in Ethernet, FDDI, AAL5, PKZIP, hard disks, etc. and I don't think people are running from these because they are unreliable.

Let's say you process some 30 million rows per day (365 days per year) and generated a CRC for each row, your failure rate based on this would be about one (1) failure every seven (7) years.

Pretty reliable stuff

I do agree with Ray that there might be a more efficient way to de-dup the sequential file, but you would not be wrong if you used CRC32 - just different.

Regards,
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

By duplicate row, do you mean repeated key but different attributes or do you mean the entire row is a duplicate?

If just a repeating key, is the data sorted in the order you like and you need to get either the first or last occurence of the row, or something more tricky?

How much data are we talking about here, is it 100 million row file of 100 chars per row, or 10 million row file of 1000 chars per row?


Given the situation, there are many recommendations. We need more information to give you the correct answer. "sort -u yourfilename > new filename" will give you a completely unique occurence of every row doing a char for char match, but you don't want to do that on a wide row or hundreds of millions of rows or the row that has columns you don't care to compare. I can give you 10 more options based on row count, row width, matching/duplicate criteria, sorted/unsorted, etc. You need to fully descibe your situation.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You might be able to use a UNIX command, such as sort -u (if my memory serves), to remove duplicates.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

First solution Ken mentioned, Ray - attention to detail, lad! :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
waitnsee
Participant
Posts: 23
Joined: Tue Jul 06, 2004 10:20 am

Post by waitnsee »

kcbland wrote:By duplicate row, do you mean repeated key but different attributes or do you mean the entire row is a duplicate?

If just a repeating key, is the data sorted in the order you like and you need to get either the first or last occurence of the row, or something more tricky?

How much data are we talking about here, is it 100 million row file of 100 chars per row, or 10 million row file of 1000 chars per row?


Given the situation, there are many recommendations. We need more information to give you the correct answer. "sort -u yourfilename > new filename" will give you a completely unique occurence of every row doing a char for char match, but you don't want to do that on a wide row or hundreds of millions of rows or the row that has columns you don't care to compare. I can give you 10 more options based on row count, row width, matching/duplicate criteria, sorted/unsorted, etc. You need to fully descibe your situation.

By duplicate data I mean the entire row is repeated. There is no key field. Where do I find CRC32 ? Let me know how do I use it in my job.

Thanks,
WNS
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

As Ken asked - how large is your file? How wide? Sure, you can use CRC32 for this, but what a painful way to accomplish something simple. :?

Leverage your operating system. As others have mentioned, check out the sort -u command. Do it 'before job' or (better yet) in the Filter option of the sequential file stage.
-craig

"You can never have too many knives" -- Logan Nine Fingers
waitnsee
Participant
Posts: 23
Joined: Tue Jul 06, 2004 10:20 am

Post by waitnsee »

chulett wrote:As Ken asked - how large is your file? How wide? Sure, you can use CRC32 for this, but what a painful way to accomplish something simple. :?

Leverage your operating system. As others have mentioned, check out the sort -u command. Do it 'before job' or (better yet) in the Filter option of the sequential file stage.
The file has 5000 rows. I have to know how I use CRC32 as I have never used it before, thats the reason am very specific on that.

Thanks,
WNS
mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Post by mhester »

WNS,

I believe you would be better off using the OS or some other method to remove duplicates (as others have outlined). You certainly could use CRC32, but as Craig points out - it would likely contain the most moving parts. With CRC32 you would have to store the CRC and then be able to lookup this value as each row is streamed. Not very efficient for removing duplicates, but very efficient for SCD processing. There is a download on the ADN of a DSX that contains a job stream that implements CRC32 that I posted over a year ago. If you download this DSX I'm sure you would see that CRC32 might be overkill for what you want to accomplish.

Regards,
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

The CRC32 is a technique that is optimally used on extremely large data sets. You've got 5000 rows, a "sort -u yourfilename > newfilename" will probably take 2 seconds. This is BY FAR the simplest and fastest solution for your volume.

When you hit 50 million rows in your file, then we can talk about building the necessary components to do that in an optimal fashion, which would not be "sort -u".
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

chulett wrote:First solution Ken mentioned, Ray - attention to detail, lad! :wink:
Ken's post wasn't there when I posted - sometimes it takes a long time for my posts to get through. :cry:
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Sorry, just itching for a chance to say that back at you. :D
ray.wurlod wrote:sometimes it takes a long time for my posts to get through.
Odd.. yours was posted what seems to be over an hour later. :? That may also explain why sometimes other people come in later and repeat almost the same thing the previous poster said. Not sure what might be causing something like that, unless your posts are being held up in Customs...
-craig

"You can never have too many knives" -- Logan Nine Fingers
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

You never know, I might have gone back in later and added that paragraph in. :twisted:
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Now that's funny... and opens the door for all kinds of mayhem. 8)
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply