Remove Duplicates from Sequential File

waitnsee · Post by **waitnsee** » Tue Mar 08, 2005 5:57 pm

How can we remove duplicates from a sequential file? I know we can load the data to a hashed file and based on the specified key, duplicates will be removed.
Is there anything called CRC32 transform to remove duplicates ?
If so please let me know.

Thanks,
WNS

ray.wurlod · Post by **ray.wurlod** » Tue Mar 08, 2005 6:15 pm

Yes, but it's not totally reliable. By its very nature CRC32 introduces a 1 in 2**32 risk of missing identifying a duplicate.
A hashed file or some form of searchable list (in a Routine) are the more common ways of removing duplicates. Or you could invest in PX and use the RemoveDuplicates stage. Or, within reason in a server job, an Aggregator stage.

mhester · Post by **mhester** » Tue Mar 08, 2005 7:40 pm

Also, to expand on what Ray mentioned regarding CRC32. I do not want you to not use CRC32 for the wrong reasons such as reliability. The statistics behind CRC32 are published and I will provide some of that research here.

If you process 4 billion rows of data in a given run and generate a CRC32 for each row, the 1 in 4 billion does not mean that you will have a failure. Rather, it means that for each row there is a 1 in 4 billion chance of a failure and likely you could process 8 billion and not see a failure.

The very algorithm that is used in Ascential's version of CRC32 is the very algorithm in use in Ethernet, FDDI, AAL5, PKZIP, hard disks, etc. and I don't think people are running from these because they are unreliable.

Let's say you process some 30 million rows per day (365 days per year) and generated a CRC for each row, your failure rate based on this would be about one (1) failure every seven (7) years.

Pretty reliable stuff

I do agree with Ray that there might be a more efficient way to de-dup the sequential file, but you would not be wrong if you used CRC32 - just different.

Regards,

kcbland · Post by **kcbland** » Tue Mar 08, 2005 8:48 pm

By duplicate row, do you mean repeated key but different attributes or do you mean the entire row is a duplicate?

If just a repeating key, is the data sorted in the order you like and you need to get either the first or last occurence of the row, or something more tricky?

How much data are we talking about here, is it 100 million row file of 100 chars per row, or 10 million row file of 1000 chars per row?

Given the situation, there are many recommendations. We need more information to give you the correct answer. "sort -u yourfilename > new filename" will give you a completely unique occurence of every row doing a char for char match, but you don't want to do that on a wide row or hundreds of millions of rows or the row that has columns you don't care to compare. I can give you 10 more options based on row count, row width, matching/duplicate criteria, sorted/unsorted, etc. You need to fully descibe your situation.

ray.wurlod · Post by **ray.wurlod** » Tue Mar 08, 2005 9:59 pm

You might be able to use a UNIX command, such as sort -u (if my memory serves), to remove duplicates.

chulett · Post by **chulett** » Wed Mar 09, 2005 7:23 am

First solution Ken mentioned, Ray - attention to detail, lad!

waitnsee · Post by **waitnsee** » Wed Mar 09, 2005 8:14 am

kcbland wrote:By duplicate row, do you mean repeated key but different attributes or do you mean the entire row is a duplicate?

If just a repeating key, is the data sorted in the order you like and you need to get either the first or last occurence of the row, or something more tricky?

How much data are we talking about here, is it 100 million row file of 100 chars per row, or 10 million row file of 1000 chars per row?

Given the situation, there are many recommendations. We need more information to give you the correct answer. "sort -u yourfilename > new filename" will give you a completely unique occurence of every row doing a char for char match, but you don't want to do that on a wide row or hundreds of millions of rows or the row that has columns you don't care to compare. I can give you 10 more options based on row count, row width, matching/duplicate criteria, sorted/unsorted, etc. You need to fully descibe your situation.

By duplicate data I mean the entire row is repeated. There is no key field. Where do I find CRC32 ? Let me know how do I use it in my job.

Thanks,
WNS

chulett · Post by **chulett** » Wed Mar 09, 2005 8:30 am

As Ken asked - how large is your file? How wide? Sure, you can use CRC32 for this, but what a painful way to accomplish something simple.

Leverage your operating system. As others have mentioned, check out the sort -u command. Do it 'before job' or (better yet) in the Filter option of the sequential file stage.

waitnsee · Post by **waitnsee** » Wed Mar 09, 2005 8:32 am

chulett wrote:As Ken asked - how large is your file? How wide? Sure, you can use CRC32 for this, but what a painful way to accomplish something simple.

Leverage your operating system. As others have mentioned, check out the sort -u command. Do it 'before job' or (better yet) in the Filter option of the sequential file stage.

The file has 5000 rows. I have to know how I use CRC32 as I have never used it before, thats the reason am very specific on that.

Thanks,
WNS

mhester · Post by **mhester** » Wed Mar 09, 2005 8:41 am

WNS,

I believe you would be better off using the OS or some other method to remove duplicates (as others have outlined). You certainly could use CRC32, but as Craig points out - it would likely contain the most moving parts. With CRC32 you would have to store the CRC and then be able to lookup this value as each row is streamed. Not very efficient for removing duplicates, but very efficient for SCD processing. There is a download on the ADN of a DSX that contains a job stream that implements CRC32 that I posted over a year ago. If you download this DSX I'm sure you would see that CRC32 might be overkill for what you want to accomplish.

Regards,

kcbland · Post by **kcbland** » Wed Mar 09, 2005 10:12 am

The CRC32 is a technique that is optimally used on extremely large data sets. You've got 5000 rows, a "sort -u yourfilename > newfilename" will probably take 2 seconds. This is BY FAR the simplest and fastest solution for your volume.

When you hit 50 million rows in your file, then we can talk about building the necessary components to do that in an optimal fashion, which would not be "sort -u".

ray.wurlod · Post by **ray.wurlod** » Thu Mar 10, 2005 6:04 pm

chulett wrote:First solution Ken mentioned, Ray - attention to detail, lad!

Ken's post wasn't there when I posted - sometimes it takes a long time for my posts to get through.

chulett · Post by **chulett** » Thu Mar 10, 2005 8:21 pm

Sorry, just itching for a chance to say that back at you. :D

ray.wurlod wrote:sometimes it takes a long time for my posts to get through.

Odd.. yours was posted what seems to be over an hour later.

That may also explain why sometimes other people come in later and repeat almost the same thing the previous poster said. Not sure what might be causing something like that, unless your posts are being held up in Customs...

kcbland · Post by **kcbland** » Thu Mar 10, 2005 9:04 pm

You never know, I might have gone back in later and added that paragraph in.

chulett · Post by **chulett** » Thu Mar 10, 2005 9:08 pm

Now that's funny... and opens the door for all kinds of mayhem.