How to Select distinct records from Files

itsme_rp · Post by **itsme_rp** » Tue Dec 13, 2005 10:35 pm

Hi All,

I have to select distinct records from a file and load it into another file.
How can I fetch distinct records from file?
This is an urgent requirement please send me logic asap.
Thanks in advance.

Titto · Post by **Titto** » Tue Dec 13, 2005 10:41 pm

read your source file and write it to a Hash file.

vmcburney · Post by **vmcburney** » Tue Dec 13, 2005 10:45 pm

Read the FAQ How do I remove duplicate rows from a data source? It shows several options.

itsme_rp · Post by **itsme_rp** » Wed Dec 14, 2005 12:36 am

Titto wrote:read your source file and write it to a Hash file.

Thanks a lot. This is working

yaminids · Post by **yaminids** » Wed Dec 14, 2005 12:43 am

Hello there,

I think you cannot use the Hash file stage if the source has many records. We accomplished the task by using Stage variables. The only requirement for our approach is that the data has to be sorted before it arrives to the Transformer stage
For example if the source file has 4 columns and you want to select distinct records based on first two columns,
1) Create 3 Stage variables (ChkDup, A and B) --Be careful of the order
of the variables
2) Map the columns (distinct) to the variables A and B
3) In the derivation of Stage variable (ChkDup) put the following:
If (InLink.Col1 <> A or InLink.Col2 <> B) THEN 'Y' ELSE 'N'
4) In the constraint of output Link put ChkDup = 'Y'

Hope this helps
Yamini

loveojha2 · Post by **loveojha2** » Wed Dec 14, 2005 12:58 am

yaminids wrote:Hello there,

I think you cannot use the Hash file stage if the source has many records.

Hey,
Can you just share with us, wht was the problem with large number of records? How many records you are talking about? What Size of the Hashed file?

yaminids · Post by **yaminids** » Wed Dec 14, 2005 1:05 am

If I am not wrong the size of Hash file cannot be more than 2GB. We were dealing with 50 million rows and the size of Hash file reached the limit. Hence its always advisable to keep the size of Hash file as small as possbile by getting only the required columns

Yamini

chulett · Post by **chulett** » Wed Dec 14, 2005 1:20 am

Right... so it's not about the number of records but the total size of the hashed file. And if you need to cross the 2GB Barrier, create the hashed file as a 64BIT hashed file. At that point, you are basically limited by the amount of disk space you have on hand.

ray.wurlod · Post by **ray.wurlod** » Wed Dec 14, 2005 3:11 pm

... or 19 million TB (approximately), whichever comes first!

dls · Post by **dls** » Wed Dec 14, 2005 3:26 pm

Have you tried using the UNIX sort command with the -u option?

DS may not be part of the solution if the utility will suffice.

anil kumar · Post by **anil kumar** » Tue Dec 20, 2005 4:29 am

yaminids wrote:Hello there,

I think you cannot use the Hash file stage if the source has many records. We accomplished the task by using Stage variables. The only requirement for our approach is that the data has to be sorted before it arrives to the Transformer stage
For example if the source file has 4 columns and you want to select distinct records based on first two columns,
1) Create 3 Stage variables (ChkDup, A and B) --Be careful of the order
of the variables
2) Map the columns (distinct) to the variables A and B
3) In the derivation of Stage variable (ChkDup) put the following:
If (InLink.Col1 <> A or InLink.Col2 <> B) THEN 'Y' ELSE 'N'
4) In the constraint of output Link put ChkDup = 'Y'

Hope this helps
Yamini

Hi

can u explain clearly the above logic ,how it works to delete duplicate records from the input file..
i doubt the ChkDup variable always be 'N' ..so none of the records will be moved out of the transformer...

can u please clarify the doubt...

thanks in advance

yaminids · Post by **yaminids** » Tue Dec 20, 2005 4:08 pm

Anil,

The stage variables store the previous data came from the link. So when a new row arrives, the data in the link is compared with the stage variables before making a decision

For example

If in the first row Col1 = 20 and Col2 = 30. When this row is passed through Transformer and since the columns are mapped to the stage variables they(stage variables) get the data from the columns. In our case A = 20 and B =30
In the next row if Col1 = 20 and Col2= 40, these values are compared with A and B. (If (InLink.Col1 <> A or InLink.Col2 <> B) THEN 'Y' ELSE 'N')
Since the values do not match ChkDup = 'Y'. Similarly if the values match with the Stage variables then ChkDup will be 'N'

Hope this is helpful. Let me know if you need more information
Yamini

vmcburney · Post by **vmcburney** » Tue Dec 20, 2005 5:05 pm

This is an extremely fast dedupe method if the data is sorted. But remember sorting is one of the slowest functions in server jobs, so a Unix sort will be faster, and you may as well use the dedupe Unix sort option while you are at it! The hash file method doesn't require sorting so may be a faster DataStage only method.

Anyone feel free to add the Unix dedupe syntax to the FAQ on removing duplicates.

kumar_s · Post by **kumar_s** » Tue Dec 20, 2005 8:50 pm

Hi,

Unix - Sort -u also has its own limitation on size

-Kumar

ArndW · Post by **ArndW** » Wed Dec 21, 2005 2:14 am

Kumar,

what is the size limit for a UNIX sort command? I have sorted some large files and it seems to be constrained by the amount of space available for temporary file creation.

DSXchange

How to Select distinct records from Files

How to Select distinct records from Files

Thanks

Re: How to Select distinct records from Files

Re: How to Select distinct records from Files

Re: How to Select distinct records from Files

Re: How to Select distinct records from Files

Re: How to Select distinct records from Files