Page 1 of 2

How to Select distinct records from Files

Posted: Tue Dec 13, 2005 10:35 pm
by itsme_rp
Hi All,

I have to select distinct records from a file and load it into another file.
How can I fetch distinct records from file?
This is an urgent requirement please send me logic asap.
Thanks in advance.

Posted: Tue Dec 13, 2005 10:41 pm
by Titto
read your source file and write it to a Hash file.

Posted: Tue Dec 13, 2005 10:45 pm
by vmcburney
Read the FAQ How do I remove duplicate rows from a data source? It shows several options.

Thanks

Posted: Wed Dec 14, 2005 12:36 am
by itsme_rp
Titto wrote:read your source file and write it to a Hash file.
Thanks a lot. This is working

Re: How to Select distinct records from Files

Posted: Wed Dec 14, 2005 12:43 am
by yaminids
Hello there,

I think you cannot use the Hash file stage if the source has many records. We accomplished the task by using Stage variables. The only requirement for our approach is that the data has to be sorted before it arrives to the Transformer stage
For example if the source file has 4 columns and you want to select distinct records based on first two columns,
1) Create 3 Stage variables (ChkDup, A and B) --Be careful of the order
of the variables
2) Map the columns (distinct) to the variables A and B
3) In the derivation of Stage variable (ChkDup) put the following:
If (InLink.Col1 <> A or InLink.Col2 <> B) THEN 'Y' ELSE 'N'
4) In the constraint of output Link put ChkDup = 'Y'

Hope this helps
Yamini

Re: How to Select distinct records from Files

Posted: Wed Dec 14, 2005 12:58 am
by loveojha2
yaminids wrote:Hello there,

I think you cannot use the Hash file stage if the source has many records.
Hey,
Can you just share with us, wht was the problem with large number of records? How many records you are talking about? What Size of the Hashed file?

Re: How to Select distinct records from Files

Posted: Wed Dec 14, 2005 1:05 am
by yaminids
If I am not wrong the size of Hash file cannot be more than 2GB. We were dealing with 50 million rows and the size of Hash file reached the limit. Hence its always advisable to keep the size of Hash file as small as possbile by getting only the required columns

Yamini

Posted: Wed Dec 14, 2005 1:20 am
by chulett
Right... so it's not about the number of records but the total size of the hashed file. And if you need to cross the 2GB Barrier, create the hashed file as a 64BIT hashed file. At that point, you are basically limited by the amount of disk space you have on hand. :wink:

Posted: Wed Dec 14, 2005 3:11 pm
by ray.wurlod
... or 19 million TB (approximately), whichever comes first! 8)

Posted: Wed Dec 14, 2005 3:26 pm
by dls
Have you tried using the UNIX sort command with the -u option?

DS may not be part of the solution if the utility will suffice.

Re: How to Select distinct records from Files

Posted: Tue Dec 20, 2005 4:29 am
by anil kumar
yaminids wrote:Hello there,

I think you cannot use the Hash file stage if the source has many records. We accomplished the task by using Stage variables. The only requirement for our approach is that the data has to be sorted before it arrives to the Transformer stage
For example if the source file has 4 columns and you want to select distinct records based on first two columns,
1) Create 3 Stage variables (ChkDup, A and B) --Be careful of the order
of the variables
2) Map the columns (distinct) to the variables A and B
3) In the derivation of Stage variable (ChkDup) put the following:
If (InLink.Col1 <> A or InLink.Col2 <> B) THEN 'Y' ELSE 'N'
4) In the constraint of output Link put ChkDup = 'Y'

Hope this helps
Yamini


Hi

can u explain clearly the above logic ,how it works to delete duplicate records from the input file..
i doubt the ChkDup variable always be 'N' ..so none of the records will be moved out of the transformer...

can u please clarify the doubt...

thanks in advance

Re: How to Select distinct records from Files

Posted: Tue Dec 20, 2005 4:08 pm
by yaminids
Anil,

The stage variables store the previous data came from the link. So when a new row arrives, the data in the link is compared with the stage variables before making a decision

For example

If in the first row Col1 = 20 and Col2 = 30. When this row is passed through Transformer and since the columns are mapped to the stage variables they(stage variables) get the data from the columns. In our case A = 20 and B =30
In the next row if Col1 = 20 and Col2= 40, these values are compared with A and B. (If (InLink.Col1 <> A or InLink.Col2 <> B) THEN 'Y' ELSE 'N')
Since the values do not match ChkDup = 'Y'. Similarly if the values match with the Stage variables then ChkDup will be 'N'

Hope this is helpful. Let me know if you need more information
Yamini

Posted: Tue Dec 20, 2005 5:05 pm
by vmcburney
This is an extremely fast dedupe method if the data is sorted. But remember sorting is one of the slowest functions in server jobs, so a Unix sort will be faster, and you may as well use the dedupe Unix sort option while you are at it! The hash file method doesn't require sorting so may be a faster DataStage only method.

Anyone feel free to add the Unix dedupe syntax to the FAQ on removing duplicates.

Posted: Tue Dec 20, 2005 8:50 pm
by kumar_s
Hi,

Unix - Sort -u also has its own limitation on size 8)

-Kumar

Posted: Wed Dec 21, 2005 2:14 am
by ArndW
Kumar,

what is the size limit for a UNIX sort command? I have sorted some large files and it seems to be constrained by the amount of space available for temporary file creation.