How to Select distinct records from Files

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

itsme_rp
Participant
Posts: 5
Joined: Thu Oct 28, 2004 2:29 am

How to Select distinct records from Files

Post by itsme_rp »

Hi All,

I have to select distinct records from a file and load it into another file.
How can I fetch distinct records from file?
This is an urgent requirement please send me logic asap.
Thanks in advance.
Titto
Participant
Posts: 148
Joined: Tue Jun 21, 2005 7:49 am

Post by Titto »

read your source file and write it to a Hash file.
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

Read the FAQ How do I remove duplicate rows from a data source? It shows several options.
itsme_rp
Participant
Posts: 5
Joined: Thu Oct 28, 2004 2:29 am

Thanks

Post by itsme_rp »

Titto wrote:read your source file and write it to a Hash file.
Thanks a lot. This is working
yaminids
Premium Member
Premium Member
Posts: 387
Joined: Mon Oct 18, 2004 1:04 pm

Re: How to Select distinct records from Files

Post by yaminids »

Hello there,

I think you cannot use the Hash file stage if the source has many records. We accomplished the task by using Stage variables. The only requirement for our approach is that the data has to be sorted before it arrives to the Transformer stage
For example if the source file has 4 columns and you want to select distinct records based on first two columns,
1) Create 3 Stage variables (ChkDup, A and B) --Be careful of the order
of the variables
2) Map the columns (distinct) to the variables A and B
3) In the derivation of Stage variable (ChkDup) put the following:
If (InLink.Col1 <> A or InLink.Col2 <> B) THEN 'Y' ELSE 'N'
4) In the constraint of output Link put ChkDup = 'Y'

Hope this helps
Yamini
loveojha2
Participant
Posts: 362
Joined: Thu May 26, 2005 12:59 am

Re: How to Select distinct records from Files

Post by loveojha2 »

yaminids wrote:Hello there,

I think you cannot use the Hash file stage if the source has many records.
Hey,
Can you just share with us, wht was the problem with large number of records? How many records you are talking about? What Size of the Hashed file?
Success consists of getting up just one more time than you fall.
yaminids
Premium Member
Premium Member
Posts: 387
Joined: Mon Oct 18, 2004 1:04 pm

Re: How to Select distinct records from Files

Post by yaminids »

If I am not wrong the size of Hash file cannot be more than 2GB. We were dealing with 50 million rows and the size of Hash file reached the limit. Hence its always advisable to keep the size of Hash file as small as possbile by getting only the required columns

Yamini
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Right... so it's not about the number of records but the total size of the hashed file. And if you need to cross the 2GB Barrier, create the hashed file as a 64BIT hashed file. At that point, you are basically limited by the amount of disk space you have on hand. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

... or 19 million TB (approximately), whichever comes first! 8)
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
dls
Premium Member
Premium Member
Posts: 96
Joined: Tue Sep 09, 2003 5:15 pm

Post by dls »

Have you tried using the UNIX sort command with the -u option?

DS may not be part of the solution if the utility will suffice.
anil kumar
Participant
Posts: 8
Joined: Sat Jun 25, 2005 8:03 am
Location: bangalore

Re: How to Select distinct records from Files

Post by anil kumar »

yaminids wrote:Hello there,

I think you cannot use the Hash file stage if the source has many records. We accomplished the task by using Stage variables. The only requirement for our approach is that the data has to be sorted before it arrives to the Transformer stage
For example if the source file has 4 columns and you want to select distinct records based on first two columns,
1) Create 3 Stage variables (ChkDup, A and B) --Be careful of the order
of the variables
2) Map the columns (distinct) to the variables A and B
3) In the derivation of Stage variable (ChkDup) put the following:
If (InLink.Col1 <> A or InLink.Col2 <> B) THEN 'Y' ELSE 'N'
4) In the constraint of output Link put ChkDup = 'Y'

Hope this helps
Yamini


Hi

can u explain clearly the above logic ,how it works to delete duplicate records from the input file..
i doubt the ChkDup variable always be 'N' ..so none of the records will be moved out of the transformer...

can u please clarify the doubt...

thanks in advance
yaminids
Premium Member
Premium Member
Posts: 387
Joined: Mon Oct 18, 2004 1:04 pm

Re: How to Select distinct records from Files

Post by yaminids »

Anil,

The stage variables store the previous data came from the link. So when a new row arrives, the data in the link is compared with the stage variables before making a decision

For example

If in the first row Col1 = 20 and Col2 = 30. When this row is passed through Transformer and since the columns are mapped to the stage variables they(stage variables) get the data from the columns. In our case A = 20 and B =30
In the next row if Col1 = 20 and Col2= 40, these values are compared with A and B. (If (InLink.Col1 <> A or InLink.Col2 <> B) THEN 'Y' ELSE 'N')
Since the values do not match ChkDup = 'Y'. Similarly if the values match with the Stage variables then ChkDup will be 'N'

Hope this is helpful. Let me know if you need more information
Yamini
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

This is an extremely fast dedupe method if the data is sorted. But remember sorting is one of the slowest functions in server jobs, so a Unix sort will be faster, and you may as well use the dedupe Unix sort option while you are at it! The hash file method doesn't require sorting so may be a faster DataStage only method.

Anyone feel free to add the Unix dedupe syntax to the FAQ on removing duplicates.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Hi,

Unix - Sort -u also has its own limitation on size 8)

-Kumar
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Kumar,

what is the size limit for a UNIX sort command? I have sorted some large files and it seems to be constrained by the amount of space available for temporary file creation.
Post Reply