How to Select distinct records from Files
Moderators: chulett, rschirm, roy
How to Select distinct records from Files
Hi All,
I have to select distinct records from a file and load it into another file.
How can I fetch distinct records from file?
This is an urgent requirement please send me logic asap.
Thanks in advance.
I have to select distinct records from a file and load it into another file.
How can I fetch distinct records from file?
This is an urgent requirement please send me logic asap.
Thanks in advance.
-
- Participant
- Posts: 3593
- Joined: Thu Jan 23, 2003 5:25 pm
- Location: Australia, Melbourne
- Contact:
Read the FAQ How do I remove duplicate rows from a data source? It shows several options.
Certus Solutions
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Re: How to Select distinct records from Files
Hello there,
I think you cannot use the Hash file stage if the source has many records. We accomplished the task by using Stage variables. The only requirement for our approach is that the data has to be sorted before it arrives to the Transformer stage
For example if the source file has 4 columns and you want to select distinct records based on first two columns,
1) Create 3 Stage variables (ChkDup, A and B) --Be careful of the order
of the variables
2) Map the columns (distinct) to the variables A and B
3) In the derivation of Stage variable (ChkDup) put the following:
If (InLink.Col1 <> A or InLink.Col2 <> B) THEN 'Y' ELSE 'N'
4) In the constraint of output Link put ChkDup = 'Y'
Hope this helps
Yamini
I think you cannot use the Hash file stage if the source has many records. We accomplished the task by using Stage variables. The only requirement for our approach is that the data has to be sorted before it arrives to the Transformer stage
For example if the source file has 4 columns and you want to select distinct records based on first two columns,
1) Create 3 Stage variables (ChkDup, A and B) --Be careful of the order
of the variables
2) Map the columns (distinct) to the variables A and B
3) In the derivation of Stage variable (ChkDup) put the following:
If (InLink.Col1 <> A or InLink.Col2 <> B) THEN 'Y' ELSE 'N'
4) In the constraint of output Link put ChkDup = 'Y'
Hope this helps
Yamini
Re: How to Select distinct records from Files
Hey,yaminids wrote:Hello there,
I think you cannot use the Hash file stage if the source has many records.
Can you just share with us, wht was the problem with large number of records? How many records you are talking about? What Size of the Hashed file?
Success consists of getting up just one more time than you fall.
Re: How to Select distinct records from Files
If I am not wrong the size of Hash file cannot be more than 2GB. We were dealing with 50 million rows and the size of Hash file reached the limit. Hence its always advisable to keep the size of Hash file as small as possbile by getting only the required columns
Yamini
Yamini
Right... so it's not about the number of records but the total size of the hashed file. And if you need to cross the 2GB Barrier, create the hashed file as a 64BIT hashed file. At that point, you are basically limited by the amount of disk space you have on hand.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Participant
- Posts: 8
- Joined: Sat Jun 25, 2005 8:03 am
- Location: bangalore
Re: How to Select distinct records from Files
yaminids wrote:Hello there,
I think you cannot use the Hash file stage if the source has many records. We accomplished the task by using Stage variables. The only requirement for our approach is that the data has to be sorted before it arrives to the Transformer stage
For example if the source file has 4 columns and you want to select distinct records based on first two columns,
1) Create 3 Stage variables (ChkDup, A and B) --Be careful of the order
of the variables
2) Map the columns (distinct) to the variables A and B
3) In the derivation of Stage variable (ChkDup) put the following:
If (InLink.Col1 <> A or InLink.Col2 <> B) THEN 'Y' ELSE 'N'
4) In the constraint of output Link put ChkDup = 'Y'
Hope this helps
Yamini
Hi
can u explain clearly the above logic ,how it works to delete duplicate records from the input file..
i doubt the ChkDup variable always be 'N' ..so none of the records will be moved out of the transformer...
can u please clarify the doubt...
thanks in advance
Re: How to Select distinct records from Files
Anil,
The stage variables store the previous data came from the link. So when a new row arrives, the data in the link is compared with the stage variables before making a decision
For example
If in the first row Col1 = 20 and Col2 = 30. When this row is passed through Transformer and since the columns are mapped to the stage variables they(stage variables) get the data from the columns. In our case A = 20 and B =30
In the next row if Col1 = 20 and Col2= 40, these values are compared with A and B. (If (InLink.Col1 <> A or InLink.Col2 <> B) THEN 'Y' ELSE 'N')
Since the values do not match ChkDup = 'Y'. Similarly if the values match with the Stage variables then ChkDup will be 'N'
Hope this is helpful. Let me know if you need more information
Yamini
The stage variables store the previous data came from the link. So when a new row arrives, the data in the link is compared with the stage variables before making a decision
For example
If in the first row Col1 = 20 and Col2 = 30. When this row is passed through Transformer and since the columns are mapped to the stage variables they(stage variables) get the data from the columns. In our case A = 20 and B =30
In the next row if Col1 = 20 and Col2= 40, these values are compared with A and B. (If (InLink.Col1 <> A or InLink.Col2 <> B) THEN 'Y' ELSE 'N')
Since the values do not match ChkDup = 'Y'. Similarly if the values match with the Stage variables then ChkDup will be 'N'
Hope this is helpful. Let me know if you need more information
Yamini
-
- Participant
- Posts: 3593
- Joined: Thu Jan 23, 2003 5:25 pm
- Location: Australia, Melbourne
- Contact:
This is an extremely fast dedupe method if the data is sorted. But remember sorting is one of the slowest functions in server jobs, so a Unix sort will be faster, and you may as well use the dedupe Unix sort option while you are at it! The hash file method doesn't require sorting so may be a faster DataStage only method.
Anyone feel free to add the Unix dedupe syntax to the FAQ on removing duplicates.
Anyone feel free to add the Unix dedupe syntax to the FAQ on removing duplicates.
Certus Solutions
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Kumar,
what is the size limit for a UNIX sort command? I have sorted some large files and it seems to be constrained by the amount of space available for temporary file creation.
what is the size limit for a UNIX sort command? I have sorted some large files and it seems to be constrained by the amount of space available for temporary file creation.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>