Duplicate records in Source
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 892
- Joined: Thu Oct 16, 2003 5:18 am
Duplicate records in Source
Hi All,
I am joining 5 tables(using primary key columns of the 5 tables) to get the source data. What are the options you would use within this stage to ensure that this is correctly done i.e (without any duplicate data or cross product). I am presently using
"distinct" to avoid duplicate data but using "distinct" degrades the performance of the query.
Regards
I am joining 5 tables(using primary key columns of the 5 tables) to get the source data. What are the options you would use within this stage to ensure that this is correctly done i.e (without any duplicate data or cross product). I am presently using
"distinct" to avoid duplicate data but using "distinct" degrades the performance of the query.
Regards
Re: Duplicate records in Source
u can use duplicate stage which will removes dulpicate records instaed of distinct and lookup stage which will have one primary input and mutiple reference input
and also check for null condition if any column has .
Regards
Sanjay
and also check for null condition if any column has .
Regards
Sanjay
Sreenivasulu wrote:Hi All,
I am joining 5 tables(using primary key columns of the 5 tables) to get the source data. What are the options you would use within this stage to ensure that this is correctly done i.e (without any duplicate data or cross product). I am presently using
"distinct" to avoid duplicate data but using "distinct" degrades the performance of the query.
Regards
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
What stage type are you using? Which database?
Are the columns participating in the joins indexed? (If so, then DISTINCT should work very well.)
PS If this really is a question about PX (parallel jobs) can you please post it on the Parallel Forum?
Are the columns participating in the joins indexed? (If so, then DISTINCT should work very well.)
PS If this really is a question about PX (parallel jobs) can you please post it on the Parallel Forum?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 892
- Joined: Thu Oct 16, 2003 5:18 am
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
It is not a general query regardless of engine, because the solution is different depending on whether you're using a server job or a parallel job.
As Sanjay points out (in highly annoying text messaging abbreviations - what does your DataStage documentation look like, Sanjay?!!), there is a Remove Duplicates stage available on the parallel canvas. In server jobs you need a different technique, possibly using stage variables, possibly using a hashed file.
As Sanjay points out (in highly annoying text messaging abbreviations - what does your DataStage documentation look like, Sanjay?!!), there is a Remove Duplicates stage available on the parallel canvas. In server jobs you need a different technique, possibly using stage variables, possibly using a hashed file.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 892
- Joined: Thu Oct 16, 2003 5:18 am
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Your original post specified parallel job. It is this that confused us.
If you're doing inner joins between the five tables based on the primary keys, you should not get duplicates (because primary key columns should have a uniqueness, or at least joint-uniqueness, property).
Can you explain exactly what you want to achieve? Indexing any foreign key columns supporting the joins will improve the performance of the query that is extracting the data.
If you're doing inner joins between the five tables based on the primary keys, you should not get duplicates (because primary key columns should have a uniqueness, or at least joint-uniqueness, property).
Can you explain exactly what you want to achieve? Indexing any foreign key columns supporting the joins will improve the performance of the query that is extracting the data.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 892
- Joined: Thu Oct 16, 2003 5:18 am
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
The idea of this Forum is to be a sharing place. If your hunch proves to be correct, please post the details of your solution so that others may benefit. I'm sure you'll post again if your hunch turns out not to be correct! ![Laughing :lol:](./images/smilies/icon_lol.gif)
![Laughing :lol:](./images/smilies/icon_lol.gif)
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 892
- Joined: Thu Oct 16, 2003 5:18 am