Page 1 of 1

Duplicate records in Source

Posted: Tue Jun 08, 2004 12:57 am
by Sreenivasulu
Hi All,

I am joining 5 tables(using primary key columns of the 5 tables) to get the source data. What are the options you would use within this stage to ensure that this is correctly done i.e (without any duplicate data or cross product). I am presently using
"distinct" to avoid duplicate data but using "distinct" degrades the performance of the query.

Regards

Re: Duplicate records in Source

Posted: Tue Jun 08, 2004 3:09 am
by sanjay
u can use duplicate stage which will removes dulpicate records instaed of distinct and lookup stage which will have one primary input and mutiple reference input
and also check for null condition if any column has .

Regards
Sanjay


Sreenivasulu wrote:Hi All,

I am joining 5 tables(using primary key columns of the 5 tables) to get the source data. What are the options you would use within this stage to ensure that this is correctly done i.e (without any duplicate data or cross product). I am presently using
"distinct" to avoid duplicate data but using "distinct" degrades the performance of the query.

Regards

Posted: Tue Jun 08, 2004 5:58 pm
by ray.wurlod
What stage type are you using? Which database?
Are the columns participating in the joins indexed? (If so, then DISTINCT should work very well.)

PS If this really is a question about PX (parallel jobs) can you please post it on the Parallel Forum?

Posted: Tue Jun 08, 2004 11:57 pm
by Sreenivasulu
Ray,

This is a general query regardless of type of jobs(server or parallel).
Database used is Oracle 9i.

Sanjay: I do not find a Duplicate Stage in Datastage 7

Regards

Posted: Wed Jun 09, 2004 12:20 am
by sanjay
In parallel job
u have remove duplicate stage .

Sanjay

Posted: Wed Jun 09, 2004 12:50 am
by ray.wurlod
It is not a general query regardless of engine, because the solution is different depending on whether you're using a server job or a parallel job.

As Sanjay points out (in highly annoying text messaging abbreviations - what does your DataStage documentation look like, Sanjay?!!), there is a Remove Duplicates stage available on the parallel canvas. In server jobs you need a different technique, possibly using stage variables, possibly using a hashed file.

Posted: Wed Jun 09, 2004 1:10 am
by Sreenivasulu
Its a server job.
We have to remove duplicates from the source query. Is there a way in which the source stage does not process duplicate records(by using keys in the source stage) instead of using a hashed file with stage variables.

Regards

Posted: Wed Jun 09, 2004 1:59 am
by ray.wurlod
Your original post specified parallel job. It is this that confused us.

If you're doing inner joins between the five tables based on the primary keys, you should not get duplicates (because primary key columns should have a uniqueness, or at least joint-uniqueness, property).

Can you explain exactly what you want to achieve? Indexing any foreign key columns supporting the joins will improve the performance of the query that is extracting the data.

Posted: Wed Jun 09, 2004 5:28 am
by Sreenivasulu
Its fine. I got a hunch of how to solve this problem
Thanks a lot for the help

Regards

Posted: Wed Jun 09, 2004 4:59 pm
by ray.wurlod
The idea of this Forum is to be a sharing place. If your hunch proves to be correct, please post the details of your solution so that others may benefit. I'm sure you'll post again if your hunch turns out not to be correct! :lol:

Posted: Wed Jun 09, 2004 11:08 pm
by Sreenivasulu
Sure Ray!!!

I would post the solution as soon i as solve this.

Regards