Duplicate records in Source

Sreenivasulu · Post by **Sreenivasulu** » Tue Jun 08, 2004 12:57 am

Hi All,

I am joining 5 tables(using primary key columns of the 5 tables) to get the source data. What are the options you would use within this stage to ensure that this is correctly done i.e (without any duplicate data or cross product). I am presently using
"distinct" to avoid duplicate data but using "distinct" degrades the performance of the query.

Regards

sanjay · Post by **sanjay** » Tue Jun 08, 2004 3:09 am

u can use duplicate stage which will removes dulpicate records instaed of distinct and lookup stage which will have one primary input and mutiple reference input
and also check for null condition if any column has .

Regards
Sanjay

Sreenivasulu wrote:Hi All,

I am joining 5 tables(using primary key columns of the 5 tables) to get the source data. What are the options you would use within this stage to ensure that this is correctly done i.e (without any duplicate data or cross product). I am presently using
"distinct" to avoid duplicate data but using "distinct" degrades the performance of the query.

Regards

ray.wurlod · Post by **ray.wurlod** » Tue Jun 08, 2004 5:58 pm

What stage type are you using? Which database?
Are the columns participating in the joins indexed? (If so, then DISTINCT should work very well.)

PS If this really is a question about PX (parallel jobs) can you please post it on the Parallel Forum?

Sreenivasulu · Post by **Sreenivasulu** » Tue Jun 08, 2004 11:57 pm

Ray,

This is a general query regardless of type of jobs(server or parallel).
Database used is Oracle 9i.

Sanjay: I do not find a Duplicate Stage in Datastage 7

Regards

sanjay · Post by **sanjay** » Wed Jun 09, 2004 12:20 am

In parallel job
u have remove duplicate stage .

Sanjay

ray.wurlod · Post by **ray.wurlod** » Wed Jun 09, 2004 12:50 am

It is not a general query regardless of engine, because the solution is different depending on whether you're using a server job or a parallel job.

As Sanjay points out (in highly annoying text messaging abbreviations - what does your DataStage documentation look like, Sanjay?!!), there is a Remove Duplicates stage available on the parallel canvas. In server jobs you need a different technique, possibly using stage variables, possibly using a hashed file.

Sreenivasulu · Post by **Sreenivasulu** » Wed Jun 09, 2004 1:10 am

Its a server job.
We have to remove duplicates from the source query. Is there a way in which the source stage does not process duplicate records(by using keys in the source stage) instead of using a hashed file with stage variables.

Regards

ray.wurlod · Post by **ray.wurlod** » Wed Jun 09, 2004 1:59 am

Your original post specified parallel job. It is this that confused us.

If you're doing inner joins between the five tables based on the primary keys, you should not get duplicates (because primary key columns should have a uniqueness, or at least joint-uniqueness, property).

Can you explain exactly what you want to achieve? Indexing any foreign key columns supporting the joins will improve the performance of the query that is extracting the data.

Sreenivasulu · Post by **Sreenivasulu** » Wed Jun 09, 2004 5:28 am

Its fine. I got a hunch of how to solve this problem
Thanks a lot for the help

Regards

ray.wurlod · Post by **ray.wurlod** » Wed Jun 09, 2004 4:59 pm

The idea of this Forum is to be a sharing place. If your hunch proves to be correct, please post the details of your solution so that others may benefit. I'm sure you'll post again if your hunch turns out not to be correct!

Sreenivasulu · Post by **Sreenivasulu** » Wed Jun 09, 2004 11:08 pm

Sure Ray!!!

I would post the solution as soon i as solve this.

Regards

DSXchange

Duplicate records in Source

Duplicate records in Source

Re: Duplicate records in Source