Track duplicate rows in the source data

ArndW · Post by **ArndW** » Fri Aug 25, 2006 6:37 am

The answers you get depend upon whether you are working in Server or in Parallel jobs. Even better answer will come when you explain what type of source data you have (sequential files don't have keys, and sometimes DB level filtering is more efficient than doing it in DataStage)

DSguru2B · Post by **DSguru2B** » Fri Aug 25, 2006 7:28 am

If you have an oracle source or any DB table then you can filter within the select statement.

To find if there are any duplicate values in TABLE_B alone (if there is a
primary key then there will be no duplicates) type the following:

Code: Select all

SELECT COUNT(*), COLUMN_A, COLUMN_B
FROM TABLE_B
GROUP BY COLUMN_A, COLUMN_B
HAVING COUNT(*)>1;

datastage1 · Post by **datastage1** » Fri Aug 25, 2006 7:50 am

DSguru2B wrote:If you have an oracle source or any DB table then you can filter within the select statement.

To find if there are any duplicate values in TABLE_B alone (if there is a
primary key then there will be no duplicates) type the following:
Code: Select all
SELECT COUNT(*), COLUMN_A, COLUMN_B
FROM TABLE_B
GROUP BY COLUMN_A, COLUMN_B
HAVING COUNT(*)>1;

Thanks for clearing my doubt .what if the case was sequential file. Is there any way to give points for the correct n helpful answers in these forums as i am new here.

DSguru2B · Post by **DSguru2B** » Fri Aug 25, 2006 7:54 am

There are many ways done in Server to resolve the Duplicacy issue. If you give a search you should be able to find routines written by the posters. Click HERE

In parallel there is a remove duplicate stage.

Good luck

chulett · Post by **chulett** » Fri Aug 25, 2006 7:54 am

datastage1 wrote:Is there any way to give points for the correct n helpful answers in these forums as i am new here.

That's coming... currently being tested by a small group of people.