Track duplicate rows in the source data
Moderators: chulett, rschirm, roy
The answers you get depend upon whether you are working in Server or in Parallel jobs. Even better answer will come when you explain what type of source data you have (sequential files don't have keys, and sometimes DB level filtering is more efficient than doing it in DataStage)
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
If you have an oracle source or any DB table then you can filter within the select statement.
To find if there are any duplicate values in TABLE_B alone (if there is a
primary key then there will be no duplicates) type the following:
To find if there are any duplicate values in TABLE_B alone (if there is a
primary key then there will be no duplicates) type the following:
Code: Select all
SELECT COUNT(*), COLUMN_A, COLUMN_B
FROM TABLE_B
GROUP BY COLUMN_A, COLUMN_B
HAVING COUNT(*)>1;
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
-
- Participant
- Posts: 5
- Joined: Mon Aug 14, 2006 2:48 pm
Thanks for clearing my doubt .what if the case was sequential file. Is there any way to give points for the correct n helpful answers in these forums as i am new here.DSguru2B wrote:If you have an oracle source or any DB table then you can filter within the select statement.
To find if there are any duplicate values in TABLE_B alone (if there is a
primary key then there will be no duplicates) type the following:
Code: Select all
SELECT COUNT(*), COLUMN_A, COLUMN_B FROM TABLE_B GROUP BY COLUMN_A, COLUMN_B HAVING COUNT(*)>1;
There are many ways done in Server to resolve the Duplicacy issue. If you give a search you should be able to find routines written by the posters. Click HERE
In parallel there is a remove duplicate stage.
Good luck
In parallel there is a remove duplicate stage.
Good luck
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.