capturing full duplicates in to a sequential file

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
dsx999
Participant
Posts: 29
Joined: Mon Aug 11, 2008 3:40 am

capturing full duplicates in to a sequential file

Post by dsx999 »

I know that this question has been asked many times before. But I think my requirement is slightly different.
I have to capture only the FULL duplicates (only when data in ALL the columns is identical).
I need an optimal solution as this job should handle more than 10m records.

Any suggestions??
anbu
Premium Member
Premium Member
Posts: 596
Joined: Sat Feb 18, 2006 2:25 am
Location: india

Post by anbu »

Sort the data and mention all the fields as key and set Create Key Change Column to true.

Next in transformer set constraint as

Code: Select all

KeyChange = 0
You are the creator of your destiny - Swami Vivekananda
dsx999
Participant
Posts: 29
Joined: Mon Aug 11, 2008 3:40 am

Post by dsx999 »

But performance is my concern. Will the performance be ok on 60 column input data of around 10m records?
Any better solutions ?
laknar
Participant
Posts: 162
Joined: Thu Apr 26, 2007 5:59 am
Location: Chennai

Post by laknar »

Code a User defined Query to query only the required records.
This will minimize the processing time.
Regards
LakNar
laknar
Participant
Posts: 162
Joined: Thu Apr 26, 2007 5:59 am
Location: Chennai

Post by laknar »

Code a User defined Query to query only the required records.
This will minimize the processing time.
Regards
LakNar
dsx999
Participant
Posts: 29
Joined: Mon Aug 11, 2008 3:40 am

Post by dsx999 »

Forget it. I wouldn't have come to DSXchange if it was possible. :wink:
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

dsx999 wrote:But performance is my concern. Will the performance be ok on 60 column input data of around 10m records?
Any better solutions ?
You have not mentioned anything about your hardware configuration. If you are one who has huge memory plus more than 100 TB disk for each node and running in 32 node config, you can easily fit them.

On the other hand, if you have 2Mb RAM and 4Mb disk space, you will have difficulty even in unzipping the file.

Try the sort option mentioned and see what happens.
Sreenivasulu
Premium Member
Premium Member
Posts: 892
Joined: Thu Oct 16, 2003 5:18 am

Post by Sreenivasulu »

You need to find an optimal solution which is suitable to the project. Doing in datastage if its not an optimal solution does not serve anyone's purpose

Regards
Sreeni
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Change Capture stage with "All keys and All columns" should detect absolute duplicates happily.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
laknar
Participant
Posts: 162
Joined: Thu Apr 26, 2007 5:59 am
Location: Chennai

Post by laknar »

Ray,Single source cannot be use with change capture stage.
either Query,Sort or remove duplicate stage can be used for this solution
Regards
LakNar
bollinenik
Participant
Posts: 111
Joined: Thu Jun 01, 2006 5:12 am
Location: Detroit

Post by bollinenik »

It would be great if you can give more details about.
what are you going to do with duplicate rcords, are you going to delete or.
you might be doing some thing right.

and what else doing at same time, are you just try to findout only duplicates with that data, else are you some other process.

If you can give that info, you will get more optimal solutions for your requirement.
dsx999
Participant
Posts: 29
Joined: Mon Aug 11, 2008 3:40 am

Post by dsx999 »

[quote="Sainath.Srinivasan

You have not mentioned anything about your hardware configuration. If you are one who has huge memory plus more than 100 TB disk for each node and running in 32 node config, you can easily fit them.

On the other hand, if you have 2Mb RAM and 4Mb disk space, you will have difficulty even in unzipping the file.

Try the sort option mentioned and see what happens.[/quote]

Hmm. Is it really a right approach to design your jobs "MAINLY" based on your hardware configurations? Ok. What would you suggest in each of the above cases?
And why should any company hire an experienced consultants? instead they can simply do one time investment of setting up 100 TB disk blah..blah..blah. In that case, any imbecile can work. right?

Hang on... I am not challenging your skills... or your answer... but this is my opinion and may be its time to change or get more support.
dsx999
Participant
Posts: 29
Joined: Mon Aug 11, 2008 3:40 am

Post by dsx999 »

bollinenik wrote:It would be great if you can give more details about.
what are you going to do with duplicate rcords, are you going to delete or.
you might be doing some thing right.

and what else doing at same time, are you just try to findout only duplicates with that data, else are you some other process.

If you can give that info, you will get more optimal solutions for your requirement.
Sorry gentleman, I didn't really get your concern. I may be missing something here. Can you please dilate further? I am wondering how would it affect the job design?
priyadarshikunal
Premium Member
Premium Member
Posts: 1735
Joined: Thu Mar 01, 2007 5:44 am
Location: Troy, MI

Post by priyadarshikunal »

with more processing power you can go for sort stage, 10 m records are nothing for a good server(enough processing power/RAM, configured for optimal performance). Hopefully you are not running it on your laptop :wink: .

even if you think about using stage variables data needs to be sorted. So go ahead and try the approach suggested above.
Priyadarshi Kunal

Genius may have its limitations, but stupidity is not thus handicapped. :wink:
Post Reply