Page 1 of 1

capturing full duplicates in to a sequential file

Posted: Tue Jun 29, 2010 12:35 pm
by dsx999
I know that this question has been asked many times before. But I think my requirement is slightly different.
I have to capture only the FULL duplicates (only when data in ALL the columns is identical).
I need an optimal solution as this job should handle more than 10m records.

Any suggestions??

Posted: Tue Jun 29, 2010 12:40 pm
by anbu
Sort the data and mention all the fields as key and set Create Key Change Column to true.

Next in transformer set constraint as

Code: Select all

KeyChange = 0

Posted: Tue Jun 29, 2010 12:51 pm
by dsx999
But performance is my concern. Will the performance be ok on 60 column input data of around 10m records?
Any better solutions ?

Posted: Tue Jun 29, 2010 1:37 pm
by laknar
Code a User defined Query to query only the required records.
This will minimize the processing time.

Posted: Tue Jun 29, 2010 1:39 pm
by laknar
Code a User defined Query to query only the required records.
This will minimize the processing time.

Posted: Tue Jun 29, 2010 2:51 pm
by dsx999
Forget it. I wouldn't have come to DSXchange if it was possible. :wink:

Posted: Tue Jun 29, 2010 2:53 pm
by Sainath.Srinivasan
dsx999 wrote:But performance is my concern. Will the performance be ok on 60 column input data of around 10m records?
Any better solutions ?
You have not mentioned anything about your hardware configuration. If you are one who has huge memory plus more than 100 TB disk for each node and running in 32 node config, you can easily fit them.

On the other hand, if you have 2Mb RAM and 4Mb disk space, you will have difficulty even in unzipping the file.

Try the sort option mentioned and see what happens.

Posted: Tue Jun 29, 2010 4:14 pm
by Sreenivasulu
You need to find an optimal solution which is suitable to the project. Doing in datastage if its not an optimal solution does not serve anyone's purpose

Regards
Sreeni

Posted: Tue Jun 29, 2010 4:41 pm
by ray.wurlod
Change Capture stage with "All keys and All columns" should detect absolute duplicates happily.

Posted: Tue Jun 29, 2010 4:54 pm
by laknar
Ray,Single source cannot be use with change capture stage.
either Query,Sort or remove duplicate stage can be used for this solution

Posted: Tue Jun 29, 2010 5:09 pm
by bollinenik
It would be great if you can give more details about.
what are you going to do with duplicate rcords, are you going to delete or.
you might be doing some thing right.

and what else doing at same time, are you just try to findout only duplicates with that data, else are you some other process.

If you can give that info, you will get more optimal solutions for your requirement.

Posted: Tue Jun 29, 2010 8:21 pm
by dsx999
[quote="Sainath.Srinivasan

You have not mentioned anything about your hardware configuration. If you are one who has huge memory plus more than 100 TB disk for each node and running in 32 node config, you can easily fit them.

On the other hand, if you have 2Mb RAM and 4Mb disk space, you will have difficulty even in unzipping the file.

Try the sort option mentioned and see what happens.[/quote]

Hmm. Is it really a right approach to design your jobs "MAINLY" based on your hardware configurations? Ok. What would you suggest in each of the above cases?
And why should any company hire an experienced consultants? instead they can simply do one time investment of setting up 100 TB disk blah..blah..blah. In that case, any imbecile can work. right?

Hang on... I am not challenging your skills... or your answer... but this is my opinion and may be its time to change or get more support.

Posted: Tue Jun 29, 2010 8:26 pm
by dsx999
bollinenik wrote:It would be great if you can give more details about.
what are you going to do with duplicate rcords, are you going to delete or.
you might be doing some thing right.

and what else doing at same time, are you just try to findout only duplicates with that data, else are you some other process.

If you can give that info, you will get more optimal solutions for your requirement.
Sorry gentleman, I didn't really get your concern. I may be missing something here. Can you please dilate further? I am wondering how would it affect the job design?

Posted: Wed Jun 30, 2010 9:54 am
by priyadarshikunal
with more processing power you can go for sort stage, 10 m records are nothing for a good server(enough processing power/RAM, configured for optimal performance). Hopefully you are not running it on your laptop :wink: .

even if you think about using stage variables data needs to be sorted. So go ahead and try the approach suggested above.