capturing full duplicates in to a sequential file

dsx999 · Post by **dsx999** » Tue Jun 29, 2010 12:35 pm

I know that this question has been asked many times before. But I think my requirement is slightly different.
I have to capture only the FULL duplicates (only when data in ALL the columns is identical).
I need an optimal solution as this job should handle more than 10m records.

Any suggestions??

anbu · Post by **anbu** » Tue Jun 29, 2010 12:40 pm

Sort the data and mention all the fields as key and set Create Key Change Column to true.

Next in transformer set constraint as

Code: Select all

KeyChange = 0

dsx999 · Post by **dsx999** » Tue Jun 29, 2010 12:51 pm

But performance is my concern. Will the performance be ok on 60 column input data of around 10m records?
Any better solutions ?

laknar · Post by **laknar** » Tue Jun 29, 2010 1:37 pm

Code a User defined Query to query only the required records.
This will minimize the processing time.

laknar · Post by **laknar** » Tue Jun 29, 2010 1:39 pm

Code a User defined Query to query only the required records.
This will minimize the processing time.

dsx999 · Post by **dsx999** » Tue Jun 29, 2010 2:51 pm

Forget it. I wouldn't have come to DSXchange if it was possible.

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Tue Jun 29, 2010 2:53 pm

dsx999 wrote:But performance is my concern. Will the performance be ok on 60 column input data of around 10m records?
Any better solutions ?

You have not mentioned anything about your hardware configuration. If you are one who has huge memory plus more than 100 TB disk for each node and running in 32 node config, you can easily fit them.

On the other hand, if you have 2Mb RAM and 4Mb disk space, you will have difficulty even in unzipping the file.

Try the sort option mentioned and see what happens.

Sreenivasulu · Post by **Sreenivasulu** » Tue Jun 29, 2010 4:14 pm

You need to find an optimal solution which is suitable to the project. Doing in datastage if its not an optimal solution does not serve anyone's purpose

Regards
Sreeni

ray.wurlod · Post by **ray.wurlod** » Tue Jun 29, 2010 4:41 pm

Change Capture stage with "All keys and All columns" should detect absolute duplicates happily.

laknar · Post by **laknar** » Tue Jun 29, 2010 4:54 pm

Ray,Single source cannot be use with change capture stage.
either Query,Sort or remove duplicate stage can be used for this solution

bollinenik · Post by **bollinenik** » Tue Jun 29, 2010 5:09 pm

It would be great if you can give more details about.
what are you going to do with duplicate rcords, are you going to delete or.
you might be doing some thing right.

and what else doing at same time, are you just try to findout only duplicates with that data, else are you some other process.

If you can give that info, you will get more optimal solutions for your requirement.

dsx999 · Post by **dsx999** » Tue Jun 29, 2010 8:21 pm

[quote="Sainath.Srinivasan

You have not mentioned anything about your hardware configuration. If you are one who has huge memory plus more than 100 TB disk for each node and running in 32 node config, you can easily fit them.

On the other hand, if you have 2Mb RAM and 4Mb disk space, you will have difficulty even in unzipping the file.

Try the sort option mentioned and see what happens.[/quote]

Hmm. Is it really a right approach to design your jobs "MAINLY" based on your hardware configurations? Ok. What would you suggest in each of the above cases?
And why should any company hire an experienced consultants? instead they can simply do one time investment of setting up 100 TB disk blah..blah..blah. In that case, any imbecile can work. right?

Hang on... I am not challenging your skills... or your answer... but this is my opinion and may be its time to change or get more support.

dsx999 · Post by **dsx999** » Tue Jun 29, 2010 8:26 pm

bollinenik wrote:It would be great if you can give more details about.
what are you going to do with duplicate rcords, are you going to delete or.
you might be doing some thing right.

and what else doing at same time, are you just try to findout only duplicates with that data, else are you some other process.

If you can give that info, you will get more optimal solutions for your requirement.

Sorry gentleman, I didn't really get your concern. I may be missing something here. Can you please dilate further? I am wondering how would it affect the job design?

priyadarshikunal · Post by **priyadarshikunal** » Wed Jun 30, 2010 9:54 am

with more processing power you can go for sort stage, 10 m records are nothing for a good server(enough processing power/RAM, configured for optimal performance). Hopefully you are not running it on your laptop

.

even if you think about using stage variables data needs to be sorted. So go ahead and try the approach suggested above.