Page 1 of 1

Eliminating Duplicates in an Excel file

Posted: Mon May 09, 2005 7:24 pm
by Bram
Hi ,

I got a excel file , which has duplicates. My target file is also an excel file, with no duplicates in it. How can i do that, Basicall y iam trying to remove duplicates from source(excel file) and loaded into target(excel file)

Also when i click on target Sequential_File_8, view data, its giving me the following error:

Test..Sequential_File_8.DSLink9: DSD.SEQOpen Failed to open DSLink9
STATUS = -1.

Any help is greately apprecited,
Bram

Posted: Mon May 09, 2005 7:57 pm
by chulett
DSLink9? Sequential_File_8? You need to start renaming your default link and stage names to something more meaningful. :wink:

The name of the sequential file to write to (or read from) defaults to the link name. Right now it seems like yours wants to read from a file called DSLink9 and the error means it doesn't exist. If you haven't run your job at least once then the target file you are planning on creating does not yet exist.

Explain what you mean by 'excel file'. Do you mean literally a spreadsheet with a .xls extension that you pushed onto your UNIX server somehow? Or a .csv file created from Excel? And what are you trying to create? Yet another spreadsheet or a csv that can be automatically opened and processed by Excel? Not enough details to provide much concrete help quite yet.

And there's nothing magical about removing duplicates from this file over any other. You'd need to either sort it and check for group changes based on your sort keys or write to a hash and let the natural destructive overwrite a hash brings to the table do the work for you. There are other ways, I'm sure - those are the top two.

Posted: Tue May 10, 2005 3:08 am
by Sunshine2323
Hi,

There are many ways duplicates can be removed,

1)You can pass the rows through a Aggregatorand Group on the keys

2)You can read and write a the same hash file and check for constraints IsNull(Key) in the transformer

--------------------Hash File

Input File -------Transformer-----------Output File

--------------------Hash File

3) Sort on the key and then use the transform RowProcCompareWithPreviousValue to compare with the previous key value and decide whether it is a Duplicate or not

Hope this helps.

Posted: Tue May 10, 2005 3:11 am
by Sainath.Srinivasan
As you are using DS in Unix, how do you intend to bring you Excel file? If you bring it as a csv, you can use a sort -u (with other options necessary) to make a unique list.

Posted: Tue May 10, 2005 5:35 am
by Prashantoncyber
I think using approach mentioned by Sainath.Srinivasan is best with performance point of view.

Posted: Tue May 10, 2005 5:56 am
by Sunshine2323
Nice way of earning points without any strong inputs. :P

Posted: Tue May 10, 2005 8:41 am
by Prashantoncyber
I leave this matter to senior Participants of this group for commets..... :shock:

Posted: Tue May 10, 2005 1:11 pm
by Bram
Hi Sainath,

I am tring to remove duplicates, using Sort -u option, but its not any idea how to do that at unix level?

Thanks,
Ram

Posted: Tue May 10, 2005 1:17 pm
by chulett
In the Filter command of the sourcing Sequential file stage. Note that doing it this way will only remove 'duplicates' where the entire row is duplicated. If that's the case here, then that is pretty simple to set up.

Posted: Tue May 10, 2005 4:44 pm
by Bram
Hi Chulett,

Can you please more elaborate on it?

Thanks,
Ram

Posted: Wed May 11, 2005 1:14 am
by Sunshine2323
Hi Ram,

If you use Sort -u then only those records will be considered duplicates in which values for all the columns are duplicated. You can check for the details of this command by doing man sort

For Example if your input file looks like this

1,2,3,4
1,1,1,1
4,4,4,4
1,1,1,1

then a

sort -u InputFile ------->will give

1,2,3,4
4,4,4,4
1,1,1,1

If this is your requirement then you can use this command in the filter command box on the Sequential File Stage.

You can enable the Filter Command box by checking Stage uses Filter Commands check box in the General Tab of the Sequential File Stage.

Posted: Wed May 11, 2005 3:37 am
by Sainath.Srinivasan
You can also use uniq to remove duplicates based on specified cols.

Posted: Tue Jul 19, 2005 11:33 pm
by alraaayeq
Prashantoncyber wrote:I think using approach mentioned by Sainath.Srinivasan is best with performance point of view.
Did you mentioned performance ! <--- this is my baby



well well, we got more than one option here, but which one is the highest performance? <-- this what I am looking for


I believe such performance issues need try&error according to the records and files sizes, so we could share the knowledge here.



I got a similar job (bid daddy job 8) ) that costs us 22 Hours CPU processing/utilization per day (22 minutes per one million record), then a friend of my used some magics and it turned to be 7 HRs. this topic was a bit concerning us, 4 hours will be the target record for us.


can any body help me in this?
Note: we are using hash stages to get rid of duplicated records,and these records should be logged somewhere, the duplication ratio is 2% of the XYZ millions records.

:?:

Posted: Tue Jul 19, 2005 11:59 pm
by vmcburney
The transformer stage method with stage variables is the fastest method if you can get your input data sorted quickly. It will also let you output discarded duplicates quite easily. See the remove duplicates FAQ.