Page 1 of 1

Uniqueness of data in a table

Posted: Wed Mar 02, 2005 11:54 am
by yaminids
Hello friends,

I am trying to load unique data into a table. The source file contains duplicate data. The formula for uniqueness is the combination of 3 columns in the source file.
My intension is to load the data from source file into two tables A and B in the target.

I am trying to apply the following rules:

1) All the rows from the source go into table A, but only unique rows must go into table B.
2) For every unique row in table B, I have to generate a sequence number which I have to enter in Table A.

Can anyone help me with this.
Thanks a lot in advance

-Yamini

Posted: Wed Mar 02, 2005 12:01 pm
by sumitgulati
Is the data coming from the source file sorted by the three columns that define uniqueness of a row? Could you also please give approx number of records you expect to process using this job.

Regards,
-Sumit

Uniqueness of data in a table

Posted: Wed Mar 02, 2005 1:19 pm
by yaminids
Hello Sumit,

Yes the data in the source is sorted and there are approximately 40 million rows in the source, which I think rules out the use of Hash file stage.

Thanks
-Yamini

Posted: Wed Mar 02, 2005 1:46 pm
by Sainath.Srinivasan
Use 3 stage variables in the transformer following the source stage.

1.) newKeyCombination (default 'N')
2.) prevKeyCombination (default '')
3.) seqNo (default 0)

newKeyCombination <- if link.newKey <> prevKeyCombination Or IsNULL(prevKeyCombination) then 'Y' Else 'N'
prevKeyCombination <- link.newKey
seqNo <- if newKeyCombination = 'Y' then seqNo + 1 Else seqNo

For A, move all values

For B, have a constraint saying newKeyCombination = 'Y'

Re: Uniqueness of data in a table

Posted: Wed Mar 02, 2005 1:49 pm
by sumitgulati
Well since the data is sorted that solves the problem.

Declare four staging variables say UniqueCnt, ColAvar, ColBVar and ColCVar in the same order. Initialize all the stagevariables to 0. Map ColAvar, ColBVar and ColCVar to ColA, ColB and ColC from the source.

In the derivation of UniqueCnt have the following code:
If SourceLink.ColA<>ColAVar OR SourceLink.ColB<>ColBVar OR SourceLink.ColC<>ColCVar Then UniqueCnt+1 Else UniqueCnt
Remember the order in which you define the stage variables is important because they get executed in the same order. Since Stage variable ColAvar, ColBVar and ColCVar are defined below stage variable UniqueCnt they hold the previous rows values when used in the derivation of UniqueCnt

Posted: Wed Mar 02, 2005 1:52 pm
by sumitgulati
Addition to the previous reply.

Using the same logic you can define a flag. Use this flag to push the values to Table B only when a new combination comes.

Push all the records to Table A

Use the value of UniqueCnt in both TableA and Table B

Posted: Wed Mar 02, 2005 3:03 pm
by Aravind
But if the source data contains a row which got loaded to Table B by lletz say a previous days load.. Then I doubt if with the above logic
"For every unique row in table B, I have to generate a sequence number which I have to enter in Table A." this requirement will me met or not. I am sorry if I am wrong

Posted: Wed Mar 02, 2005 3:32 pm
by sumitgulati
You are right Arvind but we have no informaion on
1) the frequency of load
2) if the source file contains new set of data everytime or not

Regards,
-Sumit

Uniqueness of data in a table

Posted: Wed Mar 02, 2005 7:03 pm
by yaminids
Hello all,

Thank you very much for your valuable suggestions. This is a one time load so the question of next load does not arise.
Thank you all once again

-Yamini

Posted: Thu Mar 03, 2005 11:37 am
by clshore
Why not let the target DB engine do the heavy lifting?
First load all the rows into table A.
Then select unique from A into B, generating your B sequence number as you go according to the capability of your DB (Oracle sequence, etc.)

Carter

Uniqueness of data in a table

Posted: Thu Mar 03, 2005 6:29 pm
by yaminids
Hello Sumit,

I used the logic you described to define a flag inorder the load the data into table 'B'. The flag is only working when the input columns does not have any null values. Can you explain to me how to deal with null values in this scenario.

Thanks in advance
-Yamini

Posted: Fri Mar 04, 2005 3:15 am
by Sainath.Srinivasan
Use a IsNULL() function to replace the input link value to some value which will never occur in the input.

Posted: Fri Mar 04, 2005 2:13 pm
by sumitgulati
Hi Yamini,

The <> does not work with Null values. You need to do null handling to solve this.

Follow this approach:

Declare seven staging variables say ColA, ColB, ColC, UniqueCnt, ColAvar, ColBVar and ColCVar in the same order. Initialize all the stagevariables to 0. Map ColAvar and ColA to
If IsNull(SourceLink.ColA) Then ' ' Else SourceLink.ColA

Similarly map ColB, ColC, ColBVar and ColCVar.

Now use the following expression in the derivation of UniqueCnt:
If ColA<>ColAVar OR ColB<>ColBVar OR ColC<>ColCVar Then UniqueCnt+1 Else UniqueCnt
This should work for you. Again do not forget the sequence of the sequence variables.

Regards,
-Sumit

Posted: Fri Mar 04, 2005 2:44 pm
by Sainath.Srinivasan
Yamini,

Did you try the solution mentioned by me earlier in this post? I assume that will give you a simple and straightforward solution.

Uniqueness of data in a table

Posted: Mon Mar 07, 2005 12:46 pm
by yaminids
Hello Sumit and Sainath,

Thank you for your help. I got the solution, for which I have been trying for a long time. Sainath, I did try your logic and I think its very easy to implement.

-Yamini