Page 1 of 1

Duplicate count

Posted: Wed Jan 30, 2008 6:16 pm
by krishna81
Hi,

I have a requirement to get a count of duplicates in target.please find out below i/p and o/p data.

I/P data:

No Name
----------------
1 XXX
2 XXX
3 XXX
4 YYY
5 YYY
6 ZZZ

O/P Data:

No Name DuplicateCount
-----------------------------------
1 XXX 1
2 XXX 2
3 XXX 3
4 YYY 1
5 YYY 2
6 ZZZ 1

Please post logic to achieve this requirement.

Thanks in advance

Krishna.

Posted: Wed Jan 30, 2008 6:23 pm
by ShaneMuir
Investigate the use of stage variables.

Basically you want to set up 3 variables, one that holds the value of the previous record, on that compares variable one to the current record and another that acts as a counter depending on whether there is a match or not.

Posted: Wed Jan 30, 2008 9:05 pm
by DS_SUPPORT
make sure, the input data is sorted , before applying the above said logic.

Re: Duplicate count

Posted: Wed Jan 30, 2008 9:12 pm
by thiruiitd
Hi Krish,

Create three stage variables as suggested by Shane as follows
(sort the record first)

Stage Variables and derivations (use the same order of stage variable)

SVCurrentCount --If SVPreviousName is NULL or SVPreviousName <> DSLink.Name Then 1 Else SVPreviousCount +1

SVPreviousCount --SVCurrentCount

SVPreviousName --DSLink.Name



Mapping

No :DSLink.No

Name :DSLink.Name

Duplicatecount : SVCurrentCount



Regards
Thiru















krishna81 wrote:Hi,

I have a requirement to get a count of duplicates in target.please find out below i/p and o/p data.

I/P data:

No Name
----------------
1 XXX
2 XXX
3 XXX
4 YYY
5 YYY
6 ZZZ

O/P Data:

No Name DuplicateCount
-----------------------------------
1 XXX 1
2 XXX 2
3 XXX 3
4 YYY 1
5 YYY 2
6 ZZZ 1

Please post logic to achieve this requirement.

Thanks in advance

Krishna.

Posted: Wed Jan 30, 2008 10:39 pm
by ShaneMuir
DS_SUPPORT wrote:make sure, the input data is sorted , before applying the above said logic.
Good Point!

Posted: Thu Jan 31, 2008 4:36 am
by sachin1
yes very true the logic works perfectly as mentioned by thiruiitd .

Posted: Thu Jan 31, 2008 8:56 am
by MOHAMMAD.ISSAQ
You can also do by using a hash look up and using the same lookup job as reference in the same job itself.

Posted: Thu Jan 31, 2008 9:02 am
by chulett
Yup, then sort order is not an issue. :wink:

Posted: Thu Jan 31, 2008 3:32 pm
by krishna81
Hi Guys,

Sorry for late responce,
i tried which u suggestesd above and i finally got the result.
thanks alot for posting this result and quick response.

still i have doubt about below
but earlier when i test with the reverse order of your stage variables order i got below result .
Order of stage variables:


SVPreviousName --DSLink.Name

SVPreviousCount --SVCurrentCount

SVCurrentCount --If SVPreviousName is NULL or SVPreviousName <> DSLink.Name Then 1 Else SVPreviousCount +1

No Name DuplicateCount
-----------------------------------
1 XXX 1
2 XXX 1
3 XXX 1
4 YYY 2
5 YYY 2
6 ZZZ 3

could any one Please clarify me how could we follow the order of Stage Variables.

once again thanks for previous reply.

Regards
Krishna

Posted: Thu Jan 31, 2008 6:19 pm
by ShaneMuir
krishna81 wrote:i tried which u suggestesd above and i finally got the result.
thanks alot for posting this result and quick response.

still i have doubt about below
but earlier when i test with the reverse order of your stage variables order i got below result .

...

could any one Please clarify me how could we follow the order of Stage Variables.
Sorry you have lost me now Krishna. So you have got the process working? But you are just wondering why when you reverse the order of the stage variables you get a different result?

The order of the stage variables is very important as they are processed top to bottom for each incoming record. This means that variables lower down the list can contain information from the previous record until the current record is processed for all the variables above it.
This is why svPreviousName has to be the last one, so it will hold the last value passed and so that svCurrentCount can compare that previous value to the new incoming value. If it was first on the list you would you would be comparing the incoming record to itself.

Posted: Thu Jan 31, 2008 6:48 pm
by jhmckeever
For completeness, here's the evaluation order of elements in a transformer (as I understand them):

Code: Select all

Evaluate each stage variable's initial value
For each input row:
	Evaluate each stage variable derivation value, top to bottom
	For each output link:
		Evaluate each column derivation value
		Evaluate output constraint
		Write the output record, if appropriate
	Next output link
Next input row
J.

Posted: Thu Jan 31, 2008 9:50 pm
by ray.wurlod
In server there's a little more that might happen. There's an ancient and long post by me that explains all the gory detail. You left out:
  • before any rows are processed execute before-stage subroutine if any

    for each reference input link evaluate the key expression on each Key column and request matching rows from the link setting link variables appropriately

    for each output link use Put() function to send row (if constraint expression is satisfied), detect failures and set link variables appropriately

    after all rows have been processed execute after-stage subroutine if any

Posted: Fri Feb 01, 2008 11:44 am
by shepli
I think a easier way is to use a stage variable, e.g. svDupCnt, and routine RowProcCompareWithPreviousValue (precondition is the Names are sorted).

Derivation to svDupCnt:

Code: Select all

If RowProcCompareWithPreviousValue(Input.Name) Then svDupCnt+1 Else 1
This way, you don't need to worry about the order of stage variables.

shepli