Duplicate count

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
krishna81
Premium Member
Premium Member
Posts: 78
Joined: Tue May 16, 2006 8:01 am
Location: USA

Duplicate count

Post by krishna81 »

Hi,

I have a requirement to get a count of duplicates in target.please find out below i/p and o/p data.

I/P data:

No Name
----------------
1 XXX
2 XXX
3 XXX
4 YYY
5 YYY
6 ZZZ

O/P Data:

No Name DuplicateCount
-----------------------------------
1 XXX 1
2 XXX 2
3 XXX 3
4 YYY 1
5 YYY 2
6 ZZZ 1

Please post logic to achieve this requirement.

Thanks in advance

Krishna.
ShaneMuir
Premium Member
Premium Member
Posts: 508
Joined: Tue Jun 15, 2004 5:00 am
Location: London

Post by ShaneMuir »

Investigate the use of stage variables.

Basically you want to set up 3 variables, one that holds the value of the previous record, on that compares variable one to the current record and another that acts as a counter depending on whether there is a match or not.
DS_SUPPORT
Premium Member
Premium Member
Posts: 232
Joined: Fri Aug 04, 2006 1:20 am
Location: Bangalore

Post by DS_SUPPORT »

make sure, the input data is sorted , before applying the above said logic.
thiruiitd
Participant
Posts: 4
Joined: Thu Mar 09, 2006 1:06 am
Location: Bangalore

Re: Duplicate count

Post by thiruiitd »

Hi Krish,

Create three stage variables as suggested by Shane as follows
(sort the record first)

Stage Variables and derivations (use the same order of stage variable)

SVCurrentCount --If SVPreviousName is NULL or SVPreviousName <> DSLink.Name Then 1 Else SVPreviousCount +1

SVPreviousCount --SVCurrentCount

SVPreviousName --DSLink.Name



Mapping

No :DSLink.No

Name :DSLink.Name

Duplicatecount : SVCurrentCount



Regards
Thiru















krishna81 wrote:Hi,

I have a requirement to get a count of duplicates in target.please find out below i/p and o/p data.

I/P data:

No Name
----------------
1 XXX
2 XXX
3 XXX
4 YYY
5 YYY
6 ZZZ

O/P Data:

No Name DuplicateCount
-----------------------------------
1 XXX 1
2 XXX 2
3 XXX 3
4 YYY 1
5 YYY 2
6 ZZZ 1

Please post logic to achieve this requirement.

Thanks in advance

Krishna.
Thirunavukkarasu
Bangalore
ShaneMuir
Premium Member
Premium Member
Posts: 508
Joined: Tue Jun 15, 2004 5:00 am
Location: London

Post by ShaneMuir »

DS_SUPPORT wrote:make sure, the input data is sorted , before applying the above said logic.
Good Point!
sachin1
Participant
Posts: 325
Joined: Wed May 30, 2007 7:42 am
Location: india

Post by sachin1 »

yes very true the logic works perfectly as mentioned by thiruiitd .
MOHAMMAD.ISSAQ
Participant
Posts: 78
Joined: Fri Mar 02, 2007 4:54 am
Location: CHENNAI

Post by MOHAMMAD.ISSAQ »

You can also do by using a hash look up and using the same lookup job as reference in the same job itself.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Yup, then sort order is not an issue. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
krishna81
Premium Member
Premium Member
Posts: 78
Joined: Tue May 16, 2006 8:01 am
Location: USA

Post by krishna81 »

Hi Guys,

Sorry for late responce,
i tried which u suggestesd above and i finally got the result.
thanks alot for posting this result and quick response.

still i have doubt about below
but earlier when i test with the reverse order of your stage variables order i got below result .
Order of stage variables:


SVPreviousName --DSLink.Name

SVPreviousCount --SVCurrentCount

SVCurrentCount --If SVPreviousName is NULL or SVPreviousName <> DSLink.Name Then 1 Else SVPreviousCount +1

No Name DuplicateCount
-----------------------------------
1 XXX 1
2 XXX 1
3 XXX 1
4 YYY 2
5 YYY 2
6 ZZZ 3

could any one Please clarify me how could we follow the order of Stage Variables.

once again thanks for previous reply.

Regards
Krishna
ShaneMuir
Premium Member
Premium Member
Posts: 508
Joined: Tue Jun 15, 2004 5:00 am
Location: London

Post by ShaneMuir »

krishna81 wrote:i tried which u suggestesd above and i finally got the result.
thanks alot for posting this result and quick response.

still i have doubt about below
but earlier when i test with the reverse order of your stage variables order i got below result .

...

could any one Please clarify me how could we follow the order of Stage Variables.
Sorry you have lost me now Krishna. So you have got the process working? But you are just wondering why when you reverse the order of the stage variables you get a different result?

The order of the stage variables is very important as they are processed top to bottom for each incoming record. This means that variables lower down the list can contain information from the previous record until the current record is processed for all the variables above it.
This is why svPreviousName has to be the last one, so it will hold the last value passed and so that svCurrentCount can compare that previous value to the new incoming value. If it was first on the list you would you would be comparing the incoming record to itself.
jhmckeever
Premium Member
Premium Member
Posts: 301
Joined: Thu Jul 14, 2005 10:27 am
Location: Melbourne, Australia
Contact:

Post by jhmckeever »

For completeness, here's the evaluation order of elements in a transformer (as I understand them):

Code: Select all

Evaluate each stage variable's initial value
For each input row:
	Evaluate each stage variable derivation value, top to bottom
	For each output link:
		Evaluate each column derivation value
		Evaluate output constraint
		Write the output record, if appropriate
	Next output link
Next input row
J.
<b>John McKeever</b>
Data Migrators
<b><a href="https://www.mettleci.com">MettleCI</a> - DevOps for DataStage</b>
<a href="http://www.datamigrators.com/"><img src="https://www.datamigrators.com/assets/im ... l.png"></a>
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

In server there's a little more that might happen. There's an ancient and long post by me that explains all the gory detail. You left out:
  • before any rows are processed execute before-stage subroutine if any

    for each reference input link evaluate the key expression on each Key column and request matching rows from the link setting link variables appropriately

    for each output link use Put() function to send row (if constraint expression is satisfied), detect failures and set link variables appropriately

    after all rows have been processed execute after-stage subroutine if any
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
shepli
Participant
Posts: 79
Joined: Fri Dec 17, 2004 9:56 am

Post by shepli »

I think a easier way is to use a stage variable, e.g. svDupCnt, and routine RowProcCompareWithPreviousValue (precondition is the Names are sorted).

Derivation to svDupCnt:

Code: Select all

If RowProcCompareWithPreviousValue(Input.Name) Then svDupCnt+1 Else 1
This way, you don't need to worry about the order of stage variables.

shepli
Post Reply