How do i compare ??

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
aladap12
Participant
Posts: 60
Joined: Fri Jul 20, 2007 1:15 pm
Location: NO

How do i compare ??

Post by aladap12 »

DS Guru's


I had a problem in my job.

I have three input columns like ID , date, seq

my requirement is like i have to check id and date columns for dupilcate entries. if i have duplicate entries on id and date.. then i have to check seq..

and further i have to write "xx" to new column in out put for greater seq num

How do i achive this in datastage


Thanks
Kr
sreddy
Participant
Posts: 144
Joined: Sun Oct 21, 2007 9:13 am

Re: How do i compare ??

Post by sreddy »

Aladap12

This is not Datastage issue, you have to write a SQL query to remove duplicates at Source Database Stage ( userdefined ).

DELETE FROM
table_name A
WHERE
a.rowid >
ANY (SELECT B.rowid FROM table_name B
WHERE
A.col1 = B.col1
AND
A.col2 = B.col2
);
-----------------------------------------------------
DELETE FROM table_name A WHERE ROWID > (
SELECT min(rowid) FROM table_name B
WHERE A.key_values = B.key_values);




You can try ..





aladap12 wrote:DS Guru's


I had a problem in my job.

I have three input columns like ID , date, seq

my requirement is like i have to check id and date columns for dupilcate entries. if i have duplicate entries on id and date.. then i have to check seq..

and further i have to write "xx" to new column in out put for greater seq num

How do i achive this in datastage


Thanks
Kr
SReddy
dwpractices@gmail.com
Analyzing Performance
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Use sorted data, and stage variables in a Transformer stage to detect duplicates. Search the forum - this technique has been described before.

Please ignore the advice from sreddy; it is too narrowly focussed. For example there is no guarantee that the source is a database table.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Never mind the fact that there's no indication of a need to delete anything, just detect.
-craig

"You can never have too many knives" -- Logan Nine Fingers
aladap12
Participant
Posts: 60
Joined: Fri Jul 20, 2007 1:15 pm
Location: NO

Thanks for your response Ray

Post by aladap12 »

Thanks Ray.. I have searched in the fourm but i did not find any answer to my problem.. any way i would like to explain my issuse more elaborately...

My source data contains several records..

My job design like ...One driving table, few look up table .. and i am writing to hashed file.

Like this we have set of jobs..As per my business req i am processing all the records form driving table and getting some fields form lookup tables inorder to make complete records...

My Input data in like

col1 col2 col3
------------------------------

123 , 6/24/08 , 237

124 , 6/24/08 , 238

123 , 6/24/08 , 239

124 , 6/23/08 , 240



Right now i am selecting all the records and writing to hashed file

My client asked me to send records like

IF value in Col 1 and col 2 or same (Both fields should have duplicate entries ) then i have to check column 3..
further i have to create a new column in hashed file and send xx to filed for greater value in col3..


does we achive this datastage server jobs..Pleas let me know..

Appriciate your Help 8)
Thanks
Kr



ray.wurlod wrote:Use sorted data, and stage variables in a Transformer stage to detect duplicates. Search the forum - this technique has been described before.

Please ignore the advice from sreddy; it is too narro ...
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Give us an example of what the output data should look like, given your input data example.
-craig

"You can never have too many knives" -- Logan Nine Fingers
jdmiceli
Premium Member
Premium Member
Posts: 309
Joined: Wed Feb 22, 2006 10:03 am
Location: Urbandale, IA

Post by jdmiceli »

Also, if the data is going to a hashed file, then the key value rows are simply going to overwrite each other as they occur. That is most likely not the desired result, but it is something to be aware of just in case.
Bestest!

John Miceli
System Specialist, MCP, MCDBA
Berkley Technology Services


"Good Morning. This is God. I will be handling all your problems today. I will not need your help. So have a great day!"
aladap12
Participant
Posts: 60
Joined: Fri Jul 20, 2007 1:15 pm
Location: NO

Thanks for your reply

Post by aladap12 »

chulett wrote:Give us an example of what the output data should look like, given your input data example. ...
Thanks Ray.. I have searched in the fourm but i did not find any answer to my problem.. any way i would like to explain my issuse more elaborately...

My source data contains several records..

My job design like ...One driving table, few look up table .. and i am writing to hashed file.

Like this we have set of jobs..As per my business req i am processing all the records form driving table and getting some fields form lookup tables inorder to make complete records...

My Input data in like

col1 col2 col3
------------------------------

123 , 6/24/08 , 237

124 , 6/24/08 , 238

123 , 6/24/08 , 239

124 , 6/23/08 , 240



Right now i am selecting all the records and writing to hashed file, addition to this

My client asked me to send records like

IF value in Col 1 and col 2 or same (Both fields should have duplicate entries ) then i have to check column 3..
further i have to create a new column in hashed file and send xx to filed for greater value in col3..


Out put should be like

col1 col2 col3 Col4
------------------------------

123 , 6/24/08 , 237 , 000

124 , 6/24/08 , 238 , 000

123 , 6/24/08 , 239 , 45

124 , 6/23/08 , 240 , 000


And here in my hash file i have around 70 other colums in which i have around 6 Unique (key) colums.

Hope you were understand my Point

does we achive this datastage server jobs..Pleas let me know..

Appriciate your Help
Thanks
Kr
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Exactly as I posted earlier. Use stage variables for check for "changed" (Col1 and Col2) and "larger" (Col3) in data sorted by Col1, Col2 and Col3.
Derive your output columns accordingly.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Re: Thanks for your reply

Post by PhilHibbs »

aladap12 wrote: My Input data in like

col1 col2 col3
------------------------------
123 , 6/24/08 , 237
124 , 6/24/08 , 238
123 , 6/24/08 , 239
124 , 6/23/08 , 240

Out put should be like

col1 col2 col3 Col4
------------------------------
123 , 6/24/08 , 237 , 000
124 , 6/24/08 , 238 , 000
123 , 6/24/08 , 239 , 45
124 , 6/23/08 , 240 , 000
What is the 45 value? Is this just an indicator that that row is a duplicate key and that the col3 value is greater? Is it indicating that the col3 value is the highest for that col1/col2 key pair? What if this was your input:

col1 col2 col3
------------------------------
123 , 6/24/08 , 237
123 , 6/24/08 , 239
123 , 6/24/08 , 241

Would you want the first two occurrences of 123,6/24/08 to both have 000 in the output and the third have 45? If so, you could process your data once to find the maximum value for each key (sort it and write to a hashed file, for example), and then do a lookup to check whether the current record is the highest value for the key. Alternatively, sort it by the key and by descending col3, and on the first occurrence of a key, write out the record and the 45 value to indicate the highest value, and write zero for subsequent occurrences of the same key.

Key change logic is implemented by having three Stage Variables like this:

svKey = inputlink.col1 : inputlink.col2
svKeyChanged = COMPARE( svKey, svOldKey ) <> 0
svOldKey = svKey

This works by comparing the new value of svKey with the old value of svOldKey, because the derivations are performed in the order that they are in the Stage Variables box, so svOldKey will have the value "left over" from the previous record. Make sure you init your SV's! The COMPARE function is important since "0" and "000" compare as equal when using the = operator.

In order to do key change logic, your data needs to be sorted on the key, which your example is not. So, whatever your solution, you need to process the data twice, even if the first is just a sort on the key.
Phil Hibbs | Capgemini
Technical Consultant
aladap12
Participant
Posts: 60
Joined: Fri Jul 20, 2007 1:15 pm
Location: NO

Thanks Ray..

Post by aladap12 »

ray.wurlod wrote:Exactly as I posted earlier. Use stage variables for check for "changed" (Col1 and Col2) and "larger" (Col3) in data sorted by Col1, Col2 and Col3.
Derive your output columns accordingly. ...


Thanks Ray...

Could you please explain me in detail, what i have to do to achive this??

Appriciate your Time and Help


Thanks
Kr
aladap12
Participant
Posts: 60
Joined: Fri Jul 20, 2007 1:15 pm
Location: NO

Thanks for your reply

Post by aladap12 »

What is the 45 value? Is this just an indicator that that row is a duplicate key and that the col3 value is greater? Is it indicating that the col3 value is the highest for that col1/col2 key pair? What if this was your input:

col1 col2 col3
------------------------------
123 , 6/24/08 , 237
123 , 6/24/08 , 239
123 , 6/24/08 , 241

Would you want the first two occurrences of 123,6/24/08 to both have 000 in the output and the third have 45? If so, you could process your data once to find the maximum value for each key (sort it and write to a hashed file, for example), and then do a lookup to check whether the current record is the highest value for the key. Alternatively, sort it by the key and by descending col3, and on the first occurrence of a key, write out the record and the 45 value to indicate the highest value, and write zero for subsequent occurrences of the same key.


Let me answer your Question:

Value 45 is an indicator that says user, this record have highest seq num..
If values in col1 and col2 have duplicate entries then i have to check col3 for higher base_row_id and further i have to pass one indicator in the output file with value like 45(New column in output file)

Secondly
col1 col2 col3
------------------------------
123 , 6/24/08 , 237
123 , 6/24/08 , 239
123 , 6/24/08 , 241

If my in put is like above then my output should be like

col1 col2 col3 col4
------------------------------
123 , 6/24/08 , 237 000
123 , 6/24/08 , 239 000
123 , 6/24/08 , 241 45

Hope you understand my problem

Thanks
Kr
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Post by PhilHibbs »

OK. The answer is the same: either you process the file once and keep a record (in a hashed file) of the highest col3 value for every key pair, or you process the file in descending order of col1, col2, col3 and use key change logic to detect a new key pair (which will always be the highest value for that key due to the descending sort).
Phil Hibbs | Capgemini
Technical Consultant
Post Reply