Finding Similar records & assigning Groups

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

DSFreddie
Participant
Posts: 130
Joined: Wed Nov 25, 2009 2:16 pm

Finding Similar records & assigning Groups

Post by DSFreddie »

Hi All,

Requesting your thoughts/expertise to arrive in a breakthrough for one of my project requirement in DataStage.

The requirement is as follows,

I have 2 files A (Master File) & B (New Record/Update to existing records file).

I need to combine both these files(I will use Join Stage here) & based on certain Key fields(like SSN etc), I need to group the records & assign a specific group ID to each of the distinct record set.

For e.g :
File A
-----------
100 ABC 1001
200 ASD 2001
300 3001

File B
------------
100 ACE 2002
200 ASD 2003
400 XYZ 2005

Output should be as follows,
File C
------------
100 ABC 1001 G1 (group ID)
100 ACE 2002 G1
200 ASD 2001 G2
200 ASD 2003 G2
300 3001 G3
400 XYZ 2005 G4

Can you pls shed some light on how we can accomplish this in DataStage. Your thoughts will be really helpful to take it forward.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Funnel the sources then Sort on the key field generating a Key Change column. In a downstream transformer stage use the Key Change field to trigger an increment in the Group ID.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
DSFreddie
Participant
Posts: 130
Joined: Wed Nov 25, 2009 2:16 pm

Post by DSFreddie »

Hi Ray,

Thanks for your reply & input. I tried the solution you gave. This is what I got.

File C
------------
100 ABC 1001 1 (keyChange)
100 ACE 2002 1
200 ASD 2001 1
200 ASD 2003 0
300 3001 1
400 XYZ 2005 1
400 XYZ 2005 0
500 DEF 2000 1
500 DEF 2000 0

I somehow couldnt group the records using a transformer stage. (since the keychange values are 1/0. Can you pls be more specific how i can handle it through Transformer ?

Also, this process will run once in a week & in-case of a similar key record, it should be grouped to the same Group the key records were grouped previously.

Thanks in advance for your valuable inputs.
Freddie
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

That key change column is a boolean. When it is true (1) you are at the beginning of a new group and you should increment your "group id". While it is false (0) you should continue to use the current group id. Use stage variables to manage that.
-craig

"You can never have too many knives" -- Logan Nine Fingers
DSFreddie
Participant
Posts: 130
Joined: Wed Nov 25, 2009 2:16 pm

Post by DSFreddie »

Hi Chulett,

Thanks for your reply. Unfortunately, I am not able to read your full message. ( & i know a premium membership is required to view it :(

Can someone posts the full message (or) post your thoughts for this issue.

Thank You,
Freddie
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Why not get a premium membership? It's less than 30c per day, and the revenue helps to keep DSXchange alive.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
SURA
Premium Member
Premium Member
Posts: 1229
Joined: Sat Jul 14, 2007 5:16 am
Location: Sydney

Post by SURA »

Nope, that is not fare. The solution is already provided to you.

1. Funnel it
2. sort it based on key
3. Create stage variable count,sv1,sv2, sv3 in order
4. count should be int and initial value as 0
5.Pass the sv2 to sv1
5. keycol = sv2
6. sv1 = sv2
7. In count variable, if sv3 then count + 1 else 0


In the TFM you can make your condition to get the expected result.

Let me know, if i missed / something wrong.

DS User
DSFreddie
Participant
Posts: 130
Joined: Wed Nov 25, 2009 2:16 pm

Logic Issue - Grouping distinct records (thru Transformer)

Post by DSFreddie »

Sure Ray. I am planning to upgrade my membership in the near future.

While trying to use the stage variables to check the Key Change values & assigning Group_ID's, I could assign distinct values to all 1's & 0's repectively. But my logic is not working in the scenario where i should be asssigning distinct group _Id's for each of the groups.

Can you pls help me resolve this logic.

Thanks in advance
Freddie
DSFreddie
Participant
Posts: 130
Joined: Wed Nov 25, 2009 2:16 pm

Post by DSFreddie »

Thanks a lot Sura for your inputs.

I did the first 2 steps. In the transformer, I didnt quite get what you explained using the stage variables.

I am creating 4 stage variables in the order COUNT,SV1,SV2 & SV3.

COUNT (first Stage variable) value = 0
What should I be doing for the rest ? (SV1,SV2 & SV3)

In the end, you stated to make the conditions to get the expected result. Can you pls give me more detail on it ?
Sorry, if i am asking basic questions here.

Thanks,
Freddie
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Sure.

Show us what you've tried. And explain in what way is does not produce the expected outcome.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Just as a wee bit of a personal rant - anything using stage variables with names like "sv1" would never pass a peer review. Give them meaningful names, people!
-craig

"You can never have too many knives" -- Logan Nine Fingers
SURA
Premium Member
Premium Member
Posts: 1229
Joined: Sat Jul 14, 2007 5:16 am
Location: Sydney

Post by SURA »

You are right. The aim is to give a hint . Anyhow i will change my approach.

DS User
DSFreddie
Participant
Posts: 130
Joined: Wed Nov 25, 2009 2:16 pm

Post by DSFreddie »

Hi Sura/Ray,

Thanks for your inputs. I tried the following in the stage variable,

StageVar : if not(DSLink3.keyChange) then (StageVar) else StageVar+1

In the new field GroupID, i passed StageVar. But it is not giving me the expected output. My output should look as follows,

output
------------
100 ABC 1001 1 (keyChange) G1 (Group ID)
100 ACE 2002 1 G1
200 ASD 2001 1 G2
200 ASD 2003 0 G2
300 3001 1 G3
400 XYZ 2005 1 G4
400 XYZ 2005 0 G4
500 DEF 2000 1 G5
500 DEF 2000 0 G5

Thanks for your help in resolving this logic issue.
Freddie
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

What IS it giving you? Are you deriving GroupID as

Code: Select all

"G" : StageVar
?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

Try experimenting using different combinations of columns for the sort key.
You'll be amazed how quickly the answer will make itself known when you don't wait for people to tell you the answer.
Post Reply