DSXchange

Posted: **Tue Oct 30, 2012 1:09 am**

Hi there ,
I am having a set of data with three columns :
1. MEMBER (varchar)
2. PHONE (varchar)
3. UPDATE_TIME (timestamp)
The data looks something like this below :
M1 P1 T1
M1 P1 T7
M1 P1 T1
M1 P1 T1
M2 P2 T8
M2 P2 T2
M2 P2 T6
M2 P2 T2
where M's stand for MEMBER, P's stand for PHONE and T's stand for UPDATE_TIME and T1>T2>T6>T7>T8
What is expected is to create a Datastage job(no other tool like PL/SQL etc. will be accepted by business) that will implement the following logic:
1. For EACH MEMBER-PHONE combination, it will get ALL Records which has UPDATE_TIME=MAX of UPDATE_TIME for that MEMBER-PHONE combination, and insert these records in one table
2. Any other records thereafter, will be considered as DUPLICATE's , and for each of these records there will be another column called DUPCHECK which will be set to "DUPLICATE"

The outcome for the above mentioned data will be something like

Good records:
M1 P1 T1
M1 P1 T1
M1 P1 T1
M2 P2 T2
M2 P2 T2

Duplicate records:
M1 P1 T7 DUPLICATE
M2 P2 T6 DUPLICATE
M2 P2 T8 DUPLICATE

Please note, we needs to capture all records, none of them could be dropped/rejected, and I have already used the Remove Duplicate stage with Join stage to implement this logic,
but it is taking a hell lot of time.
1.Cant this logic be acheived using SORT stage only
2.There will be millions of data , so a very optimized approach will be highly appreciated.

Thanks
Kumarjit.

Posted: **Tue Oct 30, 2012 1:29 am**

Do a hash partition using MEMBER & PHONE and perform a descending sort using the time-stamp column.In the sort stage, create the KEY CHANGE COLUMN. Your records should be grouped and sorted and will look something like this.

Code: Select all

M1 P1 T1 1 
M1 P1 T1 0
M1 P1 T1 0
M1 P1 T7 1
M2 P2 T2 1
M2 P2 T2 0
M2 P2 T6 1
M2 P2 T8 1

Code your transformer logic to filter out all the records in a group (except the first record with value 1) which have KEY CHANGE value as 1.

The output should be as expected

Posted: **Tue Oct 30, 2012 2:27 am**

Thanks for the reply Rajan.

I understood the first part, but things start to mess up from the point where you have mentioned "Code your transformer logic to filter out all the records in a group....".
Could you please elaborate ?

Thanks in advance,
Kumarjit.

Posted: **Tue Oct 30, 2012 2:38 am**

Introduce a stage variable in the transformer that will hold a sum. Initialize the value of this stage variable(svCounter) to 0. For a given group (combination of MEMBER & PHONE), increment svCounter by 1 when there's a record with KEY_CHANGE=1. When svCounter=2, then increment svCounter for any value of KEY_CHANGE in that group.

Code: Select all

M   P   T  Key_Chg  svCounter
M1 P1 T1 1                    1
M1 P1 T1 0                    1
M1 P1 T1 0                    1
M1 P1 T7 1                    2
M2 P2 T2 1                    1
M2 P2 T2 0                    1
M2 P2 T6 1                    2
M2 P2 T8 1                    3
M2 P2 T8 0                    4

Add a constraint which filters out all records which have svCounter>1.
Hope it helps

Posted: **Tue Oct 30, 2012 3:07 am**

Thanks again. But, the things which I didn't understand are :

a) How could you increment the value of the stage variable svCounter for EACH GROUP (MEMBER-PHONE) ? The reason fo asking so is , how could a Transformer stage identify each unique group of MEMBER-PHONE ?
b) Why should I increment the value of svCounter to any value of KEY_CHANGE col for that group when svCounter=2 ? What if I just go on incrementing svCounter with the value of KEY_CHANGE col even after svCounter=2 ?

Thanks for your patience,
Kumarjit.

Posted: **Tue Oct 30, 2012 3:21 am**

The answer to your first question is elementary. You'd do better to figure out the answer yourself (Hint : Key Break or better still 'Looping' if you use 8.5 or above). If you can't, search the forum. It would be redundant if I repeat something as basic as that here.

To answer your second question, the idea is to filter out the data. You can increment the counter by 1, the key_change value or by the speed of light!

Posted: **Tue Oct 30, 2012 7:39 am**

Kumarjit - please start using the Reply to topic 'button' rather than Reply with quote and quoting everything every time you post. Quote when you need to, just not all the time by default.

Thanks.

Posted: **Tue Oct 30, 2012 8:11 am**

Sorry , my bad ......

..... Probably because I'm a first time user ..... But, I dont really see the Reply to topic button in my login ........

Anyways , thanks for the info.....

Posted: **Tue Oct 30, 2012 8:24 am**

No worries, I see you've found it now.

DSXchange

Get records with MAXIMUM value in a timestamp column

Get records with MAXIMUM value in a timestamp column