Filter data based on emp_id

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
pavan_test
Premium Member
Premium Member
Posts: 263
Joined: Fri Sep 23, 2005 6:49 am

Filter data based on emp_id

Post by pavan_test »

Hi All,

I have a incoming file in this format;

Duplicate keyID's are allowed and each keyID has unique emp_id.

keyID emp_id col3

101 1043 A
101 1059 B
101 1068 D
101 1069 E

102 1042 R
102 1041 T

103 1043 Y
103 1000 U

104 1059 O
104 1071 L
104 1072 K
104 1073 J

My expected output should remove the duplicate keyID's and filter the data based on the emp_id. i care only about those emp_id whose values are 1043 and 1059.

My expected outut should look like this;

My expected output:

keyID emp_id_1 emp_id_2 col3_1 col3_2

101 1043 1059 A B
102 - - - -
103 1043 - Y -
104 - 1059 - O

now in my output i removed the duplicate keyID's and i have emp_id_1 and emp_id_2 values either null or 1043 or 1059 only.

can anyone please suggest me how to achieve this in data stage.

Thanks
Mark
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Build the output rows in a Transformer stage and use an output constraint of emp_id = 1043 or emp_id = 1059. (I've assumed your input is sorted as shown - if not, sort it.)

Some of these rows will be partially incomplete. The final row in each group (that is, for each keyID) will be complete. Downstream of the Transformer stage use a Remove Duplicates stage to select the Last member of each group.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
pavan_test
Premium Member
Premium Member
Posts: 263
Joined: Fri Sep 23, 2005 6:49 am

Filter data based on emp_id

Post by pavan_test »

[quote="ray.wurlod"]Build the output rows in a Transformer stage and use an output constraint of emp_id = 1043 or emp_id = 1059. (I've assumed your input is sorted as shown - if not, sort it.)

Some of these rows will ...[/quote]

Yes, the input data is sorted in asending order based on keyID

Thanks
Mark
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Pass the sorted data into Transformer. Have two links. One for Key_ID = 1043 and other Key_ID = 1059.
Have 5 columns specified in each link.
keyID emp_id_1 emp_id_2 col3_1 col3_2
Map keyID emp_id_1 col3_1 in the link one and map Null to rest of the fields. And in the link to map keyID emp_id_2 col3_2 and map Nulls/ empty sting to rest.
Pass it to Aggregator and group by the Key_Id and select Max fpor rest of the fields.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Remove Duplicates selecting Last (as described in my earlier post in this thread) would be more efficient than Aggregator.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Last would select the last record isn't? But here last record wont have all the field values.
I guess I am missing some where.
Forgot to mention, the two links of the Transformer should been be Funnel stage and then pass it to Aggregator. Or into RemoveDuplicate stage as Ray mentioned.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
pavan_test
Premium Member
Premium Member
Posts: 263
Joined: Fri Sep 23, 2005 6:49 am

Filter data based on emp_id

Post by pavan_test »

[quote="kumar_s"]Last would select the last record isn't? But here last record wont have all the field values.
I guess I am missing some where.
Forgot to mention, the two links of the Transformer should been be Funn ...[/quote]

Hi Guys,

Thanks you for the suggestions. I have 1 more question. when I use a transformer and then a constraint in the transformer such as
emp_id = 1043 and other emp_id = 1059 then I will be dropping the record with KeyID = 102 because this KeyID do not have emp_id as 1043 or 1059

My expect output is;

keyID emp_id_1 emp_id_2 col3_1 col3_2

101 1043 1059 A B
102 - - - -
103 1043 - Y -
104 - 1059 - O

I still have to retain the record which has keyID 102 but emp_id_1 and emp_id_2 de this keyId can be either null/blanks since the emp_id for this keyID (i.e 102) is not either 1043 or 1059.

there will be many such records in my actual file with such keyID where the emp_id for those records may or may not be 1043 or 1059. However i have to still retain those keyID's. the emp_id_1 and emp_id_2 values have to be either null/blanks for these keyID's.

any suggestions;

Thanks & Regards
Mark
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

So extend a third link which is is for other cases (Not 1043 1059) and perform the same funnel and group by.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
pavan_test
Premium Member
Premium Member
Posts: 263
Joined: Fri Sep 23, 2005 6:49 am

Filter data based on emp_id

Post by pavan_test »

[quote="kumar_s"]So extend a third link which is is for other cases (Not 1043 1059) and perform the same funnel and group by. ...[/quote]

So i will have 3 output links coming out of my transformer and then I will use a funnel stage to combine all those 3 links to single output (I want only 1 output coming out from this job).

Each record in my output then will have a unique keyID. my idea is also to remove the duplicate keyID but based on the emp_id values.

Thanks a lot.

Regards
Mark
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

As i already specified in my previous post, you need to use Aggregator stage to group by the Key and find the Max for the rest of the fields.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
pavan_test
Premium Member
Premium Member
Posts: 263
Joined: Fri Sep 23, 2005 6:49 am

Filter data based on emp_id

Post by pavan_test »

[quote="kumar_s"]As i already specified in my previous post, you need to use Aggregator stage to group by the Key and find the Max for the rest of the fields. ...[/quote]

Thanks I will do that.

Regards
Mark
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You do not need an Aggregator stage at all. Given sorted data, gradually building each row in a Transformer stage, and taking the last of each group (sort key value) in a Remove Duplicates stage, is a sufficient solution to the original problem (which did not drop row 102). To drop rows with no non-key values, extend the Transformer constraint expression so that they do not get passed.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Ok, so Stage variables need to be used to previous record's Not Null values to populated the last record. And followed by RemoveDuplicate stage.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
pavan_test
Premium Member
Premium Member
Posts: 263
Joined: Fri Sep 23, 2005 6:49 am

Filter data based on emp_id

Post by pavan_test »

Thanks a bunch guys for all the suggestions.

I ran the job and able to see the desired output.

Regards
Mark.
Post Reply