Filter data based on emp_id

pavan_test · Post by **pavan_test** » Fri Mar 21, 2008 3:51 pm

Hi All,

I have a incoming file in this format;

Duplicate keyID's are allowed and each keyID has unique emp_id.

keyID emp_id col3

101 1043 A
101 1059 B
101 1068 D
101 1069 E

102 1042 R
102 1041 T

103 1043 Y
103 1000 U

104 1059 O
104 1071 L
104 1072 K
104 1073 J

My expected output should remove the duplicate keyID's and filter the data based on the emp_id. i care only about those emp_id whose values are 1043 and 1059.

My expected outut should look like this;

My expected output:

keyID emp_id_1 emp_id_2 col3_1 col3_2

101 1043 1059 A B
102 - - - -
103 1043 - Y -
104 - 1059 - O

now in my output i removed the duplicate keyID's and i have emp_id_1 and emp_id_2 values either null or 1043 or 1059 only.

can anyone please suggest me how to achieve this in data stage.

Thanks
Mark

ray.wurlod · Post by **ray.wurlod** » Fri Mar 21, 2008 5:28 pm

Build the output rows in a Transformer stage and use an output constraint of emp_id = 1043 or emp_id = 1059. (I've assumed your input is sorted as shown - if not, sort it.)

Some of these rows will be partially incomplete. The final row in each group (that is, for each keyID) will be complete. Downstream of the Transformer stage use a Remove Duplicates stage to select the Last member of each group.

pavan_test · Post by **pavan_test** » Fri Mar 21, 2008 5:38 pm

[quote="ray.wurlod"]Build the output rows in a Transformer stage and use an output constraint of emp_id = 1043 or emp_id = 1059. (I've assumed your input is sorted as shown - if not, sort it.)

Some of these rows will ...[/quote]

Yes, the input data is sorted in asending order based on keyID

Thanks
Mark

kumar_s · Post by **kumar_s** » Fri Mar 21, 2008 7:00 pm

Pass the sorted data into Transformer. Have two links. One for Key_ID = 1043 and other Key_ID = 1059.
Have 5 columns specified in each link.
keyID emp_id_1 emp_id_2 col3_1 col3_2
Map keyID emp_id_1 col3_1 in the link one and map Null to rest of the fields. And in the link to map keyID emp_id_2 col3_2 and map Nulls/ empty sting to rest.
Pass it to Aggregator and group by the Key_Id and select Max fpor rest of the fields.

ray.wurlod · Post by **ray.wurlod** » Fri Mar 21, 2008 7:03 pm

Remove Duplicates selecting Last (as described in my earlier post in this thread) would be more efficient than Aggregator.

kumar_s · Post by **kumar_s** » Fri Mar 21, 2008 7:10 pm

Last would select the last record isn't? But here last record wont have all the field values.
I guess I am missing some where.
Forgot to mention, the two links of the Transformer should been be Funnel stage and then pass it to Aggregator. Or into RemoveDuplicate stage as Ray mentioned.

pavan_test · Post by **pavan_test** » Fri Mar 21, 2008 8:46 pm

[quote="kumar_s"]Last would select the last record isn't? But here last record wont have all the field values.
I guess I am missing some where.
Forgot to mention, the two links of the Transformer should been be Funn ...[/quote]

Hi Guys,

Thanks you for the suggestions. I have 1 more question. when I use a transformer and then a constraint in the transformer such as
emp_id = 1043 and other emp_id = 1059 then I will be dropping the record with KeyID = 102 because this KeyID do not have emp_id as 1043 or 1059

My expect output is;

keyID emp_id_1 emp_id_2 col3_1 col3_2

101 1043 1059 A B
102 - - - -
103 1043 - Y -
104 - 1059 - O

I still have to retain the record which has keyID 102 but emp_id_1 and emp_id_2 de this keyId can be either null/blanks since the emp_id for this keyID (i.e 102) is not either 1043 or 1059.

there will be many such records in my actual file with such keyID where the emp_id for those records may or may not be 1043 or 1059. However i have to still retain those keyID's. the emp_id_1 and emp_id_2 values have to be either null/blanks for these keyID's.

any suggestions;

Thanks & Regards
Mark

kumar_s · Post by **kumar_s** » Fri Mar 21, 2008 8:49 pm

So extend a third link which is is for other cases (Not 1043 1059) and perform the same funnel and group by.

pavan_test · Post by **pavan_test** » Fri Mar 21, 2008 9:03 pm

[quote="kumar_s"]So extend a third link which is is for other cases (Not 1043 1059) and perform the same funnel and group by. ...[/quote]

So i will have 3 output links coming out of my transformer and then I will use a funnel stage to combine all those 3 links to single output (I want only 1 output coming out from this job).

Each record in my output then will have a unique keyID. my idea is also to remove the duplicate keyID but based on the emp_id values.

Thanks a lot.

Regards
Mark

kumar_s · Post by **kumar_s** » Fri Mar 21, 2008 9:06 pm

As i already specified in my previous post, you need to use Aggregator stage to group by the Key and find the Max for the rest of the fields.

pavan_test · Post by **pavan_test** » Fri Mar 21, 2008 9:34 pm

[quote="kumar_s"]As i already specified in my previous post, you need to use Aggregator stage to group by the Key and find the Max for the rest of the fields. ...[/quote]

Thanks I will do that.

Regards
Mark

ray.wurlod · Post by **ray.wurlod** » Sat Mar 22, 2008 3:11 am

You do not need an Aggregator stage at all. Given sorted data, gradually building each row in a Transformer stage, and taking the last of each group (sort key value) in a Remove Duplicates stage, is a sufficient solution to the original problem (which did not drop row 102). To drop rows with no non-key values, extend the Transformer constraint expression so that they do not get passed.

kumar_s · Post by **kumar_s** » Sat Mar 22, 2008 7:56 am

Ok, so Stage variables need to be used to previous record's Not Null values to populated the last record. And followed by RemoveDuplicate stage.

pavan_test · Post by **pavan_test** » Sat Mar 22, 2008 4:10 pm

Thanks a bunch guys for all the suggestions.

I ran the job and able to see the desired output.

Regards
Mark.