lookup duplicates

harryhome · Post by **harryhome** » Mon Oct 03, 2011 3:54 pm

Lookup is distinct combination of key1 and key2. Now key1 and key2 both gets match from same reference dataset and same column.

key1 data has no duplicates but key2 has duplicates. now when look is performed every time getting different record count as output.
reference data has no duplicates.
how to get correct output

ray.wurlod · Post by **ray.wurlod** » Mon Oct 03, 2011 5:37 pm

Can you give us an example? I'm not quite clear from your description.

harryhome · Post by **harryhome** » Mon Oct 03, 2011 8:05 pm

example
input
key1 key2
A Q
B B
C B
D A

Lookup
A 1
B 2
C 3
D 4
Q 5

Expected output
1 5
2 2
3 2
4 1

SURA · Post by **SURA** » Mon Oct 03, 2011 9:08 pm

Use copy stage in reference and take two output (pass 2 input to lookup) do lookup.

DS User

harryhome · Post by **harryhome** » Mon Oct 03, 2011 9:29 pm

using copy stage in reference only . but still getting the same problem , every run with different output count.

harryhome · Post by **harryhome** » Mon Oct 03, 2011 10:00 pm

harryhome wrote:using copy stage in reference only . but still getting the same problem , every run with different output count.

To add, I have hash partition, perform unique sort on input and two reference links

SURA · Post by **SURA** » Mon Oct 03, 2011 10:25 pm

As per your sample data, if you use auto partition, you will not be in trouble.

DS User

ray.wurlod · Post by **ray.wurlod** » Mon Oct 03, 2011 11:31 pm

Please confirm that your job design looks like this:

Code: Select all

               +-------+
               |       |
               |  Ref. |
               |       |
               +---+---+
                   |
                   V
               +-------+
               |       |
               |  Copy |
               |       |
               +-+---+-+
                 |   |
           ref1  |   |  ref2
                 V   V
               +-------+
               |       |
     ------>   |Lookup |  ------->
     stream    |       | 
               +-------+

harryhome · Post by **harryhome** » Tue Oct 04, 2011 11:49 am

Yes Ray, Its exactly looks like that. one reference, one copy, two reference links and lookup.

Now in look up when I give

input as
key1 key2
A A
B A
C A
D H
E G
F A

I get different output rows.

I am doing hash partition on stream key column key2 sort

ray.wurlod · Post by **ray.wurlod** » Tue Oct 04, 2011 3:07 pm

So, what output ARE you getting?

How are the columns mapped on the output of the Lookup stage?