Lookup stage outputs duplicate rows

ASU_ETL_DEV · Post by **ASU_ETL_DEV** » Sun Mar 16, 2014 6:24 pm

Hello,
I am transitioning from the server to the parallel world so bear with me...

I have two oracle tables: one is used as stream and the other as reference. I need to find out which rows from the stream are NOT in the reference and output them to a target file.

I am using the Oracle connector for both stream and reference (row count on each is under 400) and I am using a Lookup stage with a reject link to capture the failed lookups. I coded for a sparse lookup in the reference Oracle connector.
Since the Lookup stage does not allow me to code only the reject link I have also an output stream link to a flat file for the lookup hits.
What I am seeing is that the job outputs 200k+ rows to the output stream link. In looking at this data I find that there are lots of duplicates (500+) for each input stream row.

Can you please explain this behavior of the Lookup stage?

Thanks.
Marco

ray.wurlod · Post by **ray.wurlod** » Sun Mar 16, 2014 8:02 pm

Duplicates in the stream input link.
Duplicates in the reference input link.
Both of the above.

Why are you using sparse lookup for such a small volume?

ASU_ETL_DEV · Post by **ASU_ETL_DEV** » Sun Mar 16, 2014 9:04 pm

Hi Ray,
There were duplicates in the lookup table. I changed the SELECT to DISTINCT and also changed the lookup type to normal instead of sparse and now it is working correctly.
Thanks.
Marco

yugee · Post by **yugee** » Tue Mar 18, 2014 9:33 pm

Though your reference link has duplicates, unless you specify "multiple rows returned from link" for the reference link, you should not get the duplicates. If your main link has duplicates, then it is possible to get the duplicates.