Interesting Problem: Join / Look up on partial keys

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
JonJon
Participant
Posts: 3
Joined: Mon Dec 03, 2007 7:49 pm

Interesting Problem: Join / Look up on partial keys

Post by JonJon »

Hi there

I have a Master data set with which I do comparisons with a Reference data set.

The issue I'm having is that the Ref data set contains only substring info of the keys used in the Master Data set.

Assuming that I'm only showing Key values for the two Data Sets:

Master Data Set
------------
Aabc
Adh
Defg



Ref Data Set
------------
ab
de
fg


Ideally I want the the 1st and 3rd row in the Master Data Set to be found.

This is because 'ab' from Aabc and 'fg' from Defg are found in the Ref dataset.

AFAIK, DS doesn't allow substring key matches (joins and lookups asks you to specific Keys which basically translates to ='s in SQL, what I want are LIKE or MATCH statements).

One way that I have thought of doing this is to do a Cross join (as I'm writing this, I don't think this is possible in DS?!) between Master and Ref, so end result looks like:

===========
Aabc ab
Aabc de
Aabc fg
Adh ab
Adh de
Adh fg
Defg ab
Defg de
Defg fg
===========

And then perform Index search between the columns, which will work (bolded columns above)

However, the volume of the data in Master (about 30Million) plus a Ref file which at this stage is say 100 rows, will generate a resultant data set of 300 Million rows - not something for the faint of Memory.

Another way to do this is to load into an Oracle database and use regexp_like() on the data.

But I'm working with flat files with no access to DB (at the moment), and DS doesn't natively support regular expressions that I know of.

Does anyone know of any other way to do this. Maybe there's Stage that I haven't encountered which will solve this?

Thanks in advance.
rleishman
Premium Member
Premium Member
Posts: 252
Joined: Mon Sep 19, 2005 10:28 pm
Location: Melbourne, Australia
Contact:

Post by rleishman »

You should be able to achieve a cross-join by adding an extra column to each dataset - using a constant value on every row - and then join on that column.
Ross Leishman
Post Reply