Interesting Problem: Join / Look up on partial keys

JonJon · Post by **JonJon** » Tue Dec 11, 2007 1:58 am

Hi there

I have a Master data set with which I do comparisons with a Reference data set.

The issue I'm having is that the Ref data set contains only substring info of the keys used in the Master Data set.

Assuming that I'm only showing Key values for the two Data Sets:

Master Data Set
------------
Aabc
Adh
Defg

Ref Data Set
------------
ab
de
fg

Ideally I want the the 1st and 3rd row in the Master Data Set to be found.

This is because 'ab' from Aabc and 'fg' from Defg are found in the Ref dataset.

AFAIK, DS doesn't allow substring key matches (joins and lookups asks you to specific Keys which basically translates to ='s in SQL, what I want are LIKE or MATCH statements).

One way that I have thought of doing this is to do a Cross join (as I'm writing this, I don't think this is possible in DS?!) between Master and Ref, so end result looks like:

===========
Aabc ab
Aabc de
Aabc fg
Adh ab
Adh de
Adh fg
Defg ab
Defg de
Defg fg
===========

And then perform Index search between the columns, which will work (bolded columns above)

However, the volume of the data in Master (about 30Million) plus a Ref file which at this stage is say 100 rows, will generate a resultant data set of 300 Million rows - not something for the faint of Memory.

Another way to do this is to load into an Oracle database and use regexp_like() on the data.

But I'm working with flat files with no access to DB (at the moment), and DS doesn't natively support regular expressions that I know of.

Does anyone know of any other way to do this. Maybe there's Stage that I haven't encountered which will solve this?

Thanks in advance.

rleishman · Post by **rleishman** » Tue Dec 11, 2007 11:56 pm

You should be able to achieve a cross-join by adding an extra column to each dataset - using a constant value on every row - and then join on that column.