Coomon module to access reference data

roy · Post by **roy** » Tue Dec 09, 2003 9:54 am

Hi,
well you could build a shared container that get attr1 & x as input passing them as key columns to an OCI stage which via user defined sql get the data you need using relatively simple sql.
you might need to add handling of a case where attr1 was not found etc'.
this would be the most simple way from the top of my mind.

probably others might have other ideas.

there is always the way to unload that table into a hash file and get it via lookup, but I don't know if that is something you can use performance wise s[ecific to your case of a common module.

Good Luck

kcbland · Post by **kcbland** » Tue Dec 09, 2003 9:58 am

I think I understand your requirement, but if you could put it into your business rules and actual column table names, it might help. Visualization variables a, b, c, d, e, etc is difficult to do, so real terms will help.

I don't know if your solution will appear as you have presented it. I think there actually could be a simpler way of doing what you want. Please state a more complete functional requirement as opposed to giving a technical requirement.

johnno · Post by **johnno** » Tue Dec 09, 2003 10:33 am

Sorry guys, didn't want to confuse the issue- never mind. This job processes music played by radio stations.

Table1 is our BROADCASTERS oracle table with attributes TP_SURR_ID, DN_TP_NAME, STATION_ID , GC_TRANS_TYPE_TABLE, GCTRANS_TYPE_CODE and FULL_SAMPLE. Table2 is our BROADCASTERS_SAMPLE_DATES oracle table with attributes SURR_ID, BRO_TP_SURR_ID, START_DATE, END_DATE, USED_FOR_PURPOSE.

BROADCASTERS.TP_SURR_ID links to BROADCASTERS_SAMPLE_DATES.BRO_TP_SURR_ID.

Basically we validate the station id early in the process, but when we want to load the usage, we need to make sure it is for a valid sample date. Therefore, if BROADCASTERS.FULL_SAMPLE = "Y", then we don't need to check the sample dates as we process all music played on that station. If BROADCASTERS.FULL_SAMPLE = "N", then we need to check that the date we are currently processing is a valid sample date (we only process certain dates for these station). As there can be many dates in a single file, we only want to make this check on a change in date value on the input file.

Hope this clears things up.

Roy, I thought of using a shared container, but how do you control the flow to and from it so that you only perform the check when you need to?

Thanks
Johnno

trobinson · Post by **trobinson** » Tue Dec 09, 2003 12:18 pm

A shared container is one solution although I would be aware that using a shared container brings some baggage with it. For example, two developers cannot edit the shared container at the same time. Minor problem if handled correctly but still annoying if both are trying to debug or compiling their jobs.

1. Create a shared container with a OCI stage.
2. Use Inputs/Outputs links defined as reference links.
3. Define User-defined SQL with Attr1 as a key field returning Attr2 and Attr3.
4. The input reference link takes Attr1 from the job.
5. A transform will determine that the returned values are between Attr2/Atr3. If so, then return 'Y' down the output reference link else 'N' or you can do the "between" in SQL although the loss of really good metadata is not something I would publically recommend.

kcbland · Post by **kcbland** » Tue Dec 09, 2003 12:23 pm

There's a couple of approaches you could try. I don't know your data volumes, so I don't know if there's a need to try to divide your dataset to eliminate an unnecessary hash lookup.

You might consider a hash file for when BROADCASTERS.FULL_SAMPLE = "Y" and a second hash file will be for those "N". The first hash file just hash to be keyed on your identifier. The second hash file could simply be generated with a row for each date in your range for that identifier. The key would be your identifier plus the date. You'll probably be able to generate this second lookup through a SQL query and load into a hash file.

Your job designs will do two hash lookups. If the row is found in the first hash file, then it doesn't have to check the results of the second hash lookup. But, your job design will perform both lookups. It will be up to you to use the results you require. If the row isn't found in the first lookup, then you look at the results of the second lookup.

Now, if you want to "containerize" the transformer with the two hash lookups, and then use that container as a reusable object, you can. You can window the design logic and create a shared container.

The overhead of a double hit probably won't be too bad. You would have to give me the numbers of row counts, but I suspect you can do this.

johnno · Post by **johnno** » Wed Dec 10, 2003 4:22 am

Thanks a lot guys. I will have a go at doing this today and let you know the results.