Dynamic Metadata

dsdev750 · Post by **dsdev750** » Mon Jul 10, 2006 9:17 am

Hi,

I have a requirement to load a file containing dynamic metadata. For one, the column names are included in the same file which has the data. Thus, the seq file (info.20060710 ) looks like :

FILE_START
FILE_NAME=info.20060710
REC_TYPE=NEW_EMPLOYEE

FIELDS_START
ID
NAME
DOB
CITY
#Comment

#
MARITAL_STATUS
FIELDS_END

DATA_START
1|John Her|19710101|PENN| |
REC_COUNT=1
DATA_END

FILE_END

The problem is that tomorrow the file info.20060711 could have the column CITY missing. Next day the MARITAL_STATUS could be missing. This could happen for any field except the ID and the NAME fields.

I understand that it is difficult to handle dynamic metadata in datastage. Is it possible in datastage to handle such a file? Is it advisable to do such jobs in datastage?

Thanks.

DSguru2B · Post by **DSguru2B** » Mon Jul 10, 2006 9:23 am

Datastage is metadata driven, as so i have heard time and again. So it would be advisable to do this kind of manipulation outside datastage to make the file ready for datastage to pickup.
Like maybe a shell script that detects what columns are missing and then accordingly adds it. This way you can create a job that has a static metadata, and even if the file changes, the script can force it to be static.
Might not be the best way, but one of the ways.

ray.wurlod · Post by **ray.wurlod** » Mon Jul 10, 2006 5:18 pm

The quote from Chuck Reid in DSGuru2B's post is entirely apposite here.

Can you create a specification of how you might determine what is present and what is missing? If so, you might be able to determine techniques from that.

How would it affect your source-to-target data mapping if this were the case? What business rules apply? Would you need to notify end users that their LOCATION dimension is missing CITY? How do you think they would react?

What you really need to do is to beat up on the data providers so that data are never missing - only whole rows are ever missing. DataStage can deal extremely efficiently with those!

anujgarg · Post by **anujgarg** » Tue Jul 11, 2006 12:08 am

You can use RunTime Column propogation method for this.
But you first need to seperate metadata and records then from metadata create record schema file and in the dataset use this record schema file.

ray.wurlod · Post by **ray.wurlod** » Tue Jul 11, 2006 5:13 am

Not sure how RCP would help here; the problem here is that the column itself is missing from the data (if I read the question correctly). RCP is for when you don't want to physically include the column definition in the metadata for a link.

thompsonp · Post by **thompsonp** » Tue Jul 11, 2006 6:45 am

It is possible to deal with this type of problem in DataStage but it is preferable to have fixed metadata.

I'm not sure about the format of your file. Is there a header row with the file details, then a row stating what the columns are, then data rows?

We have written jobs that read in a file and based on the list of supplied columns they map the data rows to a fixed format output. This fixed format file is then used in subsequent jobs. We hold the mapping of possible columns to the fixed format in the database and perform the mapping in a build-op.

Obviously the columns that may appear in the file one day and not the next would have to be nullable in the fixed format, or at least you would need some rules about what value to default them to.

dsdev750 · Post by **dsdev750** » Tue Jul 11, 2006 3:35 pm

Thanks everybody for responding.
I guess the final answer seems to be to somehow find out the missing columns amd update the file with default values for those columns. Once the file has been updated, Design the DS job in such a way that it reads the data section with the complete static metadata. Right?

Thanks again.

ray.wurlod · Post by **ray.wurlod** » Tue Jul 11, 2006 3:46 pm

Static metadata certainly makes the DataStage stuff much easier to do.

kumar_s · Post by **kumar_s** » Tue Jul 11, 2006 5:01 pm

Or read the file and write a Schema file for its structure. User RPC Enabled job. Use the Schema file which changes everyday as the soruce.

Or have a metadata as ID,Name and rest all varchar(225). Have a transformer, detect the missing column and distribute the columns accordingly.