Read a sequential file. Metadata not known at design time

hsahay · Post by **hsahay** » Tue May 22, 2012 3:06 pm

Hi

I need to load about 20-25 sequential files with different structures to load data into one oracle table. Later more sequential files may be added and they will all have different structures.

Normally i would write a datastage job for each sequential file in which i will have -
1. a sequential file stage
2. a transformer and
3. the target oracle stage

and end up with 20-25 different data stage job.

But i was wondering if it is possible to store the file structure in an oracle table and then write just one datastage job which can at run time get the metadata required by the sequential file stage and the mapping required by the transformer from the database and somehow read the file and load the target table.

For example i'd pass the filename, location and filetype as parameters to the job.

The job will then retrieve the metadata for the given filetype from an oracle table whose structure is as below and then read the file -

tblMetadata
----------------

filetype fieldname fieldtype fieldsize fieldprecision
-----------------------------------------------------------------------------
1 name varchar2 50
1 age numeric 2
1 salary double 8 2

Which field of filetype 1 maps to which field of the target oracle table (sometable) is stored in another table as below -

tblMapping
------------

filetype srcField tgtField
------------------------------------------
1 name performerName
1 age performerAge
1 salary performerSalary

Any ideas how this can be accomplished ? or If it can be done ? Server or parallel i don't care.

ray.wurlod · Post by **ray.wurlod** » Tue May 22, 2012 4:31 pm

You can always read a sequential file by defining its "records" (lines) as a single string (VarChar data type). But you will need to know the metadata before you can do anything sensible such as parse the line into individual columns.

Were you using a parallel job the metadata can be dynamic and established at run time through use of a schema file. But you need to know the metadata to construct the schema file. This mechanism is not available for server jobs.

hsahay · Post by **hsahay** » Wed May 23, 2012 12:56 pm

Thank you so much Ray.

I was able to use the schema file concept to load files. It works but only upto a point. I am not sure if it is because of the limitation of the tool or because of the limitation of my understanding

This is what i did.

testfile1.dat
---------------------
abc1, 38, 10000
bbc1, 32, 15000
cbc1, 31, 17000

testfile1.sch
--------------------
record
{final_delim=end,delim=','}
(
NAME:STRING[max=255];
AGE:STRING[max=255];
SALARY:STRING[max=255];
)

Oracle table EMP
-------------------
name varchar2(30)
age numeric(2)
salary numeric(8,2)
ssn numeric(9)

I created a parallel job with a sequential file stage and an oracle stage. Set the sequential file stage to read the metadata from the schema file (testfile1.sch). Turned on runtime propagation of columns. Did not have to define any metadata for the oracle stage. Set oracle stage to LOAD method and voila ...i was able to load the file's data into the table. Great so far.

Then i created another file in which i introduced a field whose value will be discarded during the load. (The third field) The oracle table does not need it.

testfile2.dat
---------------------
abc2, 38, xxx, 10000
bbc2, 32, yyy, 15000
cbc2, 31, zzz, 17000

testfile2.sch
--------------------
record
{final_delim=end,delim=','}
(
NAME:STRING[max=255];
AGE:STRING[max=255];
JUNK:STRING[max=255];
SALARY:STRING[max=255];
)

This is where i ran into a problem. Since, i think, datastage maps input columns to target fields by their names (position ??), the job aborted saying "describe (for oracle stage) failed because of a missing column "JUNK".

hmmmmm

So then the next thing i did was, i added the metadata from table EMP into oracle stage. This time it was able to load testfile2 data. IT gave a warning about missing SSN field but did not complain about the JUNK field.

Okay great.

Now i created a third file and added a SSN field to it. 9 character values 111111111, 222222222, and 333333333

testfile3.dat
---------------------
abc3, 38, xxx, 10000,111111111
bbc3, 32, yyy, 15000, 222222222 (notice the space after the last comma)
cbc3, 31, zzz, 17000, 333333333 (notice the space after the last comma)

testfile3.sch
--------------------
record
{final_delim=end,delim=','}
(
NAME:STRING[max=255];
AGE:STRING[max=255];
JUNK:STRING[max=255];
SALARY:STRING[max=255];
SSN:STRING[max=255];
)

This file failed on the 2nd and the 3rd record because of the space between the last comma and the beginning of the SSN values because apparently the value read from the sequential file becomes 10 chars long but the field is defined as only 9 chars in the oracle table.

To fix this issue i modified the job to use UPSERT instead of LOAD, changed it to user-defined-insert-and-upsert and used the TRIM function on the SSN field. With this change i was able to load the third file also.

However, this version of the job fails to load testfile1 because now it says SSN field is not found in the input data. Pffffft !!!!

So basically, there is no one way that works on all three kind of data files that i may get.

To summarize -

1. I can get a file with three fields all of which must be loaded to the table as in testfile1.

2. I can get a file with 4 fields out of which one 1, 2 and 4 should be loaded into the table.

3. I can get a file where the values may have spaces around them.

How to write a job that can handle all three scenarios i mentioned above.

hsahay · Post by **hsahay** » Wed May 23, 2012 2:16 pm

Hi Ray

I kinda made the third version of the job work by parameterzing the insert and the update statements. I am not sure how much value it adds to what i am doing though.

So basically, i can trim values only if have the metadata for the oracle stage defined at design time and use the user generated inserts and then change those inserts to trim the incoming value.

But if all the fields defined for the oracle stage are not there in the input file (for example testfile1 does not have SSN field) - it aborts.

But if i parameterize the inserts so that the insert statement only has the fields that are present in the incoming file then i can get around the abort.

The inserts can be stored in a configuration table in oracle and be sourced at run time by the unix script that invoked this job.

If you can figure out a better way please do let me know.

For testfile1
--------------
INSERT INTO test (NAME, AGE, SALARY) VALUES (trim(ORCHESTRATE.NAME), trim(ORCHESTRATE.AGE), trim(ORCHESTRATE.SALARY))

UPDATE test SET NAME = trim(ORCHESTRATE.NAME), AGE = trim(ORCHESTRATE.AGE), SALARY = trim(ORCHESTRATE.SALARY) WHERE 1=2

For testfile2
--------------
INSERT INTO test (NAME, AGE, SALARY) VALUES (trim(ORCHESTRATE.NAME), trim(ORCHESTRATE.AGE), trim(ORCHESTRATE.SALARY))
UPDATE test SET NAME = trim(ORCHESTRATE.NAME), AGE = trim(ORCHESTRATE.AGE), SALARY = trim(ORCHESTRATE.SALARY) WHERE 1=2

For testfile3
--------------
INSERT INTO test (NAME, AGE, SALARY, SSN) VALUES (ORCHESTRATE.NAME, ORCHESTRATE.AGE, ORCHESTRATE.SALARY, trim(nvl(ORCHESTRATE.SSN,0)))

UPDATE test SET NAME = ORCHESTRATE.NAME, AGE = ORCHESTRATE.AGE, SALARY = ORCHESTRATE.SALARY, SSN = ORCHESTRATE.SSN WHERE 1=2

thupallireddy · Post by **thupallireddy** » Fri Nov 07, 2014 10:46 am

Vishal -- Did u get the solution for the above requirement?