Flat file loading on UNIX

rkumar28 · Post by **rkumar28** » Mon Jun 14, 2004 9:29 pm

Hi All,

I have a situation where I am doing the load from flat file to teradata tables using Datastage. I have more than 300,000 rows in the flat file that has about 16 columns. The flat file is pipe dilimited and is on UNIX. This flat file is created from two different source with same metadata.

The problem is third column(10 digit character) got switched with the fourth column(three digit character) somewhere in the middle of the flatfile i.e. somewhere in the file the fourth column became the third column and vice versa. This probably happened when the data from two sources are merged. I am trying to put all the rows with three digit character in fourth column into one file and ten digit character rows in another file. Is there a unix script/command that I can use to divide this file into two different file so that I can load all data sucessfully. Due to some problem I am not able to generate the file again.

I will really appreciate any help in this regards.

Thanks
Raj

chulett · Post by **chulett** » Mon Jun 14, 2004 9:58 pm

Do you really need to create two files? I would think you could work with the file as is, if the mix-up is as simple as it sounds.

Declare both the 3rd and 4th fields as 10 characters on the input side. Check the LENgth of the actual data and swap fields if the data in the 3rd column is 3 digits long. Do this in the middle of your processing job and either continue on to Teradata or write out a new (corrected) file that can then be bulk loaded.

rsrikant · Post by **rsrikant** » Mon Jun 14, 2004 10:43 pm

Hi Craig,

What if the 10 digit character in the 3rd column has a value of length <=3?

If all the rows have a value of length >3 in the 10 digit character column, probably what you said can work out.

By what raj mentioned it's not clear whether all the rows are of length >3 in the 10 digit character column or not.

Correct me if I am wrong.

Thanks,
Srikanth

chulett · Post by **chulett** » Mon Jun 14, 2004 11:03 pm

We shall see, when Raj next visits.

That's why I prefaced things like I did - if the fields are of a consistant length, 10 versus 3, it should be easy to correct.

After that, it will depend on exactly what is in the fields. '10 digit character' doesn't tell me specifically if it is 10 digits or 10 characters, meaning numeric or text or alphanumeric data, or if (as you mentioned) the data is actually of some size other than exactly 3 or 10.

ray.wurlod · Post by **ray.wurlod** » Tue Jun 15, 2004 12:32 am

There's an easy DataStage job solution. Read the file and write another, with four output columns.
Field001 and Field002 are directly derived from input.
Field003 is derived as

Code: Select all

If Len(inlink.Field003) = 10 Then inlink.Field003 Else inlink.Field004

and Field004 is derived as

Code: Select all

If Len(inlink.Field004) = 3 Then inlink.Field004 Else inlink.Field003

A more efficient design would use a stage variable (called, say, svTest) that determines (and is derived as)

Code: Select all

(Len(inlink.Field003) = 10)

Then, output.Field003 is derived as

Code: Select all

If svTest Then inlink.Field003 Else inlink.Field004

and output.Field004 is derived as

Code: Select all

If svTest Then inlink.004 Else inlink.Field003

rsrikant · Post by **rsrikant** » Tue Jun 15, 2004 1:59 am

Vow!!! that's simply a great solution.

Thanks Ray.. I am adding to my favorites. :D

Srikanth

chulett · Post by **chulett** » Tue Jun 15, 2004 6:27 am

Wow Ray? That's exactly what I said. (where's a big pouty face when you need one?)

Ray just provided the actual code to go along with it.

rkumar28 · Post by **rkumar28** » Tue Jun 15, 2004 2:42 pm

Thanks Ray, Craig and Srikant...for providing the info. This really helped.

Also, wanted to share....I found something today to do this in Unix using awk utility:

awk -F'|' '{if(length($3==3)f="outfile1;else f="outfile2;print > f}' INPUT_Source_File

Its look for 3 digit number in column 3 and put data in outfile1 and everything else goes into the outfile2. But I feel doing it with datastage as Ray suggested is more simpler than creating two different files.

chulett wrote:Wow Ray? That's exactly what I said. (where's a big pouty face when you need one?) Ray just provided the actual code to go along with it.

ray.wurlod · Post by **ray.wurlod** » Tue Jun 15, 2004 5:50 pm

You could have used print with $3 and $4 in awk, thereby only needing one output file. But, heck, you've got DataStage.