Stripping header and trailer record from input files

wjfitzgerald · Post by **wjfitzgerald** » Thu Jun 04, 2009 9:50 am

Hi,

There are many topics out there which almost answer my question, but none which actually do. Hopefully someone willbe good enough to give me a hand.

I have a job that process a number of files on each run. the job currently starts with a sequential file read by file pattern. each file consists of a header record, record number 1, and a trailer record, the last record. All other records are data records. the data records are comma delimited.

When i run the job the read rejects the header and trailers as they do not match the meta data. this currently writes a number of warnings to the logs which if i am processing enough files blows the warning limit and aborts the job.

i can reject the header and trailer of course, but this still writes the warnings to the logs. is there some way of stopping the warnings when writing to the logs?

Alternatively i can preprocess the file to remove the header and trailer. unfortunatley as the read is by file pattern the file filter option is not available and si i cannot do sed command to remove the first and last record.

i also tried to run the data through a filter stage, that lead to further data rejects in the intial read.

Could anyone save me from going bald by giving me a pointer or two please?

Thanks, as always.

John Fitz

nagarjuna · Post by **nagarjuna** » Thu Jun 04, 2009 11:05 am

If your number of files being read through filepattern is equal to number of nodes used then you can use rownumber and filename options in sequential file stage and there by filter header and trailer in transformer stage .Otherwise its better to write a unix script and call from before job routine .

nagarjuna · Post by **nagarjuna** » Thu Jun 04, 2009 11:11 am

Apart from first and last row in a file , are there any identifiers to find header and trailer in a file ??

mail2hfz · Post by **mail2hfz** » Thu Jun 04, 2009 1:09 pm

May be you can read the whole record as a single field and filter the header/trailer records downstream.

nagarjuna · Post by **nagarjuna** » Thu Jun 04, 2009 1:21 pm

Yeah this is the reason i asked about header or trailer identifier .If there is no identifier other than first nd last row of a file we cannot filter out in a transformer .

mail2hfz wrote:May be you can read the whole record as a single field and filter the header/trailer records downstream.

ray.wurlod · Post by **ray.wurlod** » Thu Jun 04, 2009 4:19 pm

Gain some information ahead of processing, particularly the line count in the file (wc -l command). You can use that in a Transformer stage (executing in sequential mode or in a server job) to filter on @INROWNUM.

Code: Select all

@INROWNUM <> 1 And @INROWNUM <> paramLineCount

wjfitzgerald · Post by **wjfitzgerald** » Fri Jun 05, 2009 1:18 am

Morning,

Thanks to all for the responses.
To answer a few of queries raised:

1. the header is marked with HR in the first 2 characters & the trailer is marked with TR
2. i have tried to read it as a single field but the read is rejecting all the data records

Is it possible to turn off all the warning messages when writing to a reject file

Regards,

John FItz

arvind_ds · Post by **arvind_ds** » Fri Jun 05, 2009 3:21 am

You can set the limit as 999999 then select no limit and try.

wjfitzgerald · Post by **wjfitzgerald** » Fri Jun 05, 2009 3:25 am

Hi,

i created a seperate job to preprocess the file. read each record in as 1 field. pass the data through a transformer to isolate the first 2 characters, then use this new field in a filter stage to identify headers and trailers (might be processing multiple input files) finally write data records to new sequential file.

I then modified the original job to read the new sequential file insstead of the multiple input files.

this works, but i cannot but think that this is fairly inefficient, what with having to create a new file and subsequently delete the same file as part of the process.

Any thoughts on this work around would be gratefully recieved.

Regards,

John Fitz

nagarjuna · Post by **nagarjuna** » Fri Jun 05, 2009 5:05 am

Read all files with file pattern option and in transformer constraints specify if field[1,2]='HR' od field[1,2]='TR' then pass to onelink else output link .Now you have all the files within a single file without header and trailer .Hope this is what you are looking for

wjfitzgerald · Post by **wjfitzgerald** » Fri Jun 05, 2009 5:08 am

Thanks for coming back to me.
That is more efficient in that it would save me the use of the filter stage.

Thanks for the suggestion.

Regards,

John Fitz

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Fri Jun 05, 2009 5:34 am

D U P L I C A T E

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Fri Jun 05, 2009 5:35 am

Code: Select all

egrep -v '^HR|^TR' yourFileName

ray.wurlod · Post by **ray.wurlod** » Fri Jun 05, 2009 3:26 pm

To read as a single line set both the delimiter and quote characters to "none".