Page 1 of 1

Stripping header and trailer record from input files

Posted: Thu Jun 04, 2009 9:50 am
by wjfitzgerald
Hi,

There are many topics out there which almost answer my question, but none which actually do. Hopefully someone willbe good enough to give me a hand.

I have a job that process a number of files on each run. the job currently starts with a sequential file read by file pattern. each file consists of a header record, record number 1, and a trailer record, the last record. All other records are data records. the data records are comma delimited.

When i run the job the read rejects the header and trailers as they do not match the meta data. this currently writes a number of warnings to the logs which if i am processing enough files blows the warning limit and aborts the job.

i can reject the header and trailer of course, but this still writes the warnings to the logs. is there some way of stopping the warnings when writing to the logs?

Alternatively i can preprocess the file to remove the header and trailer. unfortunatley as the read is by file pattern the file filter option is not available and si i cannot do sed command to remove the first and last record.

i also tried to run the data through a filter stage, that lead to further data rejects in the intial read.

Could anyone save me from going bald by giving me a pointer or two please?

Thanks, as always.

John Fitz

Posted: Thu Jun 04, 2009 11:05 am
by nagarjuna
If your number of files being read through filepattern is equal to number of nodes used then you can use rownumber and filename options in sequential file stage and there by filter header and trailer in transformer stage .Otherwise its better to write a unix script and call from before job routine .

Posted: Thu Jun 04, 2009 11:11 am
by nagarjuna
Apart from first and last row in a file , are there any identifiers to find header and trailer in a file ??

Posted: Thu Jun 04, 2009 1:09 pm
by mail2hfz
May be you can read the whole record as a single field and filter the header/trailer records downstream.

Posted: Thu Jun 04, 2009 1:21 pm
by nagarjuna
Yeah this is the reason i asked about header or trailer identifier .If there is no identifier other than first nd last row of a file we cannot filter out in a transformer .
mail2hfz wrote:May be you can read the whole record as a single field and filter the header/trailer records downstream.

Posted: Thu Jun 04, 2009 4:19 pm
by ray.wurlod
Gain some information ahead of processing, particularly the line count in the file (wc -l command). You can use that in a Transformer stage (executing in sequential mode or in a server job) to filter on @INROWNUM.

Code: Select all

@INROWNUM <> 1 And @INROWNUM <> paramLineCount

Posted: Fri Jun 05, 2009 1:18 am
by wjfitzgerald
Morning,

Thanks to all for the responses.
To answer a few of queries raised:

1. the header is marked with HR in the first 2 characters & the trailer is marked with TR
2. i have tried to read it as a single field but the read is rejecting all the data records

Is it possible to turn off all the warning messages when writing to a reject file

Regards,

John FItz

Posted: Fri Jun 05, 2009 3:21 am
by arvind_ds
You can set the limit as 999999 then select no limit and try.

Posted: Fri Jun 05, 2009 3:25 am
by wjfitzgerald
Hi,

i created a seperate job to preprocess the file. read each record in as 1 field. pass the data through a transformer to isolate the first 2 characters, then use this new field in a filter stage to identify headers and trailers (might be processing multiple input files) finally write data records to new sequential file.

I then modified the original job to read the new sequential file insstead of the multiple input files.

this works, but i cannot but think that this is fairly inefficient, what with having to create a new file and subsequently delete the same file as part of the process.

Any thoughts on this work around would be gratefully recieved.

Regards,

John Fitz

Posted: Fri Jun 05, 2009 5:05 am
by nagarjuna
Read all files with file pattern option and in transformer constraints specify if field[1,2]='HR' od field[1,2]='TR' then pass to onelink else output link .Now you have all the files within a single file without header and trailer .Hope this is what you are looking for

Posted: Fri Jun 05, 2009 5:08 am
by wjfitzgerald
Thanks for coming back to me.
That is more efficient in that it would save me the use of the filter stage.

Thanks for the suggestion.

Regards,

John Fitz

Posted: Fri Jun 05, 2009 5:34 am
by Sainath.Srinivasan
D U P L I C A T E

Posted: Fri Jun 05, 2009 5:35 am
by Sainath.Srinivasan

Code: Select all

egrep -v '^HR|^TR' yourFileName

Posted: Fri Jun 05, 2009 3:26 pm
by ray.wurlod
To read as a single line set both the delimiter and quote characters to "none".