Page 1 of 2

Ignore columns from a sequential file

Posted: Tue Jul 21, 2015 1:39 pm
by rmcclure
Hi,

I am trying to read records from a sequential file to process and write to a table.

The problem is the sequential file is a global company file that generated by a group to be used by multiple divisions. I only need the first 70 columns of about 90. When I try to run the job I get a warning that:
"Import consumed only 1968bytes of the record's 2073 bytes (no further warnings will be generated from this partition)"
I could add 20 extra columns and make this warning go away but the number of columns in this file could change if another division requests additional columns.

How can I read a sequential file and just take the first 70 columns without getting warnings.

The file is:
Delimiter = comma
Null field value = ''
Quote = double

Posted: Tue Jul 21, 2015 1:57 pm
by chulett
Worst case you can use a Server Sequential File stage in a Server Shared Container as it has an "Ignore Truncation" option from what I recall. Unless there is something similar in the Parallel version somewhere?

Posted: Tue Jul 21, 2015 2:31 pm
by rkashyap
Another option could be to use External Source stage with Source Program

Code: Select all

cut -d',' -f1-70 <INFILE>

Posted: Tue Jul 21, 2015 2:50 pm
by chulett
... or that in the Filter option of the Sequential File stage.

Posted: Tue Jul 21, 2015 3:30 pm
by ray.wurlod
... or the Drop on Import property for columns in the Sequential File stage itself.

Posted: Tue Jul 21, 2015 3:50 pm
by chulett
But wouldn't you have to define them so it knows what to drop? Or does it drop any not defined?

Posted: Tue Jul 21, 2015 6:21 pm
by ray.wurlod
chulett wrote:But wouldn't you have to define them so it knows what to drop? Or does it drop any not defined?
Yes, you do have to define them.

Posted: Tue Jul 21, 2015 6:28 pm
by chulett
Ah... which they don't want to do.

Posted: Tue Jul 21, 2015 9:52 pm
by ray.wurlod
The columns could be generically named for the purposes of this exercise.

Posted: Mon Jul 27, 2015 1:28 pm
by rmcclure
The cut command did not work because the sequential file has descriptions in quotes and some of those description have commas. Others do not.

I ended up adding the generic columns. If they add more columns in the future I will need to modify my job.

Posted: Mon Jul 27, 2015 1:33 pm
by chulett
You didn't try the Server stage? It would not require any modifications when / if the file changes.

Posted: Mon Jul 27, 2015 9:12 pm
by rkashyap
Another option ... Use a Unix command to replace the "delimiter commas" with another delimiter that would not occur in data (while ignoring commas within quotes) and subsequently using the replacee delimiter to extract first seventy columns.

See example of using nawk (on Solaris) to replace delimiting commas with pipe(|) and extracting first 70 columns below:

Code: Select all

nawk -F\" 'BEGIN{OFS=FS;} {for(i=1;i<=NF;i=i+2){gsub(/,/,"|",$i);} print $0;}' <infile>| awk -F"|" '{ for(i=1; i<=70; i++) printf("%s|"), $i ;printf("\n") };'
It is possible to merge the nawk and awk commands given above.

I believe that a much simpler/elegant solution can be implemented using Perl.

Posted: Mon Jul 27, 2015 11:15 pm
by chulett
And I believe that a much simpler/elegant solution can be implemented using the Server version of the Sequential File stage. :wink:

Suppress row truncation warnings. If the sequential file being read contains more columns that you have defined, you will normally receive warnings about overlong rows when the job is run. If you want to suppress these message (for example, you might only be interested in the first three columns and happy to ignore the rest), select this check box.

Posted: Tue Jul 28, 2015 6:14 am
by rkashyap
I agree. Server job would also be more platform independent and may align better with existing skillset of the shop than awk/Perl ... thus have a lower lifetime cost

Though if OP wants to use Parallel job then Unix/Perl offer a viable solution.

Posted: Tue Jul 28, 2015 6:36 am
by chulett
Didn't mean a Server job in this particular case but as mentioned earlier a Server Sequential File stage in a Server Shared Container in their otherwise Parallel job.