Erroneous data goes undetected in a Flat File Stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
goffinw
Participant
Posts: 27
Joined: Thu Nov 18, 2004 6:50 am
Location: Belgium

Erroneous data goes undetected in a Flat File Stage

Post by goffinw »

When you read from a file using a flat file stage, it will generate errors when the data cannot be parsed according to the specified schema definition and other format specifications.
Records that don't satisfy the specification can be sent to a rejects link.
This is behaviour we appreciate.
But some data that doesn't correspond to the specifications does not result in rejected records, and that is a problem for us. Because some erroneous data goes undetected.
For example: If you have a fixed length record type, you specify that a field has type 'integer' and size 5, then 5 bytes will be read and scanned into the integer field as follows:

Code: Select all

Flat File field data       -> DataStage integer value received
--------------------          ---------------------------------------
12345                         12345        Correct
123AA                         123          Error for me
 12                           12           Could be seen as an error. Depends.
1A1A1                         1            Error for me
Does anyone know of a way to have stronger data format checking when reading through a Flat File Stage, so that the above examples would be detected as errors?

Thanks in advance,
Wim
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

I think that interpreting "123AA" as an integer is incorrect as well. There are options, though. You can edit the attributes and specify a "In_format" (I can't check now, but "nnnnn" should do the trick). Another alternative is to read these in as CHAR and then using IsValid() or other functions in a transform stage.
Nonetheless I do think that this should be given to your support provider as a case for IBM to fix or at least comment on.
goffinw
Participant
Posts: 27
Joined: Thu Nov 18, 2004 6:50 am
Location: Belgium

Post by goffinw »

ArndW wrote:I think that interpreting "123AA" as an integer is incorrect as well. ... Nonetheless I do think that this should be given to your support provider as a case for IBM to fix ...
This would be consistent with a DataStage strategy of detecting and rejecting any erroneous data. But you can't find such a statement in the manuals, can you? So I'd guess, this is not a DataStage error.
ArndW wrote:There are options, though. You can edit the attributes and specify a "In_format" (I can't check now, but "nnnnn" should do the trick).
The "In_format" property or a similar one would indeed be THE means by excellence to specify what you do and don't expect on input. I had looked into it. But I don't think that the two properties currently available, In_format and C_format, can provide the solution here. They specify the format argument of the C function 'sscanf'. Not something of the kind 'nnnnn' but something like '%5d'.
The functionality of sscanf is exactly the reason why DataStage is unable to detect this error and why these two arguments don't help: It is impossible to specify in the sscanf format, that a data string '123AA' is to be handled as incorrect.
ArndW wrote:Another alternative is to read these in as CHAR and then using IsValid() or other functions in a transform stage.
This is probably the only way to solve my problem. But isn't it too bad that I need add this complexity, just to reach this simple goal?

Thanks for your reactions,
Wim
mdan
Charter Member
Charter Member
Posts: 46
Joined: Mon Apr 28, 2003 4:21 am
Location: Brussels
Contact:

Re: Erroneous data goes undetected in a Flat File Stage

Post by mdan »

goffinw wrote: Does anyone know of a way to have stronger data format checking when reading through a Flat File Stage, so that the above examples would be detected as errors?

Thanks in advance,
Wim
Hi,
if you use decimal instead of integer, (decimal[10,0]) it will work. Looks like the issue is coming from the fact than orchestrate is using sscanf and all the other strto... functions to convert. I'm still looking for a way to enforce the format, but decimal is working (I already did a test).

Dan
goffinw
Participant
Posts: 27
Joined: Thu Nov 18, 2004 6:50 am
Location: Belgium

Re: Erroneous data goes undetected in a Flat File Stage

Post by goffinw »

mdan wrote:if you use decimal instead of integer, (decimal[10,0]) it will work. Looks like the issue is coming from the fact than orchestrate is using sscanf and all the other strto... functions to convert. I'm still looking for a way to enforce the format, but decimal is working (I already did a test).
Dan
Dan,
I confirm. The records that contain an alphabetic character are rejected. The ones that contain blanks are still accepted, but that may be acceptable.
This looks like a very attractive solution.

Regards,
Wim
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I'm assuming A-F are being considered as hex values and thus valid in the Integer data type? :?
-craig

"You can never have too many knives" -- Logan Nine Fingers
mdan
Charter Member
Charter Member
Posts: 46
Joined: Mon Apr 28, 2003 4:21 am
Location: Brussels
Contact:

Post by mdan »

chulett wrote:I'm assuming A-F are being considered as hex values and thus valid in the Integer data type? :?
No, they are stripped out. If you want them to be interpreted as hex, then you should specify this in c_format %x.

Dan
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Ok. Just curious - what happens when the alpha characters are outside of that range? Are they still just stripped or are they rejected then?
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply