Erroneous data goes undetected in a Flat File Stage

goffinw · Post by **goffinw** » Thu Aug 07, 2008 3:24 pm

When you read from a file using a flat file stage, it will generate errors when the data cannot be parsed according to the specified schema definition and other format specifications.
Records that don't satisfy the specification can be sent to a rejects link.
This is behaviour we appreciate.
But some data that doesn't correspond to the specifications does not result in rejected records, and that is a problem for us. Because some erroneous data goes undetected.
For example: If you have a fixed length record type, you specify that a field has type 'integer' and size 5, then 5 bytes will be read and scanned into the integer field as follows:

Code: Select all

Flat File field data       -> DataStage integer value received
--------------------          ---------------------------------------
12345                         12345        Correct
123AA                         123          Error for me
 12                           12           Could be seen as an error. Depends.
1A1A1                         1            Error for me

Does anyone know of a way to have stronger data format checking when reading through a Flat File Stage, so that the above examples would be detected as errors?

Thanks in advance,
Wim

ArndW · Post by **ArndW** » Fri Aug 08, 2008 1:01 am

I think that interpreting "123AA" as an integer is incorrect as well. There are options, though. You can edit the attributes and specify a "In_format" (I can't check now, but "nnnnn" should do the trick). Another alternative is to read these in as CHAR and then using IsValid() or other functions in a transform stage.
Nonetheless I do think that this should be given to your support provider as a case for IBM to fix or at least comment on.

goffinw · Post by **goffinw** » Fri Aug 08, 2008 2:58 am

ArndW wrote:I think that interpreting "123AA" as an integer is incorrect as well. ... Nonetheless I do think that this should be given to your support provider as a case for IBM to fix ...

This would be consistent with a DataStage strategy of detecting and rejecting any erroneous data. But you can't find such a statement in the manuals, can you? So I'd guess, this is not a DataStage error.

ArndW wrote:There are options, though. You can edit the attributes and specify a "In_format" (I can't check now, but "nnnnn" should do the trick).

The "In_format" property or a similar one would indeed be THE means by excellence to specify what you do and don't expect on input. I had looked into it. But I don't think that the two properties currently available, In_format and C_format, can provide the solution here. They specify the format argument of the C function 'sscanf'. Not something of the kind 'nnnnn' but something like '%5d'.
The functionality of sscanf is exactly the reason why DataStage is unable to detect this error and why these two arguments don't help: It is impossible to specify in the sscanf format, that a data string '123AA' is to be handled as incorrect.

ArndW wrote:Another alternative is to read these in as CHAR and then using IsValid() or other functions in a transform stage.

This is probably the only way to solve my problem. But isn't it too bad that I need add this complexity, just to reach this simple goal?

Thanks for your reactions,
Wim

mdan · Post by **mdan** » Fri Aug 08, 2008 7:23 am

goffinw wrote: Does anyone know of a way to have stronger data format checking when reading through a Flat File Stage, so that the above examples would be detected as errors?

Thanks in advance,
Wim

Hi,
if you use decimal instead of integer, (decimal[10,0]) it will work. Looks like the issue is coming from the fact than orchestrate is using sscanf and all the other strto... functions to convert. I'm still looking for a way to enforce the format, but decimal is working (I already did a test).

Dan

goffinw · Post by **goffinw** » Fri Aug 08, 2008 8:08 am

mdan wrote:if you use decimal instead of integer, (decimal[10,0]) it will work. Looks like the issue is coming from the fact than orchestrate is using sscanf and all the other strto... functions to convert. I'm still looking for a way to enforce the format, but decimal is working (I already did a test).
Dan

Dan,
I confirm. The records that contain an alphabetic character are rejected. The ones that contain blanks are still accepted, but that may be acceptable.
This looks like a very attractive solution.

Regards,
Wim

chulett · Post by **chulett** » Fri Aug 08, 2008 8:20 am

I'm assuming A-F are being considered as hex values and thus valid in the Integer data type?

mdan · Post by **mdan** » Fri Aug 08, 2008 8:24 am

chulett wrote:I'm assuming A-F are being considered as hex values and thus valid in the Integer data type?

No, they are stripped out. If you want them to be interpreted as hex, then you should specify this in c_format %x.

Dan

chulett · Post by **chulett** » Fri Aug 08, 2008 8:38 am

Ok. Just curious - what happens when the alpha characters are outside of that range? Are they still just stripped or are they rejected then?

DSXchange

Erroneous data goes undetected in a Flat File Stage

Erroneous data goes undetected in a Flat File Stage

Re: Erroneous data goes undetected in a Flat File Stage

Re: Erroneous data goes undetected in a Flat File Stage