More problems with Pesky Floats

blewip · Post by **blewip** » Mon May 23, 2005 3:39 am

Is anybody else getting unexplained results from Floats?

We have a database with Floats in it and we are finding difficult to get any consistent results. We have generated a test case and I will explain the results we are getting. Basically we are generating 1 row of data then hard coding a Float and a Numeric field to the value 1234.5678. In all of the cases we pass the data through another transformer, and then finally to a Seq File.

When we view the Numeric through no conversion we get.

Code: Select all

01234.5678000000

Which is what we expect, however this what we would expect all our outputs to be, yet it is the only one that is correct!

If we take the Float through no conversion we see.

Code: Select all

1234.57

If we take the Numeric then convert it Float using DecimalToFloat we also get

Code: Select all

1234.57

Which is also what we get if we take our Float and output to a VarChar(20) using no defined conversion.

If we take the Float and convert it to a Numeric using DFloatToNumeric we get

Code: Select all

01234.5677490234

If we take the Float and convert it to a VarChar(20) using DFloatToStringNoExp. Which is also what we get if we do a Float to Numeric DFloatToDecimal

Code: Select all

1234.56774902344

I believe all of these should have delivered the same result, which was what we inputted 1234.5678. The fact that we have so many different is frankly scary. Any one got any views?

ArndW · Post by **ArndW** » Mon May 23, 2005 4:11 am

Blewip,

the way that floating point numbers are stored are with a exponent and mantissa (plus a sign). A floating point number is usually internally represented as your mantissa x (e to the exponent). It is highly unlikely that any given number will be able to exactly equal any fixed-length mantissa to any e (natural logarithm) power; so a floating point number will always be a "close" representation of what it is meant to equal. The more bits you use for the mantissa the closed this approximation will be and for most purposes you will be close enough to the value you want; so you might not be able to exactly specify the value of 2.333333333333333 in a short floating point number but you will come within 0.0000000001 of it. Good for calculating but really bad for banking applications.

So back to your Px floating point dilemma - the first question would be how are you defining your float? Is it single (not good) or double (better)?

The conversion itself will implictly introduce an error - the trick in all of these cases is controlling and understanding the errors. In your case you are seeing 2 different issues - both the conversion itself plus you are doing an implicit conversion when displaying these values. Implicit conversions to/from float are trouble waiting to happen - since you don't know what method they are going to use.

You will almost never get identical results using floating point numbers and conversions, but you will get very close approximations. I think what you need to do (apart from going to double or even quad precision) is to minimize the number of conversions to/from or to even go away from floating point numbers if you need identical results. It really does depend upon what you wish to do with the numbers.

blewip · Post by **blewip** » Mon May 23, 2005 4:50 am

Well the first thing is we are a Bank and therefore we are looking to get a good representation of the number. I guess we do accept that with a Float we are not going to get and exact number all the time, but I just think the difference is too great. The difference between the entered value and the recieved value is 0.0022 surely this is not good enough.

We have tried with Double instead of Floats and we get the same results.

The other problem is that our Server jobs don't have an issue. All the developer want to go back to using Server to develop the jobs.

Also this doesn't explain why converting the Float to a Numeric is better than just leaving it as a Float and displaying the result . The Float to Numeric has a difference of only 0.0000509766 (not too bad

)

ArndW · Post by **ArndW** » Mon May 23, 2005 5:12 am

A bank financial application never uses floating point representation to store financial information. The books will never balance. I remember doing a project (years ago) for the Dresdner bank international currency trading division and they were using quad precision and still coming up with daily trading discrepancies of several thousand US$; those problems went away when we went (back to) their fixed representation.

The reason that server does will in this respect is that it will use a string representation of the real number and never convert to internal floating point unless a mathematical operation is performed on the number. Px, on the other hand, uses strict data typing and needs to perform conversions in many cases.

I agree that rounding of 0.0022 is unacceptable and that the best solution if you have a float input is to leave it as a floating point number. I think that you might be keeping good numeric values but are having conversion formatting issues. If you take your number, multiply by 1000 and then display that result using the same output method are you getting "back" those 3 digits precision? If that is the case then ensure you have the correct number of digits by doing a INT(number*10^n)/10^n (but not with a floating point resultant data type, of course).

blewip · Post by **blewip** » Mon May 23, 2005 5:33 am

I shall make no comments about the application (I didn't write it).

You say

the best solution if you have a float input is to leave it as a floating point number

However this is what I am doing when I get the 0.0022 error. It is actually better when I take the Float and convert it to Numeric; error only 0.0000509766.

Would you say this is acceptable?

ArndW · Post by **ArndW** » Mon May 23, 2005 5:44 am

Blewip,

I would say that 5 digits of precision is completely unacceptable. Something is going seriously wrong. Single-precision floating point is (I think) usually 32 bits for the mantissa and this should allow for much more accuracy.

Please take a sampling of a couple (perhaps 5) different real numbers, it might be that you have hit upon a test sample that just doesn't lend itself to a good floating point mantissa.

I don't understand about the 0.0022 error when not doing a conversion, though. How are you determining this error - using a displayed value introduces an error as it will use some method to convert the data, so you would need to do a mathematical comparison to really know what error is present.

I would suggest you write a job that does an actual math computation of the differences between the two expected results in an integer format and displays that, this way you would not be introducing a display error into the result. Take the floating point number, convert to number format and then convert back to floating point, then compute (before - after)*10000000 into an integer result and display that.

blewip · Post by **blewip** » Mon May 23, 2005 7:21 am

ArndW

I tried some other numbers, all pretty bad, the worst I found in my short testing was this one

10234.5683

I am sure there are worse ones, however if you assign a float to this value and then display the result, it is 10234.6. This is an error of 0.0317

I also tried to do some Math with FLOAT2DECI set to 1234.5678

Code: Select all

If (DSLink2.FLOAT2DECI=1234.5678) Then "Correct" Else "Wrong"

and also

Code: Select all

If (DSLink2.FLOAT2DECI=1234.57) Then "Correct" Else "Wrong"

This showed that internally DataStage did actually think that 1234.5678 was actually 1234.5678 and not 1234.57, which was what it was displaying.

Although this maybe of use to some people, I do need to produce an Extract file, so I will get the wrong amount in my file. I think I might as well raise a call with Ascential

Cheers Blewip

ArndW · Post by **ArndW** » Mon May 23, 2005 7:27 am

blewip,

in your transform stage equation, please remember that your string of 1234.5678 is also being converted into a float during the transform (an implicit conversion, that is most likely to single precision) - and that a chances are quite low that one float will exactly equal another.

Before raising a call, please try something along the lines of what I suggested earlier, that way you can 100% guarantee that you are not having a problem on the display level but at the data level. Do only explicit computational conversions and not display conversions; compute the difference in an integer which doesn't get converted on display.

blewip · Post by **blewip** » Mon May 23, 2005 9:38 am

It seems that the problem may be with the conversion from Float to display, internally the data may be okay. However as I actually want the Float to be output to a file it's not very helpful for me.

I understand that the conversion process can be the source of problems, but I think this should be consistent. I will give another example. I have data in Sybase table. A float in a particular row is shown as 11366.1399999 if I use SQL Advantage. However if I view data in DataStage's Sybase stage it thinks the data is 11366.14 (which may actually be the most accurate). I can see that DataStage has to convert the Float to display it on the screen. However why when it goes to a DataSet (still as a Float) it displays as 11366.1. Why is there no consistency between what is effectively the same conversion.

ArndW · Post by **ArndW** » Mon May 23, 2005 9:41 am

The Precision attribute will influence your display, what is that set to?

blewip · Post by **blewip** » Tue May 24, 2005 3:34 am

If I change the precision (the length) it does not seem to have any effect. However I have been playing around with something that does appear to have an effect.

I did say that we have tried Double and this has no effect, however it turns out that this is not entirely true. If I change the type from Float to Double and then view the data as a Double there is no effect. However if the Double is then converted to String using DFloatToStringNoExp or DecimalToString (In fact DFloatToNumeric also gives the correct result as a Numeric) you actually get the correct result (or maybe just a better result).

This is confusing as DataStage itself set the datatype for the column to Float and now we are having to change it to Double to get better results. We are also only getting better results when we convert the Double datatype to a different datatype.

Also of interest is that fact that DFloatToStringNoExp (not documented) suggested that the second argument is scale, however it appears to be precision and not scale.

Arnd, thanks for your help, we appear to have a way of dealing with our Floats now.

ArndW · Post by **ArndW** » Tue May 24, 2005 3:40 am

blewip - thanks for the concise post at the end of this thread - I think that not only I but others will learn from the results! (I hate it when a problem posting just ends with a "I solved it" and no solution description.

The floating point problems are truly universal - I've seen them in various aspects in every single application that uses them; I think it really does stem from most of us assuming that what-you-see-is-what-is-stored.

Luckily I haven't come up against floating point in this project [yet]

Thanks for the update,

bcarlson · Post by **bcarlson** » Tue May 24, 2005 9:18 am

Can I add a final note? We are a banking institution as well and all our numbers are stored as decimal.

For performance and Torrent conversion issues, we do not use Transforms, we use buildops. If we do calculations, we convert temporarily to float and then back to decimal. As a very simple example, this following code converts a percentage interest rate (ex. 22.99%) to a decimal (0.2299).

Code: Select all

APT_DFloat tmpFl = 100.0; 
out.int_rt.assignFromDFloat(in.X_NW_DDA_INTEREST_RATE.asDFloat() / tmpFl, APT_Decimal::eRoundInf);

We have float issues as mentioned in this thread, but they went away when we forced rounding. In the buildops there is an enumerated type called RoundMode that can be applied to assignment functions, such as that listed above. The RoundMode is defined as follows (from decimal.h):

Code: Select all

     Enumeration for rounding method.
   */
  enum RoundMode
  {
    /**
       Discard material to the right of the surviving digit, regardless of
       sign.  This corresponds to the COBOL INTEGER-PART function.
       This is the default rounding mode.  Examples: 1.6 -> 1, -1.6 -> -1
     */
    eTruncZero,

    /**
       Round towards nearest representable value, breaking ties by rounding
       towards plus infinity (positive) or minus infinity (negative).  This
       corresponds to the COBOL ROUNDED phrase.
       Examples: 1.4 -> 1,   1.5 -> 2, -1.4 -> -1, -1.5 -> -2
     */
    eRoundInf,          

    /**
       Truncate towards minus infinity.
       This corresponds to the IEEE 754 Round Down mode.
       Examples: 1.6 -> 1, -1.4 -> -2
     */
    eFloor,

    /**
       Truncate towards positive infinity.
       This corresponds to the IEEE 754 Round Up mode.
       Examples: 1.4 -> 2, -1.6 -> -1
     */
    eCeil

    /* not implemented: IEEE 754 Round to Nearest mode (eRoundEven) */
  };

If you are using buildops, you should be able to take advantage of this. Do you know if rounding methods are available in the Transformer? if so, you may want to try that out.

HTH

pneumalin · Post by **pneumalin** » Mon Apr 03, 2006 1:57 pm

bcarlson,
To do buildop is cool, I like your method.
But it also raises my curiosity about your statement:
"For performance and Torrent conversion issues, we do not use Transforms"

Many threads in the forum talk about avoid using Transformer for performance consideration, since there is a statement in Advanced Guide saying that we should try to avoid using Transformer versus using Copy, Modify, Filter..

I thought the PXEngine interprets Transformer to a generated codes like C++, and pass it on to external compiler for creating .so that is known and registered by Engine, and that .so should be able to be used in Parallel when the Engine is setup to do so.
If my speculation is true, then why DataStage discourages users deploy the Transformer due to the Performance consideration.
I can see the Compilation time is longer when you use more Transformers in a PX job, but I don't see the performance draw-back.

Furthermore, if users are forced to use BuildOp due to the performance issue of Transformer, then why users go with DataStage solution instead of going to C++ solution in the first place??

Please comment on it, I really like to hear your opinion on this.
Tnanks!

ray.wurlod · Post by **ray.wurlod** » Mon Apr 03, 2006 4:14 pm

Is there not an optional "Decimal Type" property where you can set precision and scale?