Common Anti-Patterns: String Usage for Number Manipulation

This is a short post about a topic which is frustrating to me from time to time. The topic is about number manipulation by utilizing string operations.

I do not know why, but from time to time I see code and talk to people who have some tasks to do with numbers and they do not have a better idea than printing them into a string or character stream and applying string manipulation functionality to get their needed results. For me, it is strange to think about something like that, because I think it is against common sense to solve number issue with strings. Mathematics is the way to do that.

I have to extreme examples here in this post which also caused major issues in the developed software. There were other examples, but the two were selected as examples.

Example 1: Rounding of Numbers

I have seen this twice in my career. Once implemented in C and once again in Java. Once it was needed to round to a plain integer and the other time with a give number of digits for a needed precision. In both cases the implementation was done the same way:

  1. The original floating point number was printed into a string.
  2. For the decimal point the position was looked up.
  3. The string was cut at the decimal point and number in front of the point was taken and saved into a new variable.
  4. For checking whether to round up or down, the digit after the point was checked to be ‘5’ or larger (by character comparison) to know whether to round up or down.
  5. The number in front of the point was converted back into an integer.
  6. If it was to be rounded up, 1 was added.

For the arbitrary precision case, the position of the decimal point was shifted by the number of digits related to the precision.

When I see something like that, I get frustrated. It is a waste of calculation power, because string operations are much more expensive than mathematical calculations and the procedure has several flaws:

  1. What happens if there is an exponential notion of the number? We need to cut at the ‘E’, too and take care of the number behind it.
  2. If localization is involved, what happens then? I live in Germany and we use a comma as decimal sign and the point is used for thousands.
  3. It is quite hard to assure correct behavior is arbitrary precision is to be applied.

The most simple solution for this issue is always to have a look for an already distributed function like Math.round in Java. A general procedure can be in C code:

double num = ...;
int n = (num < 0) ? (num - 0.5) : (num + 0.5)

It is assumed for that algorithm that a cast just truncates the digits right from the point. This rounding procedure is much fast than string manipulation.

To add a functionality for arbitrary precision rounding, it can be easiest achieved by multiplying the precision. For example assume the function:

double round(double d) {
    return (d < 0) ? (d - 0.5) : (d + 0.5)
}

To round num to an arbitrary precision we can do the following for two digits:

double num = ... ;
double rounded = round(num * 100.0) / 100.0;

We have rounded now to two digits.

Attention: I see the discussion coming already, but I do not take floating point number effects into account here. I know, that floating point numbers are not 100% correct due to internal representation, but this is not solved by string manipulation either as soon as the numbers are put back into doubles for example.

2. Example: Calculating Mantissa and Exponent

The task is here to extract the mantissa and the exponent from floating point number into a 4 byte signed integer for the mantissa and a signed byte for the exponent to send everything over the network wire to systems which might have different binary floating point representation, no floating point representation due to usage of fixed decimal values or no exponential representation in strings. This may happen in integration projects. The separation into integers is not a bad idea and the target system can calculate its internal representation as needed.

The procedure I have seen in C code was:

  1. Print the number into a string (via sprintf with “%e”).
  2. Cut the string into two pieces at the ‘e’ to separate mantissa and exponent.
  3. Convert the exponent into an integer assuming it fits into the byte.
  4. The mantissa string’s dot is removed.
  5. The length of the mantissa string is checked for length and shortened to 9 characters to later fit into the 4 byte integer.
  6. The mantissa is converted into the final 4 byte integer.
  7. The exponent is subtracted by the length of the mantissa string – 1 to get the correct exponent for the large mantissa.

There also some flaws except of the performance impact especially in applications which deal heavily with number which was/is the case in the observed situation:

  1. Again, localization: What happens in Germany with the dot?
  2. It is assumed that the exponential string representation has always one digit in front of the decimal point.
  3. The exponent may become greater than 127 or smaller than -128.

It is much better to do some math here. I have to admit, the math here is not trivial any more because the exponential and logarithm functions need to be applied and the mathematical rules need to be present, but this is assumed to be know by software developers and software engineers. They do not need to know them by heart, but they should know they are there and where to look them up. Here is the mathematical algorithm in words:

  1.  Calculate the logarithm of base 10 of the number to be converted. Remember: log_10 (x)  = log(x) / log(10). Most programming languages only offer the natural logarithm with base e.
  2. Round the exponent from 1. up to an integer. Up rounding is needed later to not exceed the capacity of the 4 byte integer mantissa.
  3. Calculate the mantissa by dividing the number by 10 up to the power of the exponent got from 2.
  4. The number from 3. is guaranteed to be smaller than 1 due to the up-rounding of the exponent in 2.
  5. Multiply the mantissa by 1 billion to get the maximum precision of our two integer exponent representation and round it into our 4 byte integer.
  6. Subtract 9 from the exponent in 2. to get the correct exponent to the new mantissa.

That’s it. It is maybe not that easy to understand mathematically, but it is optimized for computers. A computer is better in computing numbers than in manipulating string.

Conclusion

In our daily work we are confronted with a lot of tasks which are not so easy to solve on first glance. We always should look out for the best solution for the computer to perform, for the developer to implement correctly and for other developers to understand.

Since computers are optimized for computations, mathematical solutions are to be preferred for performance reasons. It can also be assumed that software developers and software engineers have some understanding of mathematics to find correct solutions. If not, there are also forums to ask people for help. It is not very probable that a problem at hand was not solved by someone else, yet. The most problems occurred already somewhere else and a proofed solution is always better than something new.

Leave a Reply