Pedro Sobota

An index for number series integrity

Pedro Sobota
March/2019


With this algorithm, it is aimed to implement a simple and efficient index over a series of data to detect changes in any number in the series.

Follows first a mathematical description of the problem.

Thinking about representations of all numbers in a numeric series, it can be observed that the mean is not a sufficient parameter to use for this purpose. It is subject to increasing one value in one position in the series, and decreasing another value by the same amount in another position. Such that

\[mean \{7.5, 2.98, 6.72, 0.89, 3.74, 2.87\} = \\ mean \{6.5, 3.98, 6.72, 0.89, 3.74, 2.87\}.\]

In this respect, the standard deviation is an improvement. The value would be different for the same change above:

\[stddev \{7.5, 2.98, 6.72, 0.89, 3.74, 2.87\} = 2.5149, \\ stddev \{6.5, 3.98, 6.72, 0.89, 3.74, 2.87\} = 2.21737.\]

The standard deviation, however, is susceptible to a simple value swap operation between two positions. Exchanging the position of one value with another ends up having no effect on the series' dispersion (or its mean value, either).

As an example, considering

\[A1=\{7.5, 2.98, 6.72, 0.89, 3.74, 2.87\},\\ A2=\{2.98, 7.5, 6.72, 0.89, 3.74, 2.87\}\]

(elements 1-2 swapped),

\[stddev(A1) = stddev(A2), \\ mean(A1) = mean(A2).\]

What is needed is an operation on the numbers which is not commutative, so the order of elements going into the operation determines its result.

As a non-commutative operation over the real numbers, exponentiation suggests itself immediately.

\[2 ^ 3 \neq 3 ^ 2, \\ \left(2 ^ 3\right) ^ 4 \neq 2 ^ \left(3 ^ 4\right).\]

(We note that

\[2^4=4^2,\]

, but these are the only pair of natural numbers which commute under exponentiation1.)

Let's reference the operation of consecutive exponentation with the symbol \(\bigwedge\), and each exponentiation with \(\wedge\).

Now,

\[\bigwedge \{7.5, 2.98, 6.72, 0.89, 3.74, 2.87\} = \\ 7.5 \wedge 2.98 \wedge 6.72 \wedge 0.89 \wedge 3.34 \wedge 2.87 = \\ 436.491\]

and

\[\bigwedge \{2.98, 7.5, 6.72, 0.89, 3.74, 2.87\} = \\ 4348.3,\]

such that ordering is now satisfied.

An obvious problem with consecutive exponentiation, though, is that it is too easy to overflow the representation capacity with big numbers; let's note that the consecutive exponentiation of \(\{5,6,4.6\}\), at such low number of small elements, has a magnitude of \(10 ^{2654}\).

Our solution to this problem will be to reduce the magnitude on every number to \(1\). This requires the extraction of the decimal exponent of the number and subsequent division by that. Now,

\[\bigwedge \{5, 6, 4006\} = \\ \left(5 \times \frac{1}{10^1}\right) \wedge \left(6 \times \frac{1}{10^1}\right) \wedge \left(4006 \times \frac{1}{10^4}\right) = \\ 0.5 \wedge 0.6 \wedge 0.4006 = \\ 0.568431.\]

Notating normalized numbers as \(\overline{x}\), the above is

\[\overline{5} \wedge \overline{6} \wedge \overline{4006} = 0.568431.\]

The result now sits within a controlled range of between \(0\) and \(1\).

For every solution there are corner cases, though, and this solution opens up the possibility of magnitude manipulations (since we take them out of every number).

Thus changes like

\[\bigwedge \{50,.6,46,0.63,.0005,4300,52\} = 0.596399 \]

still produce the same value, or even, for a single element change,

\[\bigwedge \{50,6,4.6,6.3,5,4.3,5.2\} = 0.596399. \]

However, we observe that those same changes produce a change in the mean parameter:

\[mean \{5,6,4.6,6.3,5,4.3,5.2\} = 5.2, \\ mean \{50,6,4.6,6.3,5,4.3,5.2\} = 11.6286. \]

If the mean is inserted into the computation, then those results could be discerned. We'll prepend it to the list of numbers, with it ending up the first exponent in the calculation:

\[\bigwedge \{5,6,4.6,6.3,5,4.3,5.2\} = \\ \overline{5} \wedge (\overline{6} \wedge (\overline{4.6} \wedge (\overline{6.3} \wedge (\overline{5} \wedge (\overline{4.3} \wedge (\overline{5.2} \wedge mean)))))) = \\ \overline{5} \wedge (\overline{6} \wedge (\overline{4.6} \wedge (\overline{6.3} \wedge (\overline{5} \wedge (\overline{4.3} \wedge (\overline{5.2} \wedge \overline{5.2})))))) = \\ 0.595317 \\ \neq \\ \bigwedge \{50,6,4.6,6.3,5,4.3,5.2\} = \\ \overline{11.6286} \wedge (\overline{50} \wedge (\overline{6} \wedge (\overline{4.6} \wedge (\overline{6.3} \wedge (\overline{5} \wedge (\overline{4.3} \wedge (\overline{5.2}))))))) = \\ 0.595047.\]

Because this mean will end up normalized (like any other value), a general change in the whole series by factors of 10 will not produce different outcomes. Such a scenario is considered uninteresting for a modifying agent and this limitation is left untreated.

Properties related to exponentiation

The choice of exponentiation as an operator presents relevant properties.

The first one is related to zeroes. Any zero base will "reset" the algorithm to a 0 value; this, per se, is not a blocking condition; a more serious issue is that a leading zero, being the last exponent to be calculated, will produce an invariable zero result.

To eliminate both conditions, we exchange zeroes for the normalized mean. This guarantees also no stabilization of value between zeroes.

A special case is of all-zero series — which yield invariably the \(0^0\) exponentiation and halt the algorithm. This is not possible to avoid without identifying the series as such, so we do it beforehand, using the mean and standard deviation to determine it. If it is the case, we return zero.

Another halting condition is related to negative inputs. Obviously, exponentiating by a negative number results in a fraction which, taken an exponent to the next number in the list, which could be negative as well, ensues an even root operation over a negative number, producing a complex number.

The trivial solution in this case, that of taking absolutes of the inputs, works finely, because changes of signal will change the mean. Like in the case of the magnitude manipulation, the mean is used as a fallback.

Another condition is that of swapping two values which are multiples of ten between themselves. This is a combination of the swapping and magnitude behaviors. This is the second condition we don't develop a solution for, due to hypothesising a low probability of occurrence of such numbers in an organic number series.

Summary

This index should detect all changes in a series of numbers, except for the following cases:

Visualization

In figures 1-2, we see intermediate value generation over one random data series. The result is the last point of the lines.


Fig. 1: intermediate values over one
random 10-sized series

Fig. 2: intermediate values over one
random 100-sized series

Performance

The execution of the algorithm involves one pass over the entire list, for the algorithm itself, and two more for the mean and standard deviation. It falls within the category of a linear algorithm.

Benchmark: Mathematica implementation; Intel i5-8250 @ 1.60 GHz, 8,00 GB, Windows 10 machine.

Random set size 1000 10000 50000 250000 1000000
Time 0.016 s 0.12 s 0.574 s 2.81 s 11.3 s

Implementations

With tests.

References

1. Structure of integer pairs which commute under exponentiation (Stack Overflow)