Adding up to 100

What could be easier? Still, percentages in polls often add up to all sorts of numbers other than 100. For example, suppose exactly one third of the participants votes for option A, another third for option B, and another third for option C. All options have exactly one third of the votes, or 33.333…%. Rounding obviously gives 33% for all options, which adds up to 99%, rather than 100%. In a similar situation with six options, every option gets 16.666…% of the votes, rounding gives 17%, and 6 times 17 equals 102!

When presenting poll results, it might make sense to ensure that the percentages always add up to 100, but there are also some more subtle problems. Suppose there are only two options and only 4 out of 1000 participants vote for option A. Rounding would result in 0% for option A and 100% for option B. This does add up to 100%, but the ratio 4:996 is quite different from 0:100. Here we will give a method for finding the most likely assignment of percentages (or integer fractions of some other number than 100). Why? Because it’s fun :) Also, if you ever happen to be founding a democracy, the method used here could be considered to give a more representative result than the method currently in use (in the Netherlands at least). Oh, and their might also be some connections to (Huffman) entropy coding (for the connaisseurs: it’s possible to show that we’re actually minimizing the Kullback-Leibler divergence).

So how can we translate a set of votes into a partition of 100 (or any other number for that matter)? One option is to consider the most likely partition, given the counted votes. The “likelihood” of a partition, given the data, is the probability of the data given the partition. If you’re a statistician, that last sentence probably made perfect sense, if you’re a mere mortal, your head might have come off. The idea is really simple. If we wish to evaluate how likely it is that it rains on 50% of the days in the Netherlands (which is apparently reasonably accurate, according to the royal Dutch meteorological institute), then we just go outside each day for a year and see whether it rains. After a year we have a series of measurements: rain, no rain, rain, no rain, etc. Now we can compute the probability of this sequence given our model that each day you have a 50% chance of rain, this probability (of the data) is the likelihood of the model.

First, lets suppose that we have a number of options: \mathcal{O}={A, B, …}. The number of votes on option A is denoted by n(A), and similarly for the other options. Say that we also assign probabilities pA, pB, etc. to each option, then the probability of the data for two options A and B equals

\displaystyle\binom{n}{n(A)}\,p_A^{n(A)}p_B^{n(B)}\text{ with }p_A+p_B=1.

This is the well-know Binomial distribution.
We will equate the computed percentages with the probabilities in our basic model, so everything stays fixed, except for pA and pB. We can thus ignore the binomial and, in general, maximize the following (with \mathcal{O} being the set of options):

\displaystyle\prod_{A\in\mathcal{O}} p_A^{n(A)}\text{ subject to }\sum_{A\in\mathcal{O}} p_A=1.

If desired, the logarithm can be taken to yield:

\displaystyle\sum_{A\in\mathcal{O}} n(A)\,\log(p_A)\text{ subject to }\sum_{A\in\mathcal{O}} p_A=1.

However, the idea is that the probabilities stem from fractions with some denominator N (typically N equals 100), so each probability has the form num/N. This makes it an integer progamming problem, and a non-convex, non-linear one at that. Perhaps contrary to expectation, integer optimization problems are typically harder than their continuous counterparts, so this is not a good thing. Luckily (in this case) there is a way out.

We can rely on what is quite possibly the most useful technique for making seemingly intractable problems tractable: dynamic programming. First of all, let’s define the following score function, using numA to denote the (integer) numerator for option A:

\displaystyle S(\mathcal{O},M)=\underset{num}{\rm{maximize}}\prod_{A\in\mathcal{O}} (num_A/N)^{n(A)}\text{ subject to }\sum_{A\in\mathcal{O}} num_A=M.

It should be clear that we are interested in the partition that gives us S(\mathcal{O},N).

We now proceed as follows:

  • First create an array S with N+1 elements, fill index zero with the value 1 and fill the rest of the array with zeroes. You can think of this as stating that if there are no options, then the only possibility is to have a total of 0%.
  • Now execute the following for the first option A (assuming n(A)>0):
    for total=N to 0 (backwards)
       S[total] = 0;
       for numerator=1 to total
          newS = S[total-numerator]*pow(numerator/N, n(A))
          if newS > S[total] then
             S[total] = newS
    

    Now we have S[M] = S({A},M). Note that the code never overwrites a value of S that we need later1.

  • Now we repeat the previous step for all options, and after having processed the last option S[N] will equal S(\mathcal{O},N). The trick is that at each step we do not care about how we got to the sum in S[total-numerator], as this does not affect the factor we multiply it with (pow(numerator/N, n)). So it suffices to remember only the best possible way of getting to total-numerator.

We glossed over the fact that we now just have a procedure for getting the value of S(\mathcal{O},N), while we are more interested in the partition that gave us this value. This is typical in dynamic programming, and has a really simple solution, we just keep track of the decisions we made. Practically speaking this means that we would have to keep an array of links around that keeps track of the configurations that gave the values in S (which we update each time we update S).

I implemented the above algorithm in a Fiddle, where you can easily try it out for yourself. The code was kept simple, but basically works fine as long as the number of classes does not exceed N. In comparison with the algorithm above, I eliminated the division by N in the score function, and used the logarithm of the score function to avoid raising numbers to insane powers. Some possible enhancements and optimizations include:

  • Ensuring that there is a monotone mapping from counts to numerators (that is, a larger count always gets a larger numerator, and equal counts get equal numerators). Currently a list of counts like 1,1,1 results in the numerators 34,33,33 (although 33,34,33 is equally likely, more on that later), which feels a bit odd (although at least it adds up to 100%). Also, if the number of options exceeds N, weird things can and do happen.
  • One may want to detect all partitions that give the maximal score, and then average them or output ranges for example. The idea is that this would solve problems like outputting 34,33,33 when it would have been just as valid to output 33,34,33 or 33,33,34. However, the resulting numerators would no longer be integers, so I’m not sure it’s a good idea. Also, apparently even when assigning seats in the parliament do they eventually fall back to letting fate decide. (At least, that’s what it seems to say in §P.2 of the Dutch “kieswet”.)
  • Only evaluate total=100 for the last class. Similarly, the first class technically does not need to store links (they should all point to zero).
  • As effectively our only problem is rounding fractions, it might make sense to enforce that we either round each numerator up or down. This can restrict the problem space enormously and speed up the algorithm quite a bit, especially for larger values of N. I have not checked whether this will ever change the end result, but if it does the changes will probably be minor.
  • Various optimizations of a purely technical nature (like using typed arrays, which is often faster, although I have not checked that it is so in this case).

Finally, there are also other ways of defining “sensible” partitions. For example, one could look for a partition so that the ratios between the numerators most closely resemble the ratios between fractions. “Most closely resemble” could be defined in terms of differences between logarithms for example. For some purposes such a scheme might make more sense than the approach discussed here.

1 In the inner loop, S[total] is the first value we access, and the outer loop is going backwards (while the inner loop does not go beyond the current total), so after having updated S[total], we never access it again.

Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s