I needed some valid SWF files with constrained character sets for an injection PoC. Putting them here in case someone else needs some.
PHP has finally taken the plunge into the 1960s by implementing GOTO. Here's an implementation for Java.
A patch for a serious Java bug. No longer needed as of June 16.
Max & Michael have written a Max/MSP driver based on the multitouch code.

My CSS-fu is weak; please use a recent browser.

Some rights reserved.

Random, semi-related image by thaths.

Bit Tricks, Part III: Fast Vertical Counter

Carry-save adders can reduce the run time of a vertical counter from log(w) to amortized constant time.

This is part 3 of a series of posts on bit-twiddling tricks and micro-optimizations. This one looks at ways to try to optimize the following function:

"Plain"
void updateCounters(long bitset) { for (int i=0; i < 64; i++) if ((bitset >> i)&1) table[i]++; }

What? Why?

Some algorithms have an inner loop like this:

  for each record r {
    bucket *b = somefunction(r);
    if (predicate1(r)) b->counter1++;
    if (predicate2(r)) b->counter2++;
    if (predicate3(r)) b->counter3++;
    if (predicate4(r)) b->counter4++;
    ...
  }

If we can organize our data so that we can calculate all the predicates at once, we can replace that code with this:

  for each record r {
    bucket *b = somefunction(r);
    long bitmap = computeAllThePredicates(r);
    b->updateCounters(bitmap);
  }

  ...

  void updateCounters(long bitset) {
    for (int i=0; i < 64; i++)
      if ((bitset >> i)&1) table[i]++;
  }

The rest of this post is about that 'updateCounters' method.

Simple improvements

We can improve the original version by avoiding the branch instruction.

"Branchless"
for (int i=0;i<64;i++) count[i] += ((bitmap[i] >> i) & 1);

Or the (possibly) faster operation using the excellent DeBruijn trick (this will deliver the counts out of order, but that can be fixed in postprocessing)...

"DeBruijn"
for (int t; (t = mask&-mask); mask ^= t) count[(i * DEBRUIJN_CONSTANT) >> 58]++;

The ordinary vertical counter

A vertical counter lets you do all those counting operations at the same time.

Consider a normal array of integers, where we read the bits horizontally:

       msb<-->lsb
  x[0]  00000010  = 2
  x[1]  00000001  = 1
  x[2]  00000101  = 5

A vertical counter stores the numbers, as the name implies, vertically; that is, a k-bit counter is stored across k words, with a single bit in each word.

  x[0]  00000110   lsb ↑
  x[1]  00000001       |
  x[2]  00000100       |
  x[3]  00000000       |
  x[4]  00000000   msb ↓

5

With the numbers stored like this, we can use bitwise operations to increment any subset of them all at once.

We create a bitmap with a 1 bit in the positions corresponding to the counters we want to increment, and loop through the array from LSB up, updating the bits as we go. The "carries" from one addition becomes the input for the next element of the array.

  input  sum

A B C S 0 0 0 0 0 1 0 1 sum = a ^ b 1 0 0 1 carry = a & b 1 1 1 1
"Vertical"
carry = input; long *p = buffer; while (carry) { a = *p; b = carry; *p++ = a ^ b; carry = a & b; }

For 64-bit words the loop will run 6-7 times on average -- the number of iterations is determined by the longest chain of carries.

Redundant encoding

A redundant positional number system is one where the individual digits can meet or exceed the radix. In English, this would be like allowing "twenty-ten" as a synonym "thirty".

If we extend the binary system to allow the digit 2, we can use the extra wiggling room to write what I'll for lack of a better word call a "lazy" binary counter.

ordinary    lazy
  001011  001011 = 11
  001100  001012 = 12
  001101  001021 = 13
  001110  001022 = 14
  001111  001111 = 15
  010000  001112 = 16
  010001  001121 = 17
  010010  001122 = 18

The digit 2 is only allowed in some positions -- we'll color those red. The rules are:

  • Blue digits are always 0 or 1. Red digits may also be 2.
  • In the beginning, all digits are blue.
  • When incrementing a red digit, the result is at most 3. We leave the sum mod 2 as a blue digit, and take a carry to the next bit.
  • When incrementing a blue digit, the result is at most 2. We leave it as a red digit, and stop.
  • As a result, each time the algorithm is run, some number of red digits are turned blue, and [at most] one blue digit is turned red.

A simple accounting argument shows that the average cost of each run will be at most 2. Since the control flow is independent of the value of the counter, we can run it in parallel.

Implementation

We use two bits to store each digit. Blue digits are stored as 00 or 01; red digits may be 00, 01, 10 or 11.

To turn a blue digit red, we merely deposit the incoming bits in the left half:

   0 + 00 -> 00
   1 + 00 -> 10
   0 + 01 -> 01
   1 + 01 -> 11

To turn a red digit blue, we use a full adder:

  input  output

A B C S 0 00 0 00 0 01 0 01 t = b0 ^ b1 0 10 0 01 0 11 1 00 carry = (a & b) ^ (t & c) 1 00 0 01 sum1 = 0 1 01 1 00 sum0 = a ^ t 1 10 1 00 1 11 1 01

We don't actually need to store the red/blue status -- if all 64 "left" bits are 0, we consider the digit blue.

"Lazy"
word *p = data; long a, b, t; while ((a = *p)) { b = p[1]; t = a ^ b; *p++ = 0; *p++ = t ^ carry; carry = (a & b) ^ (t & carry); } *p = carry;

We can also keep the state as a separate register. This is slightly faster, presumably since the branch is easier to predict.

"Scheduled"
int p = 0, t = ++time; while ((t & 1) == 0) { long a = d[p], b = d[p + 1], s = a ^ b; d[p + 1] = s ^ c; c = (a & b) ^ (c & s); t >>= 1; p += 2; } d[p] = c;

This last version has an interesting property: It is the only non-contrived program I can think of where you can invert the condition of the while clause, and the code works just as well.

Benchmarks

This graph shows the runtime of the various algorithms for 64MB of random input of various densities. The runtime of the original one is (again) completely dominated by branch mispredictions.

Here's the same graph, showing just the vertical counters. For 64 bit registers, the O(1) version is about twice as fast.

Comments