|
Posted February 1 2009 I needed some valid SWF files with constrained character sets for an injection PoC. Putting them here in case someone else needs some. PHP has finally taken the plunge into the 1960s by implementing GOTO. Here's an implementation for Java. A patch for a serious Java bug. No longer needed as of June 16. Max & Michael have written a Max/MSP driver based on the multitouch code.
My CSS-fu is weak; please use a recent browser. Random, semi-related image by thaths. |
Bit Tricks, Part III: Fast Vertical CounterCarry-save adders can reduce the run time of a vertical counter from log(w) to amortized constant time. This is part 3 of a series of posts on bit-twiddling tricks and micro-optimizations. This one looks at ways to try to optimize the following function:
What? Why?Some algorithms have an inner loop like this:
for each record r {
bucket *b = somefunction(r);
if (predicate1(r)) b->counter1++;
if (predicate2(r)) b->counter2++;
if (predicate3(r)) b->counter3++;
if (predicate4(r)) b->counter4++;
...
}
If we can organize our data so that we can calculate all the predicates at once, we can replace that code with this:
for each record r {
bucket *b = somefunction(r);
long bitmap = computeAllThePredicates(r);
b->updateCounters(bitmap);
}
...
void updateCounters(long bitset) {
for (int i=0; i < 64; i++)
if ((bitset >> i)&1) table[i]++;
}
The rest of this post is about that 'updateCounters' method. Simple improvementsWe can improve the original version by avoiding the branch instruction.
Or the (possibly) faster operation using the excellent DeBruijn trick (this will deliver the counts out of order, but that can be fixed in postprocessing)...
The ordinary vertical counterA vertical counter lets you do all those counting operations at the same time. Consider a normal array of integers, where we read the bits horizontally:
msb<-->lsb
x[0] 00000010 = 2
x[1] 00000001 = 1
x[2] 00000101 = 5
A vertical counter stores the numbers, as the name implies, vertically; that is, a k-bit counter is stored across k words, with a single bit in each word. x[0] 00000110 lsb ↑ x[1] 00000001 | x[2] 00000100 | x[3] 00000000 | x[4] 00000000 msb ↓ With the numbers stored like this, we can use bitwise operations to increment any subset of them all at once. We create a bitmap with a 1 bit in the positions corresponding to the counters we want to increment, and loop through the array from LSB up, updating the bits as we go. The "carries" from one addition becomes the input for the next element of the array. input sum
For 64-bit words the loop will run 6-7 times on average -- the number of iterations is determined by the longest chain of carries. Redundant encodingA redundant positional number system is one where the individual digits can meet or exceed the radix. In English, this would be like allowing "twenty-ten" as a synonym "thirty". If we extend the binary system to allow the digit 2, we can use the extra wiggling room to write what I'll for lack of a better word call a "lazy" binary counter. ordinary lazy 001011 001011 = 11 001100 001012 = 12 001101 001021 = 13 001110 001022 = 14 001111 001111 = 15 010000 001112 = 16 010001 001121 = 17 010010 001122 = 18 The digit 2 is only allowed in some positions -- we'll color those red. The rules are:
A simple accounting argument shows that the average cost of each run will be at most 2. Since the control flow is independent of the value of the counter, we can run it in parallel. ImplementationWe use two bits to store each digit. Blue digits are stored as 00 or 01; red digits may be 00, 01, 10 or 11. To turn a blue digit red, we merely deposit the incoming bits in the left half: 0 + 00 -> 00 1 + 00 -> 10 0 + 01 -> 01 1 + 01 -> 11 To turn a red digit blue, we use a full adder: input output We don't actually need to store the red/blue status -- if all 64 "left" bits are 0, we consider the digit blue.
We can also keep the state as a separate register. This is slightly faster, presumably since the branch is easier to predict.
This last version has an interesting property: It is the only non-contrived program I can think of where you can invert the condition of the while clause, and the code works just as well. BenchmarksThis graph shows the runtime of the various algorithms for 64MB of random input of various densities. The runtime of the original one is (again) completely dominated by branch mispredictions. Here's the same graph, showing just the vertical counters. For 64 bit registers, the O(1) version is about twice as fast. Comments |