How to enumerate bits quickly using DeBruijn sequences.
These are the slides from a tech talk I gave at the NWERC programming competition.
Topics barely covered: cache-friendly memory layouts, bit-parallelism, SIMD-within-a-register, vertical storage, DeBruijn sequences.
Next permutation
The example in the background was from Hacker's Delight:
unsigned snoob(unsigned x) {
unsigned smallest, ripple, ones;
// x = xxx0 1111 0000
smallest = x & -x; // 0000 0001 0000
ripple = x + smallest; // xxx1 0000 0000
ones = x ^ ripple; // 0001 1111 0000
ones = (ones >> 2)/smallest; // 0000 0000 0111
return ripple | ones; // xxx1 0000 0111
}
Given a subset (represented as a bit map), it finds the lexicographically next subset with the same number of elements. Equivalently, it finds the next integer with the same number of one-bits. Work it out, it's nice.
Slides
Some random images have been removed.
Slide 1 of ℵ0
[image missing]
Slide 2 of ℵ0
Erling Ellingsen
[email address removed]
Slide 3 of ℵ0
"know when not to optimize"
Slide 4 of ℵ0
not the topic today
Slide 5 of ℵ0
how to blatantly
overoptimize
Slide 6 of ℵ0
sometimes useful
Slide 7 of ℵ0
task:
Slide 8 of ℵ0
Foo Bar Baz 1 2 3 1 4 1 2 4 3 5 3 2 3 5 6
Slide 9 of ℵ0
(foo == 1) && (bar == 2)
Slide 10 of ℵ0
postgres can do 1 M records/second
Slide 11 of ℵ0
Slide 12 of ℵ0
[image missing]
Slide 13 of ℵ0
Foo Bar Baz
1 2 3
1 4 1
2 4 3
5 3 2
3 5 6
--------------
struct rec {
byte foo,bar,baz;
};
rec *data;
Slide 14 of ℵ0
for(o on data) {
if(o->foo == 1 &&
o->bar == 2) {
...
}
}
Slide 15 of ℵ0
80 M/s
Slide 16 of ℵ0
two cache lines per record
Slide 17 of ℵ0
store each field
separately
Slide 18 of ℵ0
Foo Bar Baz 1 2 3 1 4 1 2 4 3 5 3 2 3 5 6 -------------- foo: [ 1, 1, 2, 5, .. ] bar: [ 2, 4, 4, 3, .. ] baz: [ 3, 1, 3, 2, .. ]
Slide 19 of ℵ0
[image missing]
Slide 20 of ℵ0
Foo Bar Baz
1 2 3
1 4 1
2 4 3
5 3 2
3 5 6
--------------
foo[0] = 0x11253....
bar[0] = 0x24435....
Slide 21 of ℵ0
for(i) {
a = awords[i];
b = bwords[i];
for(ofs : [0,4,8..]) {
if((a>>ofs) & 0xf == 1 &&
(b>>ofs) & 0xf == 2) {
}
}
}
Slide 22 of ℵ0
3.75x
300 M/s
300 M/s
Slide 23 of ℵ0
[image missing]
Slide 24 of ℵ0
for(i) {
a = awords[i];
b = bwords[i];
if((a>> 0) & 0xf == 1 && (b>>0) & 0xf == 2) ...
if((a>> 4) & 0xf == 1 && (b>>4) & 0xf == 2) ...
if((a>> 8) & 0xf == 1 && (b>>8) & 0xf == 2) ...
}
Slide 25 of ℵ0
5.8x
465 M/s
465 M/s
Slide 26 of ℵ0
[image missing]
Slide 27 of ℵ0
^ and &
are our friends
are our friends
Slide 28 of ℵ0
32-bit words
8 4-bit values
8 4-bit values
Slide 29 of ℵ0
0101 1100 1001 0101 ....
Slide 30 of ℵ0
we want to
search for
0101
Slide 31 of ℵ0
0101 1100 1001 0101 values
xor
0101 0101 0101 0101 search key
equals
0000 1001 1100 0000 wrong bits
Slide 32 of ℵ0
0101 1100 1001 0101 values
xor
1010 1010 1010 1010 NOT search key
equals
1111 0110 0011 1111 matching bits
---- ----
Slide 33 of ℵ0
reduced problem:
find all nibbles
that are = 1111
Slide 34 of ℵ0
1111 0110 0011 1111
x = x & (x>>1);
0111 0010 0001 1111
x = x & (x>>2);
0001 0000 0000 0011
x = x & 0x11111...;
0001 0000 0000 0001
Slide 35 of ℵ0
8 comparisons at once
Slide 36 of ℵ0
3 2 1 0 0000 0001 0000 0000 // locate the 1 bits
Slide 37 of ℵ0
nice trick in JDK
Slide 38 of ℵ0
// java.lang.Integer:
public static int numberOfTrailingZeros(int i) {
int y;
if (i == 0) return 32;
int n = 31;
y = i <<16; if (y != 0) { n = n -16; i = y; }
y = i << 8; if (y != 0) { n = n - 8; i = y; }
y = i << 4; if (y != 0) { n = n - 4; i = y; }
y = i << 2; if (y != 0) { n = n - 2; i = y; }
return n - ((i << 1) >>> 31);
}
Slide 39 of ℵ0
2.8x
1300 M/s
1300 M/s
Slide 40 of ℵ0
[image missing]
Slide 41 of ℵ0
numberOfTrailingZeros
is general
Slide 42 of ℵ0
general != fast
Slide 43 of ℵ0
the 1 bits can only
be in positions 0, 4, 8...
Slide 44 of ℵ0
let's make a faster one
Slide 45 of ℵ0
first, we need to find the lowest bit
Slide 46 of ℵ0
X & -X
Slide 47 of ℵ0
000011010100110000 X
111100101011001111 ~ X
111100101011010000 (~ X) + 1
111100101011010000 -X
000000000000010000 X & -X
Slide 48 of ℵ0
next:
from 24n to n
from 24n to n
Slide 49 of ℵ0
0x01234567 *
0x00000100 =
0x23456700
0x23456700 >> 28 == 2
Slide 50 of ℵ0
while(x != 0) {
int bit = x&-x;
matched( ofs*8 + ((bit*0x01234567) >>> 28));
x -= bit;
}
Slide 51 of ℵ0
18.7x
1500 M/s
1500 M/s
Slide 52 of ℵ0
[image missing]
Slide 53 of ℵ0
before:
x[0] = 0001 0111 0101 0111 1001 ....
x[1] = ....
after:
x[0] = 000011.
x[1] = 011100.
x[2] = 010101.
x[3] = 111111.
Slide 54 of ℵ0
EGA format!
Slide 55 of ℵ0
32 matches at once
Slide 56 of ℵ0
for(ofs) {
int matches = x[ofs] & ~x[ofs+1] & x[ofs+2] & ~x[ofs+3]
& ~y[ofs] & y[ofs+1] & ~y[ofs+2] & y[ofs+3];
while(matches != 0) {
matched( ofs*32 + numberOfTrailingZeros(v) );
matches &= v-1;
}
}
Slide 57 of ℵ0
24.3x
1950 M/s
1950 M/s
Slide 58 of ℵ0
more magic numbers
Slide 59 of ℵ0
int evilBitPos(x) {
int bit = x & -x;
return table[ (bit * 2510000032) >>> 27 ];
}
Slide 60 of ℵ0
10010101100110111000111110100000
Slide 61 of ℵ0
(000) 10010101100110111000111110100000
--- -- ----- -----
all 5-bit substrings are different
Slide 62 of ℵ0
31x
2500 M/s
2500 M/s
Slide 63 of ℵ0
2G of RAM =
25M rows
25M rows
10 ms