SUFFIX ARRAYS
musings on the data structure, applications and implementation in Java
Michał Nowak, Dawid Weiss
Notation
•
Let
Σ be an alphabet;
•
Σ is finite and non-empty;
•
symbols
x
a∈ Σ, where a = 1 . . . |Σ| are ordered;
•
O(x
b− x
c) = 1;
•
$ is a special symbol smaller than any x ∈ Σ.
•
Let
X be a sequence of n symbols, where X [i] ∈ Σ and
i = 0 . . . n − 1.
•
Let
S (i) or i be a sub-sequence of X starting at position
i and ending at n − 1.
Notation
•
Let
Σ be an alphabet;
•
Σ is finite and non-empty;
•
symbols
x
a∈ Σ, where a = 1 . . . |Σ| are ordered;
•
O(x
b− x
c) = 1;
•
$ is a special symbol smaller than any x ∈ Σ.
•
Let
X be a sequence of n symbols, where X [i] ∈ Σ and
i = 0 . . . n − 1.
•
Let
S (i) or i be a sub-sequence of X starting at position
i and ending at n − 1.
Examples
•
Zero-terminated US-ASCII strings (Latin letters).
•
DNA sequences.
•
Integer-coded sequences of words.
(recall)
suffix S (i)
i
c a c a o $
0
a c a o $
1
c a o $
2
a o $
3
o $
4
$
5
suffix S (i)
i
c a c a o $
0
a c a o $
1
c a o $
2
a o $
3
o $
4
$
5
Compacting (pocket trees)
1
Move labels to edges,
Properties of suffix trees
•Maximum
2n nodes (!).
•
Elegant to program with.
?
alibaba.taliban.
What do you think,
do geese see God?
whatdoyouthinkdogeeseseegod.dogeeseseegodkniht...
There are many more ST applications.
Dan Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press.
Problems with suffix trees
•Alphabet size.
Ordered depth-first ST traversal. $ a · c a o $ a · o $ c a · c a o $ c a · o $ o $
suffix S (i) i c a c a o $ 0 a c a o $ 1 c a o $ 2 a o $ 3 o $ 4 $ 5 → sort →
suffix S (i) SA[j]
$ 5 a · c a o $ 1 a · o $ 3 c a · c a o $ 0 c a · o $ 2 o $ 4 Conclusions:
• A suffix array is an array of indices of sorted suffixes. • A suffix array is always a permutation of suffix indices. • Memory consumption: sizeof (index) × n.
• Suffix arrays and suffix trees can simulate each other. (Abouelhoda et al., 2004).
• Suffix sorting can be done inO(n);
suffix S (i) i c a c a o $ 0 a c a o $ 1 c a o $ 2 a o $ 3 o $ 4 $ 5 → sort →
suffix S (i) SA[j]
$ 5 a · c a o $ 1 a · o $ 3 c a · c a o $ 0 c a · o $ 2 o $ 4 Conclusions:
• A suffix array is an array of indices of sorted suffixes. • A suffix array is always a permutation of suffix indices. • Memory consumption: sizeof (index) × n.
• Suffix arrays and suffix trees can simulate each other. (Abouelhoda et al., 2004).
• Suffix sorting can be done inO(n);
suffix S (i) i c a c a o $ 0 a c a o $ 1 c a o $ 2 a o $ 3 o $ 4 $ 5 → sort →
suffix S (i) SA[j]
$ 5 a · c a o $ 1 a · o $ 3 c a · c a o $ 0 c a · o $ 2 o $ 4 Conclusions:
• A suffix array is an array of indices of sorted suffixes. • A suffix array is always a permutation of suffix indices. • Memory consumption: sizeof (index) × n.
• Suffix arrays and suffix trees can simulate each other. (Abouelhoda et al., 2004).
• Suffix sorting can be done inO(n);
suffix S (i) i c a c a o $ 0 a c a o $ 1 c a o $ 2 a o $ 3 o $ 4 $ 5 → sort →
suffix S (i) SA[j]
$ 5 a · c a o $ 1 a · o $ 3 c a · c a o $ 0 c a · o $ 2 o $ 4 Conclusions:
• A suffix array is an array of indices of sorted suffixes.
• A suffix array is always a permutation of suffix indices. • Memory consumption: sizeof (index) × n.
• Suffix arrays and suffix trees can simulate each other. (Abouelhoda et al., 2004).
• Suffix sorting can be done inO(n);
suffix S (i) i c a c a o $ 0 a c a o $ 1 c a o $ 2 a o $ 3 o $ 4 $ 5 → sort →
suffix S (i) SA[j]
$ 5 a · c a o $ 1 a · o $ 3 c a · c a o $ 0 c a · o $ 2 o $ 4 Conclusions:
• A suffix array is an array of indices of sorted suffixes. • A suffix array is always a permutation of suffix indices.
• Memory consumption: sizeof (index) × n.
• Suffix arrays and suffix trees can simulate each other. (Abouelhoda et al., 2004).
• Suffix sorting can be done inO(n);
suffix S (i) i c a c a o $ 0 a c a o $ 1 c a o $ 2 a o $ 3 o $ 4 $ 5 → sort →
suffix S (i) SA[j]
$ 5 a · c a o $ 1 a · o $ 3 c a · c a o $ 0 c a · o $ 2 o $ 4 Conclusions:
• A suffix array is an array of indices of sorted suffixes. • A suffix array is always a permutation of suffix indices. • Memory consumption: sizeof (index) × n.
• Suffix arrays and suffix trees can simulate each other. (Abouelhoda et al., 2004).
• Suffix sorting can be done inO(n);
suffix S (i) i c a c a o $ 0 a c a o $ 1 c a o $ 2 a o $ 3 o $ 4 $ 5 → sort →
suffix S (i) SA[j]
$ 5 a · c a o $ 1 a · o $ 3 c a · c a o $ 0 c a · o $ 2 o $ 4 Conclusions:
• A suffix array is an array of indices of sorted suffixes. • A suffix array is always a permutation of suffix indices. • Memory consumption: sizeof (index) × n.
• Suffix arrays and suffix trees can simulate each other. (Abouelhoda et al., 2004).
• Suffix sorting can be done inO(n);
suffix S (i) i c a c a o $ 0 a c a o $ 1 c a o $ 2 a o $ 3 o $ 4 $ 5 → sort →
suffix S (i) SA[j]
$ 5 a · c a o $ 1 a · o $ 3 c a · c a o $ 0 c a · o $ 2 o $ 4 Conclusions:
• A suffix array is an array of indices of sorted suffixes. • A suffix array is always a permutation of suffix indices. • Memory consumption: sizeof (index) × n.
• Suffix arrays and suffix trees can simulate each other. (Abouelhoda et al., 2004).
• Suffix sorting can be done inO(n);
(selected)
Naïve suffix sorting
Algorithms:
•
Three-way quicksort (Bentley, McIlroy).
•
Radix sort and combinations.
Problems due to:
•
suffix comparisons not of
O(1) (average LCP),
“aaaaaaa...a$”.
Naïve suffix sorting
Algorithms:
•
Three-way quicksort (Bentley, McIlroy).
•
Radix sort and combinations.
Problems due to:
•
suffix comparisons not of
O(1) (average LCP),
“aaaaaaa...a$”.
SACA goals
•Possibly
Θ(n).
•
Fast in practice (real data).
•
Lightweight (computation memory).
O(n log n)
: qsufsort
•Jesper Larsson, Kunihiko Sadakane; 1999.
•
Prefix doubling.
0 1 2 3 4 5 6 7 8 9 10 11 x= [ a b e a c a d a b e a $ ] SA1= [ 11 (0 3 5 7 10) (1 8) 4 6 (2 9) ] ISA1= [ 5 7 11 5 8 5 9 5 7 11 5 0 ] ↑ ↑ ↑ ↑ ↑ 1 4 6 8 11 i S (i) 0 abeacadabea$ 3 acadabea$ 5 adabea$ 7 abea$ 10 a$ → ordered by i + h S (i) 0 + 1 b·eacadabea$ 3 + 1 c·adabea$ 5 + 1 d·abea$ 7 + 1 b·ea$ 10 + 1 $
0 1 2 3 4 5 6 7 8 9 10 11 x= [ a b e a c a d a b e a $ ] SA1= [ 11 (0 3 5 7 10) (1 8) 4 6 (2 9) ] ISA1= [ 5 7 11 5 8 5 9 5 7 11 5 0 ] ↑ ↑ ↑ ↑ ↑ 1 4 6 8 11 i S (i) 0 abeacadabea$ 3 acadabea$ 5 adabea$ 7 abea$ 10 a$ → ordered by i + h S (i) 0 + 1 b·eacadabea$ 3 + 1 c·adabea$ 5 + 1 d·abea$ 7 + 1 b·ea$ 10 + 1 $
0 1 2 3 4 5 6 7 8 9 10 11 x= [ a b e a c a d a b e a $ ] SA1= [ 11 (0 3 5 7 10) (1 8) 4 6 (2 9) ] ISA1= [ 5 7 11 5 8 5 9 5 7 11 5 0 ] ↑ ↑ ↑ ↑ ↑ 1 4 6 8 11 i S (i) 0 abeacadabea$ 3 acadabea$ 5 adabea$ 7 abea$ 10 a$ → ordered by i + h S (i) 0 + 1 b·eacadabea$ 3 + 1 c·adabea$ 5 + 1 d·abea$ 7 + 1 b·ea$ 10 + 1 $
0 1 2 3 4 5 6 7 8 9 10 11 x= [ a b e a c a d a b e a $ ] SA1= [ 11 (0 3 5 7 10) (1 8) 4 6 (2 9) ] ISA1= [ 5 7 11 5 8 5 9 5 7 11 5 0 ] ↑ ↑ ↑ ↑ ↑ 1 4 6 8 11 i S (i) 0 abeacadabea$ 3 acadabea$ 5 adabea$ 7 abea$ 10 a$ → ordered by i + h S (i) 0 + 1 b·eacadabea$ 3 + 1 c·adabea$ 5 + 1 d·abea$ 7 + 1 b·ea$ 10 + 1 $
0 1 2 3 4 5 6 7 8 9 10 11 x= [ a b e a c a d a b e a $ ] SA1 = [ 11 (0 3 5 7 10) (1 8) 4 6 (2 9) ] ISA1 = [ 5 7 11 5 8 5 9 5 7 11 5 0 ] L= [ -1 5 2 -2 2 ] SA2 = [ 11 10 (0 7) 3 5 (1 8) 4 6 (2 9) ] ISA2 = [ 3 7 11 4 8 5 9 3 7 11 2 0 ] L= [ -2 2 -2 2 -2 2 ] SA4 = [ 11 10 (0 7) 3 5 8 1 4 6 9 2 ] ISA4 = [ 3 7 11 4 8 5 9 3 6 10 1 0 ] L= [ -2 2 -8 ] SA8 = [ 11 10 7 0 3 5 8 1 4 6 9 2 ] ISA8 = [ 3 7 11 4 8 5 9 2 6 10 1 0 ] L= [ -12 ]
O(n)
solution: skew
Divide-and-conquer:
•
Bucket-sorting.
•
Problem splitting and recursion.
•
Cheap merge from partial SAs.
Induced copying
•Determine different types of suffixes.
•
Sort a single type of suffixes.
About the project
•Michał Nowak.
•
Suffix arrays in Java.
•
Different algorithms.
Algorithms:
Algorithm Authors Complexity Memory
skew P. Sanders, J. Kärkkäinen O(n) 10-13n qsufsort J. Larrson, K. Sadakane O(n log n) 8n deep shallow G. Manzini O(n2log n) 5n
two-stage H. Itoh, H. Tanaka O(n2log n) 5n impr. two-stage S. Puglisi, M. Maniscalco O(n2log n) 5n bpr K. B. Schürmann O(n2(log n)−1) 9-10n
quicksort (naïve algorithm)
JVMs (in their newest versions):
• SUN
• IBM
• JRockit (Oracle) • Apache Harmony • gcj, initially only.
Code conversion problems
•
Pointers
→ indexed arrays.
•
Memory reused for different types
→ φ.
•
Stack-allocated structures and arrays
→ φ.
•
Boundary checks penalty.
Code conversion problems
•Pointers
→ indexed arrays.
•
Memory reused for different types
→ φ.
•
Stack-allocated structures and arrays
→ φ.
•
Boundary checks penalty.
Code conversion problems
•Pointers
→ indexed arrays.
•
Memory reused for different types
→ φ.
•
Stack-allocated structures and arrays
→ φ.
•
Boundary checks penalty.
Code conversion problems
•Pointers
→ indexed arrays.
•
Memory reused for different types
→ φ.
•
Stack-allocated structures and arrays
→ φ.
•
Boundary checks penalty.
Code conversion problems
•Pointers
→ indexed arrays.
•
Memory reused for different types
→ φ.
•
Stack-allocated structures and arrays
→ φ.
•
Boundary checks penalty.
Code conversion problems
•Pointers
→ indexed arrays.
•
Memory reused for different types
→ φ.
•
Stack-allocated structures and arrays
→ φ.
•
Boundary checks penalty.
JIT code dumping
1private static byte bump(byte v) {
2 return (byte) (v + 1);
3}
4
5public static void doLoop(byte [] array) {
6 for (int i = 0; i < array.length; i++) {
7 array[i] = bump(array[i]);
8 }
EquivalentCcode (gcc -S -O3):
1...
2.L4:
3 leaq 1048576(%rsp), %rdx // The loop’s end address.
4 addb $1, (%rax) // bump byte
5 addq $1, %rax // increase loop counter.
6 cmpq %rdx, %rax
7 jne .L4 // repeat until cond. true.
Java code compiled withgcj -O3 -S Test.java:
1...
2.L20:
3 .p2align 4,,5
4 jbe .L31 // AOOB if max <= i
5.L23:
6 movslq %edi,%rax // store current i
7 addl $1, %edi // increase loop counter.
8 addb $1, 12(%rbx,%rax) // bump byte inside the array
9 cmpl %edi, %edx // loop condition check.
10 .p2align 4,,2
Java code; JIT-compiled, SUN’s HotSpot,-server:
1 mov 0x10(%rsi),%r9d // get array length
2 test %r9d,%r9d // check if empty array.
3 jle L_OUT
4 xor %r11d,%r11d // i = 0
5L1: cmp %r9d,%r11d
6 jae L_OUT // L_OUT if max >= i
7 movslq %r11d,%r10 // store current i
8 movsbl 0x18(%rsi,%r10,1),%r8d
9 inc %r11d // bump i
10 inc %r8d // bump value at array
11 mov %r8b,0x18(%rsi,%r10,1)
12 cmp $0x1,%r11d // jump always (?)
13 jl L1
But also:
1 movslq %ecx,%rcx // (Unfolded loop)
2 movsbl 0x19(%rbx,%rcx,1),%r9d 3 inc %r9d 4 mov %r9b,0x19(%rbx,%rcx,1) 5 movsbl 0x1a(%rbx,%rcx,1),%r9d 6 inc %r9d 7 mov %r9b,0x1a(%rbx,%rcx,1) 8 movsbl 0x1b(%rbx,%rcx,1),%r9d 9 inc %r9d 10...
Quiz
1/** Do I look evil to you? */
2public class Example10 {
3 private static boolean ready;
4
5 public static void startThread() {
6 new Thread() {
7 public void run() {
8 try { sleep(2000); } catch (Exception e) { /* ignore */ }
9 ready = true;
10 System.out.println("Setting ready to true.");
11 }
12 }.start();
13 }
14
15 public static void main(String [] args) {
16 startThread();
17 while (!ready) {
18 // Do nothing.
19 }
20 System.out.println("I’m ready.");
21 }
1> gcj -O3 -S Example10.java 1... 2 cmpb $0, _ZN3com10dawidweiss9debugging6simple9Example105readyE(%rip) 3 jne .L13 4.L16: // buhu! :) 5 jmp .L16 6 .p2align 4,,7 7.L13: 8...
1> gcj -O3 -S Example10.java 1... 2 cmpb $0, _ZN3com10dawidweiss9debugging6simple9Example105readyE(%rip) 3 jne .L13 4.L16: // buhu! :) 5 jmp .L16 6 .p2align 4,,7 7.L13: 8...
Other things of interest
• Minimizing GC activity.Evaluation
•
Random input (alphabets 4, 100, 255, variable length).
•
Gauntlet and Manzini’s corpora.
•
Multiple runs, warmup rounds removed.
•
Wall time measured (no parallelism other than the GC).
1 2 3 4 5 6 7 8 0 50 100 150 200 250 300 time [s]
alphabet size [symbols] time on constant input, varying alphabet
BPR DEEP-SHALLOW DIVSUFSORT NS QSUFSORT SKEW
0 100 200 300 400 500 600 0 5 10 15 20 25 memory [MB]
input size [millions elements] memory on random input, alphabet size = 255 BPR DEEP-SHALLOW DIVSUFSORT NS QSUFSORT SKEW
0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 time [s]
input size [millions elements] time on random input, alphabet size = 255 BPR DEEP-SHALLOW DIVSUFSORT NS QSUFSORT SKEW
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 time [s]
input size [millions elements] time on random input, alphabet size = 4 BPR DEEP-SHALLOW DIVSUFSORT NS QSUFSORT SKEW
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
SKEW DIVSUFSORT BPR QSUFSORT
time [s]
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
SKEW DIVSUFSORT BPR QSUFSORT
time [s]
0 20 40 60 80 100 120 140
sun ibm jrockit harmony
time [s]
BPR DIVSUFSORT QSUFSORT SKEW
Summary
•
Suffix arrays are worth remembering.
•