SUFFIX ARRAYS musings on the data structure, applications and implementation in Java

(1)

SUFFIX ARRAYS

musings on the data structure, applications and implementation in Java

Michał Nowak, Dawid Weiss

(2)

Notation

•

Let

Σ be an alphabet;

•

Σ is finite and non-empty;

•

symbols

x

a

∈ Σ, where a = 1 . . . |Σ| are ordered;

•

O(x

b

− x

c

) = 1;

•

$ is a special symbol smaller than any x ∈ Σ.

•

Let

X be a sequence of n symbols, where X [i] ∈ Σ and

i = 0 . . . n − 1.

•

Let

S (i) or i be a sub-sequence of X starting at position

i and ending at n − 1.

(3)

Notation

•

Let

Σ be an alphabet;

•

Σ is finite and non-empty;

•

symbols

x

a

∈ Σ, where a = 1 . . . |Σ| are ordered;

•

O(x

b

− x

c

) = 1;

•

$ is a special symbol smaller than any x ∈ Σ.

•

Let

X be a sequence of n symbols, where X [i] ∈ Σ and

i = 0 . . . n − 1.

•

Let

S (i) or i be a sub-sequence of X starting at position

i and ending at n − 1.

(4)

Examples

•

Zero-terminated US-ASCII strings (Latin letters).

•

DNA sequences.

•

Integer-coded sequences of words.

(5)

(recall)

(6)

suffix S (i)

i

c a c a o $

0 a c a o $

1 c a o $

2 a o $

3 o $

4 $

5

(7)

suffix S (i)

i

c a c a o $

0 a c a o $

1 c a o $

2 a o $

3 o $

4 $

5

(8)

(9)

(10)

(11)

(12)

Compacting (pocket trees)

1

Move labels to edges,

(13)

(14)

Properties of suffix trees

•

Maximum

2n nodes (!).

•

Elegant to program with.

(15)

(16)

(17)

(18)

?

(19)

alibaba.taliban.

(20)

(21)

What do you think,

do geese see God?

(22)

whatdoyouthinkdogeeseseegod.dogeeseseegodkniht...

(23)

There are many more ST applications.

Dan Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press.

(24)

Problems with suffix trees

•

Alphabet size.

(25)

(26)

(27)

Ordered depth-first ST traversal. $ a · c a o $ a · o $ c a · c a o $ c a · o $ o $

(28)

suffix S (i) i c a c a o $ 0 a c a o $ 1 c a o $ 2 a o $ 3 o $ 4 $ 5 → sort →

suffix S (i) SA[j]

$ 5 a · c a o $ 1 a · o $ 3 c a · c a o $ 0 c a · o $ 2 o $ 4 Conclusions:

• A suffix array is an array of indices of sorted suffixes. • A suffix array is always a permutation of suffix indices. • Memory consumption: sizeof (index) × n.

• Suffix arrays and suffix trees can simulate each other. (Abouelhoda et al., 2004).

• Suffix sorting can be done inO(n);

(29)

(30)

(31)

• A suffix array is an array of indices of sorted suffixes.

• A suffix array is always a permutation of suffix indices. • Memory consumption: sizeof (index) × n.

(32)

• A suffix array is an array of indices of sorted suffixes. • A suffix array is always a permutation of suffix indices.

• Memory consumption: sizeof (index) × n.

(33)

(34)

(35)

(36)

(selected)

(37)

Naïve suffix sorting

Algorithms:

•

Three-way quicksort (Bentley, McIlroy).

•

Radix sort and combinations.

Problems due to:

•

suffix comparisons not of

O(1) (average LCP),

“aaaaaaa...a$”.

(38)

Naïve suffix sorting

Algorithms:

•

Three-way quicksort (Bentley, McIlroy).

•

Radix sort and combinations.

Problems due to:

•

suffix comparisons not of

O(1) (average LCP),

“aaaaaaa...a$”.

(39)

SACA goals

•

Possibly

Θ(n).

•

Fast in practice (real data).

•

Lightweight (computation memory).

(40)

(41)

O(n log n)

: qsufsort

•

Jesper Larsson, Kunihiko Sadakane; 1999.

•

Prefix doubling.

(42)

0 1 2 3 4 5 6 7 8 9 10 11 x= [ a b e a c a d a b e a $ ] SA₁= [ 11 (0 3 5 7 10) (1 8) 4 6 (2 9) ] ISA1= [ 5 7 11 5 8 5 9 5 7 11 5 0 ] ↑ ↑ ↑ ↑ ↑ 1 4 6 8 11 i S (i) 0 abeacadabea$ 3 acadabea$ 5 adabea$ 7 abea$ 10 a$ → ordered by i + h S (i) 0 + 1 b·eacadabea$ 3 + 1 c·adabea$ 5 + 1 d·abea$ 7 + 1 b·ea$ 10 + 1 $

(43)

(44)

(45)

(46)

0 1 2 3 4 5 6 7 8 9 10 11 x= [ a b e a c a d a b e a $ ] SA1 = [ 11 (0 3 5 7 10) (1 8) 4 6 (2 9) ] ISA₁ = [ 5 7 11 5 8 5 9 5 7 11 5 0 ] L= [ -1 5 2 -2 2 ] SA2 = [ 11 10 (0 7) 3 5 (1 8) 4 6 (2 9) ] ISA₂ = [ 3 7 11 4 8 5 9 3 7 11 2 0 ] L= [ -2 2 -2 2 -2 2 ] SA4 = [ 11 10 (0 7) 3 5 8 1 4 6 9 2 ] ISA₄ = [ 3 7 11 4 8 5 9 3 6 10 1 0 ] L= [ -2 2 -8 ] SA₈ = [ 11 10 7 0 3 5 8 1 4 6 9 2 ] ISA₈ = [ 3 7 11 4 8 5 9 2 6 10 1 0 ] L= [ -12 ]

(47)

O(n)

solution: skew

Divide-and-conquer:

•

Bucket-sorting.

•

Problem splitting and recursion.

•

Cheap merge from partial SAs.

(48)

(49)

(50)

Induced copying

•

Determine different types of suffixes.

•

Sort a single type of suffixes.

(51)

(52)

(53)

(54)

About the project

•

Michał Nowak.

•

Suffix arrays in Java.

•

Different algorithms.

(55)

Algorithms:

Algorithm Authors Complexity Memory

skew P. Sanders, J. Kärkkäinen O(n) 10-13n qsufsort J. Larrson, K. Sadakane O(n log n) 8n deep shallow G. Manzini O(n2_{log n)} _5n

two-stage H. Itoh, H. Tanaka O(n2log n) 5n impr. two-stage S. Puglisi, M. Maniscalco O(n2log n) 5n bpr K. B. Schürmann O(n2(log n)−1) 9-10n

quicksort (naïve algorithm)

JVMs (in their newest versions):

• SUN

• IBM

• JRockit (Oracle) • Apache Harmony • gcj, initially only.

(56)

Code conversion problems

•

Pointers

→ indexed arrays.

•

Memory reused for different types

→ φ.

•

Stack-allocated structures and arrays

→ φ.

•

Boundary checks penalty.

(57)

Code conversion problems

•

Pointers

→ indexed arrays.

•

Memory reused for different types

→ φ.

•

Stack-allocated structures and arrays

→ φ.

•

Boundary checks penalty.

(58)

Code conversion problems

•

Pointers

→ indexed arrays.

•

Memory reused for different types

→ φ.

•

Stack-allocated structures and arrays

→ φ.

•

Boundary checks penalty.

(59)

Code conversion problems

•

Pointers

→ indexed arrays.

•

Memory reused for different types

→ φ.

•

Stack-allocated structures and arrays

→ φ.

•

Boundary checks penalty.

(60)

Code conversion problems

•

Pointers

→ indexed arrays.

•

Memory reused for different types

→ φ.

•

Stack-allocated structures and arrays

→ φ.

•

Boundary checks penalty.

(61)

Code conversion problems

•

Pointers

→ indexed arrays.

•

Memory reused for different types

→ φ.

•

Stack-allocated structures and arrays

→ φ.

•

Boundary checks penalty.

(62)

(63)

JIT code dumping

1private static byte bump(byte v) {

2 return (byte) (v + 1);

3}

4

5public static void doLoop(byte [] array) {

6 for (int i = 0; i < array.length; i++) {

7 array[i] = bump(array[i]);

8 }

(64)

EquivalentCcode (gcc -S -O3):

1...

2.L4:

3 leaq 1048576(%rsp), %rdx // The loop’s end address.

4 addb $1, (%rax) // bump byte

5 addq $1, %rax // increase loop counter.

6 cmpq %rdx, %rax

7 jne .L4 // repeat until cond. true.

Java code compiled withgcj -O3 -S Test.java:

1...

2.L20:

3 .p2align 4,,5

4 jbe .L31 // AOOB if max <= i

5.L23:

6 movslq %edi,%rax // store current i

7 addl $1, %edi // increase loop counter.

8 addb $1, 12(%rbx,%rax) // bump byte inside the array

9 cmpl %edi, %edx // loop condition check.

10 .p2align 4,,2

(65)

Java code; JIT-compiled, SUN’s HotSpot,-server:

1 mov 0x10(%rsi),%r9d // get array length

2 test %r9d,%r9d // check if empty array.

3 jle L_OUT

4 xor %r11d,%r11d // i = 0

5L1: cmp %r9d,%r11d

6 jae L_OUT // L_OUT if max >= i

7 movslq %r11d,%r10 // store current i

8 movsbl 0x18(%rsi,%r10,1),%r8d

9 inc %r11d // bump i

10 inc %r8d // bump value at array

11 mov %r8b,0x18(%rsi,%r10,1)

12 cmp $0x1,%r11d // jump always (?)

13 jl L1

But also:

1 movslq %ecx,%rcx // (Unfolded loop)

2 movsbl 0x19(%rbx,%rcx,1),%r9d 3 inc %r9d 4 mov %r9b,0x19(%rbx,%rcx,1) 5 movsbl 0x1a(%rbx,%rcx,1),%r9d 6 inc %r9d 7 mov %r9b,0x1a(%rbx,%rcx,1) 8 movsbl 0x1b(%rbx,%rcx,1),%r9d 9 inc %r9d 10...

(66)

Quiz

1/** Do I look evil to you? */

2public class Example10 {

3 private static boolean ready;

4

5 public static void startThread() {

6 new Thread() {

7 public void run() {

8 try { sleep(2000); } catch (Exception e) { /* ignore */ }

9 ready = true;

10 System.out.println("Setting ready to true.");

11 }

12 }.start();

13 }

14

15 public static void main(String [] args) {

16 startThread();

17 while (!ready) {

18 // Do nothing.

19 }

20 System.out.println("I’m ready.");

21 }

(67)

1> gcj -O3 -S Example10.java 1... 2 cmpb $0, _ZN3com10dawidweiss9debugging6simple9Example105readyE(%rip) 3 jne .L13 4.L16: // buhu! :) 5 jmp .L16 6 .p2align 4,,7 7.L13: 8...

(68)

1> gcj -O3 -S Example10.java 1... 2 cmpb $0, _ZN3com10dawidweiss9debugging6simple9Example105readyE(%rip) 3 jne .L13 4.L16: // buhu! :) 5 jmp .L16 6 .p2align 4,,7 7.L13: 8...

(69)

Other things of interest

• Minimizing GC activity.

(70)

(71)

Evaluation

•

Random input (alphabets 4, 100, 255, variable length).

•

Gauntlet and Manzini’s corpora.

•

Multiple runs, warmup rounds removed.

•

Wall time measured (no parallelism other than the GC).

(72)

1 2 3 4 5 6 7 8 0 50 100 150 200 250 300 time [s]

alphabet size [symbols] time on constant input, varying alphabet

BPR DEEP-SHALLOW DIVSUFSORT NS QSUFSORT SKEW

(73)

0 100 200 300 400 500 600 0 5 10 15 20 25 memory [MB]

input size [millions elements] memory on random input, alphabet size = 255 BPR DEEP-SHALLOW DIVSUFSORT NS QSUFSORT SKEW

(74)

0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 time [s]

input size [millions elements] time on random input, alphabet size = 255 BPR DEEP-SHALLOW DIVSUFSORT NS QSUFSORT SKEW

(75)

0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 time [s]

input size [millions elements] time on random input, alphabet size = 4 BPR DEEP-SHALLOW DIVSUFSORT NS QSUFSORT SKEW

(76)

(77)

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

SKEW DIVSUFSORT BPR QSUFSORT

time [s]

(78)

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

SKEW DIVSUFSORT BPR QSUFSORT

time [s]

(79)

0 20 40 60 80 100 120 140

sun ibm jrockit harmony

time [s]

BPR DIVSUFSORT QSUFSORT SKEW

(80)

Summary

•

Suffix arrays are worth remembering.

•