• Nie Znaleziono Wyników

SUFFIX ARRAYS musings on the data structure, applications and implementation in Java

N/A
N/A
Protected

Academic year: 2021

Share "SUFFIX ARRAYS musings on the data structure, applications and implementation in Java"

Copied!
80
0
0

Pełen tekst

(1)

SUFFIX ARRAYS

musings on the data structure, applications and implementation in Java

Michał Nowak, Dawid Weiss

(2)

Notation

Let

Σ be an alphabet;

Σ is finite and non-empty;

symbols

x

a

∈ Σ, where a = 1 . . . |Σ| are ordered;

O(x

b

− x

c

) = 1;

$ is a special symbol smaller than any x ∈ Σ.

Let

X be a sequence of n symbols, where X [i] ∈ Σ and

i = 0 . . . n − 1.

Let

S (i) or i be a sub-sequence of X starting at position

i and ending at n − 1.

(3)

Notation

Let

Σ be an alphabet;

Σ is finite and non-empty;

symbols

x

a

∈ Σ, where a = 1 . . . |Σ| are ordered;

O(x

b

− x

c

) = 1;

$ is a special symbol smaller than any x ∈ Σ.

Let

X be a sequence of n symbols, where X [i] ∈ Σ and

i = 0 . . . n − 1.

Let

S (i) or i be a sub-sequence of X starting at position

i and ending at n − 1.

(4)

Examples

Zero-terminated US-ASCII strings (Latin letters).

DNA sequences.

Integer-coded sequences of words.

(5)

(recall)

(6)

suffix S (i)

i

c a c a o $

0

a c a o $

1

c a o $

2

a o $

3

o $

4

$

5

(7)

suffix S (i)

i

c a c a o $

0

a c a o $

1

c a o $

2

a o $

3

o $

4

$

5

(8)
(9)
(10)
(11)
(12)

Compacting (pocket trees)

1

Move labels to edges,

(13)
(14)

Properties of suffix trees

Maximum

2n nodes (!).

Elegant to program with.

(15)
(16)
(17)
(18)

?

(19)

alibaba.taliban.

(20)
(21)

What do you think,

do geese see God?

(22)

whatdoyouthinkdogeeseseegod.dogeeseseegodkniht...

(23)

There are many more ST applications.

Dan Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press.

(24)

Problems with suffix trees

Alphabet size.

(25)
(26)
(27)

Ordered depth-first ST traversal. $ a · c a o $ a · o $ c a · c a o $ c a · o $ o $

(28)

suffix S (i) i c a c a o $ 0 a c a o $ 1 c a o $ 2 a o $ 3 o $ 4 $ 5 → sort →

suffix S (i) SA[j]

$ 5 a · c a o $ 1 a · o $ 3 c a · c a o $ 0 c a · o $ 2 o $ 4 Conclusions:

• A suffix array is an array of indices of sorted suffixes. • A suffix array is always a permutation of suffix indices. • Memory consumption: sizeof (index) × n.

• Suffix arrays and suffix trees can simulate each other. (Abouelhoda et al., 2004).

• Suffix sorting can be done inO(n);

(29)

suffix S (i) i c a c a o $ 0 a c a o $ 1 c a o $ 2 a o $ 3 o $ 4 $ 5 → sort →

suffix S (i) SA[j]

$ 5 a · c a o $ 1 a · o $ 3 c a · c a o $ 0 c a · o $ 2 o $ 4 Conclusions:

• A suffix array is an array of indices of sorted suffixes. • A suffix array is always a permutation of suffix indices. • Memory consumption: sizeof (index) × n.

• Suffix arrays and suffix trees can simulate each other. (Abouelhoda et al., 2004).

• Suffix sorting can be done inO(n);

(30)

suffix S (i) i c a c a o $ 0 a c a o $ 1 c a o $ 2 a o $ 3 o $ 4 $ 5 → sort →

suffix S (i) SA[j]

$ 5 a · c a o $ 1 a · o $ 3 c a · c a o $ 0 c a · o $ 2 o $ 4 Conclusions:

• A suffix array is an array of indices of sorted suffixes. • A suffix array is always a permutation of suffix indices. • Memory consumption: sizeof (index) × n.

• Suffix arrays and suffix trees can simulate each other. (Abouelhoda et al., 2004).

• Suffix sorting can be done inO(n);

(31)

suffix S (i) i c a c a o $ 0 a c a o $ 1 c a o $ 2 a o $ 3 o $ 4 $ 5 → sort →

suffix S (i) SA[j]

$ 5 a · c a o $ 1 a · o $ 3 c a · c a o $ 0 c a · o $ 2 o $ 4 Conclusions:

• A suffix array is an array of indices of sorted suffixes.

• A suffix array is always a permutation of suffix indices. • Memory consumption: sizeof (index) × n.

• Suffix arrays and suffix trees can simulate each other. (Abouelhoda et al., 2004).

• Suffix sorting can be done inO(n);

(32)

suffix S (i) i c a c a o $ 0 a c a o $ 1 c a o $ 2 a o $ 3 o $ 4 $ 5 → sort →

suffix S (i) SA[j]

$ 5 a · c a o $ 1 a · o $ 3 c a · c a o $ 0 c a · o $ 2 o $ 4 Conclusions:

• A suffix array is an array of indices of sorted suffixes. • A suffix array is always a permutation of suffix indices.

• Memory consumption: sizeof (index) × n.

• Suffix arrays and suffix trees can simulate each other. (Abouelhoda et al., 2004).

• Suffix sorting can be done inO(n);

(33)

suffix S (i) i c a c a o $ 0 a c a o $ 1 c a o $ 2 a o $ 3 o $ 4 $ 5 → sort →

suffix S (i) SA[j]

$ 5 a · c a o $ 1 a · o $ 3 c a · c a o $ 0 c a · o $ 2 o $ 4 Conclusions:

• A suffix array is an array of indices of sorted suffixes. • A suffix array is always a permutation of suffix indices. • Memory consumption: sizeof (index) × n.

• Suffix arrays and suffix trees can simulate each other. (Abouelhoda et al., 2004).

• Suffix sorting can be done inO(n);

(34)

suffix S (i) i c a c a o $ 0 a c a o $ 1 c a o $ 2 a o $ 3 o $ 4 $ 5 → sort →

suffix S (i) SA[j]

$ 5 a · c a o $ 1 a · o $ 3 c a · c a o $ 0 c a · o $ 2 o $ 4 Conclusions:

• A suffix array is an array of indices of sorted suffixes. • A suffix array is always a permutation of suffix indices. • Memory consumption: sizeof (index) × n.

• Suffix arrays and suffix trees can simulate each other. (Abouelhoda et al., 2004).

• Suffix sorting can be done inO(n);

(35)

suffix S (i) i c a c a o $ 0 a c a o $ 1 c a o $ 2 a o $ 3 o $ 4 $ 5 → sort →

suffix S (i) SA[j]

$ 5 a · c a o $ 1 a · o $ 3 c a · c a o $ 0 c a · o $ 2 o $ 4 Conclusions:

• A suffix array is an array of indices of sorted suffixes. • A suffix array is always a permutation of suffix indices. • Memory consumption: sizeof (index) × n.

• Suffix arrays and suffix trees can simulate each other. (Abouelhoda et al., 2004).

• Suffix sorting can be done inO(n);

(36)

(selected)

(37)

Naïve suffix sorting

Algorithms:

Three-way quicksort (Bentley, McIlroy).

Radix sort and combinations.

Problems due to:

suffix comparisons not of

O(1) (average LCP),

“aaaaaaa...a$”.

(38)

Naïve suffix sorting

Algorithms:

Three-way quicksort (Bentley, McIlroy).

Radix sort and combinations.

Problems due to:

suffix comparisons not of

O(1) (average LCP),

“aaaaaaa...a$”.

(39)

SACA goals

Possibly

Θ(n).

Fast in practice (real data).

Lightweight (computation memory).

(40)
(41)

O(n log n)

: qsufsort

Jesper Larsson, Kunihiko Sadakane; 1999.

Prefix doubling.

(42)

0 1 2 3 4 5 6 7 8 9 10 11 x= [ a b e a c a d a b e a $ ] SA1= [ 11 (0 3 5 7 10) (1 8) 4 6 (2 9) ] ISA1= [ 5 7 11 5 8 5 9 5 7 11 5 0 ] ↑ ↑ ↑ ↑ ↑ 1 4 6 8 11 i S (i) 0 abeacadabea$ 3 acadabea$ 5 adabea$ 7 abea$ 10 a$ → ordered by i + h S (i) 0 + 1 b·eacadabea$ 3 + 1 c·adabea$ 5 + 1 d·abea$ 7 + 1 b·ea$ 10 + 1 $

(43)

0 1 2 3 4 5 6 7 8 9 10 11 x= [ a b e a c a d a b e a $ ] SA1= [ 11 (0 3 5 7 10) (1 8) 4 6 (2 9) ] ISA1= [ 5 7 11 5 8 5 9 5 7 11 5 0 ] ↑ ↑ ↑ ↑ ↑ 1 4 6 8 11 i S (i) 0 abeacadabea$ 3 acadabea$ 5 adabea$ 7 abea$ 10 a$ → ordered by i + h S (i) 0 + 1 b·eacadabea$ 3 + 1 c·adabea$ 5 + 1 d·abea$ 7 + 1 b·ea$ 10 + 1 $

(44)

0 1 2 3 4 5 6 7 8 9 10 11 x= [ a b e a c a d a b e a $ ] SA1= [ 11 (0 3 5 7 10) (1 8) 4 6 (2 9) ] ISA1= [ 5 7 11 5 8 5 9 5 7 11 5 0 ] ↑ ↑ ↑ ↑ ↑ 1 4 6 8 11 i S (i) 0 abeacadabea$ 3 acadabea$ 5 adabea$ 7 abea$ 10 a$ → ordered by i + h S (i) 0 + 1 b·eacadabea$ 3 + 1 c·adabea$ 5 + 1 d·abea$ 7 + 1 b·ea$ 10 + 1 $

(45)

0 1 2 3 4 5 6 7 8 9 10 11 x= [ a b e a c a d a b e a $ ] SA1= [ 11 (0 3 5 7 10) (1 8) 4 6 (2 9) ] ISA1= [ 5 7 11 5 8 5 9 5 7 11 5 0 ] ↑ ↑ ↑ ↑ ↑ 1 4 6 8 11 i S (i) 0 abeacadabea$ 3 acadabea$ 5 adabea$ 7 abea$ 10 a$ → ordered by i + h S (i) 0 + 1 b·eacadabea$ 3 + 1 c·adabea$ 5 + 1 d·abea$ 7 + 1 b·ea$ 10 + 1 $

(46)

0 1 2 3 4 5 6 7 8 9 10 11 x= [ a b e a c a d a b e a $ ] SA1 = [ 11 (0 3 5 7 10) (1 8) 4 6 (2 9) ] ISA1 = [ 5 7 11 5 8 5 9 5 7 11 5 0 ] L= [ -1 5 2 -2 2 ] SA2 = [ 11 10 (0 7) 3 5 (1 8) 4 6 (2 9) ] ISA2 = [ 3 7 11 4 8 5 9 3 7 11 2 0 ] L= [ -2 2 -2 2 -2 2 ] SA4 = [ 11 10 (0 7) 3 5 8 1 4 6 9 2 ] ISA4 = [ 3 7 11 4 8 5 9 3 6 10 1 0 ] L= [ -2 2 -8 ] SA8 = [ 11 10 7 0 3 5 8 1 4 6 9 2 ] ISA8 = [ 3 7 11 4 8 5 9 2 6 10 1 0 ] L= [ -12 ]

(47)

O(n)

solution: skew

Divide-and-conquer:

Bucket-sorting.

Problem splitting and recursion.

Cheap merge from partial SAs.

(48)
(49)
(50)

Induced copying

Determine different types of suffixes.

Sort a single type of suffixes.

(51)
(52)
(53)
(54)

About the project

Michał Nowak.

Suffix arrays in Java.

Different algorithms.

(55)

Algorithms:

Algorithm Authors Complexity Memory

skew P. Sanders, J. Kärkkäinen O(n) 10-13n qsufsort J. Larrson, K. Sadakane O(n log n) 8n deep shallow G. Manzini O(n2log n) 5n

two-stage H. Itoh, H. Tanaka O(n2log n) 5n impr. two-stage S. Puglisi, M. Maniscalco O(n2log n) 5n bpr K. B. Schürmann O(n2(log n)−1) 9-10n

quicksort (naïve algorithm)

JVMs (in their newest versions):

• SUN

• IBM

• JRockit (Oracle) • Apache Harmony • gcj, initially only.

(56)

Code conversion problems

Pointers

→ indexed arrays.

Memory reused for different types

→ φ.

Stack-allocated structures and arrays

→ φ.

Boundary checks penalty.

(57)

Code conversion problems

Pointers

→ indexed arrays.

Memory reused for different types

→ φ.

Stack-allocated structures and arrays

→ φ.

Boundary checks penalty.

(58)

Code conversion problems

Pointers

→ indexed arrays.

Memory reused for different types

→ φ.

Stack-allocated structures and arrays

→ φ.

Boundary checks penalty.

(59)

Code conversion problems

Pointers

→ indexed arrays.

Memory reused for different types

→ φ.

Stack-allocated structures and arrays

→ φ.

Boundary checks penalty.

(60)

Code conversion problems

Pointers

→ indexed arrays.

Memory reused for different types

→ φ.

Stack-allocated structures and arrays

→ φ.

Boundary checks penalty.

(61)

Code conversion problems

Pointers

→ indexed arrays.

Memory reused for different types

→ φ.

Stack-allocated structures and arrays

→ φ.

Boundary checks penalty.

(62)
(63)

JIT code dumping

1private static byte bump(byte v) {

2 return (byte) (v + 1);

3}

4

5public static void doLoop(byte [] array) {

6 for (int i = 0; i < array.length; i++) {

7 array[i] = bump(array[i]);

8 }

(64)

EquivalentCcode (gcc -S -O3):

1...

2.L4:

3 leaq 1048576(%rsp), %rdx // The loop’s end address.

4 addb $1, (%rax) // bump byte

5 addq $1, %rax // increase loop counter.

6 cmpq %rdx, %rax

7 jne .L4 // repeat until cond. true.

Java code compiled withgcj -O3 -S Test.java:

1...

2.L20:

3 .p2align 4,,5

4 jbe .L31 // AOOB if max <= i

5.L23:

6 movslq %edi,%rax // store current i

7 addl $1, %edi // increase loop counter.

8 addb $1, 12(%rbx,%rax) // bump byte inside the array

9 cmpl %edi, %edx // loop condition check.

10 .p2align 4,,2

(65)

Java code; JIT-compiled, SUN’s HotSpot,-server:

1 mov 0x10(%rsi),%r9d // get array length

2 test %r9d,%r9d // check if empty array.

3 jle L_OUT

4 xor %r11d,%r11d // i = 0

5L1: cmp %r9d,%r11d

6 jae L_OUT // L_OUT if max >= i

7 movslq %r11d,%r10 // store current i

8 movsbl 0x18(%rsi,%r10,1),%r8d

9 inc %r11d // bump i

10 inc %r8d // bump value at array

11 mov %r8b,0x18(%rsi,%r10,1)

12 cmp $0x1,%r11d // jump always (?)

13 jl L1

But also:

1 movslq %ecx,%rcx // (Unfolded loop)

2 movsbl 0x19(%rbx,%rcx,1),%r9d 3 inc %r9d 4 mov %r9b,0x19(%rbx,%rcx,1) 5 movsbl 0x1a(%rbx,%rcx,1),%r9d 6 inc %r9d 7 mov %r9b,0x1a(%rbx,%rcx,1) 8 movsbl 0x1b(%rbx,%rcx,1),%r9d 9 inc %r9d 10...

(66)

Quiz

1/** Do I look evil to you? */

2public class Example10 {

3 private static boolean ready;

4

5 public static void startThread() {

6 new Thread() {

7 public void run() {

8 try { sleep(2000); } catch (Exception e) { /* ignore */ }

9 ready = true;

10 System.out.println("Setting ready to true.");

11 }

12 }.start();

13 }

14

15 public static void main(String [] args) {

16 startThread();

17 while (!ready) {

18 // Do nothing.

19 }

20 System.out.println("I’m ready.");

21 }

(67)

1> gcj -O3 -S Example10.java 1... 2 cmpb $0, _ZN3com10dawidweiss9debugging6simple9Example105readyE(%rip) 3 jne .L13 4.L16: // buhu! :) 5 jmp .L16 6 .p2align 4,,7 7.L13: 8...

(68)

1> gcj -O3 -S Example10.java 1... 2 cmpb $0, _ZN3com10dawidweiss9debugging6simple9Example105readyE(%rip) 3 jne .L13 4.L16: // buhu! :) 5 jmp .L16 6 .p2align 4,,7 7.L13: 8...

(69)

Other things of interest

• Minimizing GC activity.

(70)
(71)

Evaluation

Random input (alphabets 4, 100, 255, variable length).

Gauntlet and Manzini’s corpora.

Multiple runs, warmup rounds removed.

Wall time measured (no parallelism other than the GC).

(72)

1 2 3 4 5 6 7 8 0 50 100 150 200 250 300 time [s]

alphabet size [symbols] time on constant input, varying alphabet

BPR DEEP-SHALLOW DIVSUFSORT NS QSUFSORT SKEW

(73)

0 100 200 300 400 500 600 0 5 10 15 20 25 memory [MB]

input size [millions elements] memory on random input, alphabet size = 255 BPR DEEP-SHALLOW DIVSUFSORT NS QSUFSORT SKEW

(74)

0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 time [s]

input size [millions elements] time on random input, alphabet size = 255 BPR DEEP-SHALLOW DIVSUFSORT NS QSUFSORT SKEW

(75)

0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 time [s]

input size [millions elements] time on random input, alphabet size = 4 BPR DEEP-SHALLOW DIVSUFSORT NS QSUFSORT SKEW

(76)
(77)

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

SKEW DIVSUFSORT BPR QSUFSORT

time [s]

(78)

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

SKEW DIVSUFSORT BPR QSUFSORT

time [s]

(79)

0 20 40 60 80 100 120 140

sun ibm jrockit harmony

time [s]

BPR DIVSUFSORT QSUFSORT SKEW

(80)

Summary

Suffix arrays are worth remembering.

Fundamental and short algorithms developed now.

Cytaty

Powiązane dokumenty

The positive semi-definitness of the computed Hermitian factors was tested by attempting to compute a Cholesky decomposition of Ii. Cholesky’s tests were

They are also not so easy to use as MBPT methods: except for the most commonly used conguration interaction method with singly and doubly excited congurations (CISD) out of a

Instead, it lost binding to the A and B antigens (Fig. In conclusion, the RGD/K motifs are directly involved in NORs binding to HBGA receptors. Whether they are also responsible

In accordance with Article 13 of REGULATION (EU) 2016/679 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 27 April 2016 on the protection of natural persons with regard to

stworzoną przez język jako podstawowe medium komunikacji. Mowa stanowi tu kratę oddzielającą mówiących od siebie i uniemożliwiającą komunikowanie się na

The third application, described in Section 5, ’quality improvement of the registration of legal notifications’ again integrates geometric and thematic data (legal notifications

У ході виконання магістерської роботи було досліджено структуру і функції інтелектуальної системи у цілому і її окремих

If we seek the essence and ground of our poetic selves as communica- tive beings, then a philosophy of communication shall be required to drop every calculative paradigm of mere