Generating de Bruijn Sequences with Many Factor-Rich Preﬁxes

(1)

Uniwersytet Warszawski

Wydział Matematyki, Informatyki i Mechaniki

Damian Repke

Nr albumu: 319377

Generating de Bruijn Sequences with Many Factor-Rich Prefixes

Praca magisterska

na kierunku COMPUTER SCIENCE

Praca wykonana pod kierunkiem prof. Wojciech Rytter

Institute of Informatics

December 2016

(2)

Oświadczenie kierujcego prac

Potwierdzam, że niniejsza praca została przygotowana pod moim kierunkiem i kwal- ifikuje si do przedstawienia jej w postpowaniu o nadanie tytułu zawodowego.

Data Podpis kierujcego prac

Oświadczenie autora (autorów) pracy

Świadom odpowiedzialności prawnej oświadczam, że niniejsza praca dyplomowa została napisana przeze mnie samodzielnie i nie zawiera treści uzyskanych w sposób niezgodny z obowizujcymi przepisami.

Oświadczam również, że przedstawiona praca nie była wcześniej przedmiotem pro- cedur zwizanych z uzyskaniem tytułu zawodowego w wyższej uczelni.

Oświadczam ponadto, że niniejsza wersja pracy jest identyczna z załczon wersj elektroniczn.

Data Podpis autora (autorów) pracy

(3)

Abstract

In this thesis we define a binary word to be factor-rich iff it has the largest number of distinct factors among binary words with the same length. Each linear binary de Bruijn word of rank n has length ∆_n= 2ⁿ+ n − 1 and is (as the whole word) factor-rich. A binary de Bruijn word of rank n is called here abundant iff each of its prefixes of size m is factor-rich for ∆_n−1 < m ≤ ∆_n. In this thesis we show a linear time algorithm constructing binary abundant de Bruijn words of each rank n. It is completely new and original result.

Słowa kluczowe

De Bruijn sequences, de Bruijn graphs, Lempel’s homomorphism

Dziedzina pracy (kody wg programu Socrates-Erasmus) 11.3 Informatics

Klasyfikacja tematyczna G.2.1 Combinatorics – Combinatorial algorithms

G.2.2 Graph Theory – Graph algorithms

Tytuł pracy w jzyku angielskim

Generowanie ciągów de Bruijna z wieloma prefiksami mającymi dużo podsłów

(4)

(5)

Introduction

A de Bruijn word of rank n over binary alphabet is a word containing in the cyclic sense each binary word of length n exactly once. There are references to this kind of sequences date back to the 19th century (raised in [10], solved in [7]) and the beginning of the 20th century ([12], [13]) however, they became well-known after de Bruijn’s paper [1]. De Bruijn (and I. J. Good in [8], independently) showed a connection between these words and some specific graphs called de Bruijn graphs. He also generalized this problem to larger alphabets (together with Tanja van Aardenne-Ehrenfest [9]).

These sequences (or its generalizations called universal cycles) are used e.g. to create flash memory, in public key cryptography or even in DNA sequencing (see [11], [14], [15], [16]).

In this thesis we consider binary strings which are factor-rich: they contain the largest number of distinct factors among binary strings of the same length. An example of factor-rich word may be any of de Bruijn words. A proof of existence of a factor-rich words of any given length was given in [5] using graph theoretic properties of de Bruijn graphs.

We show here an syntactic algorithm producing a growing family of linear de Bruijn words having a lot of factor-rich words as their prefixes: each prefix of size larger than the previous de Bruijn word is factor-rich. Such linear de Bruijn words are called here abundant. In case of ternary alphabet it is even possible to construct de Bruijn words with every factor-rich prefix, however as shown in [4] it is not possible for binary alphabet. Surprisingly, the case of binary alphabet is much harder than that of larger alphabets.

1.1. Structure of the thesis

Chapter 2 contains basic formal definitions and facts related with words, de Bruijn words and de Bruijn graphs. In that chapter we also introduce the concept of abundant words, which are the main topic of this thesis.

The main results of the thesis are presented in Chapter 3. The basic tool are complementary de Bruijn words and their linear constructions using special algebraic operation.

Chapter 4 presents, for completeness, previously known construction of factor-rich words given by J. Shallit. We give an explicit algorithm producing these words and we show how it works on a specific example.

In the last chapter we summarize the whole thesis and we suggest further course of work on the subject.

The results of this thesis are the consequences of the computer experiment.

(8)

(9)

Chapter 2

Preliminaries

2.1. Words and their properties

A word over a finite non-empty set Σ is a finite sequence of elements from Σ. This set Σ is called alphabet and its elements are called symbols. In this thesis we consider mostly binary alphabet — it means that we assume Σ = {0, 1}, unless otherwise stated. In this case symbols are also called bits.

By length of a word w = w₁w₂. . . w_n, where all w_i’s belong to the alphabet, we mean the number n ≥ 0 and denote it by |w|. In case n = 0, we call that word empty and we denote it by ε. The set of all binary words of length n is denoted by Bin(n).

We denote the i-th element of a word w = w₁w₂. . . w_i−1w_iw_i+1. . . w_nby w[i].

A negation of a word w is a word of length |w|, it is denoted by ¯w or ¬w and its elements for every 1 ≤ i ≤ |w| are defined as follows:

¯

w[i] = (¬w)[i] = (

0, if w[i] = 1 1, otherwise

A concatenation of two words w = w₁w₂. . . w_n, u = u₁u₂. . . u_mis an operation of merging these two words into one, i.e.:

w · u = w1w2. . . wnu1u2. . . um

We say that a word u is a factor of a word w = w₁w2. . . wn if and only if u is an empty word or if u = w_iw_i+1. . . w_j, for 1 ≤ i ≤ j ≤ n. We denote this factor by w[i . . . j]. The length of this factor is k = j − i + 1. A factor of length k is called k-factor. In case i = 1, we call that u is a prefix of w. Similarly, in case j = n, we call that u is a suffix of w. There is only one prefix and only one suffix of length k and we denote them by pref_k(w) and suf_k(w), respectively.

We say that a word u is a wrap factor of a word w of length n if and only if u = sufi(w) · pref_j(w), for 1 ≤ i, j < n and i + j ≤ n.

Finally, we say that a word u is a cyclic factor of a word w if and only if u is a factor or a wrap factor of w.

The set of all factors of word w we denote by F(w) and its subset containing all factors of length k we denote by F_k(w). Analogously, the set of all cyclic factors of word w we denote by CF(w) and its subset containing all cyclic factors of length k we denote by CF_k(w).

(10)

Fact 2.1.1. Maximum size of F_k(w) is |w| − k + 1 and it is reached if and only if all k-factors of w are distinct. Maximum size of CF_k(w) is |w| and it is reached if and only if all cyclic k-factors of w are distinct.

A word u is a cyclic shift of a word w of length n if and only if u = suf_k(w) · pref_n−k(w) for some 0 ≤ k ≤ n.

Fact 2.1.2. If a word u is a cyclic shift of a word w of length n then for any 1 ≤ k ≤ n:

CF_k(u) = CF_k(w).

For a word w of length n and a number 1 ≤ k ≤ n denote:

lin_k(w) = w · pref_k−1(w).

Fact 2.1.3. Assume |w| > k, then:

CF_k(w) = F_k(lin_k(w)).

For |w|, |u| ≥ k define:

SPk(w, u) = {sufi(w) · pref_k−i(u) : i = 1, 2, . . . , k − 1}.

Observe that SP_k(w, w) it is the set of wrap k-factors of w.

Let us consider how the set of cyclic n-factors changes when we concatenate two words.

Fact 2.1.4. Assume |w|, |u| > k, then:

CF_k(w · u) = F_k(w) ∪ SP_k(w, u) ∪ F_k(u) ∪ SP_k(u, w).

2.2. De Bruijn words

Definition 2.1. A binary cyclic de Bruijn word of rank n is a binary word x containing in the cyclic sense each binary word of length n exactly once.

In other words |CF_n(x)| = |x| = 2ⁿ.

Definition 2.2. A binary linear de Bruijn word of rank n is a binary word x containing each binary n-string exactly once as a linear (standard) factor.

The length of a binary linear de Bruijn word of rank n is ∆_n, where ∆_k = 2^k+ k − 1.

(11)

Example 2.2.1. The following table presents sample cyclic de Bruijn words of rank n.

n cyclic de Bruijn word of rank n 1 01

2 1100 3 00010111 3 01110100

4 0000100110101111 4 1111011001010000

5 11010100111011000101111100100000

6 0000001000011000101001111010001110010010110111011001101010111111 By DB_n we denote the set of all binary cyclic de Bruijn words of rank n.

For a cyclic de Bruijn word x of rank n denote by lin(x) the corresponding linear de Bruijn word, i.e.:

lin(x) = linn(x).

There are many different algorithms producing de Bruijn words (not only binary) of any rank n ([1], [8], [17], [18]). The standard approach is to find Eulerian circuits in some special graphs called de Bruijn graphs.

2.3. Abundant de Bruijn words

Definition 2.3. We say that a binary word of length n is factor-rich if and only if it has the maximum possible number of distinct factors among the binary words of length n.

Definition 2.4. We say that a linear de Bruijn word of rank n is abundant if and only if each of its prefixes of size m is factor-rich for ∆_n−1+ 1 ≤ m ≤ ∆_n.

The lower bound ∆_n−1+ 1 is tight in the following sense (it is a simple transformation of the Observation 4 from [4]):

Fact 2.3.1. For n > 2 there is no linear de Bruijn word of rank n with all factor-rich prefixes of size m for ∆_n−1≤ m ≤ ∆_n.

In this place we recall one of the lemmas from Shallit’s paper:

Lemma 2.1. [5] A binary word w of size ∆n−1 < m ≤ ∆n is factor-rich if and only if it contains, as a factor, each binary word of size n − 1 and all its factors of size n are distinct, in other words:

|F_n−1(w)| = 2ⁿ⁻¹ ∧ |F_n(w)| = m − n + 1.

An immediate consequence is the following fact:

Corollary 2.1. A de Bruijn word w of rank n is abundant if and only if its prefix of size

∆n−1+ 1 contains, as a factor, each binary word of size n − 1.

(12)

2.4. De Bruijn graphs

Definition 2.5. A de Bruijn graph of rank n ≥ 1 is a graph Gn= (Vn, En), where:

Vn= Bin(n),

E_n= {v₁ → v₂: v₁, v₂∈ V_n, ∧ suf_n−1(v₁) = pref_n−1(v₂)}.

The egdes of G_n can be considered as binary words of length n + 1, i.e. v₁ → v₂ = v₁· v₂[n].

When convinient, the edges v₁ → v₂ are labeled by v₂[n].

Fact 2.4.1. The in-degree of any vertex v is 2. The set of ingoing edges to v is:

{0v, 1v}.

The out-degree of any vertex v is also 2. The set of outgoing edges from v is:

{v0, v1}.

Fact 2.4.2. Any de Bruijn graph of any rank n is strongly connected.

Fact 2.4.3. Any de Bruijn graph of any rank n is Eulerian.

This fact has important consequences. Namely, all edges represents all binary words of rank n + 1. An Eulerian circuit visits every edge exactly once. Moreover, any two consecutive edges e₁, e2 on this circuit in de Bruijn graph fulfill:

suf_n(e₁) = pref_n(e₂).

It means that we can create de Bruijn word of rank n + 1 by concatenating labels from consecutive edges on Eulerian circuit.

Example 2.4.1. The following figure presents de Bruijn graph of rank 3.

000

001

010

011

100

101

110

111 0

1

0

1 1

0

0 1

1

0 0

1

0

1

Figure 2.1: G₃ with labeled edges.

(13)

In this graph we have (e.g.) the following Eulerian circuit:

000 → 000 → 001 → 010 → 100 → 001 → 011 → 110 →

101 → 010 → 101 → 011 → 111 → 111 → 110 → 100 → 000.

When we concatenate labels from all consecutive edges, we obtain the following de Bruijn word of rank 4:

0100110101111000.

2.4.1. From Gn to Gn+1

Interestingly, Eulerian circuits in G_ncorrespond to Hamiltonian cycles in G_n+1. It means that we can use the circuit from the example to construct a Hamiltonian cycle in de Bruijn graph of rank 4.

In our example we used the following edges:

0000 → 0001 → 0010 → 0100 → 1001 → 0011 → 0110 → 1101 →

1010 → 0101 → 1011 → 0111 → 1111 → 1110 → 1100 → 1000.

We can treat these edges as vertices in G₄. We call that operation EdgesToVertices.

More formally:

Definition 2.6. For any circuit C = (v₁, v₂, . . . , v_k) ⊆ G_n we define EdgesToVertices(C) as a cycle C⁰= ((v1, v2), (v2, v3), . . . , (vn−1, vn)) ⊆ Gn+1.

(14)

The following figure presents de Bruijn graph of rank 4 with marked Hamiltonian cycle which corresponds to the earlier Eulerian cycle in G₃.

0101 1010

1001

0110

0011 1100

0010 0100

1011 1101

0001 1000

0111 1110

0000

1111

Figure 2.2: G₄ with marked Hamiltonian cycle.

(15)

Chapter 3

Syntactic construction of abundant words

3.1. Two abstract operations producing de Bruijn words

We introduce two rather strange operations on words. Both operations work on binary words of size 2ⁿ and produce words in DB_n+1.

3.1.1. The abstract operation ⊕.

Denote by Sync(w, γ) the cyclic shift of the word w having the suffix equals γ (assuming that the word γ occurs exactly once as a cyclic factor of w).

Example 3.1.1. Sync(001011101, 011) = 101001011.

Definition 3.1. For two binary cyclic de Bruijn words u, w of rank n define:

u ⊕ w = u · Sync(w, sufn(u)).

This operation concatenates two cyclic de Bruijn words of rank n, firstly it shifts the second word to ensure common suffix of length n. Under certain conditions this operation produce de Bruijn word of rank n + 1. To prove this, we have to show an useful fact about words.

Fact 3.1.1. If w, u have a common suffix of length 0 ≤ m ≤ n and common prefix of length n − m − 1 then:

CFn(w · u) = CFn(w) ∪ CFn(u).

Proof. By observation 2.1.4 we know that

CF_n(w · u) = F_n(w) ∪ SP_n(w, u) ∪ F_n(u) ∪ SP_n(u, w).

However, w and u have common prefix and common suffix, so SP_n(w, u) = {suf_i(u) · pref_n−i(u) : i = 1, 2, . . . , m} ∪

{suf_i(w) · pref_n−i(w) : i = m + 1, m + 2 . . . , n − 1}

and

SP_n(u, w) = {suf_i(w) · pref_n−i(w) : i = 1, 2, . . . , m} ∪

{suf_i(u) · pref_n−i(u) : i = m + 1, m + 2 . . . , n − 1}.

(16)

Now, the union of these two sets is SP_n(w, u) ∪ SP_n(u, w) = SP_n(w, w) ∪ SP_n(u, u).

Finally:

CFn(w · u) = Fn(w) ∪ SPn(w, w) ∪ Fn(u) ∪ SPn(u, u) = CFn(w) ∪ CFn(u).

Lemma 3.1. Let u, w be cyclic de Bruijn words of rank n, such that:

|CF_n+1(u)| ∪ |CFn+1(w)| = Bin(n + 1).

Then u ⊕ w is a cyclic de Bruijn word of rank n + 1.

Proof. It follows directly from the Fact 3.1.1.

3.1.2. The abstract operation ⊗.

Assume that a word w contains a single group of p^k(p ∈ {0, 1}), where k is maximal number of consecutive p’s. Define two operations:

• add_p(w) is the word resulting by adding single bit p in the group p^k

• rem_p(w) is the word resulting by removing a single bit p in the group p^k Example 3.1.2. add1(0110) = 01110, rem0(00010111) = 0010111.

Definition 3.2. For two binary cyclic de Bruijn words u, w of rank n, both starting with 0ⁿ define:

u ⊗ w = rem₀(add₁(u)) · rem₁(add₀(w)).

In other words in the operation ⊗ we move one zero from the first group of 0ⁿ to the second group, and single one from the second group of 1ⁿ to the first group.

00011101 00010111 00111101 00001011 Figure 3.1: Graphical illustration of u ⊗ w.

3.2. Complementary de Bruijn words

At the beginning of this section we define an interesting relation between two de Bruijn words of the same rank. The words in this relation will be used to produce abundant de Bruijn words.

Definition 3.3. Two cyclic de Bruijn words of rank n are said to be complementary if the set of their common cyclic factors of length n + 1 has exactly four elements.

Observe that every two words in DB_n with n ≥ 2 have at least four common (n + 1)- factors, since every de Bruijn word contains words 01ⁿ0 and 10ⁿ1. It means that every two de Bruijn words of rank n have these (n + 1)-factors in common:

Common(n) = { 0ⁿ1, 1ⁿ0, 01ⁿ, 10ⁿ}.

(17)

Moreover, two complementary words in DB_n contain all cyclic factors of length n + 1 except:

U nused(n) = { 0ⁿ⁺¹, 1ⁿ⁺¹, 01ⁿ⁻¹0, 10ⁿ⁻¹1 }.

Similarly, none of words in DB_n contains any of these four unused (n + 1)-factors.

Example 3.2.1. The following table presents positions of all 4-cyclic factors in two de Bruijn words 00011101 and 00010111.

factor 00011101 00010111 0000

0001 1 1

0010 2

0011 2

0100 7

0101 3

0110

0111 3 5

factor 00011101 00010111

1000 8 8

1001

1010 6

1011 4

1100 7

1101 5

1110 4 6

1111

These two words are complementary. The factors 0001, 0111, 1000 and 1110 are common in these words. The factors 0000, 0110, 1001 and 1111 occur in none of them.

Lemma 3.2. Assume u, w are complementary de Bruijn words of rank n both starting with 0ⁿ. Then lin(u ⊗ w) is a binary abundant linear de Bruijn word of rank n + 1.

Proof. Let’s introduce the auxiliary variables:

u⁰ = rem0(add1(u)), w⁰ = rem1(add0(w)).

First we show the following fact:

Claim 3.1. u ⊗ w ∈ DBn+1.

Note that both u⁰ and w⁰ start with 0ⁿ⁻¹ and end with 1. It means that they have common prefix of length n − 1 and common suffix of length 1, so by Fact 3.1.1 we have:

CFn+1(u ⊗ w) = CFn+1(u⁰) ∪ CFn+1(w⁰).

Consider how adding and removing bits (p ∈ {0, 1}) change the set of cyclic (n + 1)-factors of words u, w ∈ DB_n:

• Operation add_p just adds extra (n + 1)-factor pⁿ⁺¹.

• Operation rem_p removes factors ¯ppⁿ and pⁿp and adds new factor ¯¯ ppⁿ⁻¹p.¯ Hence:

CFn+1(u⁰) = (CFn+1(u) ∪ {10ⁿ⁻¹1, 1ⁿ⁺¹}) − {10ⁿ, 0ⁿ1}, CFn+1(w⁰) = (CFn+1(w) ∪ {01ⁿ⁻¹0, 0ⁿ⁺¹}) − {01ⁿ, 1ⁿ0}.

(18)

Notice that every removed factor belongs to Common(n), so each of them still belongs to the union of CF_n+1(u⁰) and CFn+1(w⁰). Furthermore, every element of U nused(n) is added, so finally we have:

CFn+1(u ⊗ w) = CFn+1(u) ∪ CFn+1(w) ∪ U nused(n) = Bin(n + 1).

Hence u ⊗ w ∈ DB_n+1, which proves the claim.

We know that operation ⊗ produces de Bruijn word of higher rank. To prove that this de Bruijn word is abundant, by Corollary 2.1, it is enough to show:

Claim 3.2. The prefix of lin(u ⊗ w) of length ∆_n+ 1 contains, as a factor, each binary word of size n.

Notice that this prefix is of the form u⁰· 0ⁿ and its set of n-factors is the same as the set of cyclic n-factors of word u⁰ · 0 (by Fact 2.1.3) or any of its shift, e.g. 0 · u⁰ = add1(u) (by Fact 2.1.2).

The word u is a de Bruijn word of rank n, so it contains all n-factors. Obviously, adding a single one in the group 1ⁿ does not remove any of n-factors, so:

Fn(u⁰· 0ⁿ) = CFn(add1(u)) = Bin(n).

which proves the claim and the whole lemma.

Example 3.2.2. The cyclic de Bruijn words u = 00011101, w = 00010111 are complemen- tary. Then the word:

lin(u ⊗ w) = 00111101 00001011 001

is the abundant de Bruijn word of rank 4. It has length ∆₄ = 19. Its nine longest prefixes are factor-rich. The following table presents positions of all 4-factors in the word lin(u ⊗ w):

factor position

0000 9

0001 10

0010 11

0011 1

factor position

0100 7

0101 12

0110 14

0111 2

factor position

1000 8

1001 16

1010 6

1011 13

factor position

1100 15

1101 5

1110 4

1111 3

It shows that this word is linear de Bruijn word. The following table presents positions of first occurrence of all 3-factors:

factor 000 001 010 011 100 101 110 111

position 9 1 7 2 8 6 5 3

It shows that the prefix of length ∆₄ = 11 of this word is factor-rich. Hence, by Lemma 2.1, the resulting word is abundant de Bruijn word.

(19)

3.3. Cumulative sum in Z2

There is a rather strange algorithm producing de Bruijn words using a special algebraic operation Ψ(w) on words, which is the cumulative sum modulo 2.

Definition 3.4. For any binary word w the result of operation Ψ is a binary word of the same length, where for every 1 ≤ i ≤ |w|:

Ψ(w)[i] =





i

X

j=1

w[j]



 mod 2.

Equivalently:

Ψ(w)[i] =

(w[i], for i = 1,

(Ψ(w)[i − 1] + w[i]) mod 2, otherwise.

The operation Ψ was implicitly used in the algorithm R, presented in Knuth’s 4-th vol- ume, see [2]. It also implicitly appears in Lempel’s algorithm for de Bruijn words using homomorphism of de Bruijn graphs, see [3]. In fact the recursive algorithm presented later in this chapter can be viewed as syntactic version of Lempel’s graph-theoretic algorithm.

Example 3.3.1. The following table presents the results of the operation Ψ for some sample binary words.

w Ψ(w)

0 0

1 1

01 01

011 010

111 101

0101 0110

0111000 0101111

1000000 1111111

110010111 100011010

0011110100001011 0010100111110010

00011010111110010000010100111011 00010011010100011111100111010010

This operation has many interesting properties.

Fact 3.3.1. If any w 6= ε has an even number of 1’s then Ψ(w) ends with 0, otherwise it ends with 1.

From this fact we can derive another one.

Fact 3.3.2. For any binary words w, u:

Ψ(w · u) =

(Ψ(w) · Ψ(u), if w has an even number of 1’s Ψ(w) · ¬Ψ(u), otherwise

(20)

There is also quite curious formula for a negation of a word. However, it will not be used in further work.

Fact 3.3.3. The result of the operation Ψ( ¯w) is a binary word of length |w|, where for every 1 ≤ i ≤ |w|:

Ψ( ¯w)[i] =

(Ψ(w)[i], for odd i

¬Ψ(w)[i], otherwise

We introduce also an operation τ which, in a certain sense, is a reverse of Ψ.

Definition 3.5. For an (n+1)-string w define τ (w) as an n-string v such that Ψ(w[1] v) = w.

Equivalently:

τ (w)[i] = (

0, if w[i] = w[i + 1]

1, otherwise

Example 3.3.2. The following table presents the results of the operation τ for some sample binary words.

w τ (w)

0 ε

1 ε

01 1

010 11

101 11

0110 101

0101111 111000

1111111 000000

100011010 10010111

0010100111110010 011110100001011

00010011010100011111100111010010 0011010111110010000010100111011

Fact 3.3.4. For any binary word w:

τ (w) = τ ( ¯w).

Fact 3.3.5. For any word w of length n and for any i, k > 0 such that i + k ≤ n:

w[i + 1 . . . i + k] = τ (Ψ(w)[i . . . i + k]).

Proof. From Fact 3.3.2 we know that:

Ψ(w)[i . . . i + k] = Ψ(w[i . . . i + k]) ∨ Ψ(w)[i . . . i + k] = ¬Ψ(w[i . . . i + k]).

However, by Fact 3.3.4 we do not have to worry about the negation and we can write:

τ (Ψ(w)[i . . . i + k]) = τ (Ψ(w[i . . . i + k])).

(21)

By Definitions 3.4 and 3.5:

Ψ(w[i . . . i + k])[1] = w[i], τ (Ψ(w[i . . . i + k])) = v, where:

Ψ(w[i . . . i + k]) = Ψ(w[i] · v).

The function Ψ is injetive so we can omit it:

w[i + 1 . . . i + k] = v = τ (Ψ(w[i . . . i + k])) = τ (Ψ(w)[i . . . i + k]).

This fact has some useful consequences.

Fact 3.3.6. If word u of length n > k has even number of 1’s then:

CF_k(u) = {τ (w) : w ∈ CF_k+1(Ψ(u))}.

Proof. By Fact 3.3.2 we know that:

lin_k+1(Ψ(u)) = Ψ(lin_k+1(u)).

After putting lin_k+1(u) into Fact 3.3.5 and then summing up for i = 1, 2 . . . , n we obtain the content of the fact.

3.4. Recursive construction of de Bruijn words

In this section we introduce an operation which produces de Bruijn sequences of rank n + 1 from the input de Bruijn sequence of rank n.

Definition 3.6. For a binary cyclic de Bruijn word x of rank n ≥ 2 ending with 1ⁿ, denote:

Next(x) = Ψ(x) ⊕ ¬Ψ(x).

Example 3.4.1. The following table presents the results of the operation Next for some sample binary de Bruijn words.

w Next(w)

0011 00101110

00010111 0001101011110010 01000111 0111101011000010

0101100001001111 01101111100010101100100000111010 0010000110101111 00111110110010101110000010011010

(22)

Theorem 3.1. Assume that x is a cyclic de Bruijn word of rank n ≥ 2. Then Next(x) is a cyclic de Bruijn word of rank n + 1.

Proof. It is enough to show that words Ψ(x) and its negation satisfy the assumptions of Lemma 3.1.

Claim 3.3. If x is a cyclic de Bruijn word of rank n then:

CF_n+1(Ψ(x)) ∪ CF_n+1(¬Ψ(x)) = Bin(n + 1).

The word x is de Bruijn word of rank n, so we know that:

|CF_n(x)| = 2ⁿ.

Due to Fact 3.3.6 the cardinality of the set CF_n+1(Ψ(x)) is also 2ⁿ.

Moreover, τ (w) = τ ( ¯w), so if CF_n+1(Ψ(x)) contains the word w then it does not contain the word ¯w, because it would decrease the size of CF_n(x). It means that the sets of cyclic factors of word Ψ(x) and of its negation are disjoint.

Therefore, the union of these sets is of size 2ⁿ⁺¹.

3.5. Proof of complementarity

From the previous section we know that operation Next produces de Bruijn words of higher rank. However, this operation has also stronger property.

For any word w denote by^•◦w a word that differs from w just on the first symbol. Similarly, denote by^◦•w a word that differs from w just on the last symbol. Finally, denote by ^••w a word that differs from w just on the first and the last symbol.

By α_n denote a word of length n that ends with 1 and does not contain 01 or 10 as a factor. Moreover, by β_n denote a word ¬α_n.

Example 3.5.1.

α5 = 10101 α•◦5 = 00101 α◦•₅ = 10100 α••₅ = 00100

β5 = 01010

•◦

β₅ = 11010

◦•

β₅ = 01011

••

β₅ = 11011

α6 = 101010 α•◦6 = 001010 α◦•₆ = 101011 α••₆ = 001011

β6 = 101010

•◦

β₆ = 001010

◦•

β₆ = 101011

••

β₆ = 001011

Firstly, we show the following lemma.

Lemma 3.3. For any de Bruijn word z of rank n that ends with 1ⁿ:

CF_n+2(Next(z)) = ((CFn+2(Ψ(z)) ∪ CF_n+2(¬Ψ(z))) − {α_n+2^•◦ ,α_n+2^◦• }) ∪ {α_n+2,α_n+2^•• }.

(23)

Proof. Let:

p = Ψ(z).

Any cyclic de Bruijn sequence z of rank n ≥ 2 has an even number of 1’s. It means that p ends with 0. Moreover, if z ends with 01ⁿ then p starts with zero and:

suf_n+2(p) =

•◦

β_n+2. Analogously:

suf_n+2(¯p) =α_n+2^•◦ .

The operation ⊕ used in Next shifts the word ¯p to match the suffix of length n of the word p. In our case we obtain the word:

q = Sync(¯p, βn),

which means we only have to shift the last symbol of ¯p (which is 1) to the front.

Consider the set of cyclic (n + 2)-factors of the word Next(z). By Fact 2.1.2:

CFn+2(q) = CFn+2(¯p).

Now, we can reformulate the lemma by:

CF_n+2(p · q) = ((CF_n+2(p) ∪ CF_n+2(q)) − {α_n+2^•◦ ,α_n+2^◦• }) ∪ {α_n+2,α_n+2^•• }.

Define:

P = {sufi(q) · pref_n−i+2(p) : i = 1, 2, . . . , n}, Q = {sufi(p) · pref_n−i+2(q) : i = 1, 2, . . . , n}.

However, we know that suf_n(p) = suf_n(q), so:

P = {sufi(p) · pref_n−i+2(p) : i = 1, 2, . . . , n}, Q = {sufi(q) · pref_n−i+2(q) : i = 1, 2, . . . , n}.

Observe that:

α_n+2= suf_n+1(p) · pref₁(q), αn+2•• = sufn+1(q) · pref₁(p), αn+2◦• = sufn+1(p) · pref₁(p), α_n+2•◦ = suf_n+1(q) · pref₁(q).

(24)

By Observation 2.1.4:

CF_n+2(p · q) = F_n+2(p) ∪ (Q ∪ {α_n+2}) ∪ F_n+2(q) ∪ (P ∪ {α_n+2^•• }).

We also know that:

CFn+2(p) ∪ CFn+2(q) = (Fn+2(p) ∪ (P ∪ {αn+2^◦• })) ∪ (F_n+2(q) ∪ (Q ∪ {αn+2^•◦ })).

We know that all cyclic (n + 2)-factors in p and q are distinct, so we can write:

F_n+2(p) ∪ F_n+2(q) ∪ P ∪ Q = (CF_n+2(p) ∪ CF_n+2(q)) − {α^•◦_n,α^◦•_n}.

Finally:

CF_n+2(Next(z)) = ((CFn+2(p) ∪ CF_n+2(q)) − {α^•◦_n,α^◦•_n}) ∪ {α_n,α^••_n}.

This completes the proof.

Let us examine the following figures to gain a better insight into sets P and Q.

Example 3.5.2. Let z = 01100101011100000100110100011111 .

01000110010111111000100111101010

| {z }

p

11011100110100000011101100001010

| {z }

q

010001...

Figure 3.2: All elements of the sets Q and P presented on cyclic word pq.

01000110010111111000100111101010

| {z }

p

010001...

Figure 3.3: All elements of the set P presented as wrap factors of p.

11011100110100000011101100001010

| {z }

q

110111...

Figure 3.4: All elements of the set Q presented as wrap factors of q.

(25)

Now we show the key theorem of the thesis.

Theorem 3.2. Assume that the words x, y are complementary de Bruijn words of rank n ≥ 2, and each of them ends with 1ⁿ. Then Next(x) and ¬Next(y) are complementary de Bruijn words of rank n + 1.

Proof. From the previous lemma we know that:

CF_n+2(Next(x)) = ((CFn+2(Ψ(x)) ∪ CF_n+2(¬Ψ(x))) − {α_n+2^•◦ ,α_n+2^◦• }) ∪ {α_n+2,α_n+2^•• }

CF_n+2(¬Next(y)) = ((CFn+2(Ψ(y)) ∪ CF_n+2(¬Ψ(y))) − {

•◦

β_n+2,

◦•

β_n+2}) ∪ {β_n+2,

••

β_n+2}.

Now, by contradiction, assume that Next(x) and ¬Next(y) contains the same (n + 2)- string w.

Observe that:

τ (α_n) = τ (β_n) = 1ⁿ⁺¹, τ (α^••_n) = τ (α^••_n) = 01ⁿ⁻¹0.

If w = α_n or w = α^••_n then by Fact 3.3.6 the word y has to contain 1ⁿ⁺¹ or 01ⁿ⁻¹0.

However, y is de Bruijn word of rank n, so it does not contain any word from U nused(n).

It means that w 6= α_n and w 6= α^••n. Analogously, w 6= β_n and w 6=

••

βn.

It means that both x and y contain the same cyclic factor τ (w) of length (n + 1). But x and y are complementary, so the set of candidates for τ (w) is Common(n). With these candidates for τ (w) we have the following candidates for w:

τ⁻¹({1ⁿ0}) = {αn+2^◦• ,

◦•

βn+2}, τ⁻¹({0ⁿ1}) = {0ⁿ⁺¹1, 1ⁿ⁺¹0},

τ⁻¹({01ⁿ}) = {αn+2^•◦ ,

•◦

βn+2}, τ⁻¹({10ⁿ}) = {01ⁿ⁺¹, 10ⁿ⁺¹}.

In other words:

τ⁻¹(Common(n + 1)) = Common(n + 2) ∪ {αn+2^◦• ,

◦•

βn+2, αn+2^•◦ ,

•◦

βn+2}.

We know thatα_n+2^•◦ and α_n+2^◦• are not cyclic factors of Next(x). Moreover,

•◦

β_n+2 and

◦•

β_n+2 are not cyclic factors of ¬Next(y).

The set of remaining candidates for w is Common(n + 1), so it does not deny the complementarity of Next(x) and ¬Next(y).

It exhaust the list of candidates for w, so it completes the proof.

(26)

Let us look at the operation Next from the graph-theoretic perspective.

Example 3.5.3. Let x = 00010111 and y = 01000111. These two words are complementary de Bruijn words of rank 3 and are represented as cycles in de Bruijn graph.

0000 0001

0011 0110 1101

1010

0100 1000

1 1

0 1

0

0 0 0010

0101

1011 0111 1111

1110

1100 1001

1 1

0

1 0

Ψ(x) = 00011010 ¬Ψ(x) = 11100101

0011 0111

1111 1110 1101

1010

0100 1001

1 1

0 1

0

1

1 0010

0101

1011 0110 1100

1000

0000 0001

1 1

0 0

0

0 1

Ψ(y) = 01111010 ¬Ψ(y) = 10000101

Figure 3.5: Illustration of words Ψ(x), ¬Ψ(x), Ψ(y) and ¬Ψ(y) as cycles in de Bruijn graph.

The bold edges are the only common edges between the Ψ(x) ∪ ¬Ψ(x) and Ψ(y) ∪ ¬Ψ(y).

There are always exactly 8 such edges.

Let us follow the operation Next(x).

We know that the word Ψ(x) ends with 1010 and after this word we put the word ¬Ψ(x) in our operation. It means that we have to go to the vertex 0101, visit all vertices of ¬Ψ(x) and go back to the word Ψ(x).

0000 0001

0011 0110 1101

1010

0100

1000 0010

0101

1011 0111 1111

1110

1100 1001

0 1

1 1

0

1 0

1 1

0 1

0

1

0 0

Ψ(x) ¬Ψ(x)

Figure 3.6: Illustration of the result of the operation Next(x).

(27)

Observe that we are no longer using two out of eight common edges — 10100 and 00101

— we replace them with 10101 and 00100. However, after Next(y) we will also use these two new edges. That is why we have to negate the second output of Next. In fact, this time we first put the word ¬Ψ(y), which ends with 0101. After this word we have to traverse the whole word Ψ(y) and then go back to the word ¬Ψ(y) with the edge 11011.

0011 0111

1111 1110 1101

1010

0100

1001 0010

0101

1011 0110 1100

1000

0000 0001

1 0

0 0

0

0 1 1

1

0 1 1

0

1 1

Ψ(y) ¬Ψ(y)

Figure 3.7: Illustration of the result of the operation ¬Next(y). Now Next(x) and

¬Next(y) have only four common edges.

When we traverse both graphs and collect last symbols of every vertex we obtain:

Next(x) = 0001101011110010,

¬Next(y) = 1000010100111101.

As you can see, we have only four common edges in Next(x) and ¬Next(y) and they form Common(n + 1). It means that these two words are complementary.

3.6. The main algorithm

With all these operations and theorems shown earlier in this chapter we can finally present the main algorithm that produces abundant de Bruijn words in linear time with respect to the output word’s length.

1: function ConstructAbundant(n)

2: if n = 1 then return 10.

3: if n = 2 then return 11001.

4: x ← 1100.

5: y ← 1100.

6: for i = 2 to n − 2 do

7: shift x and y to both end with 1ⁱ.

8: x ← Next(x).

9: y ← ¬Next(y).

10: shift x and y to both start with 0ⁿ⁻¹.

11: return lin(x ⊗ y)

(28)

Example 3.6.1. The following table presents the results of function ConstructAbundant for some first values. The resulting words are linear abundant de Bruijn words of rank n.

n ConstructAbundant(n)

1 10 2 11001 3 0111000101

4 0010111100001101001

5 000110101111100100000101001110110001

The main result follow now directly from Lemmas 3.1, 3.2 and Theorems 3.1 and 3.2:

Theorem 3.3. The word ConsructAbundant(n) is a linear abundant de Bruijn word of rank n. It can be constructed in linear time.

(29)

Chapter 4

Description of Shallit’s algorithm

It was shown in [6] that for any k ≤ 2ⁿ there exists a binary cyclic word of length k with all n-factors distinct.

In graph terminology it means that every de Bruijn graph of rank n contains circuit of any possible length 1 ≤ k ≤ 2ⁿ⁺¹.

It is easy to observe that circuits of length 2ⁿ≤ k ≤ 2ⁿ⁺¹ that visit all 2ⁿ vertices of G_n coresspond to factor-rich words: they reach every vertex, so every n-factor occurs at least once and they never use the same edge twice, so all (n + 1)-factors are distinct.

The existence of such circuits was shown by Shallit in [5] and it was mostly based on Yoeli’s paper [6].

However, this fact was proven without explicit construction of these circuits. Below we present the compact algorithm based on Shallit’s paper.

By P-set we mean, as in Yoeli’s paper, a set of vertex-disjoint cycles which includes all the vertices of the graph. In this algorithm we naturally extend the definition of the operation EdgesToVertices to the sets of circuits.

4.1. The algorithm

Finding a circuit of length k in the graph G_n works as follows:

• For n = 1 we output the searched circuit in G₁ (lines 3–7).

• In case k < 2ⁿwe look for a circuit of the same length in G_n−1(line 8) and we transform this edge-circuit into vertex-cycle in G_n (line 10).

• In case k ≥ 2ⁿ we perform the following steps:

• As in previous case, we look for a circuit C_n−1 of the same length in G_n−1 (line 8)

• But this time we need it to compute its complement (line 12) and transform this set of circuits in G_n−1 into the set of simple cycles in G_n (line 13). Actually, it is a P-set with removed cycle of length k. It means that its complement is the union of some cycle C⁰ (actually it is EdgesToVertices(Cn−1)) of length k and a P-set which is vertex-disjoint with C⁰ (line 14).

• Finally, we iteratively reduce the number of connected components until C_n is a circuit (lines 15–19).

Generating de Bruijn Sequences with Many Factor-Rich Preﬁxes

Uniwersytet Warszawski

Wydział Matematyki, Informatyki i Mechaniki

Generating de Bruijn Sequences with Many Factor-Rich Prefixes

Contents

Chapter 1

Introduction

Chapter 2

Preliminaries

Chapter 3

Syntactic construction of abundant words

Chapter 4

Description of Shallit’s algorithm