On Words with Maximal Number of Distinct Subwords (preliminary draft)

(1)

On Words with Maximal Number of Distinct Subwords (preliminary draft)

Wojciech Rytter

Warsaw University, Warsaw, Poland

Abstract

We say that a string is factor-maximal iff it contains the largest number of distinct factors among strings of the same length and over the same alphabet. We show a family of binary strings which are factor-maximal and which are closely related to de Bruijn words. One de Bruijn word of rank k represents, in a compact way, exponentially many factor-maximal words. By the way we give a simplified linear time construction of a factor-maximal string of a given length n.

We assume in this paper that the alphabet of considerd words is binary, it simplifies presentation, and an extension to general finite alphabets is straightforward. Factor-maximal words are closely related to de Bruijn words and de Bruijn Graphs. De Bruijn word of rank k is any word of length 2

^k

containing cyclically each subword of length k exactly once. A linear de Bruijn word of rank k is a cyclic de bruin word concatenated with its prefix of size k − 1. Denote γ

_k

= 2

^k

+ k − 1. The number γ

_k

is the size of linear version of de Bruijn word of rank k. Each linear de Bruijn word is factor-maximal.

Assume we have an integer n such that γ

_k

< n < γ

_k+1

, our goal is to construct a factor-maximal binary word of size n.

Observation 1. A word w of length n, where γ

k

< n < γ

k+1

, is factor-maximal iff it contains each word of length k and each subword of length k + 1 occurs in w at most once.

Example 1.

The following word is a de Bruijn word of rank 4:

0 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 Its linear version is a linear de Bruijn word of size γ

4

:

0 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 0 0 0.

Both these words are factor-maximal, since they contain all words of size 3 as factors and all their factors of size 4 are distinct.

A construction of factor-maximal word of size γ

_k

< n < γ

_k+1

has been given in [?] using a rather complicated construction of simple cycles of size r in de Bruijn graph of rank k for any 1 ≤ k ≤ 2

^k

,

1

(2)

this construction was earlier given in [?]. In this paper we only need to find a Hamiltonian cycle which is much easier.

De Bruijn graph of rank k is G

k

= (V

k

, E

k

), where V

k

= {0, 1}

^k

. The edges are of the form:

c · α −→ α · d,

^d

c, d ∈ {0, 1}, α ∈ {0, 1}

^k−1

The label of each such edge is the symbol d, appended to α. An example of de Bruijn graph of rank 4 is shown in Figure 2, where binary words corresponding to nodes are converted to numbers. When interpreting nodes as numbers we have the edges

i −→ (2i mod 2

⁰ ^k

), i −→ (2i + 1 mod 2

¹ ^k

)

A path (not necessarily simple) is a chain if all its edges are distinct. It is a cyclic-chain if the first and the last vertex are the same.

Denote by val(π) the sequence of labels of edges of the chain π. Then each cyclic de Bruijn word of rank k equals val(π) for some Eulerian cycle π of the graph G

_k

.

Lemma 1.

(a) If each vertex G

k

has an occurrence on the chain π at the distance from the starting vertex at least k then val(π) is factor maximal.

(b) If additionaly π is cyclic then after appending at the end to π its prefix of length at most k we also obtain a factor-maximal word.

Definition of deBruijn(C, v), where C is a HAmiltonian cycle of G

k

.

x x

x

x y1

y

y 2

2

y3 4

5

6 y5 x1 y

x3

4

6

Figure 1: The cyclic structure of G

k

. The big cycle is a Hamiltonian cycle C = (y1 · y2 · y3 · y4 · y5).

Other (outer)cycles result by removing C from the graph. x

_i

’s are values (chain labels) of the outer cycles. We start in the starting node v of C and traverse the graph, the edges which are not in C have priority. We receive the word deBruijn(C, v). The word x

1

corresponds to the largest outer cycle. We obtain an Euler cycle by starting with the largest cycle, traversing C and consecutive outer cycles. The edge-labels of such an Euler cycle form the word: x

1

y

1

x

2

y

2

x

3

y

3

x

4

y

4

x

5

y

5

.

2

(3)

Example. Take the de Bruijn graph G

₄

, see Figure 2, it has 16 nodes and 32 edges.

0

1

3

7

12

14

15 2

9

5

6

11

8

13 10 4

7

3 15

8 0

12

9 3

6 14

13 6 11

10 1

5 2

11 13

12 9

4

4 C3

C1 2

1 14 C4

C2

C5 10

Figure 2: The graph DB

₄

can be decomposed into one Hamiltonian cycle and 5 edge-disjoint cycles C1, C2, C3, C4, C5, their sizes are x

1

= 8, x

2

= 1, x

3

= 4, x

4

= 1, x

5

= 2.

There is a Hamiltonian cycle

C = (8, 0, 1, 3, 7, 15, 14, 12, 9, 2, 5, 11, 6, 13, 10, 4, 8).

After removing this cycle we have 5 disjoint cycles:

[8, 12, 2, 4, 9, 3, 6, 12, 8] [0, 0] [7, 14, 13, 11, 7] [15, 15] [5, 10, 5].

the total structure of an Euler cycle induced by C looks as follows:

[8, 12, 2, 4, 9, 3, 6, 12, 8] [0, 0] 1, 3, [7, 14, 13, 11, 7] [15, 15] 14, 2, 9, 2, [5, 10, 5] 11, 6, 13, 10, 4, 8.

It implies, taking labels of consecutive edges, a de Bruijn sequence

[10011000] 0 [0] 111 [0111] 1 [1] 00101 [01] 101000 It can be written as DB

k

= x

1

y

1

x

2

y

2

x

3

y

3

x

4

y

4

x

5

y

5

, where:

x

1

= 10011000, y

1

= 0 x

2

= 0 y

2

= 111 x

3

= 0111 y

3

= 1 x

3

= 1 y

4

= 00101 x

4

= 01 y

5

= 101000.

The sequence DB

_k−1

= y

₁

y

₂

y

₃

y

₄

y

₅

is a de Bruijn sequence of smaller rank. We append the first k symbols to the end and create the sequence:

α

_k

= [10011000] 0 [0] 111 [0111] 1 [1] 00101 [01] 101000 1001

3

(4)

Then for each 2

^k

< n ≤ 2

^k+1

+ k we can create a factor maximal word of size n by concatenating a subword of DB

k

with a suffix of DB

k−1

and a prefix of DB

k

of length at most k.

Theorem 1.

For each k there is a pair of twin de Bruijn words w

_k

of length 2

^k

and w

_k+1

of length 2

^k+1

such that for any γ

_k

≤ n < γ

_k+1

there is a factor-maximal subword of w

_k

w

_k+1

of length n.

Specifically Suf (n − p, w

k

) · P ref (n, w

k+1

) is factor maximal for a parameter p = p(n).

The words w

_k

, w

_k+1

and the parameter p can be computed in linear time.

Proof. Let us consider the decomposition of the de Bruijn word corresponaing

w = x

₁

· y

₁

· x

₂

· y

₂

· . . . x

_r

· y

_r

. (1) Let us distinguish positions corresponding to elements of x

_i

.

We compute the size p of the shortest prefix of w

_k+1

containing n − γ

_k

distinguished positions. Then we obtain maximal-factor word of size n as

M axF actor(n) = SU f (n − p, w

k

) · P ref (p, w

k+1

Observation 2. The word w

_k

and its decomposition are the same for all γ

_k

< n < γ

_k+1