On Words with Maximal Number of Distinct Subwords (preliminary draft)
Wojciech Rytter
Warsaw University, Warsaw, Poland
Abstract
We say that a string is factor-maximal iff it contains the largest number of distinct factors among strings of the same length and over the same alphabet. We show a family of binary strings which are factor-maximal and which are closely related to de Bruijn words. One de Bruijn word of rank k represents, in a compact way, exponentially many factor-maximal words. By the way we give a simplified linear time construction of a factor-maximal string of a given length n.
We assume in this paper that the alphabet of considerd words is binary, it simplifies presentation, and an extension to general finite alphabets is straightforward. Factor-maximal words are closely related to de Bruijn words and de Bruijn Graphs. De Bruijn word of rank k is any word of length 2
kcontaining cyclically each subword of length k exactly once. A linear de Bruijn word of rank k is a cyclic de bruin word concatenated with its prefix of size k − 1. Denote γ
k= 2
k+ k − 1. The number γ
kis the size of linear version of de Bruijn word of rank k. Each linear de Bruijn word is factor-maximal.
Assume we have an integer n such that γ
k< n < γ
k+1, our goal is to construct a factor-maximal binary word of size n.
Observation 1. A word w of length n, where γ
k< n < γ
k+1, is factor-maximal iff it contains each word of length k and each subword of length k + 1 occurs in w at most once.
Example 1.
The following word is a de Bruijn word of rank 4:
0 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 Its linear version is a linear de Bruijn word of size γ
4:
0 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 0 0 0.
Both these words are factor-maximal, since they contain all words of size 3 as factors and all their factors of size 4 are distinct.
A construction of factor-maximal word of size γ
k< n < γ
k+1has been given in [?] using a rather complicated construction of simple cycles of size r in de Bruijn graph of rank k for any 1 ≤ k ≤ 2
k,
1
this construction was earlier given in [?]. In this paper we only need to find a Hamiltonian cycle which is much easier.
De Bruijn graph of rank k is G
k= (V
k, E
k), where V
k= {0, 1}
k. The edges are of the form:
c · α −→ α · d,
dc, d ∈ {0, 1}, α ∈ {0, 1}
k−1The label of each such edge is the symbol d, appended to α. An example of de Bruijn graph of rank 4 is shown in Figure 2, where binary words corresponding to nodes are converted to numbers. When interpreting nodes as numbers we have the edges
i −→ (2i mod 2
0 k), i −→ (2i + 1 mod 2
1 k)
A path (not necessarily simple) is a chain if all its edges are distinct. It is a cyclic-chain if the first and the last vertex are the same.
Denote by val(π) the sequence of labels of edges of the chain π. Then each cyclic de Bruijn word of rank k equals val(π) for some Eulerian cycle π of the graph G
k.
Lemma 1.
(a) If each vertex G
khas an occurrence on the chain π at the distance from the starting vertex at least k then val(π) is factor maximal.
(b) If additionaly π is cyclic then after appending at the end to π its prefix of length at most k we also obtain a factor-maximal word.
Definition of deBruijn(C, v), where C is a HAmiltonian cycle of G
k.
x x
x
x y1
y
y 2
2
y3 4
5
6 y5 x1 y
x3
4
6
Figure 1: The cyclic structure of G
k. The big cycle is a Hamiltonian cycle C = (y1 · y2 · y3 · y4 · y5).
Other (outer)cycles result by removing C from the graph. x
i’s are values (chain labels) of the outer cycles. We start in the starting node v of C and traverse the graph, the edges which are not in C have priority. We receive the word deBruijn(C, v). The word x
1corresponds to the largest outer cycle. We obtain an Euler cycle by starting with the largest cycle, traversing C and consecutive outer cycles. The edge-labels of such an Euler cycle form the word: x
1y
1x
2y
2x
3y
3x
4y
4x
5y
5.
2
Example. Take the de Bruijn graph G
4, see Figure 2, it has 16 nodes and 32 edges.
0
1
3
7
12
14
15 2
9
5
6
11
8
13 10 4
7
3 15
8 0
12
9 3
6 14
13 6 11
10 1
5 2
11 13
12 9
4
4 C3
C1 2
1 14 C4
C2
C5 10