USEFULNESS OF DIRECTED ACYCLIC SUBWORD GRAPHS IN PROBLEMS RELATED TO STANDARD STURMIAN WORDS

(1)

International Journal of Foundations of Computer Science c

World Scientiﬁc Publishing Company

USEFULNESS OF DIRECTED ACYCLIC SUBWORD GRAPHS IN PROBLEMS RELATED TO STANDARD STURMIAN WORDS

PAWE L BATURO

Faculty of Mathematics and Computer Science, Nicolaus Copernicus University Torun, Poland

and

MARCIN PIATKOWSKI

Faculty of Mathematics and Computer Science,Nicolaus Copernicus University Torun, Poland

and

WOJCIECH RYTTER

^∗

Institute of Informatics, Warsaw University and

Faculty of Mathematics and Computer Science, Nicolaus Copernicus University

Received (received date) Revised (revised date) Communicated by Editor’s name

ABSTRACT

The class of finite Sturmian words consists of words having particularly simple com- pressed representation, which is a generalization of the Fibonacci recurrence for Fibonacci words. The subword graphs of these words (especially their compacted versions) have a very special regular structure. In this paper we investigate this structure in more detail than in the previous papers and show how several syntactical properties of Sturmian words follow from their graph properties. Consequently simple alternative graph-based proofs of several known facts are presented. The very special structure of subword graphs leads also to special easy algorithms computing some parameters of Sturmain words: the number of subwords, the critical factorization point, lexicographically maximal suffixes, occurrences of subwords of a fixed length, and right special factors. These algorithms work in linear time with respect to n, the size of the compressed representation of the standard word, though the words themselves can be of exponential size with respect to n. Some of the computed parameters can be also of exponential size, however we provide their linear size compressed representations. This gives more examples of fast computations for highly compressed words. We introduce also a new concept related to standard words: Ostrowski automata.

Keywords: Sturmian words, subword graphs, numeration systems

∗

Supported by the grant N206 004 32/0806 of the Polish Ministry of Science and Higher

Education.

(2)

1. Introduction

The standard Sturmian words (standard words, in short) are aperiodic words of minimal combinatorial complexity. They are generalization of Fibonacci words and have a very simple grammar-based representation which has some algorithmic consequences.

Let S denote the set of all standard Sturmian words. These words are defined over a binary alphabet Σ = {a, b} and are described by recurrences (or grammar-based representation) corresponding to so called directive sequences: integer sequences

γ = (γ 0 , γ 1 , . . . , γ n ),

where γ 0 ≥ 0, γ i > 0 for 0 < i ≤ n. The word x n+1 corresponding to γ, denoted by Sw(γ), is defined by recurrences:

x −1 = b, x 0 = a, ∀ 0≤i<n x i+1 = x ^γ _i

ⁱ

x i−1 (1) Fibonacci words are standard Sturmian words given by the directive sequences of the form

γ = (1, 1, . . . , 1).

We consider here standard words starting with the letter a, hence assume γ 0 > 0.

The case γ 0 = 0 can be considered similarly.

For even n > 0 a standard word x n has the suffix ba, and for odd n > 0 it has the suffix ab. The number N = |x n+1 | is the (real) size, while n + 1 can be thought as the compressed size.

Example 1.

Consider directive sequence γ = (1, 2, 1, 3, 1). We have:

Sw(1, 2, 1, 3, 1) = ababaabababaabababaabababaababaab x −1 = b, x 0 = a, x 1 = x ¹ ₀ x −1 = a b, x 2 = x ² ₁ x 0 = ab ab a, x 3 = x ¹ ₂ x 1 = ababa ab, x 4 = x ³ ₃ x 2 = ababaab ababaab ababaab ababa, x 5 = x ¹ ₄ x 3 = ababaabababaabababaabababa ababaab

The grammar-based compression consists in decribing a given word by a context- free grammar G generating this (single) word. The size of the grammar G is the total length of all productions of G. Some of the outputs of our algorithms will be given in the grammar-compressed form. In particular each directive sequence of a standard Sturmian word corresponds to such a compression – the sequence of recurrences corresponding to the directive sequence. In this case the size of the grammar is proportional to the length of the directive sequence.

For more information, some lexicographic properties and structure of repetitions

of standard Sturmian words see [2, 3, 4].

(3)

2. The structure of subwords and subword graphs of standard words The subword graph is a classical data structure representing all subwords of a given word in a succinct manner. More precisely: the directed acyclic word graph (the dawg, in short) of the word w is the minimal deterministic automaton (not neces- sarily complete) that accepts all suffixes of w. We refer the reader to [5] for the complete definition and more information of subword graphs.

For the words w and u denote by p w (u) the shortest prefix of w having u as a suffix.

The smallest number of states of dawg of w is |w| + 1. We say that w is simplistic if dawg of w has exactly |w| + 1 nodes. The simplistic words have the simplest dawgs.

The following crucial fact describes the structure of the dawg of w.

Lemma 1 (See [11]) Let w be a standard Sturmian word. Then:

(a) The word w is simplistic.

(b) The nodes of dawg of w can be identified with prefixes of w.

(c) Each edge of the dawg of w is of the form α → p ^s w (α s), where s ∈ {a, b} and α is a prefix of w.

y₃

y₃ a b y₃

y₃ y₂

y₁ y₁ y₀

a b a b

a b y₂

y₂ y₁

a b a b a a b a b a b a a b a b a b a a b a b a b a a b a b a a b

a a b

b

a a

a

Figure 1: The structure of the subword graph (dawg) and its compacted version (cdawg) of Sw(1, 2, 1, 3, 1).

The compacted subword graph (the cdawg, in short) results from the subword graph

by removing all nodes of out-degree one and replacing each chain by a single edge

with the label representing the path label of this chain. We compact the chains of

the dawg as much as possible but with the following restriction: for each node v all

incomming edges of v have the same label (possibly long). The restriction implies

that we can’t fully compress the last chain going into sink node. This chain is a

concatenation of some basic subword y k and two-letter word u (ab or ba). We split

(4)

this chain into two edges: first labelled by y k and the second labelled by u and going into the sink node. Internal nodes of dawg of out-degree two, which are copied to cdawg, are called fork nodes. For example dawg and cdawg see figure 2.

Building blocks

First we consider the relations between subwords which are the “building blocks” of the subword graph of a standard word. Let Subwords(x) be the set of all nonempty subwords of x.

The subwords which are building blocks of dawgs and cdawgs are classified as:

• A special prefix of x is a prefix z of x such that za, zb ∈ Subwords(x).

• A basic prefix of x is a proper nonempty prefix of the type x ^j _k x k−1 , where 0 ≤ k ≤ n and 0 ≤ j ≤ γ k .

• A basic subword of x is a reverse of x k , for some k. Denote y k = Reverse(x k ).

See figure 2 for “building blocks” structure of an example word.

It follows directly from Lemma 1 that:

Fact 1 If w is a standard Sturmian word then the nodes of the cdawg of w of out- degree 2 (all except the last two nodes) correspond to special prefixes of w.

From the point of view of the compacted subword graph G of a standard word the most important are special prefixes. On the other hand special prefixes are composed of basic subwords, and basic subwords are labels of the edges of the compacted subword graph. Hence special prefixes and basic subwords are the main

“building blocks” of standard words with respect to compacted subword graphs.

The third type of subwords – basic prefixes, is also important, since special prefixes are “almost” the same as basic prefixes , and basic prefixes correspond more directly to the recurrences. They are a “link” between the directive sequence and special prefixes.

Denote by BP (x) the set of basic prefixes of x and by SP (x) the set of special prefixes of x. Denote by ˆ x the prefix of x of size 2, exceptionally define ˆ y 0 = ab.

Lemma 2 [Building blocks] Assume that x −1 , x 0 , . . . , x n+1 is the sequence of standard Sturmian words given by γ = (γ 0 , γ 1 , . . . , γ _n ).

(a) For i ≥ 1 we can represent the standard word x i as x _i = y ^γ ₀

⁰

y ₁ ^γ

¹

. . . y _i−2 ^γ

ⁱ⁻²

y _i−1 ^γ

ⁱ⁻¹

⁻¹ y ˆ i−1 ,

(b) Each special prefix z of the word x _n is of the the form z = y ₀ ^γ

⁰

y ^γ ₁

¹

. . . y _i ^j , where 0 ≤ j ≤ γ i for i < n − 1 and 0 ≤ j ≤ γ i − 1 for i = n − 1.

(c) Each special prefix of x n results by removing last two letters from the corre-

sponding basic prefix of x n .

(5)

b a b a b a a b a b a b a a b a b a b a a b a b a b a a b a b a a

y

₄

y

₃

y

₃

y

₃

y

₂

y

₁

y

₁

y

₀

BP SP

Figure 2: The structure of basic prefixes (BP ), special prefixes (SP ) and basic subwords of W ord(1, 2, 1, 3, 1).

Proof. We show each point separately.

Point (a) Notice that ˆ y i = ˆ y i+2 and y i+1 = y i−1 y ^γ _i

ⁱ

for i ≥ 0. First we show by induction that

y i = ˆ y i y ₀ ^γ

⁰

y ^γ ₁

¹

. . . y ^γ _i−1

ⁱ⁻¹

⁻¹ . (2) For i = 1 we have

y 1 = ba ^γ

⁰

= ˆ y 1 y ₀ ^γ

⁰

⁻¹ Assume that for i ≤ n the equation (2) is true. We have

y n+1 = y n−1 · y _n ^γ

ⁿ

= ˆ

y n−1 y ^γ ₀

⁰

y ₁ ^γ

¹

. . . y ^γ _n−2

ⁿ⁻²

⁻¹

· y n−2 y ^γ _n−1

ⁿ⁻¹

y _n ^γ

ⁿ

⁻¹

= y ˆ n+1 y ^γ ₀

⁰

y ₁ ^γ

¹

. . . y _n ^γ

ⁿ

⁻¹

Now we can prove equation from the point (a) using induction. For i = 1 we have:

x 1 = x ^γ ₀

⁰

x −1 = y ₀ ^γ

⁰

⁻¹ y ˆ 0 .

Assume that for i ≤ n equation from the point (a) is true. We have:

x n+1 = x ^γ _n

ⁿ

x n−1 =

y ₀ ^γ

⁰

. . . y _n−2 ^γ

ⁿ⁻²

y ^γ _n−1

ⁿ⁻¹

⁻¹ y ˆ n−1

γ

n

· y ₀ ^γ

⁰

. . . y _n−2 ^γ

ⁿ⁻²

⁻¹ y ˆ n−2

= y ₀ ^γ

⁰

. . . y _n−1 ^γ

ⁿ⁻¹

y ^γ _n

ⁿ

⁻¹ y ˆ _n

Point (b). The point (a) implies that z = y ^γ ₀

⁰

y ₁ ^γ

¹

. . . y ^j _i is a prefix of a standard word x n generated by the directive sequence (γ 0 , γ 1 , . . . , γ n ), where 0 ≤ j ≤ γ i for i < n − 1 and 0 ≤ j ≤ γ i − 1 for i = n − 1. We can also deduce, that a prefix x n is a palindrome (see [4] for the proof that for every standard word x the word x is a palindrome). Hence, if z is a special prefix of a standard word x, then z is also a suffix of x.

First assume that i < n − 1 and i is odd, the case for even i is similar.

(6)

If 0 ≤ j < γ i , then z is a prefix of x i+2 and zb is also a prefix of x i+2 (the first letter of y i is b). Suffix of x i+2 is ab, hence za, as a suffix of x i+2 , is also a subword of x i+2 .

If j = γ i , then z is a prefix of x i+3 and za is also a prefix of x i+3 (the first letter of y i+1 is a). The suffix of x i+3 is ba, hence zb, as a suffix of x i+3 , is also a subword of x i+3 .

Now assume that i = n − 1. For 0 ≤ j < γ n−1 the proof is similar to the case i < n − 1. It is obvious now that j < γ n−1 for i = n − 1.

Point (c) Notice that ˆ y _i = ˆ y i+2 and y i+1 = y i−1 y ^γ _i

ⁱ

for i ≥ 0.

The point(a) implies that the basic prefix x ^j _k x k−1 equals:

y ^γ ₀

⁰

. . . y ^γ _k−2

^k⁻²

y ^γ _k−1

^k⁻¹

⁻¹ y ˆ k−1

j

· y ₀ ^γ

⁰

. . . y ^γ _k−3

^k⁻³

y _k−2 ^γ

^k⁻²

⁻¹ y ˆ k−2 = y ^γ ₀

⁰

. . . y _k−1 ^γ

^k⁻¹

y ^j−1 _k y ˆ k

From (b) we conclude that basic prefix x ^j _k x k−1 with last two letters removed (ˆ y k )

is a special prefix. 2

Example 2.

For Sw(1, 2, 1, 3, 1) = ababaabababaabababaabababaababaab we have:

BP = {x 0 , x 1 , x 1 x 0 , x 2 , x 3 , x 3 x 2 , x ² ₃ x 2 , x 4 } SP = {y 0 , y 0 y 1 , y 0 y ₁ ² , y 0 y ² ₁ y 2 , y 0 y ² ₁ y 2 y 3 , y 0 y ₁ ² y 2 y ² ₃ } where: y 0 = a, y 1 = ba, y 2 = ababa, y 3 = baababa.

Sw(1, 2, 1, 3, 1) = a ba ba ababa baababa baababa baababa ab

= y 0 y ₁ ² y 2 y ₃ ³ y ˆ 4

The structure of the compacted dawg

The regularity of the structure of compacted subword graphs has been discovered in [8]. The main point is that the cdawg is exponentially smaller than the dawg for the standard word w. The following fact is implication of the results of [8], Lemma 1, Lemma 2 and our terminology.

Fact 2 Let w = Sw(γ 0 , γ 1 , . . . , γ n ) be a standard Sturmian word.

(1) The labels of edges in the cdawg of w are basic subwords of w.

(2) The compacted subword graph of w has the structure as follows:

• Each node corresponding to special prefix y ^γ ₀

⁰

y ₁ ^γ

¹

· · · y _i−1 ^γ

ⁱ⁻¹

y _i ^k , for 0 ≤ k < γ _i , has two outgoing edges:

– y ₀ ^γ

⁰

y ^γ ₁

¹

· · · y ^γ _i−1

ⁱ⁻¹

y ^k _i ^y

ⁱ

⊲ y ₀ ^γ

⁰

y ₁ ^γ

¹

· · · y _i−1 ^γ

ⁱ⁻¹

y _i ^k+1 – y ₀ ^γ

⁰

y ^γ ₁

¹

· · · y ^γ _i−1

ⁱ⁻¹

y ^k _i ^y

ⁱ⁺¹

⊲ y ₀ ^γ

⁰

y ₁ ^γ

¹

· · · y _i−1 ^γ

ⁱ⁻¹

y _i ^k y i+1

• Each edge leading to the sink node has label ˆ y n (ab or ba).

(7)

(a)

y₁ y₁ y₁ y₁

y₂ y₂ y₂ y₂ y₂

y₀ y₀ y₀ y₀

γ_n−1 γ₂

γ₁ γ₀

y₂

y₃ y₁

y_n−1 y_n−1 y_n−1

y_n−1 a b

a b

(b)

y₁ y₁ y₁ y₁

y₂ y₂ y₂ y₂ y₂

y₀ y₀ y₀ y₀

y₃ y₁

y₂

y_n y_n y_n y_n

γ₂ γ₁

γ₀ γ_n−1

b a b a

Figure 3: The compacted subword graphs of the words (a): Sw(γ 0 , γ 1 , γ 2 , . . . , γ n ) and (b): Sw(γ 0 , γ 1 , γ 2 , . . . , γ n −1, 1) are isomorphic (in the sense of graph structure).

• The last but one node doesn’t correspond to special prefix and has out-degree 1.

(see Figure 3 for comparision).

3. The number of subwords

It is known that the number of distinct subwords in the n-th Fibonacci word is Subwords(F ib n+1 )

= |F ib n | · |F ib n−1 | + 2 · |F ib n | − 1.

Surprisingly, essentially the same formula works generally for Sturmian words.

Fact 3 Let γ _n = 1, and x n+1 = Sw(γ 0 , γ 1 , . . . , γ n ). Then:

Subwords(x n+1 )

= |x n | · |x n−1 | + 2 · |x n | − 1.

Proof.

Denote by v 0 the source node of cdawg of x n+1 . Let t k = |x k |. In the compacted subword graph of x n+1 define the multiplicity mult(v) of a vertex v as the number of paths v 0 ; v and the weight of edge as the length of corresponding label-string of this edge. Denote by edges(v) the sum of weights of all edges outgoing from v.

See Figure 4 for edge-lengths and node-multiplicities structure in the cdawg of the

example word.

(8)

Claim 1 Let w = Sw(γ 0 , γ 1 , . . . , γ n ). Then:

Subwords(w)

= X

v∈G

mult(v) · edges(v) (3)

1 1 2 2 5 7 7

2 7

2 2 2

7 5

5

7 7 7 2

5 2

2 1

Figure 4: The structure of edge-lengths and multiplicities of nodes in the compacted subword graph of Sw(1, 2, 1, 3, 1). According to Fact 3 (and to the graph above) there are |x 4 | · |x 3 | + 2 · |x 4 | − 1 = 26 · 7 + 2 · 26 − 1 = 233 subwords in our example word.

We partition the set of edges into chunks, the first chunk consists of the first γ 0

consecutive vertices starting from the v 0 , the second chunk contains the next γ 1

vertices, etc. The last chunk slightly differs.

The contribution of k-th internal chunk in the sum in equation (1) is t k−1 + (γ k − 1)t k · (t k + t k+1 ) = t ² _k+1 − t ² _k , where t −1 = 1 (see Figure 5 for details).

The contribution of the last chunk is (see Figure 6) (t n−1 + 2)(t n − t n−1 ) + 2t n−1 . Altogether, the number of subwords is

n−2

X

k=0

t ² _k+1 − t ² _k + (t n−1 + 2)(t n − t n−1 ) + 2t n−1 = t n · t n−1 + 2 · t n − 1.

This completes the proof, since by definition |x k | = t k . 2

The case γ n > 1 reduces to the previous case.

Fact 4 Let γ _n > 1. Then:

Subwords Sw(γ 0 , γ 1 , . . . , γ n ) =

Subwords Sw(γ 0 , γ 1 , . . . , γ n − 1, 1) . Proof.

Compacted subword graphs of Sw(γ 0 , γ 1 , . . . , γ n ) and Sw(γ 0 , γ 1 , . . . , γ n − 1, 1) are

(9)

t_k−1 t_k t_k t_k t_k t_k t_k+1

t_k t_k t_k t_k t_k

t_k+1 t_k+1

u v

Figure 5: The k-th internal chunk G k of the subword graph consists of γ k nodes from u to v (excluding u), and their outgoing edges. The multiplicity (number of paths leading from v 0 ) of each node is written within the box corresponding to the node. The weight of the edges are the lengths of corresponding words in the cdawg.

t_n−2

t_n−1 t_n−1 t_n−1 t_n−1 t_n−1

t_n−1 t_n−1

t_n−1

u v

2

2 2

2

Figure 6: The final chunk G n−1 of the subword graph consists of γ n−1 nodes from u to v, and their outgoing edges.

isomorphic in the sense of graph structure (see Figure 3 for details). Hence we can use the result of Fact 3 to compute

Subwords Sw(γ 0 , γ 1 , . . . , γ n )

. 2

4. The structure of occurrences of subwords

In this section we are interested in the structure of first occurrences of the subwords of a given length. One type of these subwords is particularly interesting – right special factors.

Right special factors

A right special factor of the word x is any word w such that both wa and wb are subwords of x. For each k > 0 there is at most one right special factor of the length k of a given standard word. Moreover every right special factor of a standard word is either a special prefix or a suffix of some special prefix.

Fact 5 Let w = Sw(γ) be a standard Sturmian word. Then:

(1) For a given k > 0 the right special factor of w of the length k has the grammar- representation of size O |γ|.

(2) The compressed representation of the right special factor of w of the length k

can be computed in O |γ| time.

(10)

Proof.

Define the value of a path π in the cdawg of w as the word created by concatenation of the labels of edges in π.

Let v be a fork node in the cdawg of w (whichever except last two nodes), π be a path leading to v from some other node v 1 , and z π be the value of the path π. It is clear that z π is a subword of w.

The node v has two outgoing edges: one with the label starting with the letter a and the second with the label starting with the letter b. Consequently z π a and z π b are also subwords of w and therefore z π is a right special factor of w.

Observe that a value of every path in the cdawg of w, that ends in some fork node v, is the suffix of the value of the longest path from the root to v. Moreover the value of this longest path from the root to v is a prefix of w, hence it is the special prefix of w. This implies that every right special factor of w is a suffix of some special prefix of w.

Every right special factor of w is the concatenation of some basic subwords of w.

It follows easily from Lemma 2 that every right special factor of w has a grammar- representation of the size O |γ|, which can be computed in time linear with respect

to the length of the directive sequence γ. 2

Example 3.

Let w = Sw(1, 2, 1, 3, 1) = ababaabababaabababaabababaababaab.

Recall that:

y 0 = a, y 1 = ba, y 2 = ababa, y 3 = baababa.

The right special factors of w with their lengths are (special prefixes are in bold):

1 y 0

2 y 1

3 y 0 y 1

4 y ₁ ² 5 y ⁰ y 1 ²

6 y 0 y 2

7 y 1 y 2

8 y 0 y 1 y 2

9 y ² ₁ y 2

10 y ⁰ y ² 1 y ²

11 y ² ₁ y 3

12 y 2 y 3

13 y 0 y 2 y 3

14 y 1 y 2 y 3

15 y 0 y 1 y 2 y 3

16 y ² ₁ y 2 y 3

17 y ⁰ y ² 1 y ² y ³

18 y ² ₁ y ₃ ² 19 y 2 y ₃ ² 20 y 0 y 2 y ₃ ² 21 y 1 y 2 y ₃ ² 22 y 0 y 1 y 2 y ₃ ² 23 y ² ₁ y 2 y ₃ ² 24 y ⁰ y 1 ² y ² y ² 3

The structure of dawg of Sw(γ 0 , γ 1 , . . . , γ n ) implies the following fact.

Fact 6 If u is a subword of Sw(γ) then u has a unique decomposition into subwords u = y i

1

y i

2

. . . y i

k

y ˜ i

k+1

, where i 1 ∈ {0, 1}, i k+1 ∈ {i k , i k + 1, i k + 2} and ˜ y i

k+1

is a prefix (possibly the whole word) of y _i

k+1

.

Observation. Using the fact above it is easy to check in linear time if u is a subword

of Sw(γ), since the next factor of the decomposition is determined by the next

scanning letter.

(11)

Final positions of the first occurrences of subwords

For the words w and u define first-fin(u, w) as the position of the last symbol of u in the first occurrence of u in w. For k ≥ 1 define also the set

F IN (k, w) = first-fin(u, w) : u is a subword of w of size k . For an example see figure 7.

k 1

16 17 15 14 13 12 11 10 9 8 7 6 5 4 3 2 a b a b a a b a b a b a a b a b a b a a b a b a b a a b a b a a b

b b

a

a a

Figure 7: The subword graph of w and the structure of the sets F IN (k, w) for w = Sw(1, 2, 1, 3, 1).

Fact 7 Let w = Sw(γ 0 , γ 1 , . . . , γ n ) be a standard Sturmian word. Then:

(1) The set F IN (k, w) consists of a single interval or two disjoint intervals.

(2) For a given k ≥ 1 we can compute the intervals representing F IN (k, w) in time linear with respect to the size of the directive sequence.

Proof.

The structure of the set F IN (k, w) easily follows from the way how paths of the length k − 1 in dawg of w are extended into paths of the length k. Only fork nodes i ∈ F IN (k − 1, w) generate two elements of F IN (k, w), each other node i ∈ F IN (k − 1, w) generates a single element i + 1 in F IN (k, w) (see Figure 7).

It is clear that the set F IN (k + 1, w) results from F IN (k, w) by shifting each position by one to the right and adding an extra position for the fork node. Hence the thesis follows from the structure of subword graphs of a standard Sturmian

words. 2

(12)

5. Relation of subword graphs to the dual Ostrowski numeration system The dual Fibonacci numeration system has been introduced in [10], where its rela- tion to the subword structure of Fibonacci words has been investigated. We extend these results to Sturmian words. In this case we have the Ostrowski numeration system which is a generalization of the Fibonacci system.

In (only) this section we consider infinite directive sequences.

For an infinite directive sequence γ = (γ 0 , γ 1 , . . .) we introduce [∗] γ -numeration system: a version of the Ostrowski’s numeration system from [1], which is a gener- alization of the Fibonacci numeration system. Let us define the base sequence q as a sequence:

q = (q 0 , q 1 , . . .) = |x 0 |, |x 1 |, ..., where x i ’s are as in equation (1).

The base sequence can be defined without any reference to words x i as follows:

q −1 = q 0 = 1, q i+1 = q i · γ i + q i−1 for i ≥ 0.

Example 4.

If γ = (1, 2, 1, 2, . . .), then the base sequence is: q = (1, 2, 5, 7, 19, . . .).

If γ = (1, 2, 1, 1, 1, . . .), then the base sequence is: q = (1, 2, 5, 7, 12, 19, . . .).

For a sequence of integers α = (α 0 , α 1 , . . . , α _n ) define:

val γ (α 0 , α 1 , . . . , α n ) = α 0 · q 0 + α 1 · q 1 + . . . + α n · q n .

For 0 ≤ i < |x n+1 | the representation of i in the Ostrowski numeration system is defined as:

[ i ] γ = (α 0 , α 1 , . . . , α n ), where we require:

(1) val γ (α 0 , α 1 , . . . , α n ) = i (2) ∀ 0≤j<n α _j ≤ γ _j

(3) α j+1 = γ j+1 ⇒ α j = 0

In other words, in the representation of a number i, for each k we take at most γ k

numbers |x _k |, and if we take exactly γ _k numbers |x _k | then we take zero numbers

|x k−1 |.

Example 5.

Let γ = (1, 2, 1, 3, 1, . . .). Then

q = |x 0 |, |x 1 |, ... = (1, 2, 5, 7, 26, 33, . . .)

(13)

We have [16] γ = (0, 1, 0, 2), because

16 = 0 · 1 + 1 · 2 + 0 · 5 + 2 · 7 We have [29] γ = (1, 1, 0, 0, 1), because

29 = 1 · 1 + 1 · 2 + 0 · 5 + 0 · 7 + 1 · 26

For 0 ≤ i < |x n+1 | + |x _n | − 1 we define the representation of i in the dual Ostrowski numeration system as:

[ i ] ˆ _γ = (α 0 , α 1 , . . . , α n ), where we require:

(1) val γ (α 0 , α 1 , . . . , α n ) = i (2) ∀ 0≤j<n α _j ≤ γ j

(3) α j < γ j and ∃ (i>j) α i > 0

⇒ α j+1 > 0

In other words, in the representation of a number i in the numeration system de- fined above, for each k we take at most γ _k numbers |x _k |, and if we take α _k < γ _k numbers |x k | and α k is not the last one component of this representation then we must take at least one number |x k+1 |.

Example 6.

Let γ = (1, 2, 1, 3, 1, . . .). Then

q = |x 0 |, |x 1 |, ... = (1, 2, 5, 7, 26, 33, . . .) We have ˆ [16] _γ = (0, 2, 1, 1), because

16 = 0 · 1 + 2 · 2 + 1 · 5 + 1 · 7 We have ˆ [29] _γ = (1, 1, 1, 3), because

29 = 1 · 1 + 1 · 2 + 1 · 5 + 3 · 7

Uniqueness of the representation in the Ostrowski numeration system has been proved in [1]. Uniqueness of the representation in the dual Ostrowski numeration system has been proved in [8].

Let G ∞ be the infinite compacted subword graph corresponding to a given directive sequence γ = (γ 0 , γ 1 , . . .). Let π be a path from the root to another node of G ∞

and let rep(π) = (h 0 , h 1 , . . .), where h i is the number of edges of weight q i on the path π.

The following fact is an interpretation of the corresponding result in [8] in terms of

the dual Ostrowski numeration system.

(14)

q₃ q₃ q₃q₄

q₁ q₂ q₃

q₁ q₃

q₂ q₄

q0 q₀ q₀ q₀ q₀ q₁ q₁ q₁ q₁ q₁q₂ q₂ q₂ q₃ q₃

Figure 8: The illustration of the point (1) of Fact 8. In this case the representation of the length of the path π in the dual Ostrowski numeration system is given by:

rep(π) = (1, 4, 3, 2) and |π| = 1 · |q 0 | + 4 · |q 1 | + 3 · |q 2 | + 2 · |q 3 |.

Fact 8

(1) rep(π) is the representation of the length of the path π in the dual Ostrowski numeration system corresponding to the directive sequence of G ∞ .

(2) For each k > 1 there is exactly one fork-path of the length k in G ∞ . Proof.

Point (1)

Let π be a path from the root to some node v in G ∞ – an infinite compacted subword graph corresponding to a directive sequence γ = (γ 0 , γ 1 , γ 2 , . . .), and let rep(π) = (h 0 , h 1 , . . .) be defined as above. It is sufficient to check if all requirements of the definition of the dual Ostrowski numeration system are satisfied.

The construction of the path π implies that

|π| = h 0 · q 0 + h 1 · q 1 + h 2 · q 2 + · · · and ∀ i 0 ≤ h i ≤ γ i .

Moreover from the structure of G ∞ (see Figure 8) it is obvious that if for some i h _i < γ _i (we have taken q i less than γ i times) and h i is not the last non zero element in rep(π) then h i+1 > 0 (we must take at least one q i+1 to continue the construction of the path π). This concludes the proof of the point (1).

Point (2)

Thesis follows directly from the point (1) and uniqueness of the representation in

the dual Ostrowski numeration system. 2

6. S-automata

The S-language and the S-automaton are related to the dual Ostrowski numera- tion system discussed in the previous section, but can be also defined without any reference to it. These objects are for the first time defined in this paper.

Define |w| a as the number of occurrences of the letter a in the word w.

For γ = (γ 0 , γ 1 , . . . , γ _n ) we define the S-language L = S-lan(γ) as follows:

• if γ n = 1 then L is the set of all subsequences u of the word a ^γ ₀

⁰

a ^γ ₁

¹

. . . a ^γ _n

ⁿ

(15)

which satisfy:

– |u| a

n

= 1;

– ∀ 0<i<n |u| a

ⁱ

< γ _i ⇒ |u| a

i+1

> 0

• if γ n > 1 then L = S-lan(γ 0 , γ 1 , . . . γ n−1 , γ n − 1, 1)

We define the S-automaton, denoted by S-aut(γ), as the minimal deterministic automaton accepting S-lan(γ), excluding the “dead state”, where the “dead” state is the nonaccepting state which loops itself (each transition from this state goes to itself). The missing edges in the graph of the automaton are assumed to go to the

“dead” state.

a

1

a

1

a

0

a

₁

a

2

a

3

a

3

a

3

a

4

a

4

a

₄

a

4

a

₃

a

2

a

2

Figure 9: The S-automaton (the minimal deterministic automaton, without the dead state) S-aut(1, 2, 1, 3, 1). The only accepting state is the sink node.

The following fact is a direct implication of Fact 1.

Fact 9 The minimal S-automaton for a directive sequence γ, without the dead state, is isomorphic as a graph to the compacted directed acyclic subword graph of Sw(γ).

A prefix u of a word w is called maximal if u is not a proper prefix of another prefix of w.

Define the following morphism h _γ .

• If γ n = 1 then h γ (a i ) = y i , for 0 ≤ i < n and h γ (a n ) = ˆ y _n .

• Otherwise h γ (a i ) = y i , for 0 ≤ i ≤ n, and h γ (a n+1 ) = ˆ y n+1 .

The morphic image of the language is meant in the usual sense, the morphic image of an automaton results by changing the label of each edge using the morphism.

The following results are implied by Facts 1-2.

Fact 10 The set of maximal prefixes of W ord(γ) equals h γ S-lang(γ)

(it is the morphic image of the S-language for γ.

Fact 11 The compacted subword graph of the word Sw(γ) is the image of the S-

automaton S-aut(γ) under the morphism h γ .

(16)

7. Critical factorization and maximal suffixes

The minimal local period in a word w at the position k is a positive integer p such that w[i − p] = w[i] for every k ≤ i < k + p, whenever w[i] and w[i − p] are defined.

The critical factorization point in a word w is the position k in w for which the minimal local period at k equals the (global) minimal period of w. We refer the reader to [5] for the more detailed definition of the critical factorization point.

The following nontrivial fact has been shown by Crochemore and Perrin in [6].

Fact 12 The critical factorization point of w is given as the starting position of a lexicographically maximal suffix, maximized over two reversed orders of the alphabet.

Example 7.

Let w = Sw(1, 2, 1, 3, 1) = ababaabababaabababaabababaababaab.

Minimal local periods of w are as follows:

i

p(i)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ···

a b a b a a b a b a b a a b a b a ···

1 2 2 2 5 1 7 2 2 2 2 7 1 7 2 2 2 2···

i

p(i)

··· 18 19 20 21 22 23 24 25

26 27 28 29 30 31 32 33

··· b a a b a b a b

a a b a b a a b

···2 7 1 7 2 2 2 4 33 1 5 2 2 5 1 3 1 where i denotes the position in the word w and p(i) – the minimal local period at the position i in w. Hence the critical factorization point is at the position i = 25.

For a word w define π a (w) as a path in the dawg of w which starts in the root, ends in the sink, and in which we use the letter a whenever we have a choice (in every fork node). Similarly define π _b (w). The path π _a (w) (π _b (w) respectively) can be also defined for the cdawg of w: in every fork node we choose the edge with the label starting with the letter a (letter b respectively). The length of the path, denoted |π|, is defined as the length of the word given by π.

It is easily seen that the lexicographically maximal suffix of w with respect to the letter ordering ^′′ a < b ^′′ is given by π b (w) and the lexicographically maximal suffix of w with respect to the letter ordering ^′′ a > b ^′′ is given by π a (w).

Lemma 3 Let w = Sw(γ 0 , γ 1 , . . . , γ n ) be a standard Sturmian word and π a (w), π b (w) be defined as above. Then:

π a (w) = y ^γ ₀

⁰

y ₂ ^γ

²

· · · y _2k ^γ

^2k

· ˆ y n−1

π _b (w) = y ^γ ₁

¹

y ₃ ^γ

³

. . . y _2l+1 ^γ

^2l+1

· ˆ y n−1

(17)

where k = ⌊ ⁿ⁻¹ ₂ ⌋ and l = ⌊ ⁿ⁻² ₂ ⌋.

Proof.

It follows from the definition of basic subwords that y _i starts with the letter a for even i and with the letter b for odd i.

We are constructing the path π a (w) in the cdawg of w by choosing the edge with the label starting with the letter a whenever it is possible. From the structure of cdawgs of a standard Sturmian words (see Figure 3) we have that every fork node has two outgoing edges: one with the label y 2i (starting with the letter a) and the second one with the label y 2i+1 (starting with the letter b).

To construct π a (w) we have to choose γ 0 times edge with the label y 0 , then γ 2 times edge with the label y 2 , and so on up to y 2k , where k = ⌊ ⁿ⁻¹ ₂ ⌋. Finally, by Lemma 2, it suffices to add ˆ y n−1 , the last two letters of w.

The same proof works for the path π b (w). 2

Construction of the paths π a (w) and π b (w) implies the following fact.

Fact 13 Let w = Sw(γ 0 , γ 1 , . . . , γ n ) be a standard Sturmian word. Then:

(1) The critical factorization point of w is at the position k = |w| − min |π _a (w)|, |π b (w)|

(2) The critical factorization point of w can be computed in linear time with respect to the size of the directive sequence.

Proof.

The proof is immediate by Fact 12 and recalling that π a (w) and π b (w) correspond to lexicographically maximal suffixes of w with respect to the letter orderings ”a > b”

and ”a < b” respectively. 2

Example 8.

Let w = Sw(1, 2, 1, 3, 1) = ababaabababaabababaabababaababaab.

See Figure 2 for its subword graph structure.

We have

π a (w) = y 0 y 2 ab = a ababa ab

π _b (w) = y ₁ ² y ³ ₃ ab = ba ba baababa baababa baababa ab Hence the position

i = |w| − |y 0 y 2 ab| = 33 − 8 = 25 is the critical factorization point of w.

Similar computations were given in [7, 10] for Fibonacci words. The paths π _a (w)

and π b (w) have a regular structure, consequently the words represented by them

are well compressible. This implies the following fact.

(18)

Fact 14 Let w = Sw(γ) be a standard Sturmian word. Then:

(1) The lexicographically maximal suffix of w has a grammar-based representation of the size O |γ|.

(2) The compressed representation of the lexicographically maximal suffix of w can be computed in O |γ| time.

Proof.

The lexicographically maximal suffix of a standard Sturmian word w is given either by the path π a (w) or by the path π b (w) (depending on which letter ordering was chosen). The thesis follows directly from the structure of π a (w), π b (w) and the

subword graph of w (see Lemma 3). 2

References

1. J. Allouche and J. Shallit, ,,Automatic Sequences: Theory, Applications, General- izations”, Cambridge University Press, 2003

2. P. Baturo and W. Rytter, ,,Occurrence and lexicographic properties of standard Sturmian words”, LATA, 2007

3. P. Baturo, M. Piatkowski and W. Rytter, ,,The number of runs in Sturmian words”, CIAA, 2008

4. J. Berstel and P. Seebold, ,,Sturmian words”, in: M. Lothaire, ,,Algebraic combi- natorics on words”, (Chapter 2), vol. 90 of Encyclopedia of Mathematics and its Applications, Cambridge University Press, 2002

5. M. Crochemore and W. Rytter, ,,Jewels of stringology: text algorithms, World Scientific, 2003

6. M. Crochemore and D. Perrin, ,,Two-Way String Matching”, J. ACM 38(3): 651- 675, 1991

7. T. Harju and D. Nowotka, ,,On the density of critical factorizations”, ITA 36(3):

315-327, 2002

8. F. Mignosi, J. Shallit and I. Venturini, ,,Sturmian Graphs and a Conjecture of Moser”, Lecture Notes in Computer Science 3340, 175-187, 2004

9. W. Rytter, ,,The number of runs in a string”, Information and Computation Volume 205, Issue 9, 1459-1469, 2007

10. W. Rytter, ,,The structure of subword graphs and suffix trees of Fibonacci words”, Theoretical Computer Science Volume 363, Issue 2, 211 - 223, 2006

USEFULNESS OF DIRECTED ACYCLIC SUBWORD GRAPHS IN PROBLEMS RELATED TO STANDARD STURMIAN WORDS

International Journal of Foundations of Computer Science c

World Scientiﬁc Publishing Company