Polynomial Distinguishability of Timed Automata (extended abstract)

(1)

Polynomial Distinguishability of Timed

Automata

1 Sicco Verwer

a

_{Mathijs de Weerdt}

a

_{Cees Witteveen}

a

_{Delft University of Technology, P.O. Box 5031, 2600 GA, Delft, the Netherlands}

1 Introduction

We are interested in identifying (learning) a system model for a real-time process. Timed automata(TAs) [1] are insightful models that can be used to model and reason about many real-time systems such as network protocols, business processes, reactive systems, etc. TAs are finite state models that model time explicitly, i.e. using numbers. In practice, it can be very difficult to construct a TA by hand. That is why we are interested in automatically identifying TAs from data. Such data can be obtained from sensors observing the process to be modeled. This results in a time series of system states or a time-stamped event sequence: every time-step the events occurring in the system are measured and recorded. From such timed data, we could have opted to identify an untimed model that models time implicitly, i.e. using states instead of numbers. Examples of such models are the deterministic finite state automaton (DFA) and the hidden Markov model (HMM). The reason for modeling time explicitly is that modeling time implicitly results in an exponential blow-up of the model size: numbers use a binary representation of time while states use a unary representation of time. Thus, an efficient algorithm that identifies a timed system using an untimed model is by definition an inefficient algorithm since it requires exponential time and space in the size of the timed model. Naturally, we would like our identification method to be efficient.

In this paper, we study the complexity of identifying (learning) TAs from data. We prove several theo-rems that set a bound on which types of TAs can, and which cannot, be identified efficiently from data. More specifically, we show that TAs can be identified efficiently only if they contain at most one timed component, known as a clock. Clocks are the time measuring objects in TAs. They can be thought of as stopwatches that can be reset by the state transitions of a TA. Boolean constraints on the values of these clocks are used to constrain the possible executions a TA. To the best of our knowledge, ours are the first results regarding the efficient identifiability (learnability) of TAs.

This paper is split into two parts. In the first part, we explain known results regarding the efficient identifiability of DFAs. We use these results to argue which types of TAs could be identified efficiently, and how to identify them. Then we describe our results regarding the efficient identifiability of TAs.

Efficient identification in the limit

We would like to have an efficient identification process for TAs. This is difficult due to the fact the identifi-cation problem for DFAs is NP-complete [4]. This property easily generalizes to the problem of identifying a TA (by setting all time values to 0). Thus, unless P = NP , a TA cannot be identified efficiently. Even more troublesome is the fact that the DFA identification problem cannot even be approximated within any polynomial [6]. Hence (since this also generalizes), the TA identification problem is also inapproximable.

These two facts make the prospects of finding an efficient identification process for TAs look very bleak. However, both of these results rely on there being a fixed input for the identification problem (encoding a hard problem). While in normal decision problems this is very natural, in an identification problem the amount of input data is somewhat arbitrary: more data can be sampled if necessary. Therefore, it makes

(2)

sense to study the behavior of an identification process when is it given more and more data (no longer encoding the hard problem). The framework that studies this behavior is called identification in the limit [3]. This framework can be summarized as follows. Let C be a class of languages (for example the regular languages, modeled by DFAs). When given an increasing amount of examples from some language L ∈ C, a limit identification algorithm for C should at some point converge to L. If there exists such an algorithm A, C is said to be identifiable in the limit. If a polynomial amount of examples in the size of the smallest model for L is sufficient for convergence of A, C is said to be identifiable in the limit from polynomial data. If A requires time polynomial in the size of the examples, C is said to be identifiable in the limit in polynomial time. If both these statements hold, then C is identifiable in the limit from polynomial time and data, i.e. efficiently identifiable in the limit.

DFAs have been shown to be efficiently identifiable in the limit [5]. Also, it has been shown that non-deterministic finite automata(NFAs) are not efficiently identifiable in the limit [2]. This again generalizes to the problem of identifying a non-deterministic TA. Therefore, we only consider the identification problem for deterministic timed automata (DTAs).

Polynomial distinguishability of timed automata

Our goal is to determine exactly when and how DTAs can be identified efficiently in the limit. In this paper, we set a bound on which types of DTAs can, and which cannot, be identified efficiently in the limit. Our results are based on a property we call polynomial distinguishability. We call a class of automata C polynomially distinguishable if there exists a polynomial function p, such that for any two automata A, A0_{∈ C such that L(A) 6= L(A}0_{), there exists a string τ ∈ L(A) 4 L(A}0_{), such that |τ| ≤ p(|A|+|A}0_|),

where L(A) is the language of A. We use this property to show the following:

• Polynomial distinguishability is a necessary requirement for efficient identifiability in the limit. • DTAs with at least two clocks are not polynomially distinguishable.

• DTAs with one clock are polynomially distinguishable.

These efficiency results have important consequences for anyone interested in identifying timed systems (and TAs in particular). Most importantly, they tell us that DTAs with one clock seem to be a good model for identifying a timed system from data. Furthermore, they show that anyone who needs to identify a DTA with two or more clocks should either be satisfied with sometimes requiring an exponential amount of data, or he or she has to find some other method to deal with this problem. This also holds for learning frameworks other then identification in the limit. The polynomial distinguishability result of 1-DTAs is necessary for the next step required to reach our goal, which is to write an algorithm that identifies 1-DTAs efficiently in the limit.

References

[1] Rajeev Alur and David L. Dill. A theory of timed automata.Theoretical Computer Science, 126:183– 235, 1994.

[2] Colin de la Higuera. Characteristic sets for polynomial grammatical inference. Machine Learning, 27, 1997.

[3] E. Mark Gold. Language identification in the limit. Information and Control, 10(5):447–474, 1967. [4] E. Mark Gold.Complexity of automaton identification from given data. Information and Control,

37(3):302–320, 1978.

[5] J. Oncina and P Garcia. Inferring regular languages in polynomial update time. In Pattern Recognition and Image Analysis, volume 1 of Series in Machine Perception and Artificial Intelligence, pages 49–61. World Scientific, 1992.

[6] Leonard Pitt and Manfred K. Warmuth. The minimum consistent DFA problem cannot be approximated within any polynomial. Journal of the ACM, 40(1):95–142, 1993.