Zamojski W., Caban D. Modelling of a maintenance policy system for computer networks.

(1)

MODELLING OF A MAINTENANCE POLICY

SYSTEM FOR COMPUTER NETWORKS

Zamojski W., Caban D.

Institute of Computer Engineering, Control and Robotics, Wroclaw University of Technology, Wybrzeze Wyspianskiego 27, PL 50-370 Wroclaw, Poland

Abstract: Computer systems and networks are considered as a union of all their resources

(hardware, software and personnel), essential for the realization of predicted tasks. The systems operate in a real environment, often hostile, which may be the source of such threats as attacks, viruses and human faults. Model of a computer network is proposed on the basis of information balance at nodes and in the entire network. It is used to evaluate the maintenance policy of the network.

1. Introduction

Dependability is discussed with respect to the occurrence of incidents that may cause damage to the system resources; hardware, software or information and, in consequence, it may damage executed processes.

The system incidents may be classified as unintentional damages generated by faults of the hardware, software or men and as intentional events causing harm to the information resources and system processes. Very often incident is a result of a broadcast attack that is not addressed to a specific entity (computer, network) but to anonymous entities – this kind of attack is called virus. An incident may be "insignificant" if its consequences are easily removed from the system. Sometimes an incident may have a more serious impact on the system behaviour, it may escalate to a security breach, a crisis or a catastrophe. The maintenance policy is based on two main concepts: detection of unfriendly events and system responses to them. Detection mechanisms should ensure detection of incidents based on a combination of seemingly unrelated events, or on an abnormal behaviour of the system. Response provides a framework for counter-measure initiatives to respond in a quick and appropriate way to detected incidents. In general the system responses incorporate the following procedures:

(2)

 isolation of damaged resources (hardware, software; generally system processes) in order to limit proliferation of incident consequences,

 renewal of damaged processes and resources.

2. System dependability

Dependability of the system is its property to execute correctly the required functions (tasks, jobs) during anticipated time, in assumed work conditions encompassing the presence of security attacks (more generally incidents that provoke damages in the system), faults (unintentional events that generate incorrect system states) and failures (when a component no longer meets its specification). Several methods to manage faults are elaborated to improve the resilience of the system. They include fault forecasting, fault prevention, fault removal (detecting and removing faults before they cause an error), fault tolerance (provide a service implementing the system function despite the fault) and fault treatment (in order to prevent faults from occurring repeatedly) [1].

Generally, the dependability of the system is the ability to deliver service that can be justifiably trusted [2]. Sometimes the system dependability is considered as system survivability, that is capability to execute correctly all system functions in presence of possible threats (outside or inside attacks) and of system internal events (hardware failures, software and human faults). As it is impossible to prevent all incidents and attacks on a system, it is essential to react quickly (and consistently) in order to stop the proliferation of their consequences within the system. Even in case of important incidents or attacks, the system must be capable to provide some essential services.

3. System incidents and maintenance

An incident is an unintended system event that might lead to disruption of the system performance: a deliberate attack from the system environment or from within the system or system internal events (unfriendly but unintentional) such as hardware failures or software and human faults.

The system dependability is described by such attributes as availability (readiness for correct service), reliability (continuity of correct service), safety (absence of catastrophic consequences to the users and environment), security (concurrent availability of the system to authorized users and only to them), confidentiality (absence of unauthorized disclosure of information), integrity (absence of improper system state alterations) and maintainability (ability to undergo repairs and modifications) [1].

Modern computer systems are equipped with suitable measures to minimize the negative effects of these inefficiencies (a check-diagnostic complex, fault recovery, information renewal, time and hardware redundancy, reconfiguration or graceful degradation, etc).

(3)

The hardware failures occur very rarely (one or two per year or less) but soft faults (malfunctions), resulting from software faults, human mistakes, viruses and attacks are very frequent (even a few times per twenty-four hours) [5]. Many of them have devastating impact on the whole system and tasks executed by it. If a malfunction occurs, then adequate procedures of information renewals and system recovery (restarts) have to be activated [3]. Duration of the system restart (recovery) has decisive influence on the dependability measures of the modern computer systems and networks [4].

4. An approach to network modeling

The functional – reliability model of a computer network have to consider specificity of the network: nodes and communication channels, the ability to dynamic changes of network traffic (routing) and reconfiguration of the net.

Any node of the network is considered as a point to which a few terminals are connected (small grey circuits on the Fig. 1.a) and several channels connecting it with other nodes of the network. The node is considered as a site (Fig. 1.b) in which some information is generated and transferred to the network and a part of information received from the network is consumed, that means it is processed, stored or sent to terminals.

Fig. 1 Transformation of i-th node

j i v_, i 1 j v_ _, 2 j i v,  1 j i v,  i a  i PS  i G j i v_, i 1 j v__, 2 j i v_, _ vi,j1 i b i 1 j v,  i V j i v, 2 j i v, vi,j1 i c

(4)

A balance of information of i-th node is defined as a sum of received, transferred, generated, processed and stored information:

     i j i,j i j ji, i

_v

_G

_v

_PS

V













₍₁₎ where:  i

V

- balance of information of the i-th node,

j , i

v _{- volume of information transferred between nodes i and j,}

 i

G - volume of information generated by the i node and sent into the network,

 i

PS - volume of information processed (and stored) by the i-th node. On the basis of equation (1) the third model of the node is built (Fig. 1.c).

If

V

 i

_

0

_{the work conditions of the node and bandwidth of channels are perfectly}

balanced, as everything is adjusted to the demand of all terminal users and to the needs of the network.

 

₀

V

i

_

_{- the node is flooded by information, that is it takes up more information from}

the network than it can use (consume) and/or send into the network by channels of limited bandwidth. Also, the node may generate too much information to send it to the network. The problem may be solved by equipping the node with an adequately large storage, provided that the situation is only temporary.

 

₀

V

i

_

_{- a sum of received information from the net and generated by the node is less}

than the potential capabilities for sending, processing and storing information. If the situation is temporary, it may be used to release information stored during node overload.

(5)

Fig. 2 Model of a network

The network is described by the vector of the information balances of all nodes

 



V

;

i

1 ,

2 ,...



V

i

N



(2)

that may be used in analysis of performance and dependability properties of the system. Influence of a node failure or an attack on a node or small part of a network (Fig. 2) are shown below.

A failure. Let’s consider the influence of a failure of i-th node on the balance of information of nodes i and i+2.

For efficient working of the network the balances are evaluated as

     i 2 i , i 1 i , i i i , 1 i i _v _G _v _v _PS V         (3) and     i 2 3 i , 2 i 1 i , 2 i 2 i 2 i , i 2 i , 1 i 2 i

PS

v

G

v

V

         







(4) Consequences of a failure of i-th node may be estimated as

 

i 

V

= 0 (5)

because the node is broken down and it does not generate or process information. But, it also interrupts communication channels to neighbouring nodes.

More complicated analysis is needed for node i+2. As the channel between nodes i-1 and i is closed, the node i-1 dispatches a part of the information by the channel i-1,i+2 . The bandwidth of this channel is limited to

i 1,i ; i 1,i 2 i 1,i i 1,i 2 2 i , 1 i k v v v_ _  _ _ _ _  _ _ . (6)

In consequences, volumes of information sent by next channels is reduced too (Fig. 2b). The i+2-th node’s balance is estimated as

        i 2 3 i , 2 i 1 i , 2 i 2 i 2 i, 1 i i , 1 i 2 i , 1 i ; i, 1 i 2 i

PS

v

G

v

k

V

             







(7) where ki1,i ;i1,i2 is a routing coefficient fixing the ratio of communication

(6)

An attack. If the node is attacked by a virus that generates spam into the network, then it is possible to flood all the connected channels (vi1,i 0 , vi,i2 ) and to reduce

processing and store possibility of the node (

PS

 i

_

0

_{). The information balances of}

both considered nodes are estimated as:

 

2 i,i 1 i,i i SPAM i

_G

_v

V



_

_

_

_

_ (8)     i 2 3 i, 2 i 1 i, 2 i 2 i 2 i, i 2 i, 1 i 2 i

PS

v

G

v

V

          







(9) It is proposed to use the balances _V i _{, for the purpose of constructing a maintenance}

policy system for the given network. If the information balances of nodes are normalized in interval [-1, +1] then it is easy to accept conclusions:

 a node having large value of the balance (

V

 i



1

) may very easily disrupt the whole

network, that is every attack on that node or its failure puts the network in a critical situation,

 a node with a small value of the balance (V i __1_{) may take up some traffics from} other nodes, of course if its communication channels have sufficient bandwidth,

 it is useful to evaluate the information balances of all nodes and then to make

a ranking of them for the purpose of finding a subset of nodes working in critical conditions.

5. Maintenance policy

It is hard to predict all incidents in the system; especially it is not possible to foresee all possible attacks, so system reactions are very often "improvised" by the system, by its administrators or even by expert panels specially created to find a solution for the existing situation. The amount of time needed for a renewal of the system depends on the incident, on the available system resources and on the renewal policy. The renewal policy is formed on the basis of required levels of the system dependability (and safety) and on economical principles (mainly the cost of system down time and lost execution time).

Maintenance rule (mr i ; j 1,2,...

j  ) is a chain of decisions about allocation of

system resources (hardware, software, information and service staff) that are engaged in keeping the system in working conditions after an incident. These rules are very often connected with small fragments of the system. For instance, repair of a node processor or

(7)

communication channel. These operations, though local in character, may have significant impact on the parameters of the whole system.

Cost of the j-th maintenance rule (  i j mr

c ) is defined as the cost of all ventures used to ensure required level of working and to renew part of the system. These costs may include the expenditure of exchanging a broken computer for a new one, or salaries for servicemen, or the system time lost for the rule realization (renewal) or other similar. The cost of maintenance rule execution may be expressed as the pair

     i j R i j R i j mr c , c   (10)  i j R c - renewal expenses,  i j R

 - time needed for removing a fault together with such operations as a system restart.

System maintenance policy (MPS) is formed on the basis of the rules and their impact on the overall system performance (system cost).

Table 1. System maintenance policy

A very simple example of the maintenance system policy is demonstrated in the Table 1, where all foreseen incidents are located in the first column. Maintenance policies are fixed

(8)

for each incident – the columns: 2 (maintenance rules) and 3 (system impact). The number of policies considered for each incident depends on the real life situation of the system (its resources, organization of maintenance, etc.).

The last column is the most important, since it gives grounds for adopting maintenance decisions based on the analysis of the whole system. The impact rules must be objective and have numerical value. For example, in a network of computers, a chosen local maintenance policy may have a huge and diverse influence on various parts of the network, so locally the cheapest local maintenance policy does not have to yield the best global solution. A utility of a global solution depends on the tasks executed by the network and on losses arising from realizing the maintenance policy. It depends on the accuracy of predictions of dependability measures of the system and on finding the relations between these and the local maintenance policies.

6. Conclusions

Development of the Maintenance Policy System is very complicated since:

 new incidences, especially new viruses and attacks, may appear while the system is in

use and the correct maintenance policies must be created ad hoc, MPS must be modified accordingly,

 when the system is large and the time needed to complete a maintenance policy is long, then new incidents may occur while the maintenance is still in progress; thus the table of MPS will become multidimensional,

 the table of MPS increases dramatically in size during exploitation of the system, as

new incidents, new maintenance policies and new decision rules are added.

System users (and administrators) may have difficulty finding the optimal decision with multi criteria objective function in a multidimensional space. Other users do not have even access to data on what happens and where. System simulation may help in developing the effective maintenance system MPS, by determining the “global” effects of the maintenance decisions.

References

1. Arvidsson J. (ed): Taxonomy of the Computer Security Incident related terminology. Telia CERT (http://www.terena.nl/tech/projects/cert/i-taxonomy/archive/.txt).

2. IFIP WG10.4 on Dependable Computing and Fault Tolerance

(http://www.dependability.org/).

3. Oppenheimer D. et al.: ROC-1: Hardware Suport for Recovery-Oriented Computing. IEEE Trans. On Computers, Vol. 5, no 2, February 2002.

4. Patterson D. et al.: Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,

(9)

5. Zamojski W.: Remarks on Reliability of Future Computer Systems. Proc. of ICITNS, Amman 2003.

6. Zamojski W., Caban D.: Trends in the Theory and Engineering of Reliability Applied to the

NBIC Technology. Proc. of The 3-rd Safety and Reliability International Conference To Safer

Life and Environment KONBIN 2003, Gdynia, 2003.

7. Zamojski W.: Functional-reliability model of a computer-man system. [in:] W. Zamojski (ed.): Computer Engineering. WKiŁ, Warsaw 2005.