Cluster analysis by optimal decomposition of induced fuzzy sets

(1)

BIBUOTHEEK TU Delft P 1177 5058

(2)

CLUSTER ANALYSIS

BY OPTIMAL DECOMPOSITION

OF INDUCED FUZZY SETS

PROEFSCHRIFT

ter verkrijging van de graad van doctor

in de technische wetenschappen aan de

Technische Hogeschool Delft, op gezag

van de rector magnificus

prof. ir. L. Huisman,

voor een commissie aangewezen

door het college van dekanen te

verdedigen op woensdag 15 februari 1978

te 14.00 uur

door

ERIC BACKER

elektrotechnisch ingenieur,

geboren te Soestdijk

Delftse Universitaire Pers, 1978

(3)

Dit proefschrift is goedgekeurd

door de promotor

(4)

Aan Janneke,

Tsuki en Maarten

(5)

SUMMARY

Information processing is a characteristic aspect of pattern

recognition. This is, because pattern recognition can be considered as a process of sense perception or technical measurement, information extraction and attaching some meaning to the information selected. In this context a pattern should be considered as a collective noun of divergent measurable signals, objects or states. Based on the idea that

'cognition' should precede 'recognition', there is great interest in one of the areas of pattern recognition commonly referred to as cluster analysis.

In cluster analysis, a group of objects is split up into a number of more or less homogeneous subgroups on the basis of an often subjective-ly chosen measure of similarity, such that the similarity between objects within a subgroup is larger than the similarity between objects belonging to different subgroups. Both the diversity of cluster

techniques and methods, and the number of different scientific

disciplines in which these have been developed, are striking. Irrespec-tive of the method, the final result is usually given m the form of a collection of object-subsets. In this context an object-subset is considered as a classical set which means, that an object either belongs to the set or not. With respect to both the structural proper-ties among objects m a certain representation space, and the way of assigning them to subsets, the above description of the group of objects is rather lacking m nuances.. This justifies using the fuzzy

(9)

set theory to achieve a much more nuanced numerically interpretable description of object clusterings. In contrast to the classical set, an object can be assigned to a fuzzy set with a varying degree of

member-ship from 0 up to and including 1.

In this thesis, a cluster model is described which tries to achieve an optimal decomposition of the group of objects into a collection of fuzzy sets, where optimality does have a restrictive meaning in the sense that a certain criterion function extremizes and is able to summarize only part of the information available. Moreover, the model demands a specific choice of a concept in which object-similarity is made explicit.

Basically, the model consists of an algorithmic relationship between an inducing object partition on the one hand and a collection of

induced fuzzy sets on the other hand. A collection of fuzzy sets is induced by means of a point-to-subset affinity concept on the basis of the structural properties among objects m the representation space. This has been called affinity decomposition and performs in the same way as the Hayes Theorem from probability theory. The appropriateness of the induced collection of fuzzy sets is made explicit by applying some measure of fuzziness or a measure for the amount of fuzzy set structure. The collection of induced fuzzy sets may cause a repartition due to a reclassification function. Then, a new inducing step follows, so we have got an iterative procedure. The model as described can be regarded as unifying to the extent that many concepts of object-similarity as well as various criterion functions can be implemented without altering in essence the procedure.

Attaching numerical values to fuzziness can be regarded as

determining a form of nonstatistical uncertainty. A relationship with information theory can be imagined, where uncertainty is based upon the notion of probability. Intuitively, a resemblence between the membership function, as defined in fuzzy set theory, and a probability IS felt. However, probability is an objective characteristic (conclusions from probability theory can, m general, be tested by experiment), while the membership grade is subjective due to intrinsic ambiguity and is

(10)

arbitrarily determined. Irrespective to the nature of uncertainty, information theory enables us to make these uncertainties explicit. The fact that the nonstatistical uncertainty m this study originates from the differences between the inducing partition and the induced fuzzy sets has been reason for applying the so-called decision function type of uncertainty.

As in statistical decision theory the error analysis has been related to the overlap of a posteriori probabilities, here, fuzziness will be related to the notion of intersection of fuzzy sets. This leads to a number of criterion functions (partitioning-characterization) analogous to probabilistic distance measures and the probability of error resulting from the a posteriori decision rule.

The thesis is organized as follows. In Chapter 1 the cluster problem and its relation to fuzzy set theory is introduced.

In Chapter 2, some mathematical aspects of fuzzy set theory are briefly discussed; in particular, a measure of fuzziness is suggested. Finally, the optimization-clustering problem is characterized.

In Chapter 3, the fundamental idea behind affinity decomposition is dealt with. Among the many affinity concepts which may be imagined, three are discussed m detail, namely: the distance concept, the neighborhood concept and the probabilistic concept.

In Chapter 4, further analysis takes place with respect to the partitioning-characterization functions. Since generalization of the notion of intersection of fuzzy sets results m two manageable definitions of fuzzy set intersection, two corresponding fuzzmess criteria are conducted. Due to the notion of symmetric difference from set theory, two measures of inter fuzzy set distance are derived. All the above partitioning-characterization functions lead to one basic form of the reclassification function. A particular partitioning-characterization IS based upon a fuzzy similarity measure being

analogous to the mutual information defined in information theory. This criterion function determines the amount of fuzzy set structure.

In Chapter 5 attention is given to the iterative optimization

procedure. The reclassification function is investigated and convergence

(11)

properties are examined. The considerable advantage in having more nuanced and better interpretable representation of the object partition IS offered to the investigator at the cost of computational complexity. However, time consuming data analysis is preferable over incorrect decision making, starting from inferior design criteria for process automatization, and so on. Speeding up the procedure is achieved by introducing the principle of peripheral optimization of fuzzy sets

(set neighborhood optimization). From experimental evidence it became clear that a proper choice of the starting configuration is of crucial importance.

In Chapter 6, a number of experiments in support of the method suggested is described, where sufficient a priori knowledge has been obtained by means of two fast methods from different origin, m order to be able to set parameter values as correct as possible. One of them belongs to the hierarchical family, while the other acts as a mode-seeking procedure.

Four object data sets have served as appropriate test cases m

support of the method described. Both the Ruspini data and the I n s data enables us to compare the results obtained with those, known from the literature. The Ruspini data illustrates the local optima problem, whereas the I n s data, having touching clusters, exemplifies the appreciation of the additional information supplied by the cluster membership values varying from 0 up to and including 1. The analysis shows the appropriateness with respect to the investigation of cluster validity.

A third experiment covers the data from the analysis of connected speech. Since it is assumed that there exists a strong relation between the description of written language in phonemes and the description of spoken language on the basis of its manifold acoustic variations, the allophones, it is expected that a linguistic partition of the data on word level should result m a clustering of the acoustical waveform properties. Numerical evaluation of this hypothesis may lead to extensive insight.

Finally, an experiment is described where information extracted from xiv

(12)

an image (brightness, color and texture) has been used for segmentation purposes of pictorial regions.

In conclusion, the model of decomposition of an object data set into a collection of fuzzy sets may be regarded as an extension of data analysis having the property that homogeneous groupings, sparsed groups of objects, touching groups, and so on, are classified as such. The ability of the model to include various concepts and a priori knowledge may be regarded as one of the interesting features.

(13)

1 INTRODUCTION

This thesis is concerned with nonsupervised pattern recognition m which we explore the concept of fuzzy sets in order to provide the investigator (data analyst) additional information supplied by the pattern class membership values apart from the classical pattern class assignments.

In this first chapter we shall discuss the basic ideas behind the pattern recognition problem, the clustering problem and the concept of fuzzy sets in cluster analysis. We shall also give a brief review of the literature as far as the 'fuzzy' cluster analysis is concerned. Finally we shall comment on some current methods and propose ideas about objectively evaluating and comparing these methods.

1.1 GENERAL BACKGROUND

Classification or labeling of a group of objects on the basis of a set of available descriptions of objects is commonly stated as a pattern recognition problem. Those objects classified into the same pattern class share common properties. This proposition leads us to two subproblems. The first subproblem is what measurements or features

should be extracted from the input patterns. The device which extracts the feature measurements from input patterns is called feature

extractor. Usually, the selection of features is rather subjective and 1

(14)

input patfern feature extractor feature measurements classifier decision

FIGURE 1: A pattern recognition system.

also dependent on the practical situations, e.g. the availability of potential features, the cost of feature measurements, and so on. A general guideline of feature selection is to obtain a set of features such that the interclass variabilities are much larger than the intraclass variabilities.

An extensive research has been conducted on the selection, ordering, discriminating power, and many other aspects of features, [4, 17, 20, 24, 26, 47, 50, 65, 72, 73, 76, 77, 79, 89, 103]. However, a general theory of feature extraction has not yet been available.

The second subproblem is the problem of classification, [36, 49]. The device which performs the function of classification is called classifier.

A simplified block diagram of a pattern recognition system is shown m Fig. 1.

As an example* we assume a number of labeled objects (patterns) of which we consider two measurable features. In the scatter diagram given in Fig. 2, each individual pattern is represented by a 2-dimensional measurement vector. As can be seen two pattern classes are considered, indicated by the symbols • and A. We notice that in this example these classes are not well separable which may be caused either by the fact that the problem by its very nature is not well separable, or by the fact that the discriminating power of the two measurements used is too

Remark: this example will be used at different places in this thesis. 2

(15)

I

M B I t

!

FIGURE 2: Labeled patterns m a 2-dimensional measurement space.

low and other or more measurements should be searched for.

Pattern recognition finds a wide variety of applications, Kanal [61]. The applications include character recognition, identification of bank checks, speech recognition and speaker identification

(Section 5.3.3, Speech data), general medical diagnosis, weather prediction, system fault identification, market analysis, numerical taxonomy (Section 6.3.2, I n s data), target detection and picture processing (Section 6.3.4, Pictorial data).

A fundamental distinction can be made in the form of (i) supervised pattern recognition, and (ii) nonsupervised pattern recognition.

In (i), the classification system is developed and the performance of the system is evaluated with the aid of training samples. Analysis of supervised pattern recognition is sometimes called discrimination analysis.

In (ii), no amount (or only a vague amount) of a priori knowledge about the data itself and the number of pattern classes of the given data IS known. It is then required to set up a method of deciding whether or not the given objects fall into groups. Analysis of

non-supervised pattern recognition is sometimes called cluster analysis. 3

(16)

There are many situations in which one has to carry out cluster analysis rather than discrimination analysis to classify objects into groups, due to a lack of a priori problem knowledge.

For example in the biological taxonomy, over the years, plants and animals have been classified into categories, classes or groups according to their observable characteristics. However, everything does not come packaged with labels of origin of population, while the number of populations is unknown.

In conclusion, cluster analysis can be adapted to a wider range of problems than discrimination analysis.

In this thesis we shall concentrate on nonsupervised pattern recognition or cluster analysis.

1.2 CLUSTER ANALYSIS IN A NUTSHELL

In many scientific disciplines — like life sciences, behavioral and social sciences, earth sciences, medicine, engineering sciences, information and policy sciences — there exists a common problem: by what means can we get the deepest understanding of a given set of object measurements (data points). Very often this problem will be solved by some method of data analysis which can be used for a partition of this set of data points, consisting of a number of

natural and homogeneous groupings (subsets, clusters), and to indicate the most representative elements.

In general these problems are stated as clustering problems and the sort of data analysis is referred to as cluster analysis.

The primary objective of a clustering technique is to classify a given set of data points by assigning them to a reasonably small number of natural and homogeneous clusters. The term homogeneous is used in the sense that all data points contained in the same cluster are more similar to each other than that they are to data points in other clusters.

(17)

The search for natural groupings m a set of data points is very much an experimental-oriented 'art' m the sense that the grouping of the data points is very dependent on the investigator's feeling for the degree of association between elements and distinctness between

groupings.

Therefore, looking for an optimum method and a unique result is contrary to the very nature of the problem.

Depending on the field of interest m which these techniques can be applied, the nomenclature varies from cluster analysis to Q-analysis, typology, grouping, clumping, classification or numerical taxonomy.

Likewise, the goals of various research workers m clustering techniques are frequently dissimilar. They might be:

. finding a true typology, . model fitting,

. prediction based on groupings, . hypothesis testing,

. data exploration, . hypothesis generating, . data reduction.

As IS known, m supervised pattern classification the category

(cluster) structure is known, while in cluster analysis (nonsupervised) little or nothing is known about the category structure. All that is available is a collection of object measurements where category

memberships are unknown. The problem is to sort the object measurements (observations) into groups such that the degree of natural association IS high among members of the same group and low between members of different groups. Then, the essence of cluster analysis might be viewed as assigning appropriate meaning to the terms natural groupings and natural association. If we compare: . Salon of Athens . Thales of Miletus . Penander of Corinth 5

(18)

. Pittacus of Mitylene . Chobolus of L m d u s . Bias of Priene . Chilon of Sparta, being the seven wise men, with:

. the Colossus of Rhodes

. the hanging gardens of Babylon . the Pharos at Alexandria . the Pyramids of Egypt

. Pheidras' statue of Jupiter at Olympus . the temple of Diana at Ephesus

. the Mausoleum at Halicarnassus

as the seven wonders, then, without specifying the nature of the observations, the meaning of natural association withm each of the groups and between the two groups is obvious.

From this it is understood that the objects of analysis may be persons, animals, commodities, opinions, or other entities.

In our case, the objects of analysis will merely come from real life: signals (like speech and other sound signals) printed or hand-written alpha-numerical characters, pictures, ECG-signals, rontgen photographs, etc. But in fact the question of where the objects of analysis originate from is of minor importance.

Even if we only have an intuitive understanding of the problem, it will be clear that the problem can be characterized by:

. the number of measurements and data points . the choice of what to cluster

. the sort of measurements . the separation of the clusters . the number of meaningful solutions . the underlying data generation process

. the distribution of data points among clusters whereas the selection of the technique is subject to: 6

(19)

. the form of the results . cost and capacity factors

. interactions with association measures . sensitivity to various cluster structures . dependency on arbitrary decisions

. the form of evaluation

. interactive multi-method analysis.

The foregoing may clarify why the literature of cluster analysis is both voluminous and diverse. Moreover, the terminology differs widely from one field to another so that the literature seems to be very disjunct. Further, the number of clustering techniques available is large, as is the number of problems in applying them.

No generality can be expected: for each of the following data structures an appropriate method should be selected:

compact clusters, high

separation

• • • •

• • •

••••

elongated clusters ••• •• • • • •••• .

•v.:

touching clusters •• • !• ••

• • • • • ; .

7

(20)

l i n e a r l y nonseparable

c l u s t e r s

• • • • • •

, • • • r

• t .

• ••

_{• • .}

s p h e r i c a l c l u s t e r s

• : • •

• ••• '

non-compact clusters

. unequal cluster population

• • •

It must be clear to the reader that no universal ideal technique, which may uncover all the above structures, does exist.

Although there have been several attemps to review cluster analysis techniques, [5, 28, 46], there still is a lack of comprehensiveness and, consequently, a strong need for a practical guide to making intelligent choices amidst this bewildering array of alternatives.

Basically, the clustering techniques can be devided into two distinct categories, i.e., (i) hierarchical and (ii) non-hierarchical clustering techniques.

(21)

measurement problem p a t t e r n recognition ft) n-tr O a, w o 1-h 3 O a w c -a ro 1-1 •< in c l a s s i f i c a t i o n problem a H to

o

0 a supervised classification nonsupervised classification hierarchical techniques divisive methods agglomerative methods threshold methods optimization partitioning nonhierarchical techniques mode-seeking prob mixture decomposition graph-theoretical methods

(22)

(i) The hierarchical clustering techniques derive the relations between the objects to be classified in a hierarchical fashion or a tree diagram. The tree diagram is also known as the dendrogram m biological taxonomy. In terms of the processes of forming the tree diagrams, the techniques can be further divided into divisive algorithms (starting with a cluster which contains all objects and split into new clusters by a set of rules; this process continues until each cluster contains exactly one object), [106], and

agglomerative algorithms (starting with each object as a cluster and merging them to form new clusters by some rule; the process continues until a single cluster is attained which contains all objects), [100].

(ii) In non-hierarchical clustering the major techniques are the threshold partition method (objects are introduced sequentially which, dependent on prespecified thresholds, create new clusters or combine into existing clusters), [6], the partitioning methods of optimizing a criterion function (objects are transfered to each of different clusters to improve the value of a chosen criterion function), [32, 48, 51, 59, 70], the mode-seeking methods (objects are assigned to the modes of the point density by some rule), [65], the decomposition methods of probability mixture, [88, 109], and the graph-theoretic methods (these methods use the notion of graph theory m locating the clusters of the data), [18, 25, 71, 117].

The methods of nonsupervised pattern recognition mentioned in the foregoing are summarized in Fig. 3. In this thesis special attention will be paid to optimization-partitioning methods and (probabilistic) mixture decomposition.

A simple example may be helpful. First we illustrate the operation of a hierarchical cluster analysis by means of a simple agglomerative method, known as the centrold cluster method, [lOO], which will be applied to the data set of Table I consisting of the values of two measurements for each of the objects. The location of the objects is shown in Fig. 4.

(23)

M 5 3 3 0 <l E 2 • b • d • 9 • c ~ i 1 r 2 3 4 m e a s u r e m e n t 1 n 1 1 5 6 7

FIGURE 4: Location of the objects a - g.

measurements objects 1 2 a 1 . 0 1 . 0 b 1 . 5 2 . 0 c 3 . 0 4 . 0 d 5 . 0 7 . 0 e 3 . 5 5 . 0 f 4 . 5 5 . 0 g 3 . 5 4 . 5

TABLE I: Object-measurement table of a simple data set.

For this set of data we compute a matrix D(l) of the mter-object distances, given in Table Ila, where distance is measured by the Euclidean distance measure.

Examination of D(l) shows that d is the smallest distance. eg

Consequently, object e and g are fused to form a group and are re-placed by the co-ordinates of their centrold. From the reduced set of data, an mter-object distance matrix D(2) is calculated again, given in Table lib.

(24)

a b c d e f g a . b 1.25 . c 1 3 . 5 . 2 5 . d 5 2 . 3 7 . 2 5 1 3 . . e 2 2 . 2 5 1 3 . 1.25 6 . 2 5 . f 28.25 1 8 . 6.25 10.25 1. g 1 8 . 5 1 0 . 2 5 . 5 2 . 5 . 2 5 1.25 • TABLE Ila: D(l). a b c d (eg) f a , b 1.25 . c 1 3 . 6 . 2 5 . d 5 2 . 3 7 . 2 5 1 3 . .

Q

2 0 . 3 1 11.56 . 8 1 7 . 3 1 .

f

28.25 1 8 . 6.25 10.25 1.06 • TABLE lib: D(2).

From the matrix D(2) it follows that d , , is the smallest; so c (eg)

c, e and g are fused to form a three-membered group. Then, the result IS shown in Table lie.

We have d as the smallest distance appearing m D(3), and objects a and b are fused to form a second group. Then, we continue with Table lid: D(4). In this case d, , ^ is the smallest distance, and

(ceg)f

(ceg) and object f are fused to form a four-membered group.

Finally we get D(5), Table lie, where d, ^,, is the smallest entry (cegf)d

of the matrix being the smallest distance between the objects c, e, g and f on the one hand and object d on the other hand, and so these are fused to form a five-membered group.

(25)

a b (ceg) d f a b 1.25 .

/""x

^l)

17.54 9 . 4 9 . d 5 2 . 3 7 . 2 5 9.14 . f 2 8 . 2 5 1 8 . 1.69 1 0 . 2 5 • TABLE lie: D(3). (ab) (ceg) d f

0

.

0

31.20 . d 4 4 . 3 1 9.14 . f 2 2 . 8 1 1.69 10.25 • TABLE lid: D(4) e (ab) (cefg) d 15.46 14.21 7 . 5 TABLE lie: D(5). 13

(26)

^ ^ , 1

'

(

e g c f d a b

FIGURE 5: Centroid clustering dendrogram.

The last stage consists of the fusions of the two remaining groups into a single group. The result of this example is shown in a so-callec dendrogram. Fig. 5. If the data given has to be partitioned into two groups, then it can be concluded that these groups should be {a,b} and {c,d,e,f,g}. If three groups are desired, then these should be {a,b}, {d} and {c,e,f,g,}.

Next, we illustrate the operation of a non-hierarchical clustering technique using the partitioning method by means of a simple procedure which will be applied to the same data set.

Let us assume that the data given must be partitioned into two groups, and that the method to be employed must find a solution such that all the members of a group are nearer the mean vector of that group than the mean vector of any other group.

Furthermore, let us assume that the initial partition is given by

A = {a,b,c,d}, and B = {e,f,g}.

Then the mean vectors of group A and B are given by 14 d t 1 25 0 81 0 25 0 00

(27)

7 6 5 4 3 2 1

-1

''* I • a / / d»)

/^y^

d£)

^%\ + /

J

FIGURE 6a: Scatter diagram of the data from Table I; initial partition, {A,B].

7 - | 6 5 4 3 2 1 -0 / d . A

/ / B y

/•e ( • / l^g^y' / ' c ^ S

/ /

/b» / ' T— 1 — 1 1 1 1 1

FIGURE 6b: Partition after the first step, [A',B'}.

7 -| 6 5 4 3 2 1 -0 /b0\ 1 r / ^ d » \

/ j

/ iB" /

1 * ^ / V* y 1 1 1 1 1

FIGURE 6c: Final partition, {^",5"}.

(28)

^A = ( 3 5 ) '

^"^ 'B

= (4M) •

This situation is shown in the scatter diagram of Fig. 6a.

Now each object is tested to see whether or not it is nearer the mean vector of its own group than that of the other group, or,

if object 1 IS assigned to A beforehand, and d(i,p ) <d(i,iJ ) , B A

then 1 IS relocated to group B.

Otherwise 1 remains classified m group A.

This appears to be true in our case for object d, so the new partition is given by

A' = {a,b,c}, and

B' = {d,e,f,g}, see Fig. 5b.

Consequently, the new mean vectors become

^A =

(2.1)

^"^ ^B =

(5:4)

•

Again we test each object, and now object c is relocated, giving the final partition,

A" = {a,b}, and B" = {c,d,e,f,g},

"^^^

^A" =

(,1.5

y ^"^

^^B"

=

U-iJ

•

Thus, as a result of using a particular clustering method, we have derived algorithmically a set of cluster assignments, represented m Fig. 6c. Moreover, this figure shows a two-point data set representa-tion, which really clarifies the information (data) reduction viewpoint of cluster analysis.

In many fields the research worker is faced with a large number of observations, which are quite intractable unless classified into

(29)

manageable groups, which in some sense can be treated as units. So, clustering techniques can be used to perform this data reduction, reducing the information on the whole set of N objects to information about m groups, (m « N ) . In other words, simplification with minimal loss of information is looked for.

We conclude this section with a number of (warning) statements, [35] 1. Clustering techniques attempt to group points m a multi-dimensional

space in such a way that all points in a single group have some natural relation to each other and points not in the same group are

somehow different.

2. Clustering techniques are tools for discovery, not an end in themselves.

3. A cluster analysis really is a preprocessing step that should generate ideas and should help the user to form hypotheses.

4. The utility of a cluster analysis lies more in the questions raised by it than in the questions answered.

5. A cluster analysis should be supplemented by other descriptive techniques.

5. There is no true structure since we have to make ad hoc choices m selection of a measure of association, a criterion of clustering goodness, various parameters, and m the weighing of several computational constraints.

7. A single criterion, a single result, cannot summarize all information available m the clustering.

1.3 THE CONCEPT OF FUZZY SETS IN CLUSTER ANALYSIS

From the foregoing section it is clear that we have to accept that there is no unique solution for a clustering problem, due both to the large variety of techniques available, and to the subjective choices to be made. The non-uniqueness particularly exists in situations where the clusters to be detected are not very compact and not well separated. In that case it is evident that a fuzzy set representation m cluster

(30)

analysis will be convenient since fuzzy sets, as introduced by L.A. Zadeh [112] in 1965, are sets with no sharp boundaries.

When clusters are not compact and not well separated it is more natural to assign each object to a cluster with a degree of cluster membership (cluster belongingness) than is done m classical set theory where each point may either belong to a cluster or not. Consequently, the advantage is that various types of non-uniqueness or vagueness may be classified as such, while the corresponding classical partition still remains meaningful.

It IS the aim of this thesis to develop a clustering method which is capable of detecting clusters that cannot easily be found by

classical means, and which provides a common mathematical framework for apparently different clustering concepts.

The concept of fuzzy sets provides an appropriate way to reformulate the clustering problem in a unifying way.

1.4 RELATED WORK

Fuzzy sets as a theoretical basis for clustering were first suggestec by Bellman, Kalabe and Zadeh, [9]. Subsequently, various theories of fuzzy clustering were proposed, [3, 7, 10-15, 38-42, 54-55, 85, 92-95, 110].

Negoita, in 1959, uses the separation theorem of fuzzy sets in the problem of describing cluster-based information retrieval.

Gitman and L e v m e , [55], in 1970, presented an algorithm which achieves a decomposition of a multi-modal fuzzy set into a number of uni-modal fuzzy sets by taking into account a usual distance measure and the order of importance of every point given by its membership function.

Ruspini, [92], m 1959, presented classification m fuzzy sets as the breakdown of the probability density function of the original set of data points into a weighted sum of the component fuzzy set densities.

(31)

The method which we will present here is in some sense a reverse model of Ruspini's approach.

Tamura, Higuchi and Tanaka, [102], in 1971, described for the first time a hierarchical clustering scheme generated by one parameter family of equivalence relations on a data set. The role of fuzzy hierarchical clustering in information retrieval is discussed in detail by Negoita,

[83], in 1973.

Dunn, [38], and Bezdek, [lO], in 1973, generalized the Isodata clustering process to a fuzzy-Isodata clustering technique.

In 1974, Dunn, [41], used graph-theoretical arguments to show that the hierarchical clustering scheme induced the Tamura's n-step fuzzy relation is contained in the maximal single linkage hierarchy.

In an interesting paper by Diday, [32], m 1973, the notion of strong and weak cluster patterns, m the sense of types of fuzzy sets, has been developed. The method presented m this thesis is similar to that of Diday to the extent that clusterings obtained are characterized by means of some quantity representing the degree of structure.

Yeh and Bang, [llO], in 1975, extended many standard

graph-theoretical concepts into fuzzy graphs and discussed their applications to cluster analysis.

Barnes, [7], m 1976, developed a fuzzy set clustering algorithm containing a measure of the separation between the classes. The degree of separability can also be used to determine the number of classes into which the data should be separated.

From a historical point of view we may characterize the development of iterative optimization-partitioning techniques (also referred to as Iterative relocation, reclassification or repartition procedures) by the fact that the number of cluster representative elements on which the cluster assignment is based, has increased.

First we have seen the development of various k-means or k-cluster-center methods (k elements involved). Then, Diday introduced the k-samplings method where k groups of elements were involved. Finally in this thesis we shall develop an induced fuzzy sets approach in which all data points are involved.

(32)

1.5 THE DEFICIENCY OF CURRENT METHODS AND SOME GUIDELINES

In most current clustering techniques, the notion of a cluster cannot be clearly defined, [25]. The m a m reason for this may be given by the five factors listed below, [ 5 ] .

(i) computational complexity (ii) memory requirement (ill) sample size requirement (iv) nature of the data

(v) availability of categorization information.

re (i) & (ii): Due to computational difficulty and memory requirement

clustering techniques are often limited, both m the number of objects that can be handled, and in the number of properties

(dimensionality) used to represent each object. Therefore, the meaning of clusters is rather restricted.

re (ill) : In hierarchical methods the sample size should be small enough to permit interpretation of the result. However, some non-hierarchical methods, particularly mode-seeking techniques, demand considerably large sample sizes.

re (iv): Clusters are often considered as groups of objects which are the result of a certain algorithm for merging or splitting the given data, or as groups of objects which are induced by a certain

(locally-optimal) partition procedure. Therefore, the clusters have no clear, explicit and intuitive description, [78]. The methods are often restricted to working for certain types of data, particularly compact and well separated data. Hierarchical methods often suffer from the chaining effect of the given data. The chaining effect causes spurious merges of legitimate clusters, [82].

re (v): It is difficult to determine the number of appropriate clusters in hierarchical methods, whereas m most non-hierarchical methods, one must initiate the proper number of clusters or have no control over the number of clusters.

(33)

The following additional properties, requirements for a good clustering technique, may serve as a guideline for evaluating the performance of the method, [50].

1. The method should be invariant under any linear transformation of the measurements of objects.

2. The method should yield a unique result, the result should be independent of the scanning sequence of the data.

3. The method should yield a stable result. That is, small changes m the data should not affect the result drastically.

4. The method should preferably be iterative. The number of clusters should be adjustable during the clustering process.

5. The method should be convergent and efficient m terms of computation time, the memory storage should be manageable.

In light of the above discussion, the induced fuzzy set strategy advocated meets many of the requirements, although some problems remain unsolved.

1.6 EVALUATION OF CLUSTERING RESULTS

After choosing a criterion function or clustering technique, the investigator is faced with the problem of interpreting resultant

clusterings. Therefore he should be interested in whether the resultant clusterings reveal natural structure withm the data, or if he is deal-ing with pecularities m the behavior of the clusterdeal-ing technique, and, if so, to what extent. This part of the research is called evaluation of techniques and results.

The literature on this subject is neither extensive nor encouraging. Most publications are content to suggest a criterion function and to

show that the strategy under consideration works well for a specific set of data.

The following questions reflect several aspects of the behavior of clustering techniques, [90].

(34)

1. How well does the technique used retrieve natural clusters?

2. How sensitive is the technique to errors m the object measurements'^ 3. How sensitive is the technique to missing objects?

4. What is the amount of agreement between the resultant clustering and clustering results produced by other techniques?

Consequently, an additional number of experiments and procedures are necessary, motivated by the fact that the apparent data structure should be a good representation of the true data structure.

It is clear that we may take advantage of the fact that various types of non-uniqueness and vagueness can be classified as such by means of their cluster membership function as well as by the flexibility of the affinity concept used.

In this thesis we shall develop a unifying approach which will be evaluated m terms of the foregoing to demonstrate its superiority over non-fuzzy methods although at the cost of computation time.

(35)

2 PRELIMINARIES

In this chapter we present a mathematical introduction to both fuzzy set theory and cluster analysis.

After an introduction to fuzzy set theory, a measure of fuzziness, 1(f) is defined and a number of properties are conducted. It is shown that 1(f) has relevance to the evaluation of partitioning quality m a well-defined sense.

In the second part of the chapter, a brief mathematical characteri-zation of the optimicharacteri-zation-partitioning clustering problem is unfolded. It will be shown that the key-notion m the analysis of fuzzy clustering will be the partitionmg-characterization function.

2. 1 FUZZY SET THEORY

Zadeh, [112], 1965, introduced the concept of fuzzy sets by defining them m terms of mappings from a set into the unit interval on the real line. Goguen, [57], 1967, extended the concept by defining fuzzy sets to be functions from a set into a lattice.

Fuzzy sets were introduced to provide a means of mathematically describing situations which give rise to ill-defmed classes, i.e., collections of objects for which there are no precise criteria for membership. Collections of this type have vague or fuzzy boundaries. In real situations, especially in problems of pattern classification,

(36)

fuzziness is the rule rather than the exception. It is therefore believed that fuzzy sets can be applied to these problems at least as well, and probably better, than the methods now being used.

In fact, most of the classes of objects encountered m the real world are of this fuzzy, not sharply defined, type. In other words, imprecise description of real objects is quite natural.

Usually, imprecision and indeterminacy are considered statistical, random characteristics and are taken into account by the methods of probability theory. However, in real situations a frequent source of imprecision is not only the presence of random variables, but also the impossibility of operating with exact data, the latter as a result of the complexity of the system, imprecision of the constraints and objectives, and intrinsic ambiguity.

The fuzzy set theory provides a scheme for handling a number of real situations m which imprecision results m the possibility that an object may belong to a certain category with varying grades of member-ship.

References [19, 31, 57, 62, 63, 54, 107, 112, 113] introduce the basic concepts and lay the foundations of fuzzy set theory. The basic directions for application of this theory are given in [116].

2.1.1 DEFINITIONS AND NOTATIONS IN FUZZY SET THEORY

A conceptual framework for dealing with classes m which there may be grades of membership intermediate between full membership and non-membership will be given in this section to the extent that it appears to be relevant to the clustering problem under consideration.

In Dijkman & Lowen, [S3], the conceptual framework of fuzzy set theory is discussed in much greater detail; m [31] special emphasis is given to relations of equivalence generalized to fuzzy sets.

In the following we define a fuzzy set and derive some of the immediate properties.

(37)

DEFINITION 2.1.1

Let C be a set of objects each represented by a point m a

k-dimensional space. Let an object be denoted by u. Thus C = {u}. Then, a fuzzy set A m C is characterized by a membership function, f , which associates with each point u e C a real number m the interval

[0,1] such that the value of f,(u) represents the degree of membership

(belongingness) of u m A. *

Thus the membership function is a function

f : C -> [0,1] (2.1.1)

It should be noted that, as a consequence of Definition 2.1.1, a classical set with indicator function x given by

X : C ^- {0,1} (2.1.2)

IS a special case of a fuzzy set. In that case we have

u £ A Iff f^(u) = x^(u) = 1 (2.1.3a)

u ^ A Iff f^(u) = x^(u) = 0. (2.1.3b)

DEFINITION 2.1.2

Two fuzzy sets A and B are equal, written

A = B, Iff f„(u) = f„(u), (V u e C ) . (2.1.4) A B

* DEFINITION 2.1.3

A' IS said to be the complement of A iff its membership function is given by

f^,(u) = 1 - f^(u), (V u e C ) . (2.1.5) *

(38)

DEFINITION 2.1.4 (2 (2 1 1 7a) 7b) + A IS said to be contained in B iff

f^(u) < fg(u), (V u £ C ) . (2.1.6)

*

DEFINITION 2.1.5

The intersection A 0 B (c=l,2) of two fuzzy sets A and B is a fuzzy set. Its membership function is equal either to

^AfjlB^"' = min [f^(u),fg(u)], (V u £ C)

or to f^ne'"' = ^A'^' • ^B^""^' (V u £ C ) .

It IS easy to show that both Eqs. 2.1.7a and 2.1.7b are valid definitions for an intersection m terms of classical sets. Obviously, their properties for fuzzy sets are different, as is shown in [10, 31, 39].

As it IS the aim of this study to show that an optimal solution of the clustering problem will correspond to minimum fuzzy set overlap, It will be clear that the notion of fuzzy set intersection is of extreme importance. Interestingly enough both definitions lead to generalizations of some well-known functionals in error analysis in statistical classification models.

Remark: Eq. 2.1.7a may be considered as a sieve function whereas Eq. 2.1.7b acts as a filtering function.

As a consequence of Definition 2.1.5 the next definition follows.

DEFINITION 2.1.6

c

The union A U B (c=l,2) of two fuzzy sets A and B is a fuzzy set. Its membership function can be defined either by

(39)

f^^g(u) = max [f^(u),fg(u)], (V u £ C) (2.1.8a)

or by f 2,„(u) = f, (u) + f^ (u) - f„ (u) . f„ (u) , (V u £ C) . (2.1.8b)

ABB

A

B A B

Which of the two possibilities of the Definitions 2.1.5 and 2.1.6 should be used is subject to both the application and the computational complexity. If we have a collection of fuzzy sets A , (i=l,2,...,m), Definition 2.1.5 leads to

f (u) = m m f (u) , (V u £ C) (2.1.9a) m A

g A ^ ^

1 = 1 ^

m

and f (u) = n f (u) , (V u e C) . (2.1.9b) m , A

Similarly, Definition 2.1.5 leads to

f (u) = max f (u) , (V u £ C) (2.1.10a) m A

i A ^ ^

1=1 "

m

and f (u) = 2 f (u) - S f (u) . f (u) + m , A A A

0 A

"=^

^

^'^

" ^

1 = 1 ^ m ... + (-1)™ n f (u) , (V u £ C) . (2.1.10b) 1=1 1

From a mathematical point of view Eqs. 2.1.9b and 2.1.10b are more attractive than Eqs. 2.1.9a and 2.1.10a, however they are more computationally complex.

The next three theorems list the immediate properties of 0, fl, ', (c=l,2).

(40)

THEOREM 2.1.1

The operations of union and intersection are commutative,

associative, and each distributes over the other only for c=l. "*

THEOREM 2.1.2

If we denote the fuzzy set f(u) = xt^) - ^ ^°^ ^ H u e C by U, and the fuzzy set f(u) = x (u) = 0 for all u e C by (j), then

1. A U (f> = A, (c=l,2) (2.1.11a) 2. A 0 4> = ((), (c=l,2) (2.1.11b) 3. A U U = U, (c=l,2) (2.1.11c) 4. A n U = A, (c=l,2) (2,l.lld) 5. A LI A' c U, (c=l,2) (2.1.lie) 6. ()) c A n A' , (c=l,2) (2. l.Uf) 7. (})' = U (2.1.11g) 8. U' = (j) (2.1.11h) THEOREM 2.1.3

The statement of De Morgan's laws also hold for fuzzy sets, for

c=l,2. *

In [112], the classical separation theorem for convex sets was generalized to convex fuzzy sets. In terms of Definition 2.1.5 we get the following theorem.

THEOREM 2.1.4

Let A and B be convex fuzzy sets with maximal grades of membership M, and M respectively [ M = Sup f (u), M = Sup f (u)]. Let M be the

A B A ' ^ A B '^B

(41)

maximal grade for the intersection A D B, [ M = Sup f (u)]. Then the u

degree of separation D is given by

D = 1 - M

= 1 - Sup f^p|3(u). (2.1.12)

This theorem holds for ft, (c=l,2). So, a distinction can be made by writing M and D tor c=l,2.

c c

Proofs of these theorems, here omitted, are given in [19] and [112]. Definitions 2.1.1 to 2.1.6 and the immediate properties given m Theorems 2.1.1 to 2.1.4 sufficiently indicate the basic operations on fuzzy sets.

Since we shall deal with collections of m fuzzy sets according to the number m of clusters to be detected, given by

{f^}™^^, (2.1.13a)

denoted throughout this thesis as

{(f ,u), V u £ C } " , (2.1.13b)

1 1=1

when a finite number of pairs (f ,u) is assumed such that f (u) is the

1 1

value of the membership function f assigned to each u, an appropriate condition will be

m

S f^(u) = 1 , (V u £ C ) . (2.1.14)

1=1

Once we have a fuzzy set defined on C, we need a rule for mapping the fuzzy set on a classical set related to f. A useful rule is defined in the following way.

DEFINITION 2.1.7

F ( Y ) I S said to be a y-level set, given by its indicator function

(42)

X„, , (U) = 1 Iff f(u) > y (2.1.15a F tyj

= 0 iff f(u) < Y (2.1.15t

in short

F ( Y ) = {u : f(u) > Y ) (2.1.15c

With {u : f(u) = —} the so-called turn-over points are determined. In these points the uncertainty about set membership is maximum.

In Fig. 7 the introduced notions are illustrated.

2.1.2 ON MEASURES OF FUZZINESS, 1(f)

In this section we will develop an axiomatic framework for a measure of fuzziness 1(f).

Let C = {u} be the set of objects. Then, a fuzzy set is represented by

f = { (f,u), V u e C} (2.1.15a

and some special cases are denoted as

f = I" (2.1.15b:

f = 1 (2.1.16c: and f = 0. (2.1.16d:

Let us consider a real valued function i(.) on the interval [0,1] such that the function satisfies

a) 1(0) = 1(1) = 0 (2.1.17a) b) i(Y) = i(l-Y), (Y e [0,*:]) (2.1.17b)

c) i ( . ) IS n o n - d e c r e a s m g on [ O , ^ ] . ( 2 . 1 . 1 7 c ) 30

(43)

FIGURE 7a: Illustration of union and intersection, (c=l,2).

FIGURE 7b: Illustration of the separation theorem, (c=l ,2).

X.

FIGURE 7c: Illustration of a y-level set, F(y).

(44)

In other words, i(f(u)) should express the amount of uncertainty, fuzziness or vagueness in set belongingness in point u. It should be zero for f(u) = 0 and for f(u) = 1 (the case of a classical set), have its maximum for f(u) = — and be symmetric in f(u) = —.

Irrespective of a specific choice of i(.), the following definition characterizes the amount of fuzziness m f.

DEFINITION 2.1.8

Given a fuzzy set

{(f,u), V u e C}

and a real valued function i(.) on [0,l] for which Egs. 2.1.17a, b and c hold, the amount of fuzziness 1(f) is given by

1(f) = ^ 2 i(f(u)) (2.1.18) U£C

where N is the number of elements contained in the set C on which the

fuzzy set has been defined. *

We may encounter a case in which weights w(u) for each individual object u in C are known a priori in such a way that

2 w(u) = 1. (2.1.19) U£C

Then, with

w(u) = - , (2.1.20)

Eq. 2.1.18 IS a special case of

1(f) = 2 w(u) i(f(u)). (2.1.21)

U£C

Unless indicated otherwise we assume this special case. 32

(45)

Given i(.) for which Eqs. 2.1.17a, b and c hold, the next theorems are immediate properties of 1(f). Proofs are partly borrowed from [31].

THEOREM 2.1.5

1(f) = 0 « f(u) = x(u)' (V u e C ) . (2.1.22)

Proof

(i) If we assume that f(u) is an indicator function for all u £ C, then f(u) £ {0,1}, (V u £ C ) . As a consequence of Eq. 2.1.17a, it follows that i(f(u)) = 0 for all u £ C, so, 1(f) = 0.

(ii) Given Eq. 2.1.17a and Eq. 2.1.17c, it follows that i(f(u)) >0

for all u £ C. Then 1(f) = 0 implies i(f(u)) = 0 for all u £ C. Consequently, f(u) £ {0,1} for all u £ C and f (u) = xt^^)/ V u £ C.

D

THEOREM 2.1.6

1(f) < lih) (2.1.23)

with equality iff f(u) = y/ V u £ C.

Proof

(i) We may write

K f ) - I(^) = h ^ i(f(u)) - ^ S i{h) (2.1.24a)

U£C U£C

= i- S [i(f (u)) - i(!5)]. (2.1.24b)

N U£C

Because of Eqs. 2.1.17b and c, it follows that

i(f(u)) - 1(^5) < 0. (2.1.24c) 33

(46)

Since this is true V u £ C, Eq. 2.1.23 is proved.

(ii) Obviously, if i(f(u)) = i(^) for all u e C, we have f(u) = h,

V u £ C, and Eq. 2.1.23 becomes an equality. C

THEOREM 2.1.7

If f* IS a sharpening of f (f*(u) > f(u) for f(u) > h, f*(u) < f (u) for f(u) < '2, f*(u) = f(u) for f(u) = h; W u e C) , then

I(f*) < 1(f). (2.1.25

Proof

(I) Assume f(u) ^h, then f*(u) > f(u). Consequently, i(f*(u)) < i(f(u)).

(II) O n the o t h e r h a n d , if w e a s s u m e f ( u ) ^ h, w e h a v e f*(u) ^ f ( u ) , so i ( f * ( u ) ) < i ( f ( u ) ) . T h u s for a l l u £ C w e h a v e i ( f * ( u ) ) < i ( f ( u ) )

w h i c h r e s u l t s m E q . 2.1.25. CH

T H E O R E M 2.1.8

If f d e n o t e s the c o m p l e m e n t o f f (Definition 2 . 1 . 3 ) , then

K f ) = K f ) . (2.1.25)

*

Proof

Following Definition 2.1.8, we have

K f ) = ^ 2 i(f (u)) U£C

which can be rewritten as

K f ) = ^ 2; 1(1 - f(u)) (2.1.27a ueC

due to Eq. 2.1.17b. Eq. 2.1.27a can be rewritten as

(47)

K f ) = ^ 2 i ( f ' ( u ) ) , ( 2 . 1 . 2 7 b )

N „

ueC

t h u s 1 ( f ) = K f ' ) . • THEOREM 2 . 1 . 9

Let f and h be two fuzzy s e t s . Then

K f 0 h) + K f i h) = 1 ( f ) + 1 ( h ) . ( 2 . 1 . 2 8 )

*

Proof

Let u s d e n o t e

11 (u) = f (u) - h ( u ) ( 2 . 1 . 2 9 a )

as t h e d i f f e r e n c e between t h e two fuzzy s e t s f and h i n each i n d i v i d u a l element o f C. Then n = {u : TT(U) > 0} ( 2 . 1 . 2 9 b ) d e n o t e s t h e s e t of e l e m e n t s f o r which f ( u ) ^ h ( u ) . C o n s e q u e n t l y , n ' = {u . TT(U) < 0} ( 2 . 1 . 2 9 c ) where FI 0 TT = (j)

and n u n ' = U. (2.1.29e)

From Section 2.1.1 we know

f ii h(u) = max [f(u), h(u)], (V u f C) (2.1.30a)

and f Q h (u) = m m [f(u), h(u)], (V u ^ C) (2.1.30b)

Then, we have

(48)

N . I(f i) h) = 2 i(f(u)) + 2 i(h(u)) (2.1.31a u£n u£n'

and likewise

N . I(f 5 h) = 2 i(f(u)) + 2 i(h(u)). (2.1.31b:

u£n'

u£n

Taking the sum of Eqs. 2.1.31a and 2.1.31b, it follows

N . [ K f il h) + K f 5 h)] = 2 i(f(u)) + 2 i(h(u)). (2.1.31c) ueC u£C

Dividing Eq. 2.1.31c by N, we get Eq. 2.1.28, which completes the

proof. • So far 1(f) has been considered quite generally. If we make a choice

of i(.), then we find a particular measure of fuzziness. For example we may choose the Shannon function type ^

i(x) = - X log X - (1-x) log (1-x), (0 < x < 1) . (2.1.32)

In terms of membership functions we get

i(f(u)) = - f(u) log f(u) - (1 - f(u)) log (1 - f(u)). (2.1.33) Recalling Definition 2.1.8 the amount of fuzziness can be computed by means of the expression

K f ) = - i 2 f(u) log f(u) - ^ 2 (1 - f(u)) log (1 - f(u)) N N

u£C ueC

(2.1.34a)

or 1(f) = - - 2 f(u) log f(u) - ^ 2 f'(u) log f'(u). (2.1.34b) N N

U£C U£C

This type of fuzziness has been investigated by Capocelli, De Luca and Termini, [21, 22, 29, 30], Knopfmacher, [57], and Dijkman S Lowen, [31].

(49)

In this thesis we shall go along the lines of a decision function

type, stated as , , .,i ^ ,

i(x) = X if X £ [0,*5] ^ W Z Y > - " = O

1-X if X £ [h,l]. (2.1.35)

In terms of fuzzy sets we find

i(f (u)) = f (u) if 0 < f (u) < h

= 1 - f(u) if h< f(u) < 1. (2.1.35)

This type of uncertainty measure has been suggested by the author in 1975, [3].

Recalling the notion of Y~level sets. Definition 2.1.7, we formulate the next theorem.

THEOREM 2.1.10

K f (u)) = |f (u) - x^,,, (u) I - (V u £ C ) . (2.1.37)

This is a real valued continuous function on the interval [O,^] satisfying Eqs. 2.1.17a, b and c.

Proof

Following Definition 2.1.7, we have

F(i5) = {u : f (u) > h] (2.1.38a)

with x„,i,(u) = 1 for f(u) >h

= 0 Otherwise. (2.1.38b)

With the aid of Eq. 2.1.38b we shall proof successively that Eq. 2.1.37 satisfies Eqs. 2.1.17a, b and c.

(50)

(i) If f (u) = 0, then Xr,/L\ (") = 0' so i(0) = 0. On the other hand, F (T)

if f(u) = 1, then Xp(iv(u) = 1 and i(l) = 0 . Therefore Eq. 2.1.37 holds for Eq. 2.1.17a.

(ii) Assume f (u) = Y "^ *^' then it follows that

X^M K u ) = 0 =* i(Y) = Y- (2.1.39a)

Under the same assumption we have

X p , ( ; , ) ( u ) = l =* i(l-Y) = |f'(u) - Xp,(i^)(u)I

= I 1 - Y - 1 I = Y- (2.1.39b)

Combining the results of Eqs. 2.1.39a and b we find

I ( Y ) = i(l-Y) for the case Y e [0,^5].

Likewise it follows that

i(Y) = i(l-Y) for the case y e [^5,1].

Thus Eq. 2.1.37 holds for Eq. 2.1.17b.

(ill) Assume f(u) < h(u) < h. Then, we have x ,1,(u) = 0. Consequently,

F (-i)

we get

i(f(u)) < i ( h ( u ) ) <h, (v f ( u ) , h(u) £ [0,15]). (2.1.40)

For the case in which h(u) = h, then x ,1 > (u) = 1 and i(h(u)) = h.

F (-5)

In conclusion, we find, together with Eq. 2.1,40,

i(f(u)) <i(h(u)) < *5 » f(u) < h ( u ) < ^5 (2.1.41) which proves that Eq. 2.1.37 is a real valued continuous function on

the interval [O,^] satisfying also Eq. 2.1.17c.

Parts (1), (11) and (111) prove the theorem. •

Both the Shannon type and the decision function type are illustrated m Fig. 8.

(51)

X (u)

i ( ) 0.

F('/2)

I(U)

FIGURE 8: Illustration of fuzziness functions

2 i (f(u))

s i^(f(u))

d

- f(u) log f(u) - (1 f(u)) log (1 - f (u))

FIGURE 9: a) Illustration of an arbitrary mapping of fuzzy set

f into a classical subset A c c.

b) Illustration of minimum uncertainty mapping of

fuzzy

set f into its h-level set F(h)

c c.

(52)

With the choice in this thesis of the decision function type, the amount of fuzziness becomes

K f ) = ^ S |f(u) - Xp^i^s (u)| (2.1.42)

ueC

which can be interpreted as the amount of uncertainty that arises if, when considering a fuzzy set f defined on C, one has to assign to each of the elements of C a (classical) membership m a subset A* with A* = F(^) c C. This point of view is illustrated in Fig. 9.

2.1.3 SOME PROPERTIES OF lf{f }™ ,1

'• 1 1=1''

In this section we develop a number of properties which are valid for m-collections of fuzzy sets

{(f^,u), V u £ C}™^^, (m=2,3,...)

m

where 2 f (u) = 1, (V u £ C)

1=1 1

for the purpose of characterization of the fuzzy set decomposition to be considered m fuzzy clustering.

THEOREM 2.1.11

K f ) = i 2 min [f (u) , 1 - f (u)]

U£C

Proof

If we assume f(u) < h, then we find

|f(u) - Xp(;^,(u)I = f(u)

40

(2.1.43)

(53)

= m m [f(u), 1 - f(u)]. (2.1.44a)

On the other hand, if we assume f(u) > h, then

|f(u) - Xp(J^)(u)| = 1 - f(u)

= m m [f(u), 1 - f(u)]. (2.1.44b)

Combining Eqs. 2.1.44a and b and taking the arithmetic mean proves

Theorem 2.1.11. •

THEOREM 2.1.12

/ " \ 1

\<1, 'J = J

ueC 1 ^ 2 m m f (u) . (2.1.45)

Note that n and U are meant to be fl and U unless specified otherwise.

Proof

First we consider two fuzzy sets f and f from the m-collection. Then,

K f n f ) = ^ 2 |f „ (u) - x^,iKu) I . (2.1.46a)

1 J ^ C ^ -' F(i5) '

Using Eqs. 2.1.17a and 2.1.14 it follows that

f „ (u) = m m [f (u) , f (u)] < i5 . (2.1.46b) lOj 1 J

Thus Xp(ij) (u) = 0. (2.1.46c)

Then, Eq. 2.1.46a becomes

I(f n f ) = ^ 2 m m [f (u) , f (u)]. (2.1.45d)

-^ U£C -'

(54)

Because of the fact that, for m fuzzy sets n f = f n n f

1=1

^ 3

v = l

^ i^D we have (2.1.46e) m i| n f 1=1 ^

f n n f

1^3 (2.1.46f) 1 y — Z. m m N U £ C f (u) , f (u)

J m J

n

1=1 (2.1.46g) — 2 m m [f (u) , m m f (u) ] U £ C — 2 m m f (u) U £ C 1 1=1 1^3 (2.1.46h)

which proves the theorem.

_n

THEOREM 2 . 1 . 1 3

l ( U f ) = - 2 m m fmm [ l - f (u) ] , max f (u) 1

\ . i / N ^ 1 1 ^ ^ 1 = 1 ' u c C 1 1

(2.1.47) *

Proof

Due to the fact that

f (u) m U f 1=1 ^ max [f (u)] 1 1 42

(55)

1 - m i n [ l - f (u) ] 1 1 l - f (u) , ( 2 . 1 . 4 8 a ) m

n f

1=1 ^ i t f o l l o w s t h a t

il^y-<

f m U f 1=1 ^ I 1 - f ™ m

n f

1=1 ^

\ m J

n f

1=1 ^ Eq. 2.1.48b leads to

l(u fj=i 2 mm(f ^ (-'' 1 - f m ^"0

1=1 1=1

Since f (u) = min [1 - f (u) ] ,

m 1

n f

1=1 ^

Eq. 2.1.49a can be rewritten into

(2.1.48b)

(2.1.49a)

if U f j = i 2 min (min [1 - f (u)], 1 - m m [1 - f^(u)]j

^1=1 ^^ ueC ^ 1 ^ i

^ 2 min ( m m [1 - f (u)], max [f (u) ] ) (2.1.49b)

N \ 1 1 / ueC ^ 1 1 '

(56)

which had to be proved.

D

The next theorem is an immediate consequence of Theorems 2.1.12 and 2.1.13. THEOREM 2.1.14 , m I n f ^1=1 ^ i( u f I < 0 ^1=1 ^' (2.1.50) * Proof

Because of the fact that

min f (u) < m m [ l - f (u) ]

1 1 1 1

(2.1.51a)

and min f (u) "^ max f (u) ,

1 1

(2.1.51b)

we directly see that

m m f (u) ^ m m (mm [l - f (u) ], max f (u))

1 ^ 1 1 -^ (2.1.51c)

Since this is true for all u £ C, we find

- 2 m m f (u) - — 2 min (min [l - f (u) ], max f (u)

N „ i N ^ ^ 1 1 ' ueC 1 ueC 1 N 2 ueC m m ^ 1

f (u) - m m (min [l - f (u) ], max f (u)

1 '^ 1 1

-< 0 (2.1.51d)

which completes the proof. D

A special case forms f (u)

1

for all u £ C. Then Eq. 2.1.50 becomes

(57)

an equality which is an obvious result since m n f. = U f. = - for f. (u) = -, (V u £ C) . (2.1.52) . , 1 . , 1 m 1 m 1=1 1=1 THEOREM 2.1.15

(

"

I U f. I = max K f . ) . (2.1.53) \. , 1/ . 1 Proof Since

min (min [1 - f. (u) ] , max [f. (u) ] | = max (^min [1 - f.(u), f.(u)]l,

. 1 . 1 ^ . ^ 1 1

1 1 1

(2.1.54a)

it follows, by taking the arithmetic mean

l( U f. I = max I(f^) (2.1.54b) ^i=l ' i

which had to be proved. D

THEOREM 2.1.15

I( n f^ I = min K f . ) . (2.1.55)

^i=l ^^ i "• . *

Proof

It is easily understood that

min f.(u) = min [f.(u), 1 - f.(u)] . (2.1.56a)

1 . 1 1

1 1

(58)

Taking the arithmetic mean, we get

l( n f j = min K f ) (2.1.56

^1=1 ' 1

which had to be proved.

In conclusion, we have found the following relation

0 < if n f ) = m m I (f ) < K f ) < max 1 ( f ) \ , 1/ r k 1 ^1=1 ' 1 1 / m I U f 1 < ij , (k=l,2,. ..,m) (2.1.57 ^1=1 1 1

for which equality holds when f (u) = —, (i=l,2,...,m) for all u £ C.

1 m

2.2 CLUSTER ANALYSIS

As pointed out in Chapter 1, cluster analysis is one of the pattern recognition techniques which may be used to find natural clusters corresponding to natural classes in the data, to break up all objects

(samples) in a pattern classification problem into smaller groups so that a simpler pattern classification problem can be solved on each subgroup, to reconstruct a probability density from the samples, to get the most out of a small set of labeled samples by extrapolating class membership to unlabeled samples, or simply to understand better the system from which the data arises.

In this thesis attention will be given only to the optimization-partitioning method which handles iteratively a partition m clusters.

In the subsequent sections a brief characterization of this type of clustering problem will be given.

(59)

2.2.1 DEFINTIONS AND NOTATIONS IN CLUSTER ANALYSIS

Let the set C = {u } , denote N objects. We assume that each object V v=l

can be represented by k measurable properties. Thus, the data is available to the investigator m the form of a set of N k-dimensional measurement vectors so that the objects are represented as N points in a k-dimensional space. The usual representation of the data is

contained in the so-called object-property table as given m Table III. The properties may be quantitative (length, weight, amplitude, age etc.) or qualitative (state, name, illness, etc.). Since the analysis may be characterized by the use of resemblance or disresemblance measures between the objects to be identified, the sort of properties to be

measured is of crucial importance to the investigator's choice on the clustering model and the clustering operator. An evaluation of the significance of resemblance and disresemblance measures, or, equivalent-ly, similarity and dissimilarity measures, is made in [34].

A similarity measure s(u ,u ) gives a numerical value to the notion V w

of closeness between object u and u derived from the observed values V w

of their k properties. Usually, a high value of s(.,.) indicates high similarity or closeness.

The fundamental purpose of a similarity measure is to induce an order on the set of pairs (u ,u ) for all (v,w) £ {1,2,...,N}. Many similarity measures may induce the same order, m fact, simplicity and calculability IS the guide of the specialist m data structure analysis. His choice should also be influenced by what he knows of the pecularities of the data.

DEFINITION 2.2.1

Given a data set C = {u } ., a similarity measure s(u ,u ) is a real V v=l V w

valued function whose domain is NxN and for which a £ [0,l] is given such that

(i) 0 < s(u ,u ) < a (V v,w £ {1,2...,N}) (2.2.1a) V w

(ii) s(u ,u ) = a « u = u (2.2.1b) V w V w

(60)

(ill) s(u ,u ) = s(u ,u ). (2.2.1c) V W W V

*

The pairwise similarities are arranged m a symmetric NxN object-similarity table as given in Table IV.

Commonly, when a dissimilarity measure is used, it is given by

d(u ,u ) = a - s(u ,u ) . (2.2.Id) v W V w

Usually, the range of variation is [0,1], thus a=l.

Often a distance measure is used as a particular dissimilarity measure.

DEFINITION 2.2.2

Given a data set C = {u } , , a distance measure 3(u ,u ) is a V v=l V w real valued function whose domain is NxN such that

(i) 3(u ,u ) > 0 (2.2.2a) V w

where equality holds iff u = u V w

(ii) 8(u ,u ) = 3(u ,u ) . (2.2.2b) V W W V

*

If 3(u ,u ) satisfies V w

3(u ,u ) < 3(u ,u ) + 3(u ,u ) (2.2.3) V W V X X w

(triangle inequality)

it IS a metric distance. If

3(u ,u ) < max [3(u ,u ) , 3(u ,u )] (2.2.4) V W V X X w

X

3(u ,u ) IS called an ultrametric distance. V w

In addition to Definition 2.2.2 we have the following definition. 48

(61)

O b J e c t s 1 2 3 k ^ <''

u j

-<'' U - ) " 2

4'^

(2) " 2 (3) " 2 u ^ -" 3 • • <'' <'' <''

uf' . .

^N ( 1 ) " N u < 2 ) N u < 3 ' N N

TABLE III: The object-property table for C = {u }

obJ e c t s

^ ^ 2 " 3 " N " l s ( u ^ , u ^ ) '^2 s ( u ^ , U 2 ) ^ ' ^ 2 ' " 2 ' . " 3 s ( u S(U2 s ( u 3 - 3 ' " 3 ' - 3 '

• • \

s ( u ^ , U j ^ ) s ( u 2 , u ^ ) s ( u 3 , u ^ ) ^ ' " N ' " N '

Cluster analysis by optimal decomposition of induced fuzzy sets