Discrimination of Symbolic Objects

(1)

A C T A U N I V E R S I T A T I S L O D Z I E N S I S

FO LIA O E C O N O M IC A 206, 2007

An d rz e j D u d e k *

D ISC R IM IN A TIO N O F SY M BO LIC O B JE C T S

Abstract. Sym bolic D a ta A nalysis is an extension o f multivariate analysis dealing with data represented in an extended form. Each cell in sym bolic data table (sym bolic variable) can contain data in form o f single quantitative value, categorical value, interval, multivalued variable, m ultivalued variable with weights. Variable can be taxonom ic, hierarchically depen dent, logically dependent. D u e to extended data representation Sym bolic D a ta Analysis introduces new m ethods and also implements traditional m ethods that sym bolic data can be treated as an input. Article shows how “classical” Bayesian discrim ination rule can be adapted to deal with data o f different sym bolic types, presents kernel intensity measures for sym bolic data and m ethods o f obtaining probabilities o f belongings to the classes. The exam ple o f using sym bolic discriminant analysis for electronic mail filtering is given.

Key words: discrim ination, sym bolic object, Kernel density estim ators.

1. IN T R O D U C T IO N

Bayesian discrim inant analysis is a well-known m ethod, which is often used in m ultivariate d ata analysis. However this m ethod has recently found an unexpected usage in com puter science and is used to filter unsolicited electronic mail (spam). This paper describes a com putational example of discrim inant analysis o f symbolic objects representing e-mails.

D iscrim inant analysis goals and the m ethods of estim ating distribution density functions for each class are described in first part o f the article with special focus on non-param etric kernel density estimation m ethod.

The second p art introduces notions of symbolic objects and symbolic variable and describes m ain dissimilarity measures for symbolic objects.

* P h .D ., Departm ent o f Econom etrics and Computer Science, U niversity o f Econom ics in W rocław.

(2)

The third part shows how m ethods o f discriminant analysis, and of kernel discrim inant analysis in particular, may be adapted for symbolic objects.

Finally, the described m ethods are used for filtering electronic mail. The procedure assigns two symbolic objects, each with seven variables, to two classes, one containing 17 messages pre-classified as spam and one containing 13 legitimised mails.

The paper finishes with conclusions, including suggestions for future areas o f research.

2. D IS C R IM IN A N T A N A L Y SIS A N D K ERNEL D E N SIT Y E ST IM A T IO N

D iscrim inant analysis assigns objects from a test set to an existing structure o f classes (training set).

M ost o f discrim inant m ethods are based on the m axim um likelihood rule, which says th at an object from test set should be assigned to the class o f training set for which the value of distribution density function achieves m axim um . This rule is equivalent to the Bayesian rule, which defines misclassification cost in terms of a priori and a posteriori probabilities.

In earlier discrim inant m ethods (Altm an equation, Fisher analysis) there was an assum ption that objects in classes o f training sets had normal distribution but in real discrim ination problems we cannot m ake such assumption. Therefore one o f m ain problems of m odern discrim inant analysis is to estim ate distribution density function for each class of the training set.

There are three approaches to achieve this (see: H a n d 1981; G o l d s t e i n 1975; B o c k , D i d a y 2000, p. 235-293):

• linear estim ation (Fisher), • quadratic estimation, • non-param etric methods.

One of the m ost commonly used non-param etric m ethods of estimation o f distribution density function is kernel density estim ation. Equation (1) represents general form of kernel density estim ator ( H a n d 1981)

where:

J k - kernel density estimator,

d - dimension,

к - class num ber,

(3)

hk - bandw ith window for k-th class (a param eter), K(...) - kernel.

Kernel can obtain various forms. In the simplest case its value equals 1 if all coordinates of its argum ent all smaller than 1; in other cases it is equal to 0.

3. S Y M B O L IC O B JE C T S A N D SY M B O L IC V A R IA BLES

3.1. Symbolic data table

Symbolic data, unlike classical data, are m ore complex than tables of numeric values. While Tab. 1 presents usual data representation with objects in rows and variables (attributes) in columns with a num ber in each cell, Tab. 2 presents symbolic objects with intervals, set and text data.

T a b l e 1 Classical data situation

X Variable 1 Variable 2 Variable 3

1 1 108 11.98

2 1.3 123 -2 3 .3 7

3 0.9 99 14.35

S o u r c e : own research.

T a b l e 2 Sym bolic data table

X Variable 1 Variable 2 Variable 3 Variable 4 1 (0.9; 0.9) {106; 108; 110} (11; 98) {blue; green} 2 (i; 2) {123; 124; 125} (-23; 37) {light-grey} 3 (0.9; 1, 3) {100; 102; 99; 97} (14; 35) {pale}

S o u r c e : own research.

H.-H. B o c k and E. D i d a y (2000) define five types o f symbolic variables: • single quantitative value,

• categorical value, • interval,

(4)

• m ultivalued variable,

• m ultivalued variable with weights.

Variables in a symbolic object can also be, regardless of its type (D i d a у 2002):

• taxonom ic - representing hierarchical structure, • hierarchically dependent,

• logically dependent.

3.2.Dissimilarity measures for symbolic objects

Because o f the structure of symbolic objects, usual m easures like M an- hatan distance, Euclidean distance, Canbererra distance or M inkowski m et rics cannot be used. W ith symbolic data, other m easures m ust be used.

D. M a l e r b a et al. (2001) define three m ain types o f dissimilarity measures for symbolic objects:

• G ow da, K rishna and D iday - m utual neighbourhood value, with no taxonom ic variables implemented,

• Ichino and Yaguchi - dissimilarity m easure based on operators of C artesian join and Cartesian meet, which extend operators и (sum of sets) and и (product o f sets) onto all d ata types represented in symbolic object,

• D e C arvalho measures - extension of Ichino and Yaguchi measure based on a com parison function (CF), aggregation function (AF) and description potential o f an object.

Table 3 com pares the form ulas o f these m easures.

T a b l e 3

D issim ilarity measures for sym bolic data

N o . D issim ilarity measure for variables D issim ilarity m easure for objects 1 D »(A , B) = D n(A , B) + D S( A ,B ) + + D C(A , B) d(O v 0 2) = Í D ^ B , ) j-i 2 ф (А , В) = \Л ® В \ - |A ® ß | + у(2 ■ И ® В | - \ A \ - \ B \ d ,(O u OJ = { ^ t < P ( A lt B jj 3 d ,( 0 l, 0 2) = (^Z w (A „B ,)j 4 ď j(0l, 0 1) = y í [ w l4, ( A „ B ^

(5)

Table 3 (cd.)

N o. D issim ilarity measure for variables D issim ilarity m easure for objects 5 d ,(A ,B ) i = 1, 2, . . . , 5 ‘W . O , ) - ( j j t [ Wjdl (A„B,)]'') 6 МЛ0В) O J = j 7 4 ( 0 ,, o 2) = [я(о,фо2) - Tr(0,(S>02) + у ( 2 п ( 0 , ® 0 ^ - п ( 0 х) - п ( 0 г))} 8 4 ( 0 „ 0 2) = K O , 0 O 2) - я ( 0 , ® 0 2) + г (2^0,18)0,) - тг(0,) - я ( 0 2))]/тг(0,г) 9 4 ( 0 , . 0 2) = [л(О,0О2) - я ( 0 , ® 0 2) + у ( 2 я ( 0 ,® 0 2) - я ( 0 ,) - ^ (O ^ íl/n íO .e O ,) 10 d,(A, B), i = 1 ,2 , ...,5 4 ( ° . . ° г ) = 1

j

O,, Oj - represent sym bolic objects with variables (Aj, Bj).

S o u r c e : ow n researched based on: B o c k , D i d a y 2000; D i d a y 2002; G a t n a r 1998; M a l e r b a e t al. 2001.

4. D ISC R IM IN A N T A N A LY SIS O F SY M B O L IC O B JE C T S

4.1. Kernel density estimation for symbolic objects

In case o f symbolic objects space, density distribution is undisputable. The integral operator isn’t defined in this kind o f space and it’s not a subspace o f Euclidean space either.

H .-H . B o c k and E. D i d a y (2000) introduce a replacement of kernel density estim ator for symbolic objects

t o = ~

i

i W / * . )

(6)

where:

p - num ber of classes in the training set,

к - class num ber,

I k - kernel intensity estimator,

nk - num ber of objects in k-th class,

hj - window bandwidth for j -th class (param eter), unified kernel for symbolic objects

dj(x, у) - dissimilarity measure for symbolic objects, one o f the dissimilarity

m easures from Tab. 3.

4.2. Finding a posteriori probabilities for kernel intensity estimators A n algorithm o f finding post-probabilities o f belonging to classes of training sets for each object in the test set is iterational. Starting from equal probabilities for each class, it determines the probability in t-th step according to the following form ula ( B o c k , D i d a у 2000):

where:

g - num ber o f classes in a training set, m - num ber o f all objects in a training set, к - class num ber,

I k - kernel intensity estim ator, t - step of iteration.

5. S P A M FILTER IN G W ITH D ISC R IM IN A N T A N A L Y S IS FOR SY M B O L IC O B JE C T S

In the research, the training set contained 30 objects describing electronic messages. It has been divided into two classes, one containing 17 objects classified as spam and the other containing 13 legitimised messages. Each object has seven parameters:

for dj(x, y) < hj

for dj(x, y) > hj (3)

(7)

• length of message; • num ber o f attachm ents; • num ber o f receivers; • key-words;

• title;

• sender’s address;

• 1 if sender server is in Open Relay D ataB ase1, 0 in other cases. The first three variables are numerical, the fourth and fifth are m ulti valued, the sixth variable is categorical and the seventh is a Boolean variable. F o r storing inform ation about messages from the training set M icrosoft Access 2000 has been used, and for assigning objects from the test set to classes - Symbolic Official D ata Analysis Software (SODAS) modules:

• DB2SO, • D I, • D K S.

The training set had two objects. Their contents are listed in Fig. 1.

Test set - object 1

Received: from unilodge.com .au (61.110.! 52.158)

by oscar.ae.jgora.pl with M E R C U R Mailserver (v4.02.30 M jk4NjItNjQwNSOxO-T IxM g = = )

for < andrzej@ e.jgora.pl > ; Fri, 15 Oct 2004 05:44:28 + 0200 Received: from 152.109.219.62 by smtp.sebank.se;

Fri, 22 Oct 2004 03:43:03 + 0000

M essage-1D: <03b801c4b7e9$8102fe47$93e8521T@ ui!Ílodge.com .au> From: “Irma T illm an” < irm atillm andn@ sebank.se>

To: andrzej@ae.jgora.pl

Subject: Order R olex or other Swiss watches online D ate: Fri, 15 Oct 2004 07:43:02 + 0400

X -E nvelope-T o: < andrzej@ae.jgora.pl > X -E nvelope-From : - irmatillmandn@ sebank.se

H ey a,

D o you want a rolex watch?

In our online store you can buy replicas o f R olex watches. 1 hey look and feel exactly like the real thing.

1 The com m only know n black- and grey-lists o f spammers’ IP-addresses, available on h ttp ://w w .ordb.org. M any popular e-mail servers use these lists to deny access lor spammers.

(8)

Test set - object 2

Received: from pop.uni.lodz.pl (212.191.64.2) by oscar.ae.jgora.pl with M E R C U R Mailserver (v4.02.30 M jk4N jItN jQ w N SO xO T IxM g= = ) for < andrzej@ oscar.ae.jgora.pl> ; Thu, 21 Oct 2004 12:44:23 + 0200

Received: from m ail.uni.lodz.pl (212.191.64.8) by pop.uni.lodz.pl (M X V5.3 A n4q) with SM TP; T hu, 21 Oct 2004 12:38:50 + 0200

Received: ...

From: “K onferencja M S A ” < m sa@ u n i.lod z.p l> To: ... < m arekw @ oscar.ae.jgora.pl > ,

< andrzej@ oscar.ae.jgora.pl > , < abak@ oscar.ae.jgora.pl > , Subject: konferencja M S A ’04

D ate: T hu, 21 Oct 2004 12:37:42 + 0200 X -E nvelope-T o: < andrzej@ oscar.ae.jgora.pl > X -E nvelope-From : < m sa@ un i.lo d z.p l>

Szanow ni U czestnicy Konferencji W ielowym iarow a A naliza Statystyczna = M S A ’2004!

W załączniku przesyłam y program konferencji. Referenci będą mieli d o dyspozycji rzutnik pism a (folie) oraz rzutnik m ultim edialny.

Fig. 1. Test set S o u r c e : own research.

A n output o f kernel discrim inant symbolic analysis is presented in Fig. 2.

S O D A S FIL E c :\s o d a s\sp a m l5 .sd s 8 V A R IA B L E S 32 IN D IV ID U A L S 3 CLASSES * C l: 17 T R A IN IN G OBJECTS * C2: 13 T R A IN IN G OBJECTS * C3: 0 T R A IN IN G OBJECTS 30 T R A IN IN G OBJECTS 2 OBJECTS TO C L A SSIFY SM O O T H IN G P A R A M E T E R : 1.0643 LOO E S T IM A T E D ER R O R RATE: 0% PR IO R PR O BABILITIES: C l: 0.333 C2: 0.333 C3: 0.333

(9)

PO STER IO R PROBABILITIES:

O BJECT C LA SSE 1 CLASSE 2 CLA SSE 3

0 0.857 M A X 1 31 1 2 32 0.1430.143 0 0 2

Fig. 2. R esults o f discriminant analysis o f objects from the test set S o u r c e : own research. Report file from SO D A S software.

Object 1 has been classified as spam with 100% probability, object 2 has been classified as non-spam with 85.7% probability. These results quite sufficiently correspond with the intuitive nature o f emails described by object.

• M ethods of discrim inant analysis based on non-param etric distribution density estim ation can be adapted to symbolic data.

• Discrim inant analysis o f symbolic objects can be used lor filtering incoming e-mail messages and m arking spam.

• The results are promising but also quite preliminary. I he relatively small size of training and test sets is implicated by the fact that process oi creating symbolic objects describing messages has not been autom ated.

• M ore accurate measuring of quality of filtering requires lull autom ation o f the process and can be obtained by creating a simple POP3/1M AP client combined with text parser, symbolic object generator and algorithm s desc ribed in the paper. A uthor is currently working on such a heuristic, symbolic, Bayesian anti-spam filter and hopes to share the results in not too far a future, but for now the problem is an open issue.

B o c k H. H. , D i d a y E. (2000), Analysis o f sym bolic data. E xplanatory m ethods f o r extracting

sta tistica l information fro m complex data, Springer-Verlag, Berlin.

D id a y E. (2002), An introduction to sym bolic data analysis and the S O D A S softw are, J.S .D .A ., International E-Joum al, http://w w w .jsda.unina2.it/newjsda/voIum es/VO LO/Ed- w in.P D F .

D u d e k A . (2004), M iary podobieństwa obiektów sym bolicznych. Odległość Ichino-Yaguchiego, “ Prace N au kow e A kadem ii Ekonomicznej we W rocławiu” , nr 1021, 100-106.

G a t n a r E. (1998), Sym boliczne m etody klasyfikacji danych, W ydaw nictw o N au kow e PW N, 6. C O N C L U SIO N S

REFEREN CES

(10)

G o l d s t e i n M . (1975), Comparison o f Som e D ensity E stim ate Classification Procedures. “Journal o f the Am erican Statistical A ssociation” , Part I, 70, Issue 351, 666-669. H a n d D . J. (1981), KerneI Discrim inant A nalysis, W iley, N ew York.

H o l d e n S. (2004), Porównanie serwerowych filtró w bayesowskich, “ H akin9” , 2, 62-71. M a l e r b a D. , E s p o z i t o F. , G i o v a l l e V. , T a m m a V. (2001), Com paring D issim ilarity

M easures f o r S ym bolic D ata A nalysis, “ N ew Techniques and T echnologies for Statistcs”

and “ Exchange o f T ech nology and K n ow -h ow ” conference m aterials (E T K -N T T S ’01), 4 73-481.

S O D A S . D ocum entation, S O D A S package docu m en tation v . l . 20, availab le at http :// w w w .cerem ad e.d au p hin e.fr/~tou ati/aided oc/.

Andrzej Dudek

D Y SK R Y M IN A C JA OBIEK TÓ W S Y M B O L IC Z N Y C H

Sym boliczna analiza danych jest rozszerzeniem m etod wielowymiarowej analizy statystycznej ze względu na sp osób reprezentacji danych. K ażda kom órka w sym bolicznej tablicy danych (zmienna sym boliczna) m oże reprezentować dane w postaci liczb, danych jakościow ych (teks tow ych), przedziałów liczbow ych, zbioru wartości, zbioru wartości z wagam i. Zm ienne m ogą p onadto reprezentow ać strukturę gałęziow ą oraz być hierarchicznie lub logicznie zależne. Ze względu na sp osób reprezentacji sym boliczna analiza danych w prow adza now e m etody ich przetwarzania oraz tak implementuje m etody tradycyjne, żeby dane sym boliczne m ogły być ich danym i wejściow ym i. W artykule pokazano, jak „klasyczna” analiza B ayesow ska m oże być zaadoptow ana dla różnych typów danych sym bolicznych za pom ocą jądrow ego estymatora intensyw ności dla obiektów sym bolicznych. Całość jest zakończona przykładem zastosow ania analizy dyskryminacyjnej obiektów sym bolicznych d o filtrowania przychodzącej poczty elek tronicznej.