• Nie Znaleziono Wyników

Visualisation of a two - way contingency table in R

N/A
N/A
Protected

Academic year: 2021

Share "Visualisation of a two - way contingency table in R"

Copied!
9
0
0

Pełen tekst

(1)

Iw on a K asprzyk’

V IS U A L IS A T IO N O F A T W O - W A Y C O N T IN G E N C Y T A BL E IN R

A B S T R A C T . T he con tin gen cy table is one o f the m ost popular w ays o f presenting categorical data. W e can m ake a visualisation o f data contained in the tw o - w ay con tin ­ gen cy table using the v c d and g r a p h ic s packages in the R sofw are. T he m ain aim o f this Paper is to sh o w the use o f various types o f plots: the fourfold d isplay, the m osaic d is­ play, the sie v e diagram and the association plot. In addition to that, w e can describe the relations am ong different categories o f variables by applying the correspondence analysis.

K ey w o rd s: co n tin g en cy table, correspondence analysis, fourfold display, a sso cia ­ tion plot, m osaic d isplay, sie v e diagram

I. INTRODU CTIO N

This paper provides various types o f plots o f visualization o f a contingency table, especially the two-way table.

As an example, we present the analysis o f the unemployment in the city o f Bytom — a place strongly affected by the issue. The unemployment rate has been one o f the highest in the Silesia area. In first half o f 2006, it was over 23%. The unemployment analysis is shown on the strength o f variables: time without work, age, level o f education and job seniority.

II. T H E ASSOCIATION PL O T

The association plot has been proposed by Cohen (1980). The height o f each rectangle is proportional to the Pearson residual e.t.:

rv =

’ PhD Student, Department o f Statistics, The Karol Adamiecki University o f Economics, Katowice.

(2)

u ni+n+J where: eu = --- —.

n

The width o f each rectangle is proportional to and the area o f the rec­ tangle is proportional to ntJ - e iJ. If the difference is positive, the rectangle is filled with black colour, if negative - the colour is grey.

Figure 2 presents the unemplyment analysis for Bytom. In the R software, the commands can be saved as follows:

> l i b r a r y ( g r a p h i c s ) > d a t e - r e a d . t a b l e (“d a n e - B y t o m .R" , h e a d e r = F A L S E ) > r o w n a m e s (dat) < - c (" t o 1", ” 1-3", “ 3 -6 ", " 6 - 1 2 " , 111 2 - 2 4 " , " o v e r 24") > c o l n a m e s ( d a t ) < - c ( " 1 8 - 2 4 ” , " 2 5 - 3 4 " , " 3 5 - 4 4 " , " 4 5 - 5 4 " , " 5 5 - 5 9 " , " 6 0 - 64") > d a t l < - a s . m a t r i x ( d a t ) > a s s o c p l o t ( d a t l ) age 18-24 to 1 25-34 35-44 45-54 55-59 60-6 Ш

< o o ó Ш 6-12 ■ ■ ■ Ш over 24

Л■

Figure 1: The association plot for age and time without work Source: Own research.

III. TH E SIEV E DIAGRAM

The sieve diagram has been proposed by Riedwyl and Schiipbach (1983) and in 1994 it was called a parquet diagram. This kind o f plot divides a square unit into rectangles. The height o f each rectangle is proportional to the row mar­ ginal frequency ( ni+), the width of each is proportional to the column marginal

(3)

frequency ( я +у). Hence, the area o f each rectangle is proportional to the ex­ pected frequency ( etj).

If the difference between the observed and expected frequency is positive, the rectangle is filled with a dark grey colour, but if it is negative, the rectangle is a light grey. Using these colours one can indicate whether the devation from independence is positive or negative. Inside each of the rectangles are drawn squares, which reflect the observed frequency contained in the contingency ta­ ble.

By using the following commands in the R software, one receives the sieve diagram for the two variables: the age and the time without work are shown in Figure 2. > l i b r a r y ( v e d ) > d a t e - r e a d . t a b l e (" d a t a - B y t o m . R " , h e a d e r = F A L S E ) > r o w n a m e s ( d a t ) < - c ( “ t o 1 1 - 3 3 - 6 6 - 1 2 1 2 - 2 4 , o v e r 2 4 ' ) > colnames(dat)<-c("1 8 2 4 " , " 2 5 3 4 " , " 3 5 4 4 " , " 4 5 5 4 " , " 5 5 5 9 " , " 6 0 -6 4 " ) > d a t l < - a s . m a t r i x ( d a t ) > s i e v e ( d a t , s h a d e = T R U E ) Ч* 12-24 и щ и вш а п и в шшиш — o v e r 2 4 --- --- --- " “ " —

Figure 2: The sieve diagram for age and time without work Source: Own research.

(4)

IV. T H E FO U RFO LD PLO T

A fourfold display has been proposed by Friendly (1994). It can be used for a 2 x 2 and 2 x 2 x k table. In this kind o f plot, the radius o f a quarter - circle is proportional to . Here, the odds ratio is used as the measure o f the strength o f association between the two variables contented in the contingency table

( 0 = (n u / n i2) / ( n 2l/ n 21) ) .

In Bytom, women constituted about 32% o f all the registered unemployed who had been without work for over 12 months in the first half o f 2006. The odds ratio is 1,5, indicating that men were 1,5 times more likely to stay without work for 12 months than women. Since the odds ratio is not 1, sex and time without work are dependent.

The fourfold display discribes the unemplyment analysis for Bytom.To pre­ sent this the following listing can be derived:

> l i b r a r y ( v c d ) > d a t < - c ( 3 2 4 2 , 2 7 6 7 , 3 5 4 0 , 4 5 8 0 ) > d i m ( d a t ) < - c (2,2) > r o w n a m e s ( d a t ) < - c ("t o 12 m o n t h " ,"o v e r 12 m o n t h " ) > c o l n a m e s ( d a t )< - c ( " m e n " , " w o m e n " ) > n a m e s ( d i m n a m e s ( d a t ) ) <- c ( " t i m e w i t h o u t w o r k " , "s ex ") > f o u r f o l d ( d a t , f o n t s i z e = 10)

time without work: to 12 month

3242 3540

2767 4580

time without work: over 12 month

Figure 3: The fourfold display for sex and time without work Source: own research.

(5)

V. TH E M OSAIC DISPLAY

The mosaic display has been proposed by Hartigan and Kleiner (1981) and later considered by Friendly (1994). This plot is a graphical method for visualiz­ ing n-way contingency table.

For the two-way table, the width o f each rectangle is proportional to the marginal probabilities ( p t = w(+ I n ) and the heigth o f the rectangle is propor­ tional to the conditional probabilities for the columns given rows '(Pjn =n„/n, J .

The area o f the rectangle is proportional to the observed frequency and the given probabilities:

n,j.

«,7

nu

Pij - P iP j / i > ± ..O L = - ± (2)

In the mosaic displaycolour is o f great significance. The |^y|<2 cells are filled with white colour and the 2 < |/^| < 4 cells are filled with a light grey. It is very specific for this kind o f plot. Then the |/v |> 4 cells are filled with a dark grey colour.

age

25-34 35-44 45-54 55-59 60-64

Figure 4: The mosaic display for age and time without work. Source: own research.

(6)

The mosaic display in Figure 4, can be obtained using the following com­ mands in the R software:

> l i b r a r y ( v c d ) > d a t e - r e a d . t a b l e (" d a n e - B y t o m . R " , h e a d e r = F A L S E ) > r o w n a m e s ( d a t )< - c (* t o 1 ” , " 1 - 3 " , ” 3 - 6 “ , " 6 - 1 2 " , “1 2 - 2 4 ” , " o v e r 24") > c o l n a m e s ( d a t ) < - c ( " 1 8 - 2 4 " , " 2 5 - 3 4 " , * 3 5 - 4 4 " , " 4 5 - 5 4 " , ”5 5 - 5 9 “ , ”6 0 - 64") > d a t l < - a s . m a t r i x ( d a t ) > m o s a i c ( d a t l , s h a d e = T R U E )

Analyzing the example of the unemployment in Bytom, one can observe that the unemployed, aged between 60 to 64 stayed without work for 12 to 24 months. Moreover, it can be concluded that the unemployed between the age of 25 to 34 had been without work for only up to 1 month.

VI. CO R R ESPO N D EN C E ANALYSIS

The correspondence analysis is a multivariate method for categorical data. This technique analyzes the association between two or more categorical vari­ ables.

The contingency table is the starting point for the method. The next step is to create the correspondent matrix, which is defined as the matrix of the ele­ ment o f the contingency table divided by the size: P = n {j / n . Using the gener­ alized singular value decomposition (3) one can calculate the principal coordi­ nates for the row profiles (4) and the principal coordinates for the column pro­ files (5):

A = D ;,/2( P - r r T)D r 1/2, (3)

A = U Г V T , (4)

F = D ; 1/2U r G = D c1/2U r , (5) where: U is called the left singular vectors, V - the right singular vectors, I)r ,

D c is the diagonal matrix o f the column (row) masses, respectively.

The below perception map can be made by means o f MASS package in the R software using the following listing:

(7)

> l i b r a r y ( M A S S )

> d a t < - r e a d . t a b l e ( " d a n e - B y t o m . R " , h e a d e r = F A L S E )

> b i p l o t ( c o r r e s p ( d a t ,n f = 2 ) , x l a b = " d i m 1 " , y l a b = " d i m 2", c e x = 0 . 8 )

Figure 5: Perception map for age and time without work in the city o f Bytom Source: Own research.

Furthermore, creating two perception maps (Figure 6), which present the relationship assosiation between the seniority and time without work. The sec­ ond map shows the relationship between time without work and the level of education among the unemployed.

On the basis o f the overall analysis o f the unemployment in the city o f By­ tom, we can conclude that the major group of the unemployed staying without work are the people between the 18-34 years o f age. Those unemployed for 3 to 6 months are between 55-59 years o f age, the people about the job seniority 1-5 years and 30 years o f age or more are unemployed for 6-12, from 12 to 24 months — the people between 60—64 years o f age and with basic vocational edu­ cation and the people in 60—64 years o f age, but with lower secondary, primary and incomplete primary education are unemployed over 24 months.

(8)

■04 0 2 00 02 ÍN _ О . ^ О О " Э1 łoi 1 1 00 10 - i г w o , и » _|_ 6-17 1 13*74 od 2 1 1 ■02 00 9 “ rt 9 - n 1 - tertiary

2 lachrveal and vocmional secondary 3 general secondary4 ■ bavr vocational

5 • lower secondary. primary and me omarte jnmary

j •04 ■04 .0 3 -0 2 -0 1 00 S ©

dim 1

Figure 6. On the left side - the perception map for time without work and job seniority, on the right side - the perception map for time without work and level o f education

Source: own research.

VII. CO NCLUSION

All types o f plots that have been shown in this article present the degree to which the variables in the two-way contingency table are independent or not. The correspondence analysis, as the method, allows for more profound data analysis. The main o f aim this method is not only to research association be­ tween the two variables, but also to disclose the relationship between categories o f variables.

R E FE R E N C E S

C lausen S .E .(1 9 9 8 ), A p p lie d C o rre s p o n d e n c e A n a lysis. A n In tro d u c tio n . Sage: U niver­ sity Paper 121.

Friendly M . (1 9 9 4 ). M osaic displays for m ulti-w ay co n tin g en cy tables. J o u r n a l o f th e

A m e r ic a n S ta tis tic a l A s s o c ia tio n , 8 9, p. 1 9 0 -2 0 0

Friendly M. (1 9 9 8 ), C o n c e p tio n a l M o d e ls f o r V isu a lizin g C o n tin g e n c y T a b le D a ta , in: B lasiu s J., Greenacre M . (ed s.), V isu a liza tio n o f C a te g o r ic a l D a ta , A cad em ic Press. Friendly M .(1 9 9 9 ), E xtending M osaic D isplays: M arginal, Partial, and C onditional

V ie w s o f C ategorical Data, J o u r n a l o f C o m p u ta tio n a l a n d G r a p h ic a l S ta tis tic s, 8, p .3 7 3 -3 9 5 .

Greenacre M. J .(1 9 8 4 ), T h eo ry a n d A p p lic a tio n s o f C o r r e s p o n d e n c e A n a ly sis, A cadem ic Press London.

Hartigan, J. A ., and K leiner, B. (19 8 4 ). A m osaic o f telev isio n ratings. The A m e r ic a n

(9)

Iwona Kasprzyk

W IZ U A L IZ A C JA D W U W Y M IA RO W Y CH TA B LIC K O N T Y N G E N C JI W P A K IE C IE STATYSTYCZNYM R

Tablica kontyngencji jest częstym sposobem przedstawiania danych mierzonych za­ równo na skali nominalnej jak i porządkowej. W artykule zostanie przeprowadzona analiza bezrobocia na terenie Śląska, ze szczególnym uwzględnieniem obszaru Bytomia tj. miasta szczególnie dotkniętego tą problematyką.

Za pomocą pakietu ved i graphics w programie R zostanie dokonana wizualizacja danych zawartych w dwuwymiarowej tablicy kontyngencji przy pomocy kilku sposobów graficznej prezentacji, w tym za pomocą wykresu mozaikowego, wykresu siatkowego oraz wykresu zależności. W celu dokładniejszej analizy danych, wyniki zostaną przed­ stawione również za pomocą analizy korespondencji, która pozwala na opisa­ nie zależności pomiędzy kategoriami zmiennych.

Cytaty

Powiązane dokumenty

exitium inferentes, quibus etiam hostis fidei p epercisset. C yfry bez porów nania niższe, chociaż tylko częściow e, ppdają zeznania św iadków z r. ju b ileu szu

Abhyankar in “Expansion Techniques in Algebraic Geometry”, Tata Institute of Fundamental Research Lectures on Mathematics and Physics (Tata Inst.. Onishi, The

W następnym etapie MEN zamierza wdrożyć program wieloletni zakładając, że „jednym z podsta- wowych zadań współczesnej szkoły jest rozwijanie kompetencji uczniów

Cieľom príspevku je upozorniť na možné riziká, ktoré vyplývajú z práce v chemickom laboratóriu, dôsledne pripraviť budúcich učiteľov chémie pre svoje budúce

The diagnostic block of the technology contains the input and output diag- nostics level of professional competence of social and psychological service junior officers, the

Może ono być wyni- kiem zaburzeń rozwoju umysłowego, utraty słuchu, zaburzeń ekspresji mowy, autyzmu, może wynikać z przyczyn psychospo- łecznych, czy organicznego

=H Z]JOÖGX QD QLHZLHONLH ]QDF]HQLH 3ROVNL QD PLÖG]\QDURGRZ\P U\QNX

kryto półnooną krawędź brany 1 stwierdzono, że umoonienla skręoały ta w kierunku końoioła, atanowląoego alanent obronny w XII - XYI w* Koóolół romaóskl okazał