Index of /rozprawy2/11218

Pełen tekst

(1)AGH UNIVERSITY OF SCIENCE AND TECHNOLOGY. Faculty of Computer Science, Electronics and Telecommunication Department of Computer Science. Doctoral dissertation. Exploring Group Dynamics in Social Media. Author: mgr inz. ˙ Bogdan Gliwa. Supervisor: dr hab. inz. ˙ Aleksander Byrski Supporting supervisor: dr inz. ˙ Anna Zygmunt. Kraków 2017.

(2)

(3) AKADEMIA GÓRNICZO-HUTNICZA. Wydział Informatyki, Elektroniki i Telekomunikacji Katedra Informatyki. Rozprawa doktorska. Analiza dynamiki grup w mediach społeczno´sciowych. Autor: mgr inz. ˙ Bogdan Gliwa. Promotor: dr hab. inz. ˙ Aleksander Byrski Promotor pomocniczy: dr inz. ˙ Anna Zygmunt. Kraków 2017.

(4)

(5) Abstract This dissertation concerns an analysis of groups and their dynamics in social media which is a popular way of expressing people’s opinions, commenting on current events, and building relationships with others. As a result of interactions between people, groups form; because these interactions are dynamic, these groups constantly change over time. Analysis of such groups is widely used, primarily in marketing (finding potential customers for a particular product as well as its promotion), monitoring social or election campaigns, and optimizing transport networks in cities or structures in companies. A number of problems have been examined in this dissertation. The first is an effective analysis of group changes over time — for this problem, the SGCI method is proposed. The second problem (related to the first one) regards the evaluation of methods for the analysis of group dynamics. In real-world datasets, we do not have knowledge about real processes occurring within groups (called “ground truth”). Therefore, a benchmark generating synthetic datasets and a set of metrics for comparing the results of such methods have been introduced (moreover, as the considered class of methods for the analysis of group dynamics uses existing methods of group discovery, a benchmark and metrics for the task of group discovery have also been created). Another relevant task is the visualization of group dynamics — for this task, a new tool (GEVi) has been implemented. Prediction of group behaviors is another issue under consideration. In this context, new metrics have been proposed for the prediction of group events and group merging (this dissertation is the first to formulate and discuss the latter issue). The last task is the analysis of group topics (i.e. determining the topics being discussed within groups). In order to accomplish this task, a new method integrating the selected aspects of the Natural Language Processing approach and Social Network Analysis has been introduced. For each considered problem, a series of experiments was carried out that confirmed the dissertation thesis, saying that observation of the characteristics of groups forming in social media allows for a systematic analysis of the content spread through them, modeling group dynamics, and predicting their future behavior.. iii.

(6) Streszczenie Praca dotyczy analizy grup i ich dynamiki w mediach społecznych, które sa˛ popularnym s´ rodkiem do wyrazania ˙ swoich opinii, komentowania biez˙ acych ˛ wydarzen´ oraz budowania relacji z innymi osobami. W wyniku interakcji mi˛edzy osobami tworza˛ si˛e grupy, a poniewaz˙ te interakcje sa˛ dynamiczne, grupy nieustannie zmieniaja˛ si˛e w czasie. Analiza grup ma szerokie zastosowanie, przede wszystkim w marketingu (wykrywanie potencjalnych klientów dla okre´slonego produktu i jego popularyzacja), monitorowaniu kampanii społecznych, wyborczych, a takze ˙ optymalizacji sieci transportowej w mie´scie czy struktur w firmie. W pracy rozpatrywanych jest szereg problemów. Pierwszym z nich jest efektywna analiza zmian grup w czasie – dla tego problemu została zaproponowana metoda SGCI. Drugim, bezpo´srednio powiazanym ˛ z poprzednim problemem, jest ewaluacja metod do analizy dynamiki grup. W rzeczywistych zbiorach nie mamy wiedzy o rzeczywistych procesach zachodzacych ˛ z grupami, dlatego tez˙ został stworzony system generacji sztucznych zbiorów danych oraz zestaw metryk do porównywania wyników takich metod (a poniewaz˙ rozpatrywana klasa metod do analizy dynamiki grup wykorzystuje istniejace ˛ metody do odkrywania grup, został tez˙ stworzony analogiczny system do generowania sztucznych zbiorów oraz metryki do zadania znajdowania grup). Kolejnym rozpatrywanym zagadnieniem jest wizualizacja dynamiki grup – dla realizacji tego zadania zostało stworzone nowe narz˛edzie GEVi. Nast˛epnym rozwazanym ˙ problemem jest predykcja zachowan´ grup. W tym kontek´scie zostały zaproponowane nowe metryki do predykcji zdarzen´ grup oraz predykcji scalen´ grup (to drugie zagadnienie zostało po raz pierwszy sformułowane i rozwazane ˙ w tej rozprawie). Ostatnim rozpatrywanym zadaniem jest analiza tematyki grup, która polega na okre´sleniu jakie tematy sa˛ dyskutowane w obr˛ebie grupy. W celu realizacji tego zadania została stworzona nowa metoda, integrujaca ˛ wybrane aspekty Przetwarzania J˛ezyka Naturalnego oraz Analizy Sieci Społecznych. Dla kazdego ˙ rozwazanego ˙ problemu wykonany został szereg eksperymentów potwierdzajacych ˛ tez˛e cało´sciowa˛ pracy mówiac ˛ a˛ o tym, ze ˙ obserwacja charakterystyk grup tworzacych ˛ si˛e w mediach społecznych umozliwia ˙ systematyczna˛ analiz˛e rozpowszechnianych w nich tre´sci, modelowanie dynamiki grup oraz predykcj˛e ich przyszłych zachowan. ´. iv.

(7) Acknowledgements I would like to express my gratitude to my supervisor, Aleksander Byrski Ph.D. D.Sc., for his helpful remarks during the writing of this thesis. I am also very grateful to my supporting supervisor, Anna Zygmunt Ph.D., for her irreplaceable help and continuing advice before, during and beyond writing of this thesis. I want also to thank for our collaboration in the fields of research, teaching and organization. I cannot imagine completing this thesis without her support. I would like to thank Jarosław Ko´zlak Ph.D. D.Sc. for his substantial contribution to the development of the SGCI method, that became one of the cornerstones of my Ph.D. thesis. I am also thankful for our fruitful discussions and collaboration, that were seminal for my further research. My gratitude is also addressed to my colleagues, friends, and family who have supported me during this important period of my life. I acknowledge that during the work on my thesis I have been a scholarship fellow of the “Doctus – Małopolski fundusz stypendialny dla doktorantów” project (Lesser Poland Scholarship Fund for Ph.D. candidates) cofunded by EU funds within European Social Fund..

(8)

(9) Contents 1 Introduction 1.1 Research domain and problem description 1.2 Thesis statement . . . . . . . . . . . . . . . . 1.3 Research contribution . . . . . . . . . . . . . 1.4 Dissertation organization . . . . . . . . . . . 2 Introduction to Social network analysis 2.1 Social network analysis . . . . . . . . 2.2 Centrality metrics . . . . . . . . . . . 2.3 Degree distribution . . . . . . . . . . 2.4 Groups . . . . . . . . . . . . . . . . . 2.5 Dynamics of networks . . . . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . .. . . . . .. . . . .. . . . . .. . . . .. . . . . .. . . . .. . . . . .. . . . .. . . . . .. . . . .. . . . . .. . . . .. . . . . .. . . . .. . . . . .. . . . .. . . . . .. . . . .. . . . . .. . . . .. . . . . .. . . . .. . . . . .. . . . .. . . . . .. . . . .. . . . . .. . . . .. . . . . .. . . . .. . . . . .. . . . .. 1 1 2 3 4. . . . . .. 7 7 9 11 12 18. 3 Selected contemporary methods related to Group Analysis in Social Networks 3.1 Group dynamics in social networks . . . . . . . . . . . . . . . . . . . . . . 3.2 Evaluation of methods for group dynamics . . . . . . . . . . . . . . . . . 3.3 Visualization of group dynamics in social networks . . . . . . . . . . . . 3.4 Prediction of group dynamics in social networks . . . . . . . . . . . . . . 3.5 Text Mining in Social Media . . . . . . . . . . . . . . . . . . . . . . . . . .. 21 21 32 38 39 40. 4 Proposed Group Analysis Methods in Social Networks 4.1 Group dynamics in social networks . . . . . . . . . . . . . . . . . . . . 4.1.1 SGCI method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Comparison of group dynamics methods . . . . . . . . . . . . 4.2 Evaluation of methods for group dynamics . . . . . . . . . . . . . . . 4.2.1 Proposed metrics for evaluation of group discovery . . . . . . 4.2.2 Proposed metrics for evaluation of group dynamics methods 4.2.3 eLFR — benchmark for group discovery methods . . . . . . . 4.2.4 GevBen — a benchmark for group dynamics methods . . . . 4.2.5 Comparison of benchmarks for group dynamics methods . . 4.3 Visualization of group dynamics in social networks . . . . . . . . . . 4.4 Prediction of group dynamics in social networks . . . . . . . . . . . . 4.4.1 Prediction of group events . . . . . . . . . . . . . . . . . . . . . 4.4.2 The merge prediction problem . . . . . . . . . . . . . . . . . .. 47 47 48 52 53 54 58 63 66 75 75 80 80 87. iii. . . . . . . . . . . . . .. . . . . . . . . . . . . ..

(10) 4.5. Topics analysis in social networks . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Method of assigning topics for a social group . . . . . . . . . . . . 4.5.2 Inheritance of topics . . . . . . . . . . . . . . . . . . . . . . . . . .. 89 89 92. 5 Experimental verification of the Proposed Methods 93 5.1 Evaluation of methods for group dynamics . . . . . . . . . . . . . . . . . 93 5.1.1 Experiments for evaluation metrics of groups detection methods 94 5.1.2 Experiments for evaluation metrics of group dynamics methods . 98 5.1.3 Experiments comparing benchmarks for group discovery methods 102 5.1.4 Performance of benchmarks for group dynamics methods . . . . 111 5.2 Group dynamics in social networks . . . . . . . . . . . . . . . . . . . . . . 112 5.2.1 Experiments on benchmark data . . . . . . . . . . . . . . . . . . . 112 5.2.2 Experiments on real-world data . . . . . . . . . . . . . . . . . . . 120 5.3 Visualization of group dynamics in social networks . . . . . . . . . . . . 127 5.4 Prediction of group dynamics in social networks . . . . . . . . . . . . . . 130 5.4.1 Datasets and experiments setup . . . . . . . . . . . . . . . . . . . 130 5.4.2 Experiments for prediction of group events . . . . . . . . . . . . . 133 5.4.3 Experiments on merge prediction problem . . . . . . . . . . . . . 147 5.5 Topics analysis in social networks . . . . . . . . . . . . . . . . . . . . . . . 150 6 Summary 6.1 Thesis verification and confirmation . . . . . . . . . . . . . . . . . . . . . 6.2 Work beyond the scope of the thesis . . . . . . . . . . . . . . . . . . . . . 6.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 165 165 166 167. List of Figures. 169. List of Tables. 171. List of Pseudocodes. 173. Bibliography. 175. Scientific curriculum. 195. iv.

(11) Chapter 1. Introduction This chapter provides an introduction to the domain of the dissertation and describes the importance of the undertaken problems. Next, the thesis is formulated, and the research contribution is depicted. The last section presents the organization of this dissertation.. 1.1. Research domain and problem description. The Internet has become an important part of our lives nowadays. Many services and businesses have relocated to this virtual world. Nowadays, it is hard to do business without being present on the Internet — some companies have combined it with the traditional way of conducting business, and others have moved there entirely. Furthermore, we all participate in social media: we write blogs, comment on the posts of others, and exchange our opinions on forums, fanpages, and other kinds of social media. Often, these opinions regard products or companies. We leave traces of our activity everywhere; therefore, an analysis of the activities conducted on the Internet is becoming increasingly important, especially in social media. Many companies are interested in analyzing the behavior of their customers or potential customers; this may help them tailor their offers better to their targets or find potential brand advocates who could promote their products in new circles. Other activities on the Internet that are at the center of attention include the monitoring of social campaigns (e.g., influence of conducted campaigns on the number of protagonists or the results of elections) and monitoring actions of suspicious groups of people (e.g., terrorists). Social media data consists of heterogeneous data types — mostly links and text information. Links are representations of interactions and relationships that can be modeled as a network using the Social Network Analysis approach. However, such a network is not homogeneous, and one can distinguish groups of people who discuss or interact more frequently inside a group than with other parts of a network. Groups may undertake different events — they may continue to exist, split, join with other groups, or disappear altogether. An interesting problem is not only detecting of such 1.

(12) events, but also predicting them. A prediction can answer the following question, for example: is it worth investing in such a group or not? For instance, if the considered group is going to disappear, maybe it is not a good choice to introduce a new product to it. Moreover, an inherent element of social media is text, and it can be combined with the discovered groups to find out what is discussed by their members. Conducting such an analysis of topics in time allows us to find out which topics gain importance and which are less interesting for members of a group. This can be useful in monitoring the effectiveness of the conducted actions; e.g., campaigns or advertisements within a group.. 1.2. Thesis statement. The dissertation thesis is formulated as follows: Observation of the characteristics of groups forming in social media allows for the systematic analysis of content spread through them, modeling group dynamics and predicting their future behavior. The thesis is situated in the field of Social Network Analysis. It concerns analyzing social groups while taking into account their dynamics, forecasting their future behavior and analyzing the text content disseminated by their members. Most methods developed in this thesis can be used for any kind of social network (methods focused on group dynamics and prediction), but methods designed to integrate text analytics with social network analysis are dedicated to the blogosphere, which is a type of social media. Blogosphere datasets, among others, are used in all three of the mentioned parts of this thesis. An overview of problems faced in this dissertation is depicted in Figure 1.1. The main problem is the analysis of group dynamics. This problem is connected with another one — the validation of methods for group dynamics. This is an important issue, because in real-world networks, we do not know the real processes occurring within groups (ground truth). Two subtasks can be distinguished — the generation of synthetic datasets (benchmarks), and a comparison of the results among different methods of group dynamics (metrics). As the considered class of group dynamics methods use existing methods of group discovery, analogous subtasks for the group discovery task are also regarded. The next issue is the visualization of group dynamics, which helps in the task of analyzing group dynamics. Another considered problem is the prediction of group behaviors. In this context, two subproblems are faced — the prediction of group events and the merge prediction problem (which will be described for the first time in this dissertation). The last task is analyzing the topics of discussions conducted within groups. 2.

(13) Visualization visualization of group dynamics. Prediction Text Mining. topics of groups. social groups. group events prediction. groups dynamics. benchmark for group dynamics. merge prediction problem. metrics for evaluation of group dynamics. use. use. benchmark for group discovery. metrics for evaluation of group discovery. Validation. Figure 1.1: Overview of tackled problems in this thesis. 1.3. Research contribution. The main contributions of this dissertation are as follows: • introduction of new metrics for comparison of group discovery methods (Section 4.2.1), along with a detailed analysis of their properties (Section 5.1.1); • introduction of new metrics for comparison of group dynamics methods (Section 4.2.2), along with a careful study of their properties (Section 5.1.2); • extension of the LFR benchmark for group discovery methods (Section 4.2.3), with verification of the proposed extensions (Section 5.1.3); • introduction of GevBen — a new benchmark for group dynamics methods (Section 4.2.4); • introduction of SGCI — a method for analyzing group dynamics (Section 4.1.1); • detailed study of the properties of different group dynamics methods on synthetic networks generated by the GevBen benchmark (Section 5.2.1) as well as on real networks from the blogosphere domain (Section 5.2.2); • implementation of GEVi — a tool for visualization of group dynamics (Section 4.3); • formulation of the merge prediction problem (Section 4.4.2); 3.

(14) • introduction of new metrics useful in predicting group events (Section 4.4.1); • experimental verification of methods for predicting group events (Section 4.4.1) and the merge prediction problem (Section 4.4.2); • introduction of a method integrating topic analysis with social groups, with a variant for topic inheritance (Section 4.5); • experiments verifying the integration approach of topic analysis within social groups (Section 5.5); • application of topic features of a group into the merge prediction problem (Section 5.5);. 1.4. Dissertation organization. The dissertation is organized in the following manner. First, Chapter 2 introduces the basic concepts of Social Network Analysis (which is the background of this thesis). Then, Chapter 3 presents an overview of the selected methods related to Group Analysis in Social Networks, with a focus on group dynamics in social networks (Section 3.1), evaluation of methods for group dynamics (Section 3.2), visualization (Section 3.3) and prediction of group dynamics (Section 3.4), as well as existing methods of text mining and their application to social network analysis (Section 3.5). Chapter 4 depicts the proposed methods of group analysis in social networks. It contains five parts: group dynamics in social networks — Section 4.1; an evaluation of methods of group dynamics — Section 4.2 (consisting of four problems: a comparison of group discovery methods in Section 4.2.1, a comparison of group dynamics methods in Section 4.2.2, benchmarks for group discovery methods in Section 4.2.3 and benchmarks for group dynamics methods in Section 4.2.4); a visualization of group dynamics — Section 4.3; a prediction of group dynamics — Section 4.4 (two problems are faced here: the prediction of group events in Section 4.4.1, and the merge prediction problem in Section 4.4.2); and topic analysis of groups in social networks — Section 4.5. The results of the experiments are described and discussed in Chapter 5. It starts with the results for validating methods of group dynamics (Section 5.1). Then, along with real-world datasets, these methods are used in experiments regarding methods of group dynamics (Section 5.2). In the next section (Section 5.3 dedicated to the visualization of group dynamics), a real use case with the introduced tool is presented. The next part (Section 5.4) is devoted to the methods of predicting group events and their merging behavior (which were verified on five real-world datasets). The last part is dedicated to experiments regarding the topic analysis of groups (Section 5.5). Chapter 6 contains the conclusion, future directions, and a short overview of work that is beyond the scope of this thesis, yet related to the main theme. The structure of this dissertation is also presented in Figure 1.2, which portrays specific problems, the solutions proposed in this dissertation to tackle them, and the ways 4.

(15) used for their verification. As can be seen, existing problems (and some approaches to them) are depicted in Chapter 3 (except for newly identified ones, which are described in Chapter 4), proposed methods and approaches are presented in Chapter 4, and verification is conducted in Chapter 5.. Problem. Solution. Verification. SGCI method (Section 4.1.1). blogs datasets, synthetic networks ; comparison with SOTA methods (Section 5.2). benchmark for group dynamics (Section 3.2). GevBen benchmark (Section 4.2.4). comparison with SOTA methods (Section 5.1.4). metrics for evaluation of group dynamics (Section 4.2.2). GEM with variants (Section 4.2.2). synthetic datasets (Section 5.1.2). benchmark for group discovery (Section 3.2). eLFR benchmark (Section 4.2.3). comparison with SOTA methods (Section 5.1.3). metrics for evaluation of group discovery (Section 3.2). Cmatrix measure with variants (Section 4.2.1). synthetic datasets; comparison with SOTA methods (Section 5.1.1). GEVi tool (Section 4.3). blogs datasets; comparison with SOTA methods (Section 5.3). group events prediction (Section 3.4). new metrics, analysis of features, classification (Section 4.4.1). blog datasets, Facebook, DBLP, Enron (Section 5.4.2). merge prediction problem (Section 4.4.2). new metrics, analysis of features (Section 4.4.2). blog datasets, Facebook, DBLP, Enron (Section 5.4.3). GTA method (Section 4.5). blog datasets (Section 5.5). Group dynamics analysis of group dynamics (Section 3.1). Validation. Visualization visualization of group dynamics (Section 3.3). Prediction. Text Mining topics of groups (Section 4.5). Figure 1.2: Problems faced in the thesis, their solutions and verification 5.

(16) 6.

(17) Chapter 2. Introduction to Social network analysis This chapter presents the thesis background which is Social network analysis. It starts with the basic definitions and concepts, such as centrality metrics and degree distribution. Then, various definitions and criteria for groups are presented. Finally, most popular methods of group discovery are depicted.. 2.1. Social network analysis. McCarthy and Molina [142] define social network analysis (SNA) as “the study of the patterns of interaction between actors” (people). Social networks are in the spotlight of Social network analysis. Newman [152] defines social networks as networks (graphs) “in which the vertices are people, or sometimes groups of people, and the edges represent some form of social interaction between them, such as friendship.” Sociologist often refer to vertices (also called nodes) as actors and the edges (also called links) as ties. In that way, social networks are defined by Wasserman and Faust [213] i.e. they define a social network as “a finite set or sets of actors and the relation or relations defined on them”. The relation is understood as a collection of ties of one kind. Both definitions are similar — they highlight that in the spotlight, there are people represented as nodes in a graph and their social relations or interactions are represented by edges in a graph. Many examples of social networks can be mentioned: students [226], collaborations of scientists [92, 154], online communities, such as Facebook [127] etc. Representation of social networks Typically, social networks are represented as graphs or matrices [152]. In the case of graph representation — actors are nodes and relations between them are represented as edges. Edges can be directed (non-symmetric relation) or undirected (symmetric 7.

(18) relation), weighted (with strength of relation) or unweighted. In this dissertation, the focus is on simple graphs [152], i.e. graphs that have neither self-edges (edges connecting nodes to themselves), nor multiedges (more than one edge between the same pair of nodes). The most popular matrix representation is the adjacency matrix A (with dimensions n × n, where n is the number of nodes in a graph) where each entry Ai j for unweighted graph is described as:    1, if there is an edge from node i to node j, Ai j =  (2.1)  0, otherwise. For an undirected graph, such a matrix is symmetric, for a directed one — it is nonsymmetric. In the case of a weighted graph, instead of 1 for the existence of relation between node i and j, there will be a value representing the strength of such relation. Paths A path [152] in a network is a sequence of nodes such that each consecutive pair of them in the sequence is connected by an edge in a network. The length of a path is the number of edges traversed in a path (edges can be traversed more than once). A geodesic path (also called the shortest path) is a path between two nodes such that no shorter path exists. Using this definition, the diameter of a graph can be defined as the longest geodesic path between any pair of nodes in a network. Density The density [213] of a graph is the proportion of possible edges in a graph that are actually present, i.e. the ratio of the number of the existing edges to the number of all possible ones. For an undirected graph G, it has the following form: density(G) =. m n(n−1) 2. ,. (2.2). where m is the number of edges in a graph G and n is the number of nodes in this graph. For a directed graph G (m and n have the same meaning as in Equation 2.2): density(G) =. m . n(n − 1). (2.3). When all possible edges are present, then such a graph is called complete. Clustering coefficient An important property in social networks is the concept of transitivity [213] which is a property that considers triples of nodes in a graph. It can be quantified using a clustering coefficient metric [152]. If u is connected with v and v has a connection with w, then we have a path uvw consisting of two edges in the network. If u is also connected 8.

(19) with w, then we can say that the path is closed (or that u,v and w form a closed triad). Now, we can define a clustering coefficient in the following way: cc =. (number o f closed paths o f length two) . (number o f paths o f length two). (2.4). An equivalent definition of the clustering coefficient uses a different equation: cc =. (number o f triangles) × 6 . (number o f paths o f length two). (2.5). The factor of six in this definition comes from the fact that each triangle in the network is counted six times when we count up the number of closed paths of length two. For a triangle uvw, we have six paths which have a length of two: uvw, vwu, wuv, wvu, uwv and vuw. Another, also popular definition of the clustering coefficient uses the term “connected triple” which means three vertices uvw with edges (u, v) and (v, w) (the edge (u, w) can be present or not). It has the following form: cc =. (number o f triangles) × 3 . (number o f connected triples). (2.6). The factor of three arises because each triangle gets counted three times in the process of counting the connected triples in the network — a triangle uvw contains three triples: uvw, vwu and wuv. A clustering coefficient can be also defined for a single vertex i (which in this case is called the local clustering coefficient) as: cci =. (number o f pairs o f neighbours o f i that are connected) . (number o f pairs o f neighbours o f i). (2.7). In [214], Watts and Strogatz proposed to calculate the global clustering coefficient as the mean of the local clustering coefficients for each vertex: n. ccWS. 1X = cci . n. (2.8). i=1. This is a different definition than that provided earlier in Equations 2.4, 2.5 and 2.6. Also, the results obtained using the previous definitions (Equations 2.4, 2.5, 2.6) and the definition of Watts and Strogatz (Equation 2.8) are different — the Watts-Strogatz version of the clustering coefficient puts more emphasis on low degree nodes. However, all of them are commonly used.. 2.2. Centrality metrics. Centrality metrics [227, 32, 234] describe the importance of nodes in the network. They try to answer the following question, but taking into account different criteria — which nodes are the most central in the network? 9.

(20) Degree centrality Degree centrality [157] is the simplest among all metrics of centrality in a network. In an undirected graph, it is the number of edges of a node x: X Axy , (2.9) D(x) = y∈V, y,x. where A is the adjacency matrix of a graph, and V — the set of all nodes in a graph. If n is the total number of nodes in the network (n = |V|), then maximum degree a node can have is equal to n − 1, so a normalized degree for a node x has the following form: DN (x) =. D(x) . n−1. (2.10). In a directed graph, we can distinguish two versions — indegree which is the number of incoming edges to a given node, and outdegree which is the number of outgoing edges from a given node. The idea behind this metric is that the most central nodes are the ones with the highest number of connections with the other nodes in the network. Closeness centrality The closeness metric [183] is based on the concept that the most central node in the network is the node with the shortest paths with all other nodes. It takes the form (for a node x): 1 P C(x) = , (2.11) d(x, y) x,y∈V, x,y. where d(x, y) is a distance function — it describes the length of the shortest path between nodes x and y (geodesic distance), and V is the set of all nodes in a graph. The closeness metric (for a node x) can also be normalized: CN (x) =. n−1 P , d(x, y). (2.12). x,y∈V,x,y. where n is the number of nodes in the network. The normalized centrality metric can be viewed as the inverse average distance between node x and the rest of the network. Betweenness centrality The betweenness metric[9, 67] is based on the idea that a node is central if it lies on shortest paths of many nodes. It can be formalized (for a node x) as [213]: B(x) =. X i, j∈V, i,x,j. σi j (x) σi j. ,. (2.13). where σi j (x) is the number of the shortest paths between nodes i and j such that node x lies on them, σi j is the number of the shortest paths between node i and j. The 10.

(21) (n−1)(n−2). minimum value of the betweenness metric is zero and the maximum is which 2 is the number of pairs of nodes excluding x (n is the number of nodes in a graph). Therefore, a normalized version of betweenness (for a node x) has the following form: BN (x) =. 2B(x) . (n − 1)(n − 2). (2.14). PageRank PageRank was introduced by Brin and Page in [36]. It was firstly applied to ranking webpages. PageRank is based on the intuition that an important node has many incoming edges from other important nodes. It is calculated for a node x using the following formula [152]: X PR(y) PR(x) = α + β, (2.15) Dout (y) y∈V,A yx >0. where α ∈ (0, 1) is a damping factor (frequently set to 0.85), β ∈ (0, 1), A is the adjacency matrix, V is a set of all nodes in a graph, n is the total number of nodes in the network P and Dout (y) = A yz is the out-degree of a node y. In the form proposed by Brin z∈V, z,y. and Page, β was equal to 1−α n . The algorithm is iterative — usually for initialization of 1 all nodes, the value n is assumed, and then iterations are performed until convergence.. 2.3. Degree distribution. One of the most fundamental network properties is degree distribution [152, 211] which is a degree frequency distribution of nodes. The degree distribution function P(k) is the probability that a randomly selected node has a degree k: P(k) =. nk , n. (2.16). where n is the number of nodes in a network and nk is the number of nodes with a degree equal to k. For regular lattice, all nodes have the same number of edges, so degree distribution contains a single sharp spike. On the other extreme, for a completely random network the degree sequence obeys the Poisson distribution. However, many empirical results showed that most large-scale real networks have different degree distribution — it deviates significantly from the Poisson distribution and can be described better by a power law: P(k) ∼ k−γ , (2.17) where k is the degree, γ is the exponent of the power law and, typically, has values between 2 and 3. These power-laws are free of any characteristic scale, therefore, such a network with a power-law degree distribution is called a scale-free network. Examples of the power-law phenomena include the structure of the Internet [62], networks of 11.

(22) movie actors [214, 17], paper co-authorship [154], frequencies of family names, intensity of solar flares [51]. Barabási and Albert [17, 18, 6] proposed a model to generate a scale-free network using the preferential attachment mechanism. This mechanism is based on the fact that the probability of creating an edge from a newly created node to existing nodes in a network is proportional to degrees of the existing nodes i.e. the probability pi that the new node connected to existing node i is calculated using the following formula: ki pi = P , kj. (2.18). j∈V. where ki is the degree of node i, V is the set of all nodes in a network,. P. k j is a sum. j∈V. of degrees of all the nodes in a network. This mechanism prefers attaching new nodes to the nodes with a high degree. The preferential attachment mechanism exists in literature under various names such as “the rich get richer” or the Matthew effect.. 2.4. Groups. Many social networks naturally divide into groups. However, in the area of social network analysis there is no single definition of a group (in literature also called a community, cluster, module or cohesive subgroup). Many authors define it depending on different properties or different algorithms that are used to their detection. One of the most popular definition of a group (which appears in the works of Fortunato [65], Porter [174] and Yang [220]) says that a group is a set of nodes that are relatively densely connected to each other but sparsely connected to nodes in other groups in the network. This definition will be also used in this dissertation as a definition of a group. An important property of a community is connectedness [65], i.e. for each pair of members of a community should be a path running only through members of this community. Fortunato [65] has distinguished three levels to define a community: local definitions, global definitions and definitions based on vertex similarity. In the local definition of a group, the perspective focuses on their inner structure without consideration of the remaining part of the network. This perspective was detailed described by Wasserman and Faust [213] who identified four criteria of groups (from the most strict criterion to the least one): • complete mutuality of links between members, • reachability of members — all members should be reachable to each other but not necessarily adjacent, • nodes degree — the members should have links to many other members within a group, • relative density of links among members compared to nodes outside a group. 12.

(23) It can be noticed that successive criteria weaken the notion of adjacency between members of a group. In the first criterion — complete mutuality of links, communities can be defined in a very strict sense that each member is connected to all other members [137] which correspond to a clique (a clique is a maximal subgraph of at least three nodes [213]). Another drawback of this definition is that all nodes in such a group are symmetric, with no differentiation between them [65]. The notion of a clique was extended using the concept of reachability (the second criterion of Wasserman and Faust [213]), i.e. the existence and length of paths between nodes in a group [65]. An n-clique is a maximal subgraph such that the shortest distances between each pair of nodes of such a subgraph are not larger than n [136, 4]. There are also some problems with the concept of n-cliques. The first one is that they may have a diameter greater than n (when the shortest paths between some nodes inside n-clique lay outside it). The second one — even some nodes are connected by a path no longer than n there may be no path between them using only links inside n − clique. These problems were later eliminated by extending those definitions [145]. The third criterion refers to the degrees of nodes in a group requiring that each member of a group is adjacent to some number of other members in the same group (which is a weaker condition than it was in the case of a clique where all the members should be adjacent to all the other ones). The best example of definition of group for this criterion is k-core which is a subgraph in which each node should be adjacent to at least k other nodes in this subgraph [191]. The last criterion of Wasserman and Faust [213] concerns the density of links inside and outside a group. An example for this criterion can be an LS-set [135] which is a subgraph such that the internal degree for each node inside the subgraph is greater than the nodes’ external degree. In the global definition of a group distinguished by Fortunato [65], a global criterion is considered to discover groups and groups are defined with respect to the whole network. A global criterion is dependent on different algorithms but one of the most popular ones is the criterion that a graph with a structure of communities is different from a random graph with some of its structural features the same as in the original graph (such a random graph is called a null model). The concept of modularity (described in more detail in Section 2.4) is an example of this approach. In the vertex similarity definition of a group, a group is treated as a set of vertices similar to each other. In this approach, the similarity between each pair of vertices is computed with respect to some property, regardless of the presence of an edge between them. One of the properties that are used is the commute-time between a pair of nodes in the graph i.e. the average number of steps needed for a random walker starting from the first node to reach the second one and return to its starting node [185]. Basic properties Groups can be classified into two categories: 13.

(24) • disjoint — each node in the network belongs to exactly one group (in a special case a single node can form a group), • overlapping — each node in the network can belong to zero or more groups (especially, to more than one group). A partition [65] is a division of a network into disjoint groups. A division of a network into overlapping groups is called a cover. Intracluster density (called also a density of group G) [65] is the ratio between the number of edges inside group G (between members of group G) to the number of all possible internal edges in group G: |E(G)| density(G) = n(n−1) , (2.19) 2. where E(G) is a set of edges inside group G and n is the number of nodes in group G. Interestingly, many networks exhibit power law scaling in the community size distributions [10, 122, 134, 173]. Methods of group discovery Many group detection algorithms were created [65, 8, 159, 158, 218, 120, 225], so there have been many attempts to classify those algorithms into some categories [172, 167, 65, 174, 220]. Here, as an example of such classification, the taxonomy by Papadopoulos [167] is presented. Papadopoulos has classified methods of group discovery into the following categories: • subgraph discovery — methods focused on the inner structure of groups. Many methods fall into this category which refer to local definition of a group [65] e.g. Clique Percolation Method [162]; • model-based — methods considering a dynamic process taking place on a network which reveals its communities e.g. label propagation ([177, 90]), the spin model by Reichardt and Bornholdt [179]; • vertex clustering — techniques having their roots in traditional data clustering. Typically, those methods change the problem of group discovery into a data clustering problem by embedding graph nodes in a vector space, where similarities or distances between each pair of nodes are calculated. This category corresponds to the vertex similarity definition of a group [65]. An example of a method can be method proposed by Saerens et al [185]; • quality optimization — methods incorporating the optimization of one of some graph-based measures of community quality, such as modularity. Louvain method [28] is an example from this category; 14.

(25) • divisive — methods that rely on the identification of nodes of links that are positioned between communities and progressively remove them e.g. the method of Girvan-Newman [72]. In this section the most popular methods of community detection are described. Clique Percolation Method The Clique Percolation Method (CPM) was proposed by Palla et al. [57, 162]. The authors has provided an implementation of this method in the CFinder1 tool [1]. CPM is one of the most popular methods for finding overlapping groups. The main concept of this method is based on the assumption that internal edges of a group are inclined to form cliques, but edges connecting different groups are unlikely to form cliques. The following terminology is used in the CPM algorithm [163, 57]: • k-clique — a complete graph with k nodes (the difference between k-cliques and cliques is fact that k-cliques can be subsets of larger cliques), • adjacent k-cliques — two k-cliques are adjacent if they share k-1 nodes i.e. they differ only in a single node, • k-clique community (or k-clique percolation cluster) — a union of all k-cliques that can be reached from each other through a series of adjacent k-cliques (such a series is called a k-clique chain). CPM finds k-clique communities using the following algorithm [162]: 1. Find all cliques in a graph. 2. Create a clique-clique overlap matrix (in such a matrix, each row and each column represent a clique and the matrix elements contain shared nodes between the cliques). 3. In the matrix, erase all off-diagonal elements with values smaller than k − 1 and all diagonal elements smaller than k, then all connected cliques form separate k-clique communities. Apart from a version for undirected graphs, two other variants are in use: • Directed Clique Percolation Method (CPMd) [164] — the algorithm is the same but instead of k-cliques, it uses the concept of a directed k-clique which is a complete subgraph with k nodes fulfilling the following condition: k nodes are ordered in such a way that between any pair of them, a directed edge exists pointing from the node with the higher rank towards the node with the lower rank. Ranks of nodes are defined based on the number of out-links in a clique, i.e. the higher the rank of a node, the more out-links it has. Palla et al. in [164] found out that for many real networks, the results between CPM and CPMd are very similar. 1. http://www.cfinder.org/. 15.

(26) • Weighted Clique Percolation Method (CPMw) [63] — a variant for weighted undirected graphs which utilizes the concept of a weighted k-clique defined as a complete subgraph with k nodes such that the geometric mean of weights of all edges within the k-clique is greater than a selected threshold value. OSLOM Order Statistics Local Optimization Method (OSLOM) was introduced by Lancichinetti et al. [124]. OSLOM locally optimizes the statistical significance of clusters [123] which is defined as the probability of finding the cluster in a random null model. A random null model is a class of graphs without a community structure. OSLOM uses the configuration model [146] as a null model. The configuration model builds random graphs with a predefined degree distribution of nodes. Therefore, the result of the OSLOM algorithm is a set of groups that are unlikely to be found in a random graph with the same degree sequence as the analyzed graph. OSLOM may use the initial partition/cover, if they were provided as input, or may build clusters containing individual nodes at random. The OSLOM community detection method consists of the following phases: 1. Finding significant clusters, until convergence; 2. Analyzing the resulting set of clusters trying to detect their internal structure or possible unions; 3. Detecting the hierarchical structure of the clusters. Statistical significance of clusters with respect to the configuration model is calculated in the first phase. During growth of a cluster, the r value (the cumulative probability of having the number of internal edges in a community equal or larger than the number of edges between a neighbor and a community in the null model) is computed for each neighbor. If the smallest among r values is smaller than the predefined threshold, then it is considered to be significant and a node with an r value is added to the community. Otherwise, the second smallest r value is checked and so on. The authors of OSLOM have provided an implementation2 for their method. Girvan-Newman Method Girvan and Newman in [72, 155] have proposed a method that detects groups by systematic removal of edges in a network. Instead of finding most central edges to a community, they focused on the edges that should interconnect communities. To achieve this goal, they generalized node betweenness measure constructing edge betweenness which is defined as the number of the shortest paths between a pair of nodes that run along it. In the case when there exists more than one shortest path between a pair of nodes, it is proposed to assign an equal weight for each path. If a network contains groups that are connected with each other by a few intergroup edges, then all the shortest paths between those groups must go through one of these few integroup edges. Therefore, such edges have high scores of edge 2. www.oslom.org. 16.

(27) betweenness and if we remove them, we separate groups from one another uncovering community structure in the network. The Girvan-Newman algorithm consists of the following steps: 1. Calculate the betweenness scores for all edges in the network. 2. Remove the edge with the highest edge betweenness value. 3. Recalculate the betweenness scores for all the edges affected by the edge removal in the previous step. 4. Repeat from Step 2 until no edges remain. The output of this method produces a dendrogram. The root of a dendrogram is a whole network with leaves representing individual nodes. Louvain Method The Louvain method (also called the Blondel method) of group detection was created by Blondel et al. in [28]. It is very fast and dedicated to very large graphs — its computational complexity is O(m), where m is a number of edges in the network. This method finds only disjoint groups. The Louvain method is based on modularity optimization. Modularity [153] is a metric assessing quality of a partition (a set of non-overlapping groups). It measures the density of links inside communities in comparison with links between communities. Modularity is defined in the following way: " # ki k j 1 X Ai j − δ(ci , c j ), Q= 2m 2m. (2.20). i, j∈V. where V is the set of all nodes in a network, Ai j is the weight of the edge between nodes P i and j (A is an adjacency matrix for a weighted network), ki = Ai j is a sum of weights j∈V. of edges incident to node i (its weighted degree), ci is the community to which node i is P assigned, m = 12 Ai j , the δ-function δ(u, v) returns 1 if u = v and 0 — otherwise. i, j∈V. The Blondel algorithm consists of the following steps: 1. Assign a different community to each node in the network. 2. For each node i consider its neighbors j and evaluate a gain of modularity ∆Q (Equation 2.21) that can be reached when node i will be removed from its current community and will be placed in the community of node j. Then, node i is moved to a community for which the modularity gain is maximum, but only if the value of gain is positive (otherwise, node i stays in its original community). 3. Repeat step 2 until no further improvement of modularity can be reached. 17.

(28) 4. Build a new network comprised of nodes that are communities found in the previous step. Weights of edges between such nodes are calculated as the sum of edges weights linking two communities [11]. Edges between nodes of the same community found in the previous step here lead to self-loops for a node representing such a community. 5. Apply Steps 1–4 and iterate until there are no more changes. The gain in modularity ∆Q as a result of moving an isolated node i into a community C can be calculated using the following formula:  in !2   in !2  2  DC + ki,C   DC  D D + k k C C i i  −   , ∆Q(C, i) =  − − − (2.21)   2m 2m 2m 2m 2m  where DC is a sum of the weights of edges incident to nodes belonging to group C (both is the sum of weights of edges internal and external edges in relation with group C), Din C inside group C, ki is the sum of the weights of edges incident to node i, ki,C — the sum of the weights of edges from node i to nodes in group C, m is a sum of the weights of all edges in the network. The algorithm of Blondel has in-built hierarchies — communities of communities are constructed during the process and the height of hierarchy is regulated by the number of such passes. The results obtained by this algorithm depend on the ordering of the nodes (in which the nodes are considered) in the network — the influence on the final value of the obtained modularity is not strong, but it affects the computation time.. 2.5. Dynamics of networks. Dynamic networks are ubiquitous — in literature [102] we can find many examples of networks with time dimension, such as email communication [110], physical proximity of cell phones [60], protein interactions [94], air transportation network [166], mobility network of animals [204]. Holme and Saramäki [102, 103] have distinguished two classes of networks describing temporal relations (called temporal networks): • contact sequence — a set of contacts represented by triples (i, j, t) where i, j ∈ V and t denotes time. Equivalent representation uses a set of edges E and for each edge e ∈ E — a non-empty set of times of contacts Te = {t1 , . . . , tn }. This representation is suited for modeling communication data e.g. by emails, phone calls, text messages etc. • interval graphs — in this case the edges are active over a set of intervals Te = {(t1 , t01 ), . . . , (tn , t0n )} where the parentheses marks the periods of activity (the unprimed times mean the beginning of the interval and the primed ones — the end of the interval). Examples of such networks are proximity networks (modeling that individuals have been close for some extent of time). 18.

(29) Temporal networks are also named by Blonder et al. [30] as time-ordered networks. These networks can be reduced to a series of static graphs containing edges that appeared or lasted in a specific period. Such networks are called time aggregated graphs [102], time-aggregated networks [30], snapshots or timesteps. In the case of reduction of a temporal network into time-aggregated networks, a window is specified [50] (its length, sometimes also overlapping neighboring windows). The effect of choice time window size on the network was studied by Krings et al. [117], Ribeiro et al. [180]. Blonder et al. [30] distinguished the following common problems regarding time dimension for a network: • varying the window size for aggregate interactions may produce different topologies — too short of a window makes that no individuals are connected, but too long of a window causes that all individuals may appear connected [29]; • if networks change more rapidly than they are sampled, dramatic changes of topology may occur [198]; • simulated removal or addition of edges should not neglect changes of topological dynamics of a network — many networks change their structure in response to perturbation [7]; • the ordering of events is important and affects the flow processes in the network — some paths are causally impossible and some paths, appearing to be short in terms of the numbers of edges, may be long taking into account the time delay [101, 46]. Some patterns in evolving networks were described in literature [125, 126, 44, 116]. The most famous are: • Densification Power Law — Leskovec et al. [125, 126] have found that several real networks grow over time according to the power law describing the relation between the number of nodes N(t) at time t and the number of edges E(t) E(t) ∝ N(t)α ,. 1 ≤ α ≤ 2,. (2.22). where α is called the Densification Power Law Exponent and remains stable over time. This confirms earlier empirical observations that the average degree of a graph grows in time [19]. • Shrinking Diameters — despite the growth of graphs, their diameter is shrinking over time [125, 126].. 19.

(30) 20.

(31) Chapter 3. Selected contemporary methods related to Group Analysis in Social Networks This chapter describes the selected methods of Social Network Analysis regarding methods of group dynamics, their evaluation, prediction and visualization. Moreover, the last section portrays the basic methods of text mining and their application to Social Network Analysis.. 3.1. Group dynamics in social networks. In the recent years, many methods to assess group dynamics in social networks were proposed. Palla et al. in [161] defined basic operations on groups: • Growth — a group grows if new nodes join a group, • Contraction — a contraction takes place when some nodes leave a group, • Merging — a merging occurs when two or more groups merge into a single one, • Splitting — a splitting happens when one group splits into two or more groups, • Birth — when a new group appears, • Death — when a group disintegrates. These fundamental events are present in most methods assessing group dynamics with some modifications of naming and semantics. Communities in evolving networks can be modeled using one of two ways [41]: • as a sequence of static communities — this approach is based on the discovery of communities in static snapshots of a network, 21.

(32) • as an initial static community and a sequence of its modifications — this approach relies on the model of temporal network. A detailed overview of different approaches to group dynamics is provided in [41] by Cazabet at al. and in [13] by Aynaud et al. Here, the categorization of methods of community dynamics by Cazabet et al. [41] is presented. They divided these methods into the following categories: 1. independent community detection and matching — the most popular approach which treats communities discovered in different timesteps independently. The main advantages of this approach can be concluded as reusing of community detection techniques applied on static networks and the possibility to parallelize this process. This widely-used approach was utilized in many works e.g. Hopcroft et al [104], Palla et al. [161], Wang et al. [212], Greene et al. [89] or Rosvall and Bergstrom [182]. Its drawback is the susceptibility on the instability of community detection methods (especially the stochastic ones) — small modifications of a network can cause very different results of such methods. 2. informed iterative community detection — in this case, the result from the previous timestep is also used to find a community in the current timestep. To find such communities, algorithms use different techniques, such as the initialization the communities found in the the previous timestep [124], a metric trying to optimize two factors — the quality of communities in the current timestep and the similarity with communities from the previous one [45, 132], or creating a weighed network for the current timestep where the weight is calculated based on the current timestep and timesteps from history [47, 219]. In this approach, the traditional community discovery methods are no longer directly applicable. Moreover, it is not possible to parallelize community discovery on different timesteps which causes calculations to possibly be slow on large networks with many timesteps. 3. global community detection on all timesteps — in this approach, all timesteps are studied simultaneously. It is performed by creating a metric that can be optimized on many timesteps [204, 148, 14, 223], or creating a special network containing all the instances of nodes from all timesteps and edges linking nodes in the same timestep or from different timesteps [23]. The main drawback of this approach is its high computational cost. Moreover, it is not an iterative algorithm, so after getting new data for a new timestep, the whole computation needs to be performed once again. 4. dynamic community detection on temporal networks — in this approach, we do not rely on processing timesteps but rather temporal networks (however, a network still is divided into timesteps). Firstly, communities are discovered, and then only modifications of a network are provided for the algorithm and based on that 22.

(33) modifications, such an algorithm attempts to deduce changes for communities [61, 42, 40, 128, 193]. The first approach is widely used and the methods proposed in this thesis fall into this class of methods, so it will be described in more detail. An analysis of group dynamics consists of the following steps (see Figure 3.1): • dividing the dataset into timesteps (which can be disjoint or overlapping), • detection of static communities in each timestep independently, • matching of communities detected in different timesteps. Most methods use some metrics to match groups from one timestep to another. However, Palla et al. in [160, 165] proposed a method in which a specific community detection algorithm (CPM) was used to perform the matching between groups from neighboring timesteps. In their work, a union of networks from time t and t + 1 is built and on such a network, the CPM method discovery is performed. Then, groups at time t + 1 that are joined with groups at time t on the union graph are considered to be their continuations. In all approaches to group dynamics, an important problem is how to choose timesteps. Albano et al. in [5] consider using the intrinsic time scale (measured by the changes of the network state e.g. by the number of nodes added or removed) as opposed to the external time scale (measured in seconds or other time units, which is commonly used without notice). Aynaud et al. in [15] proposed a method of merging some timesteps into bigger ones if the changes are small enough to not change the discovered communities from the previous timesteps. Berlingerio et al. [24] detected clusters of timesteps interpreted as eras of evolution. They used hierarchical clustering methodology to detect the turning points at the beginning of the eras. Before description of most popular methods of group dynamics, the formal model of group dynamics will be portrayed. Formal model of group dynamics A complex network may be described using a standard definition of a graph: N = (V, E),. (3.1). where: V ⊂ N, stands for a finite set of vertices (nodes), that is: V = {n : n ∈ N ∧ n ≤ nmax },. (3.2). where nmax is the number of nodes and E ⊂ V × V is a finite set of edges. In the case of a dynamic complex network, we have an additional time aspect, i.e. a network evolves with time. A dynamic complex network can be portrayed using the following definition: ND = (VD , ED ) (3.3) 23.

(34) 1. Time segmentation. 2. Social group discovery Timestep 1. Timestep 2. Timestep 3. 3. Group evolution Timestep 1. split. Timestep 2. Timestep 3 continuation. split continuation merge continuation merge. Figure 3.1: Illustration of group dynamics with a set of vertices that appear any time in the course of evolution of a dynamic complex network VD and a set of dynamic edges ED : ED = {(u, v, tstart , tend )},. (3.4). where u, v ∈ VD are interacting vertices and tstart and tend are two timestamps determining the start and the end of the interaction (in most cases, such as commenting other users’ messages in social media, the start and the end are equal, but in general they 24.

(35) may be different e.g. in the case of a phone call). We can also define a projection ND of a dynamic network to time range < ta , tb > as a static network Na,b : Na,b = (V 0 , E0 ), (3.5) where E0 = {(ui , u j , tstart , tend ) :< tstart , tend > ∩ < ta , tb >, ∅} 0. (3.6). 0. and V 0 ⊂ VD such that ∀ u, v ∈ V ∃ e ∈ E : e = (u, v, ti , t j ) ∨ e = (v, u, ti , t j ). A dynamic complex network ND can be represented as a series of timesteps T¯ T¯ =< T1 , T2 , . . . Ts >, s ∈ N+ .. (3.7). Ti = (tia , tib , Na,b ),. (3.8). Each timestep Ti is a tuple: where tia and tib are start and end times for i-th timestep, respectively. Striving to provide means for observation of groups that are formed in a certain timestep, let us consider the following space of system states: G = 2V . The elements of G are subsets of V. Now, observing the system in a timestep Ti , it may be seen that the set of vertices is decomposed into the following subsets called groups: G 3 Gi = {Gi,k }, i, k ∈ N,. (3.9). each group may be described as: Gi,k = {u1 , . . . , umaxi,k },. (3.10). where u1 , . . . , umaxi,k ∈ V and maxi,k stands for the maximum number of individuals in group k in timestep i. Note, that the subsets (called groups) observed at a certain timestep i may contain the same elements (they may overlap). Notation Tij will be used in order to refer to the j-th group in the timestep i-th. The number of members in a group Tij is marked as |Tij | and number of groups in timestep Ti as |Ti |. In notation of group Tij : 1 ≤ i ≤ |T|, 1 ≤ j ≤ .|Ti | Notation V ij is used to denote the set of members of group Tij and notation Eij to mark the set of the edges of group Tij . A group dynamics task can be described as the task of finding and naming transitions between groups from different timesteps. Transitions represent group continuations. Frequently, transitions are found between groups from neighboring timesteps, but some methods also consider distant transitions (groups from non-adjoining timesteps). A transition between group A and B will be marked as tA,B . Formally, transition tA,B can be defined as a pair: tA,B = (A, B), A, B ⊂ 2V , A ∪ B , ∅. (3.11). containing groups that are referred to (groups A,B) in this transition. As we can see in Equation 3.11, one of the groups referred to in a transition may be an empty 25.

(36) group which means that a non-empty group has no matching group in the future or the past, depending whether an empty group is in the second or the first position of such a pair (denoting the dissolving or formation of such a group). With each transition tA,B the event name can also be associated. Different methods may in their own way define the condition for the existence of a transition, events (names and their semantics) and a method for assignment of events to transitions. To sum up, the group dynamics task can be defined in the following way: given a series of timesteps T¯ with discovered groups in each of them, find a set of transitions (and assign events to them) such that the found transitions model the underlying process (for real networks the underlying process in most cases is unknown) of groups changes. Methods of group dynamics The most popular methods of group dynamics which were used in this thesis are described below. Asur et al. method Asur et al. in [12] defined a framework for analyzing dynamics of groups. It was based on finding events between groups from consecutive timesteps. They defined the following events: • Continue. A group Ti+1 is marked as a continuation of Tki if they have the same j members, i.e. Vki = V i+1 ; j i+1 in the next • κ-Merge. Two groups Tki and Tli are merged if there exists a group Tm timestep that contains at least κ % of the nodes from merging groups, i.e. if i+1 | |(Vki ∪ Vli ) ∩ Vm i+1 |) max(|Vki ∪ Vli |, |Vm i+1 | > and |Vki ∩ Vm. |Vki | 2 ,. i+1 | > |Vli ∩ Vm. > κ%, κ ∈ [0, 100]. (3.12). |Vli | 2 ;. • κ-Split. A group Tij is marked as split if κ % of the nodes of this group exist in two different groups in the next timestep Ti+1 , i.e. i | |(Vki+1 ∪ Vli+1 ) ∩ Vm i |) max(|Vki+1 ∪ Vli+1 |, |Vm i |> and |Vki+1 ∩ Vm. |Vki+1 | 2 ,. i |> |Vli+1 ∩ Vm. > κ%, κ ∈ [0, 100]. (3.13). |Vli+1 | 2 ;. • Form. A group Tki+1 has an assigned form event, if none of the nodes from group Tki+1 were together earlier in a group in the previous timestep, i.e. @ Tij : |Vki+1 ∩ V ij | > 1; • Dissolve. A group Tki is marked as dissolved if none of the nodes from group Tki are together in the same group in the next timestep, i.e. @ Ti+1 : |Vki ∩ V i+1 | > 1. j j 26.

(37) The implementation of this method is not made available by its authors. GED method Bródka et al. in [38, 39] defined the GED (Group Evolution Discovery) method for finding events in evolving groups. The similarity of groups G1 , G2 is calculated using the inclusion measure: P |G1 ∩ G2 | · I(G1 , G2 ) = |G1 |. x∈(G1 ∩G2 ). P x∈G1. SPG1 (x). SPG2 (x). ,. (3.14). where |G1 | is the number of nodes in group G1 , SP means Social position [150] metric (for a node x) which is defined in the following way: X SPn+1 (x) = (1 − ε) + ε · SPn (y) · C(y → x), (3.15) y∈V. where SPn (x) and SPn+1 (x) is the social position of member x after n and n + 1 iterations, respectively, and SP0 (x) = 1 for each node x ∈ V; V is set of all nodes; is a fixed coefficient from the range (0, 1); C(y → x) is the commitment function expressing the weight (strength of relation) from node y to x. Instead of social position, any other measure indicating member position within a community can be utilized e.g. degree centrality, betweenness degree, PageRank etc. (the authors in [186] conduct experiments with social position and degree centrality measures). In GED, the following events are defined (in notation below: I(A, B) is the inclusion measure for groups A and B; α, β ∈ [0, 1] are thresholds): i+1 when these two • Continuing. A continuing takes place for groups Tki and Tm groups have an equal size and the inclusion metric calculated for both directions is above or equal to the predefined thresholds α, β ∈ [0, 1], i.e.: i+1 i+1 i+1 I(Tki , Tm ) ≥ α ∧ I(Tm , Tki ) ≥ β ∧ |Tki | = |Tm |;. (3.16). • Shrinking. A shrinking occurs when a group Tki is matched with only a single group i+1 and the matching group has a smaller size and inclusion metric calculated for Tm both directions is above or equal to the predefined thresholds α, β (or the matching i+1 is below group can also have the same size if inclusion metric between Tki and Tm α), i.e. h i+1 i+1 i+1 I(Tki , Tm ) ≥ α ∧ I(Tm , Tki ) ≥ β ∧ |Tki | > |Tm | i i+1 i+1 i+1 ∨ I(Tki , Tm ) < α ∧ I(Tm , Tki ) ≥ β ∧ |Tki | ≥ |Tm | i+1 i+1 i+1 ∧ @ Tpi+1 , Tm : I(Tki , Tm ) ≥ α ∨ I(Tm , Tki ) ≥ β ; 27. (3.17).

(38) • Growing. A growing occurs when a group Tki is matched with only a single group i+1 and the matching group has a bigger size, and inclusion metric calculated for Tm both directions is above or equal to the predefined thresholds α, β (or the matching i+1 and T i is group can also have the same size if the inclusion metric between Tm k below β), i.e. h i+1 i+1 i+1 I(Tki , Tm ) ≥ α ∧ I(Tm , Tki ) ≥ β ∧ |Tki | < |Tm | i i+1 i+1 i+1 ∨ I(Tki , Tm ) ≥ α ∧ I(Tm , Tki ) < β ∧ |Tki | ≤ |Tm | (3.18) i+1 i+1 i+1 ∧ @ Tpi+1 , Tm : I(Tki , Tm ) ≥ α ∨ I(Tm , Tki ) ≥ β ; i+1 (among others) when I(T i , T i+1 ) < α, • Splitting. A group Tki splits to a group Tm k m i+1 i but in a reverse way — I(Tm , Tk ) ≥ β, the size of group Tki is bigger or equal than i+1 | and that group has more than one match in a timestep T i+1 : its match |Tki | ≥ |Tm i+1 i+1 i+1 I(Tki , Tm ) < α ∧ I(Tm , Tki ) ≥ β ∧ |Tki | ≥ |Tm | i+1 i+1 i+1 ∧ ∃ Tpi+1 , Tm : I(Tki , Tm ) ≥ α ∨ I(Tm , Tki ) ≥ β ;. (3.19). i+1 when the inclusion • Merging. A group Tki (among others) merges to a group Tm i+1 is above or equal to the threshold α, but in a reverse metric between Tki and Tm way — below threshold β, the size of group Tki is smaller or equal to its match and that group has more than one match in a timestep Ti+1 : i+1 i+1 i+1 ) ≥ α ∧ I(Tm , Tki ) < β ∧ |Tki | ≤ |Tm | I(Tki , Tm i+1 i+1 i+1 ∧ ∃ Tpi+1 , Tm : I(Tki , Tm ) ≥ α ∨ I(Tm , Tki ) ≥ β ;. (3.20). • Dissolving. A group Tki dissolves, if the inclusion metric between this group and every group in the next timestep is below 10%, i.e. i+1 i+1 i+1 ∀ Tm : I(Tki , Tm ) < 10% ∧ I(Tm , Tki ) < 10%;. (3.21). i+1 forms, if the inclusion metric between this group and every • Forming. A group Tm group in the previous timestep is below 10%, i.e. i+1 i+1 ∀ Tki : I(Tki , Tm ) < 10% ∧ I(Tm , Tki ) < 10%.. (3.22). Although GED in its original form assumes the 10% threshold for forming and dissolving events, various values for this threshold (named f d) will be tested in the experimental section (Section 5.2.1). Takaffoli et al. method Takaffoli et al. in [201, 202, 199] described the MODEC framework for modeling the evolution of communities. Their approach finds not only transitions between consecutive timesteps, but also transitions between distant 28.

(39) timesteps. The communities are matched using the Community Similarity metric (for groups A and B):  |V(A)∩V(B)| |V(A)∩V(B)|    max(|V(A)|,|V(B)|) , if max(|V(A)|,|V(B)|) ≥ k sim(A, B) =   0, otherwise,. (3.23). where V(A) is the number of nodes for group A (similarly, for group B) and k ∈ [0, 1]. This measure says that two groups A and B are similar if their common members constitute at least k proportion of the larger group. It is used later for matching groups. The MODEC framework consists of two parts: • community matching algorithm, • events assigning algorithm. The algorithm of community matching maximizes the amount of similarity of groups from one timestep to the next one with consideration of the absent groups. The algorithm is depicted in Pseudocode 1. In each iteration, there is an attempt to match groups from timestep i-th with those from timestep i − 1-th (but only those not having assigned yet their continuations), then groups not matched yet from i-th timestep are tried to match with those not matched from i − 2-th timestep and so on. Community matching is performed by constructing a weighted bipartite graph between groups from two analyzed timesteps (the weight between groups is calculated using the Community Similarity metric) and, then, maximum weight bipartite matching [118] is applied to connect groups from these two timesteps. A function match can be defined (later used in the definitions of events) as one that j for a given group Tki and the number of timestep j returns a matched group Tm if such a matching exists, or empty set if it does not (operation 2V × N → 2V ):  j j i   Tm , if next(Tk ) = Tm i match(Tk , j) =   ∅, otherwise.. (3.24). The authors named a series of groups from different timesteps that are matched by i T i+m } such that for each two consecutive transitions, as a meta community — M = {Tp,.., r j. j. groups in this series Tpi and Tq : next(Tpi ) = Tq . In MODEC, the following events are defined: • Form. A group Tpi has an assigned event form if there is no match for it in any of the previous timesteps: ∀ j < i : match(Tpi , j) = ∅; (3.25) • Dissolve. A group Tpi is marked as dissolved if there is no match for it in any of the next timesteps: ∀ j > i : match(Tpi , j) = ∅; (3.26) 29.

(40) • Survive. A group Tpi survives if there exists a timestep j > i that contains a community match for Tpi : j. j. ∃ j > i ∃ Tq : match(Tpi , j) = Tq ;. (3.27). • Split. A group Tpi splits to a set of groups T j∗ from timestep T j if at least k proportion of the members of groups in T j∗ is from group Tpi and common members of the union of target groups in T j∗ with a source group Tpi should be greater than k proportion of the source group (to prevent the case when most of the members of Tpi leave the network):

(41)

(42)

(43)

(44)

(45) V i ∩ V j

(46)

(47)

(48) r p j ∃ j > i ∃ T j∗ ⊂ T j : ∀ Tr ∈ T j∗

(49)

(50)

(51)

(52) ≥k∧

(53)

(54) V j

(55)

(56) r.

(57)

(58)

(59)

(60)

(61) ∪.

(62)

(63) j i

(64)

(65) ∩ V V j p

(66) Tr ∈T j∗ r

(67)

(68)

(69)

(70) ≥ k, k ∈ [0, 1];

(71) Vpi

(72) (3.28). j. j. • Merge. A set of groups Ti∗ is merged to Tq if Tq contains at least k proportion of the members from groups in Ti∗ and common members of the union of source j groups in Ti∗ with Tq should be greater than k proportion of the target group (to prevent the case that most of the members of the target group were not present before):

(73)

(74)

(75)

(76)

(77)

(78)

(79)

(80) V j ∩ V i

(81)

(82)

(83)

(84)

(85) ∪ i i∗ V i ∩ V j

(86)

(87)

(88) q q r r Tr ∈T

(89)

(90)

(91)

(92) ∃ j > i ∃ Ti∗ ⊂ Ti : ∀ Tri ∈ Ti∗

(93)

(94)