• Nie Znaleziono Wyników

The eigenvectors corresponding to the second eigenvalue of the google matrix and their relation to link spamming

N/A
N/A
Protected

Academic year: 2021

Share "The eigenvectors corresponding to the second eigenvalue of the google matrix and their relation to link spamming"

Copied!
15
0
0

Pełen tekst

(1)

DELFT UNIVERSITY OF TECHNOLOGY

REPORT 13-15

The eigenvectors corresponding to the second eigenvalue of the Google matrix and their relation to link spamming

Alex Sangers and Martin B. van Gijzen

ISSN 1389-6520

Reports of the Department of Applied Mathematical Analysis Delft 2013

(2)

Copyright  2013 by Department of Applied Mathematical Analysis, Delft, The Netherlands. No part of the Journal may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission from Department of Applied Mathematical Analysis, Delft University of Technology, The Netherlands.

(3)

The eigenvectors corresponding to the second eigenvalue of the

Google matrix and their relation to link spamming

A. Sangers∗ and M.B. van Gijzen†

December 10, 2013

Abstract

Google uses the PageRank algorithm to determine the relative importance of a website. Link spamming is the name for putting links between websites with no other purpose than to increase the PageRank value of a website. To give a fair result to a search query it is important to detect whether a website is link spammed so that it can be filtered out of the search result.

While the dominant eigenvector of the Google matrix determines the PageRank value, the second eigenvector can be used to detect a certain type of link spamming. We will describe an efficient algorithm for computing a complete set of independent eigenvectors for the second eigenvalue, and explain how this algorithm can be used to detect link spamming. We will illustrate the performance of the algorithm on web crawls of millions of pages.

Keywords: Google PageRank, link spamming, second eigenvector, Markov chains, irreducible closed subsets.

1

Introduction

Google’s PageRank algorithm aims to return the best ranking of websites when searching on the web. The PageRank model assumes that a web surfer randomly follows one of the outgoing hyperlinks at a given website with a chance p or jump to a random website with chance 1 − p. Mathematically this can be modeled by a Markov chain. The PageRank of a website is the probability to be on this website in the stationary distribution of the Markov chain. This stationary distribution is given by the first eigenvector of the transition matrix of the Markov chain.

According to Haveliwala and Kamvar [6] the eigenvectors for the second eigenvalue are also of importance: they can be used to detect link spam. Link spam is the name for putting links be-tween web pages with no other purpose than to increase the PageRank of a website. Specifically, in the conclusions of [6] Haveliwala and Kamvar state that “The eigenvectors corresponding to the second eigenvalue λ2= p are an artifact of certain structures in the web graph. In particular, each pair of leaf nodes in the SCC1 graph for the chain P corresponds to an eigenvector of A with eigenvalue p. These leave nodes in the SCC are those subgraphs in the web link graph

Delft University of Technology, Delft Institute of Applied Mathematics, Mekelweg 4, 2628 CD, The Nether-lands. E-mail: A.Sangers@student.tudelft.nl

Delft University of Technology, Delft Institute of Applied Mathematics, Mekelweg 4, 2628 CD, The Nether-lands. E-mail: M.B.vanGijzen@tudelft.nl

(4)

which have incoming edges, but have no edges to other components. Link spammers often gen-erate such structures in attempts to hoard rank. Analysis of the nonprincipal eigenvectors of A may lead to strategies for combating link spam.”

In this paper we will explain this remark. We will review the theory about the second eigen-value of the Google Matrix that is described in [5] and in [6] and extend it with results for the corresponding eigenvectors. We will use our findings to propose an efficient algorithm to detect these structures in the web that may indicate link spamming. We will illustrate the performance of the algorithm on web crawls containing several millions of pages.

This structure of this paper is as follows. Section 2 explains the structure of the Google Ma-trix and gives different methods for computing the PageRank. Section 3 discusses the relation between irreducible closed subsets in a graph and link spamming. Section 4 gives the relevant theory for the second eigenvalue and the corresponding eigenvectors of the Google Matrix. It also explains how the second eigenvectors are related to the irreducible closed subsets. Section 5 describes two algorithms for computing the second eigenvectors. Section 6 compares the per-formance of the algorithms on web crawls of several millions of pages. Section 7 summarizes our findings and makes some concluding remarks.

Remarks on notation and terminology: The terms ‘web sites’, ‘web pages’ and ‘nodes’ as well as the terms ‘hyperlinks’ and ‘web links’ are used interchangeably.

The i-th eigenvector is written as x(i) and the j-th element of vector x is written as xj. A submatrix of matrix A will be denoted by Aij and an element of A by aij.

2

The Google Matrix

We introduce W , a set of the web pages, that are connected to each other by hyperlinks, i.e., incoming and outgoing links between web pages. The mathematical representation of W is a directed graph, in which a directed link between nodes of the graph represents an incoming or outgoing link between web pages.

Let n be the number of websites. Further, let G be the n-by-n connectivity matrix with gij = 1 if there is an outgoing hyperlink from page j to i and gij = 0 otherwise. G is the matrix representation of W . The number of websites n is extremely large, hundreds of millions, while every website only contains a few outgoing links. The matrix G is therefore large and sparse.

We denote by cj the column sums of G, that is cj =Pigij. Note that cj is the number of outgoing hyperlinks of website j. We will also call this the out-degree of page j.

Surfing the web can be modeled as a Markov process, where one state transitions into another state by following hyperlinks. In order to model this process we introduce the row-stochastic matrix P. The entries pji of P are given by

pji= 

gij/cj if cj 6= 0, 1/n if cj = 0.

(2.1) Note that PT is the column-stochastic transition probability matrix of the Markov process. Nodes without outgoing hyperlink are called dangling nodes. From (2.1) follows that from a dangling node all pages can be reached with equal probability. Following [9], we assume that self-referencing nodes, i.e., gii= 1 for node i, are not allowed.

(5)

The above Markov process does not capture the possibility that a web surfer jumps to another page without following an outlink. To include this behavior, called teleportation, Google’s PageRank model assumes that an outlink is followed with chance p and a jump to a random page is made with chance 1 − p. Typically, p is chosen between 0.85 and 0.99.

Let A be the n-by-n column-stochastic transition matrix of this Markov process that includes teleportation. The elements aij of this matrix are given by

aij = 

pgij/cj+ (1 − p)/n if cj 6= 0. 1/n if cj = 0. In matrix notation this can be written as

A = pPT+(1 − p)

n ee

T,

with e is the n-vector of all ones. Also, recognize that if page j is a dangling node then each page has a chance 1/n to be chosen. Thus, if column aj = e/n then page j is a dangling node.

By introducing the diagonal matrix D, of which the main diagonal elements djj are defined by

djj= 

1/cj if cj 6= 0 0 if cj = 0, and by defining the vector z with coefficients zj given by

zj = 

(1 − p)/n if cj 6= 0 1/n if cj = 0, the matrix A can also be written as

A = pGD + ezT .

The matrix ezT accounts for teleportation. Note that as a consequence of this teleportation matrix, A is positive, meaning that every entry is positive, and is irreducible.

The PageRank is determined as the eigenvector of the dominant eigenvalue of the following system:

Ax(1) = λ1x(1).

Intuitively, when recalling the random web surfer from Section 1, the eigenvector x(1) is the distribution of the visiting frequency for each node. The more often the surfer passes node j, the higher its PageRank will be.

The matrix A has a simple dominant eigenvalue, with corresponding positive eigenvector x(1). This follows from the well-known Perron-Frobenius theorem (see e.g. [8]) for irreducible, square, nonnegative matrices. A nonnegative matrix is a matrix of which all entries are nonneg-ative.

Theorem 2.1. (Perron-Frobenius) Let A be a square irreducible nonnegative matrix. Then A has a unique positive real eigenvalue λ1equal to its spectral radius. The eigenvector corresponding to λ1 is positive. If A is positive, then λ1 is dominant.

It can be shown [8] that the dominant eigenvalue λ1 satisfies the following inequalities: min j X i aij ≤ λ1 ≤ max j X i aij.

(6)

All column sums of A are equal to one, so it immediately follows that λ1= 1. Since λ1 is a simple eigenvalue

x(1)= Ax(1) (2.2)

has a unique solution up to a scaling factor. If this scaling factor is chosen such thatP ix

(1) i = 1 (or, by positivity: ||x(1)||1 = 1), then x(1)is the stationary stochastic vector of the Markov chain and also, x(1) is the Google PageRank vector.

2.1 Computing the PageRank vector

The most common way to solve a large system in Equation (2.2) is the power method. The power method starts with a guess u0 and then we iteratively compute uk+1= Auk. After each iteration we scale uk with ||uk||1 = 1 to make sure uk sums up to 1 and thus is stochastic.

To perform a power iteration, only a matrix-vector multiplication with A needs to be per-formed. This operation can be performed cheaply as follows: uk+1 = pGDuk+ e(zTuk). We refer to [9] for more information.

An alternative way to compute the PageRank is by rewriting Equation (2.2) as a linear system

(I − pGD)x(1) = βe (2.3)

with β = zTx(1). Note that we do not know the value of scalar β, but we take β = 1 so the equation can be solved explicitly. Then x(1) can be rescaled so that P

ix (1) i = 1.

3

Irreducible closed subsets and link spamming

A typical technique to increase the PageRank of a group of websites is to create many inlinks to the group, and to remove all outlinks. In this way, it is easy for the random surfer to enter the group, but difficult to leave since he can only escape from this group through teleportation. To illustrate this we consider the example given by Figure 1. The PageRank vector for this

Figure 1: Simple directed graph example is given by

x(1)T = 0.318 0.332 0.087 0.078 0.061 0.054 0.070  .

The nodes with the highest PageRanks are number 1 and 2. Note that nodes 6 and 7 are dangling nodes. By definition, dangling nodes are connected to all other nodes.

(7)

Now we illustrate how to increase the PageRank of node 4. First we remove dangling node 7 by making a link back to node 4. Next we remove the outlink form node 4 to node 3. We refer to Figure 2 for the resulting graph. The PageRank vector after these modifications becomes

Figure 2: Changes to improve the PageRank for node 4.

x(1)T = 0.203 0.209 0.036 0.246 0.036 0.036 0.235  . Clearly, node 4 now has the highest PageRank.

To analyse this we will recall some well known definitions.

Definition 3.1. A set of states S is a closed subset of the Markov chain corresponding to PT if and only if i ∈ S and j /∈ S implies that pji = 0.

Definition 3.1 tells us that a Markov chain is closed if it is not possible to get out of subset S as soon as you are in it. This means that any subset containing a dangling node cannot be closed, and in particular, any dangling node cannot be a a closed subset.

Definition 3.2. A set of states S is an irreducible closed subset of the Markov chain corre-sponding to PT if and only if S is a closed subset, and no proper subset of S is a closed subset. Let l be the number of irreducible closed subsets of P. Then we can rewrite P in canonical form ( [8]) by renumbering the nodes:

P ∼  T11 T12 0 T22  =               P11 P12 · · · P1r P1,r+1 P1,r+2 · · · P1m 0 P22 · · · P2r P2,r+1 P2,r+2 · · · P2m .. . . .. ... ... ... · · · ... 0 0 · · · Prr Pr,r+1 Pr,r+2 · · · Prm 0 0 · · · 0 Pr+1,r+1 0 · · · 0 0 0 · · · 0 0 Pr+2,r+2 · · · 0 .. . ... · · · ... ... ... . .. ... 0 0 · · · 0 0 0 · · · Pmm               , (3.1)

where l = m − r and each P11, . . . , Prr is either irreducible or [0]1×1, and Pr+1,r+1, . . . , Pmm are irreducible and closed. First, note that each Pij is a submatrix of the n-by-n matrix P. Let us call the dimension of the block T11 r-by-˜˜ r and thus, the dimension of the block T22 is (n − ˜r)-by-(n − ˜r).

(8)

3.1 Example

We illustrate the theory by the graph displayed in Figure 2. Firstly, we will renumber the nodes to get the canonical form as in (3.1). For a graphical representation of the renumbering, we refer to Figure 3.

Figure 3: Renumbering the nodes of Figure 2 to canonical form. Thus, rewriting P to Pcanon:

P =           0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 13 0 13 13 0 0 0 0 0 0 0 0 1 0 0 13 13 0 13 0 1 7 1 7 1 7 1 7 1 7 1 7 1 7 0 0 0 1 0 0 0           ∼           1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 3 0 1 3 1 3 0 0 0 0 13 0 13 0 0 13 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0           = Pcanon. (3.2)

Let us take a closer look at Pcanon in (3.2). Firstly, we recognize the block on the lower left side of all zeros. Also, it is clear that we have two irreducible closed subsets (P22 and P33), which can be reached by T12. However, T12 includes all other nodes that are not in T22 and thus, P11is the only block in the upper left side of Pcanon (i.e., there are no nodes that do not refer to one of the irreducible closed subsets). Note that P11 is irreducible, but not closed. P22 and P33 are irreducible and closed.

4

The second eigenvector and its relation to link spamming

To explain the relation of the second eigenvector to link spamming we review some results from [5] and [6]. The following lemma can be found in [6]:

Lemma 4.1. Every eigenvector x(2) corresponding to the second eigenvalue of A is orthogonal to e: eTx(2)= 0.

(9)

Below we give a sketch of the proof. For the complete proof we refer to [6].

Proof. Since A is column stochastic, e is a left eigenvector of A corresponding to the dominant eigenvalue λ1 = 1. The lemma follows from the fact that the left and right eigenvectors are bi-orthogonal.

Lemma 4.1 gives rise to the following theorem.

Theorem 4.2. Every eigenvector x(2) corresponding to the second eigenvalue A is an eigenvec-tor of PT.

Proof. The second eigenvector x(2) of A satisfies  p(PT+1 − p n ee T  x(2) = λ(2)x(2). Using Lemma 4.1 yields

PTx(2)= λ (2) p x

(2) ,

which proves the theorem.

The first left eigenvector(s) of P have a special structure, which becomes clear from the canonical form of P. We assume that T22 is non-empty. The eigenvector(s) corresponding to eigenvalue γi = 1 for P in canonical form satisfy

yTP = γiyT yT1, yT2   T11 T12 0 T22  = y1T, yT2  ⇒  yT1T11 = yT1 yT1T12+ y2TT22 = yT2 (4.1)

We know that (T11− I) is non-singular, since |γi| < 1 for T11(refer to [8], page 698). Therefore, Equation (4.1) implies yT1 = 0. It follows that yT2T22= y2T.

We get yT2(T22− I) = 0, where (T22− I) is singular and thus, y2 is a left eigenvector of T22corresponding to γi= 1. Each submatrix Pr+j,r+j (1 ≤ j ≤ l) in T22is row-stochastic and therefore has eigenvalue 1. This leads to the following lemma ( [7], page 126):

Lemma 4.3. The multiplicity of the eigenvalue 1 for P is equal to the number of irreducible closed subsets of P.

Let yr+j be the dominant left eigenvector of Pr+j,r+j. Since Pr+j,r+j is irreducible and has only nonnegative entries, this eigenvector can be scaled to be positive and stochastic by Theorem 2.1. Let ¯yr+j be the vector that results from padding the stochastic vector yr+j with zeros to get the appropriate size n. Every dominant left eigenvector of P can be written as a linear combination of the vectors ¯yr+j, j = 1, . . . , m − r:

y = m−r X

j=1

(10)

Using Lemma 4.1 and Theorem 4.2 we can now construct m − r − 1 independent second eigen-vectors x(2) ⊥ e for A:

x(2)= ¯yr+jT − ¯yTr+j+1, j = 1, · · · m − r − 1 . (4.3) Here we have assumed that there are at least two irreducible closed subsets and we used that the eigenvectors yr+j are stochastic.

The following theorem that can be found in [5] and [6] is a direct consequence of the discussion above.

Theorem 4.4. If PT has at least two irreducible closed subsets, then the second eigenvalue of A is λ2= p, with 1 − p the teleportation chance as introduced in Section 2.

These second eigenvectors of A have the following special nonzero structure that is charac-terized by Theorem 4.5.

Theorem 4.5. Let x(2) = (x1, · · · , xn)T be an eigenvector of A corresponding to the eigenvalue p. Then xj = 0 if j /∈ irreducible closed subset.

Proof. The proof follows from Equations (4.2) and (4.3).

5

Computation of all the eigenvectors that correspond to the

second eigenvalue of A

In this section we assume that we have a set W of websites with at least two irreducible closed subsets, so we know that p is the second eigenvalue of A. We will present two algorithms for computing all the eigenvectors that correspond to this eigenvalue.

5.1 Computation of the eigenvectors for eigenvalue p of A by computing all the irreducible closed subsets of W

The first algorithm computes the eigenvectors for eigenvalue p of A by computing all the irre-ducible closed subsets of W . As we mentioned before, a directed graph is irreirre-ducible if, given any two nodes, there exists a directed path from the first node to the second. This is equivalent to the directed graph being strongly connected. Determining all the strongly connected compo-nents in the graph for W therefore allows us to determine the submatrices Pii in Equation (3.1). Whether Pii corresponds to a closed subset can be determined by inspecting whether there are outlinks to the subset corresponding to Pii. There are no outlinks to this set if Pij = O, j = 1, . . . , n, j 6= i. Several efficient algorithms exist for determining these strongly connected components. One of the most efficient ones is Tarjan’s algorithm [11]. An efficient Matlab routine that implements Tarjan’s algorithm is graphconncomp [1].

Once the the m − r submatrices Pr+j,r+j have been determined, we can compute their dominant left eigenvectors yr+j. This can be done by solving the homogeneous equation

(PTr+j,r+j− I)yr+j= 0 . (5.1)

A technique to compute a solution is to apply an unpreconditioned Krylov subspace method to this system with a nonzero initial guess x0. In our experiment we use IDR(s) [12] to solve Equation (5.1). The vector yr+j must be normalized to make it stochastic and padded with zeros to give ¯yr+j. The m − r − 1 eigenvectors x(2) of A then follow from Equation (4.3).

We will denote the resulting algorithm by Tarjan-based algorithm. It is summarized as follows:

(11)

1. Apply Tarjan’s algorithm to the graph W . The strongly connected components without outlinks are irreducible closed subsets;

2. Form the matrices Pr+j,r+j that correspond to the irreducible closed subsets;

3. Compute the dominant eigenvectors yr+j,r+j of the matrices Pr+j,r+j by solving Equation (5.1), scale them to make them stochastic, and pad them with zeros to the appropriate size. This results in the vectors ¯yr+j,r+j;

4. Combine the vectors ¯yr+j,r+jpairwise using Equation (4.3) to compute second eigenvectors of A.

Remarks: To detect link spamming, only the irreducible closed subsets need to be computed in step 1. The eigenvectors for the second eigenvalue as computed in step 4 are sparse, the total amount of nonzeroes in these vectors cannot exceed 2n.

5.2 Computation of all the eigenvectors for eigenvalue p of A by computing one second eigenvector of A

The second algorithm that we present uses the nonzero structure of the second eigenvectors of A that is given in Theorem 4.5. Nonzero components of the second eigenvector correspond to nodes in an irreducible closed subset. The idea is to compute one second eigenvector and deter-mine all the nonzero elements. An arbitrary second eigenvector of A has with high probability nonzero values in all the entries that correspond to nodes in irreducible closed subsets. The second eigenvectors of A are eigenvectors of PT corresponding to the eigenvalue 1. One second eigenvector of A can therefore be computed by solving the homogeneous system

(PT− I)y = 0 . (5.2)

To detect which nodes are in the same irreducible closed subset, we form a directed graph that only consists of the nodes that correspond to nonzero values in y. We apply Tarjan’s algorithm to this graph, that is of much smaller size than the original graph W . The strongly connected components in this graph correspond to irreducible closed subsets. Once we have found all the nodes that constitute an irreducible closed subset we can form the corresponding matrix Pr+j,r+j. Of each of these matrices we compute the dominant left eigenvector yr+j, and these vectors are then combined to second eigenvectors of A using Equation (4.3).

We will denote the resulting algorithm by eigenvector-based algorithm. It is summarized as follows:

1. Compute one dominant eigenvector of PT by solving Equation (5.2). This can be done by using applying a Krylov method to the homogeneous system with a nonzero initial guess; 2. Determine the nonzero coefficients;

3. Apply Tarjan’s algorithm to the graph formed by the nonzero nodes. The strongly con-nected components in this graph are irreducible closed subsets;

4. Form the matrices Pr+j,r+j that correspond to the irreducible closed subsets;

5. Compute the dominant eigenvectors yr+j,r+j of the matrices Pr+j,r+j by solving Equation (5.1), scale them to make them stochastic, and pad them with zeros to the appropriate size. This results in the vectors ¯yr+j,r+j;

(12)

Test problem Size IDR(1) iterations CPU time wb-cs-stanford 9914 57 0.2 flickr 820878 45 18.6 wikipedia-20051105 1634989 63 40.7 wikipedia-20060925 2983494 68 73.6 wikipedia-20061104 3148440 63 66.5 wikipedia-20070206 3566907 66 78.0 wb-edu 9845725 103 152.4

Table 1: Iterations and CPU-time to compute the first eigenvector.

6. Combine the vectors ¯yr+j,r+jpairwise using Equation (4.3) to compute second eigenvectors of A.

Remarks: To detect link spamming, only the irreducible closed subsets need to be computed in step 1-3. The eigenvectors for the second eigenvalue as computed in step 6 are sparse, the total amount of nonzeroes in these vectors cannot exceed 2n.

6

Numerical experiments

As test problems we consider 7 matrices from the University of Florida Sparse Matrix Collection [4]. These matrices correspond to web crawls and have been contributed by David Gleich. The problem sizes correspond to approximately 103 pages for the smallest test problem to 107 pages for the largest problem. The connectivity matrices G as included in the Florida Sparse Matrix Collection are defined as gi,j = 1 if page i links to page j, which corresponds to the reversed direction with respect to the definition we use for the matrix G. Moreover, the main diagonal elements of the matrices G are not all zero. Since we do not allow self-referencing, we set the main diagonal elements to zero. The matrices are therefore pre-prossed as follows

G = GT − diag(G) .

All computations have been performed using Matlab 7.13 on a workstation with 32 GB of memory and equipped with an 8 core Xeon processor.

We first determine the PageRank by solving system (2.3) using IDR(1). As termination criterion we use

krik kr0k < 10

−8,

in which ri is the residual after i iterations. These systems are very well conditioned, which means that the convergence of IDR(s) is not influenced a lot by the choice of s. For this reason we have selected s = 1, the choice with lowest vector overhead. Table 1 gives in the first column the name of the test problem, in the second column the size of the matrix, (number of pages), in the third column the number of IDR(1) iterations, and in the fourth column the CPU-times. Note that the number of IDR(1) iterations only depends very mildly on the problem size.

We have applied the two algorithms of the previous section to detect the irreducible closed subsets and the second eigenvectors of A. Table 2 gives the results for Tarjan’s algorithm and Table 2 for the eigenvector-based algorithm.

System (5.2) can be quite ill-conditioned. For this reason, we use IDR(4) to solve Equation (5.2), which gives a considerable reduction of the number of iterations compared to IDR(1). The

(13)

Test problem Size Number of irreducible CPU-time closed subsets wb-cs-stanford 9914 113 0.3 flickr 820878 5394 399.3 wikipedia-20051105 1634989 68 1515.3 wikipedia-20060925 2983494 63 5077.1 wikipedia-20061104 3148440 59 5696.9 wikipedia-20070206 3566907 58 7462.7 wb-edu 9845725 49573 75703.2

Table 2: Results for Tarjan’s algorithm to compute irreducible closed subsets.

Test problem Size Number of irreducible IDR(4) iterations CPU-time closed subsets wb-cs-stanford 9914 350 113 1.4 flickr 820878 5394 312 160.8 wikipedia-20051105 1634989 68 169 140.2 wikipedia-20060925 2983494 63 126 166.6 wikipedia-20061104 3148440 59 112 155.1 wikipedia-20070206 3566907 58 176 313.6 wb-edu 9845725 85470 1000 2825.6

Table 3: Results for the eigenvector-based algorithm to compute irreducible closed subsets.

termination criterion we use is

krik kr0k

< 10−12,

which is more strict than for the computation of the PageRank, but needed in practice to determine if a coefficient of the solution vector equals zero. The iterative method was stopped if the number of iterations exceeded 1000, which was the case for test problem wb-edu. As a result an incorrect number of 85470 irreducible closed subesets was found, yielding 85469 computed eigenvectors for eigenvalue p = 0.85. After checking the Rayleigh quotients for these computed eigenvectors it turned out that of these 85469 vectors, 41605 corresponded to actual eigenvectors for p. After this correction, the number of detected irreducible closed subsets becomes 41606.

As is clear from the results in the above tables, the eigenvector-based algorithm gives a big computational advantage: the computing time is 10 to 20 times less for the larger test problem. For the eigenvector-based algorithm, the solution of the linear system (5.2) takes almost all of the computing time. This is similar to the computation of the PageRank, where the solution of Equation (2.3) takes all the computing time. However, since system (2.3) is much better conditioned than (5.2), the solution of Equation (2.3) is considerably more time consuming than of Equation (5.2), and hence the computation of the PageRank is much faster than the detection of possible link spamming.

7

Conclusion

In this paper we have examined the second eigenvector of the Google matrix and its relation to link spamming. Creating an irreducible closed subset is an effective way of link spamming.

(14)

Irreducible closed subsets can be found with the second eigenvector of the Google matrix. The second eigenvectors of A are first eigenvectors of PT. The elements of such eigenvectors have with high probability nonzero value in the nodes that correspond to irreducible closed subset and zero value in other nodes.

The second eigenvectors of A can all be found by an algorithm aiming to find the strongly connected components in matrix PT, such as Tarjan’s algorithm. Another method is to first find a second eigenvector of A. The entries with nonzero values in that eigenvector must correspond to a node in an irreducible closed subset of the graph. To detect the different irreducible closed subsets one can apply Tarjan’s algorithm, but only to the nodes that correspond to nonzero values in the second eigenvector.

There are several ways to reduce the effectiveness of the type of link spamming that we considered in this paper. One way is to reduce the chance of teleporting to a node in an irreducible closed subset. This can be done by using a non-homogeneous teleportation vector v, called personalization vector. Using a personalization vector, the transition matrix becomes A = pPT+ (1 − p)veT. Although the original idea of the personalization vector [10] was to more accurately describes the surfing behavior of certain types of web surfers, this vector can also be used to combat link spamming, by giving small values to entries of v that corresponds to nodes that are suspected of being link spammed. Note that Theorem 4.1, which tells us that eTx(2) = 0, still holds after introducing a personalization vector. Therefore, our findings carry over to this case.

We used the second eigenvector for detecting link spamming that is based on irreducible closed subsets. However, this is not the only link spamming technique, and other techniques will require different approaches to combat them. See for a discussion for example [2, 3].

References

[1] http://www.mathworks.nl/help/bioinfo/ref/graphconncomp.html. The Mathworks, R2013b Documentation, 2013.

[2] Monica Bianchini, Marco Gori, and Franco Scarselli. Inside PageRank. ACM Trans. Inter-net Technol., 5(1):92–128, February 2005.

[3] Monica Bianchini, Marco Marco Gori, and Franco Scarselli. PageRank: A Circuital Analy-sis. In In Proceedings of the Eleventh International World Wide Web (WWW) Conference, 2002.

[4] Timothy A. Davis and Yifan Hu. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw., 38(1):1:1–1:25, December 2011.

[5] L. Eld´en. A Note on the Eigenvalues of the Google Matrix. ArXiv Mathematics e-prints, January 2004.

[6] Taher Haveliwala and Sepandar Kamvar. The Second Eigenvalue of the Google Matrix. Technical Report 2003-20, Stanford InfoLab, 2003.

[7] Dean L. Isaacson and Richard W. Madsen. Markov chains, theory and applications. Wiley New York , 1976.

(15)

[8] Carl D. Meyer, editor. Matrix Analysis and Applied Linear Algebra. Chapter 7 and 8. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2000.

[9] Cleve Moler. Experiments with MATLAB. Chapter 7: Google PageRank, The MathWorks, 2011.

[10] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank Cita-tion Ranking: Bringing Order to the Web. Technical Report 1999-66, Stanford InfoLab, November 1999.

[11] Robert Endre Tarjan. Depth-First Search and Linear Graph Algorithms. SIAM J. Comput., 1(2):146–160, 1972.

[12] Martin B. van Gijzen and Peter Sonneveld. Algorithm 913: An elegant IDR(s) variant that efficiently exploits bi-orthogonality properties. ACM Transactions on Mathematical Software, 38(1):5:1–5:19, November 2011.

Cytaty

Powiązane dokumenty