Wednesday, December 15, 2010

Introducing Page Rank Maths

In my previous post, I started introducing Google PageRank. Now I'm starting to dig into PageRank maths. PageRank definition is not very "computation oriented".

r(Pi) = Pj ∈ Li  r(Pj)

|Pj|
(2)
There is not a clear temporal ordering of pages or any other sensible starting point. Hopefully, an iterative method may converge to the result. So we subscript the rank with a number indicating the step and express the (i)-th step in terms of the (i−1)-step:

rk(Pi) = Pj ∈ Li  rk−1(Pj)

|Pj|
(3)
Of course, we are not happy with the whole "may-converge" thing and would rather have some precise results. The first step is using some convenient notation: formula 2.2 is easy to grasp but awkward to manipulate.
The idea here is express the equations in matrix form. Let L = (eij) be the adjacency matrix of our graph. In other words:

eij =
1    if there is a link from i to j
0    otherwise
(4)
Then we consider the matrix H = D−1·A, where D = max(I, D) and D is the diagonal matrix of the degrees of the nodes. In Python we can compute H with:
import numpy as np

def h_matrix(adjacency_matrix):
    degrees = np.sum(adjacency_matrix, axis=1)
    degrees[degrees==0] = 1
    return np.dot(np.diag(1./degrees), adjacency_matrix)


# ...

The H j-th row can be interpreted as the distribution of probability for jumping from node j to each other node. If our graph is the web, H has billions of rows (and columns) but in average each page has about 10 out links. H is a very sparse matrix, indeed.
Let π be the rank-vector. Then we can reformulate the problem as:
πT = πT·H
(5)
Basically we are looking for the eigenvector associated with the eigenvalue λ1 = 1. The idea here is to use the power method. The power method is an iterative method that computes the eigenpair (λ1, π) of a matrix A. A must satisfy some basic hypotheses for the power method to work. Let {λ1,…,λn} be the set of eigenvalues of A, then it must be true that |λ1| > |λ2| ≥ … ≥ |λm|. As a consequence, λ1 ∈ \mathbbR.
Then it is possible to compute (λ1, π) with:

yn=Axn
      
xn + 1 = yn

cn
(6)
Then xn→ x and cn→λ1.
Here cn is a scaling factor; e.g. λ1, ||Anx0||, the component of yn of maximal magnitude. Now, the fact that we suggested λ1 as a possible scaling factor is not a mistake: it is proved that the actual Google matrix has λ1.
Moreover x0R(A−λ1I), where R(M)={v|v=Ax}. However, in practice "most" starting vectors x0 are good.
If H were a usable Google Matrix, things would be very nice. H is essentially the matrix of a random walk over the graph. We could just see the whole thing as a Markov process. More on this can be found in [3].
The analogy with Markov processes here can give us some insight on the problems with the H matrix. If we interpret nodes as states, we can see that a closed class of state (possibly a singleton) would be a serious issue. And any node with no outgoing links constitutes such a class.
In the Markov chain interpretation, if such a node is reachable, then we would just remain there. Such nodes would be rank sinks. In such a setting, every web site with no outgoing links would acquire a rank far superior to its actual importance. Using Markov chains terminology, the H matrix is substochastic.
Another problem is related to cycles. E.g., consider the matrix
C =
0
1
1
0

Then: C2n = I and C2n+1 = C. As a consequence we cannot expect to converge (notice, here we have eigenvalues with same magnitude, so we know that the power method is not going to work).
Luckily enough, these problems are rather easy to fix modifying the matrix. More on this, later!

References

[1]
M. Franceschet. PageRank: Standing on the shoulders of giants. arxiv.org.
[2]
M.E.J. Newman. The structure and function of complex networks. Arxiv preprint cond-mat/0303516, 2003.
[3]
M.E.J. Newman. Networks: An Introduction. Oxford University Press, 2010.
[4]
L Page and S Brin. The pagerank citation ranking: Bringing order to the web. 1998.



No comments: