In my
previous post, I started introducing Google PageRank. Now I'm starting to dig into PageRank maths.
PageRank definition is not very "computation oriented".
r(Pi) = |
∑
Pj ∈ Li
| | r(Pj)
|Pj|
| |
| (2) |
There is not a clear temporal ordering of pages or any other sensible
starting point. Hopefully, an iterative method may converge to the result.
So we subscript the rank with a number indicating the step and express
the (i)-th step in terms of the (i−1)-step:
rk(Pi) = |
∑
Pj ∈ Li
| | rk−1(Pj)
|Pj|
| |
| (3) |
Of course, we are not happy with the whole "
may-converge"
thing and would rather have some precise results. The first step
is using some convenient notation: formula
2.2 is easy to grasp but awkward to
manipulate.
The idea here is express the equations in matrix form. Let L = (e
ij)
be the adjacency matrix of our graph. In other words:
eij = |
|
1 if there is a link from i to j |
|
|
| |
| (4) |
Then we consider the matrix H =
―D
−1·A, where
―D = max(I, D)
and D is the diagonal matrix of the degrees of the nodes.
In Python we can compute H with:
import numpy as np
def h_matrix(adjacency_matrix):
degrees = np.sum(adjacency_matrix, axis=1)
degrees[degrees==0] = 1
return np.dot(np.diag(1./degrees), adjacency_matrix)
# ...
The H j-th row can be interpreted as the distribution of probability
for jumping from node j to each other node. If our graph is the web,
H has billions of rows (and columns) but in average each page has about
10 out links. H is a very sparse matrix, indeed.
Let π be the rank-vector. Then we can reformulate the problem as:
Basically we are looking for the eigenvector associated with the
eigenvalue λ
1 = 1. The idea here is to use the
power method.
The power method is an iterative method that computes the eigenpair
(λ
1, π) of a matrix A. A must satisfy some basic hypotheses for
the power method to work. Let {λ
1,…,λ
n} be
the set of eigenvalues of A, then it must be true that
|λ
1| > |λ
2| ≥ … ≥ |λ
m|. As a consequence,
λ
1 ∈ \mathbbR.
Then it is possible to compute (λ
1, π) with:
Then x
n→ x and c
n→λ
1.
Here c
n is a scaling factor; e.g. λ
1, ||A
nx
0||,
the component of y
n of maximal magnitude.
Now, the fact
that we suggested λ
1 as a possible scaling factor is not a mistake:
it is proved that the actual Google matrix has λ
1.
Moreover x
0 ∉
R(A−λ
1I),
where
R(M)={v|v=Ax}. However, in practice "most"
starting vectors x
0 are good.
If H were a usable Google Matrix, things would be
very nice.
H is essentially the matrix of a random walk over the graph. We could
just see the whole thing as a Markov process. More on this can be
found in [
3].
The analogy with Markov processes here can give us some insight on the
problems with the H matrix. If we interpret nodes as states, we can see
that a closed class of state (possibly a singleton) would be a serious
issue. And any node with no outgoing links constitutes such a class.
In the Markov chain interpretation, if such a node is reachable, then
we would just remain there. Such nodes would be rank sinks. In such a setting,
every web site with no outgoing links would acquire a rank far superior to
its actual importance. Using Markov chains terminology, the H matrix is
substochastic.
Another problem is related to cycles. E.g., consider the matrix
Then: C
2n = I and C
2n+1 = C. As a consequence we cannot
expect to converge (notice, here we have eigenvalues with same magnitude,
so we
know that the power method is not going to work).
Luckily enough, these problems are rather easy to fix modifying the matrix.
More on this, later!
References
- [1]
- M. Franceschet.
PageRank: Standing on the shoulders of giants.
arxiv.org.
- [2]
- M.E.J. Newman.
The structure and function of complex networks.
Arxiv preprint cond-mat/0303516, 2003.
- [3]
- M.E.J. Newman.
Networks: An Introduction.
Oxford University Press, 2010.
- [4]
- L Page and S Brin.
The pagerank citation ranking: Bringing order to the web.
1998.
No comments:
Post a Comment