大规模图处理

各种网络及其度

无/有向网络中节点的度

无向网络

Degree (or degree centrality) of a vertex: d(vi)d(v_i)

d(vi)=vj s.t. eijEeij=ejid(v_i) = |v_j| \ s.t. \ e_{ij} \in E \wedge e_{ij} = e_{ji}#\# of edges connected to it

eg. 下图中 d(A)=4, d(H)=2d(A) = 4,\ d(H) = 2

有向网络

In-degree of a vertex din(vi)d_{in}(v_i)din(vi)=vj s.t. eijEd_{in}(v_i) = |v_j| \ s.t. \ e_{ij} \in E#\# of edges pointing to viv_i

Out-degree of a vertex dout(vi)d_{out}(v_i)dout(vi)=vj s.t. ejiEd_{out}(v_i) = |v_j| \ s.t. \ e_{ji} \in E#\# of edges from viv_i

eg. 下图中 din(A)=3, dout(A)=1; din(B)=2, dout(B)=2d_{in}(A) = 3, \ d_{out}(A) = 1;\ d_{in}(B) = 2, \ d_{out}(B) = 2

各网络及其度的定义

Simple network: If a network has neither self-edges nor multi-edges

Multi-edge (multigraph): If more than one edge between the same pair of vertices

Self-loop: If an edge connects vertex to itself

Direct graph (digraph): If each edge has a direction

Weighted graph: If a weight wijw_{ij} (a real number ) is associated with each edge vijv_{ij}

网络图结构及性质

Subgraph: A subset of the nodes and edges in a graph/network

Clique (complete graph): Every node is connected to every other

Singleton vs. dyad (two nodes and their relationship) vs. triad:

Ego-centric network: A net work pull out by selecting a node and all of its connections

度的分布

Degree sequence

The list of degrees of the nodes sorted in non-increasing order. Eg: in G1G_1 ,degree sequence:(4, 3, 2, 2, 1)

Degree frequency distribution

Let NKN_K denote the #\# of vertices with degree k, (N0,N1,,Nt), t(N_0,N_1,\dots,N_t), \ t is max degree for a node

Eg: in G1G_1 ,degree frequency distribution: (0, 1, 2, 1, 1)

Degree distribution

Probability mass function ff for random variable XX:

(f(0),f(1),,f(t)), where f(k)=P(X=k)=Nk/n(f(0), f(1),\dots,f(t)),\ where\ f(k) = P(X=k) = N_k/n

Eg: in ​ G1G_1 ,degree distribution: (0, 0.2, 0.4, 0.2, 0.2)

路径

Walk

In a graph GG between nodes XX and YY : ordered sequence of vertices, starting at XX and ending at YY, s.t. there is an edge between every pair of consecutive vertices

Hop

The length of the walk

Path

A walk with distinct vertices

Shortest Path (geodesic path, dd )

Geodesic paths are not necessarily unique: It is quite possible to have more than one path of equal length between a given pair of vertices

Diameter of a graph: the length of the longest geodesic path between any pair of vertices in the network for which a path actually exists

Average Path Length( <d><d> )

Average of the shortest paths between all pairs of nodes: <d>=1N(N1)i,j=1,N(ij)di,j<d> = \frac{1}{N(N-1)} \sum \limits_{i,j=1,N(i\neq j)}d_{i,j}

Eulerian Path

A path that traverses each edge in a network exactly once

Hamilton Path

A path that visits each vertex in a network exactly once

Distance

The length of the shortest path

Total #\# of path length 2 from jj to ii ,via any vertex in Nij(2)N_{ij}^{(2)} is:Nij(2)=k=1nAikAkj=[A2]ijN_{ij}^{(2)} = \sum_{k=1}^n A_{ik}A_{kj} = [A^2]_{ij}

So, generalizing to path of arbitrary length, we have: Nij(r)=[Ar]ijN_{ij}^(r) = [A^r]_{ij}

When starting and ending at the same vertex ii , we have: Lr=i=1n[Ar]ii=TrArL_r = \sum_{i=1}^n[A^r]_{ii} = Tr A^r

#\# of loops can be expressed in terms of adjacency matrix:

Matrix AA written in the form of A=UKUTA = UKU^T , UU is the orthogonal matrix of eigenvectors and KK is the diagonal matrix of eigenvalues (SVD):

Lr=Tr(UKUT)r=Tr(UKrUT)=Tr(UUTKr)=Tr(Kr)=ikirL_r = Tr(UKU^T)^r = Tr(UK^rU^T)=Tr(UU^TK^r)=Tr(K^r)=\sum_i k_i^r, where kik_i is the ii-th eigenvalue of the adjacency matrix

Independent Paths

Edge-independent:

Two path connecting a pair of vertices are edge-independent if they share no edges

Vertex-independent:

Two path are vertex-independent if they share no vertices other than the starting and ending vertices

半径与直径

Eccentricity

The eccentricity of a node viv_i is the maximum distance from viv_i to any other nodes in graph

e(vi)=maxj{d(vi,vj)}e(v_i) = max_j\{d(v_i,v_j)\} Eg. e(A)=1,e(F)=e(B)=e(D)=e(H)=2e(A) = 1, e(F) = e(B) = e(D) = e(H) =2

Radius

Radius of a connected graph GG : the min eccentricity of any node in GG

r(G)=mini{e(vi)}=mini{d(vi,vj)}r(G) = min_i\{e(v_i)\} = min_i\{d(v_i,v_j)\} Eg. r(G1)=1r(G_1) = 1

Diameter

Diameter of a connected graph GG : the max eccentricity of any node in GG

d(G)=maxi{e(vi)}=maxi,j{d(vi,vj)}d(G) = max_i\{e(v_i)\} = max_{i,j}\{d(v_i,v_j)\} Eg. d(G1)=2d(G1) = 2

弱/强连接&出/入元素(有向图)

Weakly/Strongly connected

Weakly: If the vertices are connected by 1 or more paths when one can go either way along any edge

In/Out-component

Out-component: Those reachable from vertex A

簇类系数

因为真实的网络可能十分稀疏,所以需要这么一个衡量系数去衡量一下密集度

Clustering coefficient of a node viv_i: A measure of the density of edges in the neighborhood of viv_i

Let Gi=(Vi,Ei)G_i = (V_i,E_i) be the subgraph induced by the neighbors of vertex viv_i , Vi=ni|V_i| = n_i( #\# of neighbors of viv_i ), and Ei=mi|E_i| = m_i ( #\# of edges among the neighbors of viv_i )

For undirected network: C(vi)=#edges in Gimax #edges in Gi=mi(ni2)=2×mini(ni1)C(v_i) = \frac{\# edges\ in\ G_i}{max\ \#edges\ in\ G_i} = \frac{m_i}{\binom{n_i}{2}} = \frac{2\times m_i}{n_i(n_i-1)}

For directed network: C(vi)=#edges in Gimax #edges in Gi=mini(ni1)C(v_i) = \frac{\# edges\ in\ G_i}{max\ \#edges\ in\ G_i} = \frac{m_i}{n_i(n_i-1)}

For a graph: C(G)=1niC(vi)C(G) = \frac{1}{n}\sum \limits_i C(v_i) averaging the local clustering coefficient of all the vertices

双边网络

Bipartite Network: two kinds of vertices, and edges linking only vertices of unlike types

Incidence matrix: Bij=1B_{ij} = 1 if vertex jj links to group ii, 00 otherwise

The projection to one-mode can be written in terms of the incidence matrix BB as follow:

Pij=k=1gBkiBkj=k=1gBikTBkjP_{ij}=\sum_{k=1}^g B_{ki}B_{kj} = \sum_{k=1}^g B_{ik}^TB_{kj}

The product of BkiBkjB_{ki}B_{kj} will be 11 if ii and jj both belong to the sample group kk in the bi-partite network

协同引证与文献耦合

Co-citation

Co-citation of vertices ii and jj : AikAjk=1A_{ik}A_{jk}=1 , if ii and jj are both cited by kk

#\# of vertices having outgoing edges pointing to both ii and jj :

Cij=k=1nAikAjk=k=1nAikAkjTC_{ij} = \sum_{k=1}^n A_{ik}A_{jk} = \sum_{k=1}^n A_{ik}A_{kj}^T

Co-citation matrix: its a symmetric matrix C=AATC = AA^T

Diagonal matrix (Cii)(C_{ii}): total #\# papers citing ii

Bibliographic coupling

Bibliographic coupling of vertices ii and jj : AikAjk=1A_{ik}A_{jk}=1 , if ii and jj both cite kk

Bibliographic coupling of ii and jj :

Bij=k=1nAkiAkj=k=1nAikTAkjB_{ij} = \sum_{k=1}^n A_{ki}A_{kj} = \sum_{k=1}^nA_{ik}^T A_{kj}

Bibliographic coupling matrix: B=ATAB=A^TA

Diagonal matrix (Bii)(B_{ii}): total #\# papers cited by ii

Co-citation&Bibliographic Coupling Comparison

Strong co-citation must have a lot of incoming edges

Strong bib-coupling if two papers have similar citations

图计算任务和算法

图计算的任务

1、Graph pattern matching: subgraph matching...

2、 Graph analytics: PageRank ...

图计算的算法

PageRank, Simrank, HITS, Label Propagation, Maximal-Independent-Set, Maximal-Node-Matching, Connected-Component, Shortest Distance ...

程序模型框架

不支持:N 支持:Y

Model

Data/Graph Integration

Both Graph Matching/Analytic

Explicit Loop Control

Systems

Vertex-Centric

N

N

N

Giraph, Pregel, WindCatch, GPS, Grace, Pregel+, TUX

Edge-Centric

N

N

N

PowerGraph, GraphChi, X-Stream, TurboGraph, FlashGraph, Powerlyra

Block-Centric

N

N

N

Giraph++, Blogel

Linear Algebra

N

N

N

PEGASUS, GraphTwist, LA3

Grape

N

Y

N

Grape

DSL

N

N

Y

Green-Marl

Datalog

N

Y

Y

Socialite, Distribute-Socialite, DeALS

Dataflow

Y

Y

Y

GraphX, GraphFrame, Pregelix

SQL-G

Y

Y

Y

SQL-G, EmptyHeaded

Last updated