|
Performance and scalability are crucial for the Parallel BGL. We
aim to be competitive with existing parallel graph libraries in both
areas.
Experimental Setup
All performance evaluations were performed on the Indiana
University Computer Science
Department's ``Odin'' research cluster. Odin consists of 128
compute nodes connected via Infiniband. Each node contains 4GB of
dual-channel PC3200 (400 MHz) DDR-DRAM with two 2.0GHz AMD Opteron 270
processors (1MB Cache) running Red Hat Enterprise Linux WS Release 4.
Each node contains a Mellanox InfiniHost HCA on a 133MHz PCI-X bus
running OFED 1.1, connected to a 144 port Voltaire SDR switch. For
our tests we left one processor idle on each node.
We typically do performance testing on random graphs generated from
three different models. When evaluating how the Parallel BGL might
perform for a particular problem, consider how the structure of your
graph relates to the structure of these graph models:
- Small-world graphs start as ring graphs, where each
vertex is connected to its k closest vertices in the
ring. Each edge is then randomly rewired with probability p,
which we fix at 0.04 for most experiments. Small-world graphs
distribute very well and therefore typically produce the best
performance and scalability numbers.
- Erdos-Renyi graphs have a uniform probability p
of having an edge between any two vertices in the graph. The graphs
exhibit very little structure and typically do not distribute
well. These graphs are rarely useful for modeling real-world data,
but the model is simple enough that it permits theoretical analyses
of algorithm complexity that are not feasible for other, more
realistic, models.
- Scale-free graphs have as a defining characteristic a
"power law" distribution of vertex degrees. Thus, there are a large
number of vertices with a vertex small degree but only a tiny
fraction of the vertices will have a large degree. These graphs do
not distribute well, because most algorithms operate by considering
the edges of each node processed: when one processor has one of
high-degree nodes, it will do much more work than other
processors. However, this model is useful because many kinds of
naturally-occuring networks (such as the Web graph) are scale-free
graphs.
Note: The performance numbers reported here are for the
older versions of the Parallel BGL, up to date performance data
reflecting the significant advances in communications infrastructure
will be added soon. In the future, we hope to provide more complete
information about particular versions of the Parallel BGL, varying
different parameters of the experimental setup, and additional
comparative results.
Performance results For each kind of graph, we present the
wall clock times for several of the Parallel BGL algorithms. We make
no attempt to explain performance results in this section; refer to
the algorithm documentation for performance caveats.
The following chart illustrates the scalability and performance of
the two variants of Dijkstra's Single Source Shortest Path algorithm
available in the Parallel BGL. This example uses an Erdos-Renyi graph
with 2.5M vertices and 12.5M (directed) edges per processor for a
maximum size of 240M vertices and 1.2B edges on 96 processors. This
example demonstrates that the ability of the Parallel BGL to solve
very large problems is limited only by available memory.
The following chart illustrates the largest Single Source Shortest
Path problem solved with the Parallel BGL to date. This example uses
the Delta-Stepping algorithm on an Erdos-Renyi graph with average
degree 4. The largest problem solved is 1B vertices and 4B edges
using 96 processors.
The following chart illustrates the scalability and performance of
an older version of the Parallel BGL on the Odin cluster using GCC
3.3.1. For this example, we use graphs that can easily be processed
on a single node, with 1M vertices and 15M (undirected) edges. This
chart shows the scalability of several Parallel BGL algorithms for
fixed-size small-world graphs under somewhat "ideal" conditions,
because the small-world graphs have been partitioned well across the
compute nodes.
The following chart illustrates the scalability and performance of
several Parallel BGL algorithms on fixed-sized Erdos-Renyi graphs
with 1M vertices and 15M edges. Since there is no good partitioning
for Erdos-Renyi graphs, this chart illustrates how the Parallel BGL
performs under very poor conditions. Compare this chart with the
previous chart (small-world graphs of the same size) to see the
effects of poor partitioning on graph algorithm performance.
The following chart illustrates the scalablity of the Parallel BGL
when the size of the problem is scaled with the number of
processors. In this example, we run several of the Parallel BGL
algorithms on small-world graphs with ~547k vertices per compute
node and 30 edges incident on each vertex (up to 70M vertices/1B
edges on 128 compute nodes). An algorithm that scales linearly
would produce a horizontal line in this chart.
CGMgraph (part of CGMlib) is another
freely-available parallel graph library. It is based on the Course
Grained Multicomputer (CGM) model of parallel computation, which is
similar to the Bulk Sychronous Parallel (BSP) model used primarily
in the older version of the Parallel BGL benchmarked below. The
following chart illustrates the results of comparing the
implementation of Connected Components in the Parallel BGL against
the CGMgraph. The tests were run using version 0.5.0 of the
Parallel BGL and version 0.9.5 beta of CGMlib, both
compiled with GCC 3.3.1 and optimization flags -O3. The
tests were run on Odin using LAM/MPI 7.1.1. We used relatively
small Erdos-Renyi graphs (96k vertices, 10M edges) for this test to
match the results provided in the following paper.
F. Dehne, A. Ferreira, E. Caceres, S. W. Song, A. Roncato, "Efficient
parallel graph algorithms for coarse grained multicomputers and BSP",
in Algorithmica Vol. 33, No. 2, 2002, pp. 183-200.
|