Parallel BGL logo

Parallel BGL Performance

  |   Home   |   Documentation   |  

Performance and scalability are crucial for the Parallel BGL. We aim to be competitive with existing parallel graph libraries in both areas.

Experimental Setup

All performance evaluations were performed on the Indiana University Computer Science Department's ``Odin'' research cluster. Odin consists of 128 compute nodes connected via Infiniband. Each node contains 4GB of dual-channel PC3200 (400 MHz) DDR-DRAM with two 2.0GHz AMD Opteron 270 processors (1MB Cache) running Red Hat Enterprise Linux WS Release 4. Each node contains a Mellanox InfiniHost HCA on a 133MHz PCI-X bus running OFED 1.1, connected to a 144 port Voltaire SDR switch. For our tests we left one processor idle on each node.

We typically do performance testing on random graphs generated from three different models. When evaluating how the Parallel BGL might perform for a particular problem, consider how the structure of your graph relates to the structure of these graph models:

  • Small-world graphs start as ring graphs, where each vertex is connected to its k closest vertices in the ring. Each edge is then randomly rewired with probability p, which we fix at 0.04 for most experiments. Small-world graphs distribute very well and therefore typically produce the best performance and scalability numbers.
  • Erdos-Renyi graphs have a uniform probability p of having an edge between any two vertices in the graph. The graphs exhibit very little structure and typically do not distribute well. These graphs are rarely useful for modeling real-world data, but the model is simple enough that it permits theoretical analyses of algorithm complexity that are not feasible for other, more realistic, models.
  • Scale-free graphs have as a defining characteristic a "power law" distribution of vertex degrees. Thus, there are a large number of vertices with a vertex small degree but only a tiny fraction of the vertices will have a large degree. These graphs do not distribute well, because most algorithms operate by considering the edges of each node processed: when one processor has one of high-degree nodes, it will do much more work than other processors. However, this model is useful because many kinds of naturally-occuring networks (such as the Web graph) are scale-free graphs.

Note: The performance numbers reported here are for the older versions of the Parallel BGL, up to date performance data reflecting the significant advances in communications infrastructure will be added soon. In the future, we hope to provide more complete information about particular versions of the Parallel BGL, varying different parameters of the experimental setup, and additional comparative results.

Performance results

For each kind of graph, we present the wall clock times for several of the Parallel BGL algorithms. We make no attempt to explain performance results in this section; refer to the algorithm documentation for performance caveats.

The following chart illustrates the scalability and performance of the two variants of Dijkstra's Single Source Shortest Path algorithm available in the Parallel BGL. This example uses an Erdos-Renyi graph with 2.5M vertices and 12.5M (directed) edges per processor for a maximum size of 240M vertices and 1.2B edges on 96 processors. This example demonstrates that the ability of the Parallel BGL to solve very large problems is limited only by available memory.

The following chart illustrates the largest Single Source Shortest Path problem solved with the Parallel BGL to date. This example uses the Delta-Stepping algorithm on an Erdos-Renyi graph with average degree 4. The largest problem solved is 1B vertices and 4B edges using 96 processors.

The following chart illustrates the scalability and performance of an older version of the Parallel BGL on the Odin cluster using GCC 3.3.1. For this example, we use graphs that can easily be processed on a single node, with 1M vertices and 15M (undirected) edges. This chart shows the scalability of several Parallel BGL algorithms for fixed-size small-world graphs under somewhat "ideal" conditions, because the small-world graphs have been partitioned well across the compute nodes.

The following chart illustrates the scalability and performance of several Parallel BGL algorithms on fixed-sized Erdos-Renyi graphs with 1M vertices and 15M edges. Since there is no good partitioning for Erdos-Renyi graphs, this chart illustrates how the Parallel BGL performs under very poor conditions. Compare this chart with the previous chart (small-world graphs of the same size) to see the effects of poor partitioning on graph algorithm performance.

The following chart illustrates the scalablity of the Parallel BGL when the size of the problem is scaled with the number of processors. In this example, we run several of the Parallel BGL algorithms on small-world graphs with ~547k vertices per compute node and 30 edges incident on each vertex (up to 70M vertices/1B edges on 128 compute nodes). An algorithm that scales linearly would produce a horizontal line in this chart.

CGMgraph (part of CGMlib) is another freely-available parallel graph library. It is based on the Course Grained Multicomputer (CGM) model of parallel computation, which is similar to the Bulk Sychronous Parallel (BSP) model used primarily in the older version of the Parallel BGL benchmarked below. The following chart illustrates the results of comparing the implementation of Connected Components in the Parallel BGL against the CGMgraph. The tests were run using version 0.5.0 of the Parallel BGL and version 0.9.5 beta of CGMlib, both compiled with GCC 3.3.1 and optimization flags -O3. The tests were run on Odin using LAM/MPI 7.1.1. We used relatively small Erdos-Renyi graphs (96k vertices, 10M edges) for this test to match the results provided in the following paper.

F. Dehne, A. Ferreira, E. Caceres, S. W. Song, A. Roncato, "Efficient parallel graph algorithms for coarse grained multicomputers and BSP", in Algorithmica Vol. 33, No. 2, 2002, pp. 183-200. Comparison between the Parallel BGL and CGMgraph Connected Components algorithms.