Chris Mueller


Open Systems Lab / Computer Science Department
Indiana University
215 Lindley Hall
Bloomington, IN 47405
chemuell@cs.indiana.edu

Distributed Heterogeneous Graph Visualization


Problem

Heterogeneous graphs (graphs constructed from multiple data sets and integration strategies) are common in data rich environments such as the life sciences. Due to their size and complexity, it is difficult to generate useful visualizations of these graphs. The traditional node-link graph visualization scales to only a few hunderd nodes on current displays. The layout algorithms are also expensive and generally non-interactive. Visual similarity matrices (VSMs) are capable of displaying large graphs quickly, but are difficult to interpret and have not been studied in the context of large, heterogeneous graphs. Our previous work (e.g., [1], [2]) suggests that it is possible to adapt and augment VSMs to provide multiple perspectives on complex graphs to make them more interpretable. However, even with useful visualizations, the size of the graphs makes it difficult to present them to users in an interactive manner.

Study

In this thesis, we will address the problem of visualizing large, heterogeneous graphs. We will explore visualization techniques to present multiple perspectives of the graph built around the visual similarity matrix technique for large graph visualization along with the more traditional node-link graph diagrams. Second, we will address the scalability issues that arise when implementing the visual interfaces for large graphs. For the purpose of this work, large graphs are considered to be graphs with tens to hundreds of thousands of vertices and millions of edges.
For visualization, we will focus on extending and augmenting visual similarity matrices and scaling both VSMs and node-link plots to large data sets. This builds on our previous work with VSMs and our study of distributed force directed layout algorithms for node-link diagrams [3]. Heterogeneous graphs add another dimension to the data represented in the graph. Initial attempts to display these graphs using VSMs have failed to produce interpretable images, even for 'expert' users. However, the physical characteristics of the data, specifically the fact that it is composed of multiple, independent graphs, suggest possible methods of presenting the graphs to users.
For instance, multiple VSMs can be chained together using their axes as natural data partitions. Using this technique, multiple graphs can be displayed and physically linked together. Additionally, other 'marginal' plots can display other attributes of the data that can aid in interpretation. These could include simple statisical plots that are linked to the data or more complex node-link diagrams showing various subgraphs. The key will be to have a well defined and systematic method for extracting nodes and edges for marginal plots along with a set of rules for interpretation. This will allow the user to explore the graphs in a predictable and reproducible manner.
To support user directed navigation through collections of graph representations, our graph vocabulary will support interaction and basic graph tasks. These include node select, path identification, algorithm execution (e.g., find the connected components), and 'details-on-demand' for specific data items.

Linked visual similarity matrices, node-link plots, property plots, and node details.
The real world graphs that motivate this research are large graphs built around drug discovery and genomic data collections. While the backbone of the graphs (just the nodes and edges) can currently fit into main memory, they are already past the scale were a graphics pipeline can render them in real time. Even with graphs in memory, the amount of screen real estate available for display places limits on visualizations. Tiled display systems afford some flexibilty in layout and may provide some novel opportunities for managing the visual graph mining process. Additionally, the peripheral data (which is most important to an end user) will not always be immediately available in memory and will most likely reside on remote servers. Thus, it will be important to develop a general system architecture that can scale with the data. As with other projects where visualization is the end goal (our work on large dot-plot visualization and chemical comparisons are examples), it is entirely possible that this project will end up focusing primarily on the algorithms and data structures necessary to scale with the data and manage distributed system resources.

Research Plan

  • Acquire large, heterogeneous graphs (CGB and Lilly)
  • Adapt VSM and node-link diagrams to heterogeneous graphs
    • Partitioning
    • Coloring
    • Linked plots
    • 'Stacked' plots
    • Overlays
  • Augment main visualizations with 'marginal' visualizations
    • Property/Statistical plots
    • Node-link plots
  • User Tasks (in context of adapted and augment plots)
    • Node selection
    • Path identification
    • 'Cluster' selection
    • 'Details-on-demand'
      • Node statistics
      • Cluster statistics
  • System issues
    • Data structures for interactive visualization
    • Data structures for interactive graph queries
    • Data structures/vis tricks for compositing graph images (e.g.,overlays, details)
    • Distributed data flow
    • Large display surfaces (e.g. tiled display walls, multi-head systems)

Chem/Bio Applications

  • Interaction and similarity graph mining
  • Pathway EDA
  • Drug interaction EDA
  • Our examples of these graphs all come from the life sciences.

Related Work

  • Visual similarity matrix research
  • Component-based visualization (e.g., Chris North's 'snap together' vis)
  • Scatterplot matrices/Trellis visualization
  • Pajek, Perfuse
  • Cytoscape
  • GAP - Generalized Association Plots (Chen 2002)

Risks

Without user studies, it is difficult to measure usefulness of adapted and augmented plots. Are interpretation guides and usage 'algebras' enough?
Potentially too broad: any one topic in the research plan list could turn into a dissertation - esp. when combined with system integration and performance issues.
May overlap with current work in Jean-Daniel Fekete's lab.