Distributed Heterogeneous Graph Visualization
Problem
Heterogeneous graphs (graphs constructed from multiple data sets and
integration strategies) are common in data rich environments such as
the life sciences. Due to their size and complexity, it is difficult
to generate useful visualizations of these graphs.
The traditional node-link graph visualization scales to only a few
hunderd nodes on current displays. The layout algorithms are
also expensive and generally non-interactive.
Visual similarity matrices (VSMs) are capable of displaying large
graphs quickly, but are difficult to interpret and have not been
studied in the context of large, heterogeneous graphs. Our previous
work
(e.g.,
[1],
[2])
suggests that it is possible
to adapt and augment VSMs to provide multiple perspectives on complex
graphs to make them more interpretable. However, even with useful
visualizations, the size of the graphs makes it difficult to present
them to users in an interactive manner.
Study
In this thesis, we will address the problem of visualizing large,
heterogeneous graphs. We will explore visualization techniques to
present multiple perspectives of the
graph built around the visual similarity matrix technique for large
graph visualization along with the more traditional node-link graph
diagrams. Second, we will address the scalability issues
that arise when implementing the visual interfaces for large graphs.
For the purpose of this work, large graphs are considered to be graphs
with tens to hundreds of thousands of vertices and millions of
edges.
For visualization, we will focus on extending and augmenting visual
similarity matrices and scaling both VSMs and node-link plots to large
data sets. This builds on our previous work with VSMs and our study
of distributed force directed layout algorithms for node-link diagrams
[3].
Heterogeneous graphs add another dimension to the data represented in
the graph. Initial attempts to display these graphs using VSMs have
failed to produce interpretable images, even for 'expert' users.
However, the physical characteristics of the data, specifically the
fact that it is composed of multiple, independent graphs, suggest
possible methods of presenting the graphs to users.
For instance, multiple VSMs can be chained together using their axes
as natural data partitions. Using this technique, multiple graphs can
be displayed and physically linked together. Additionally, other
'marginal' plots can display other attributes of the data that can aid
in interpretation. These could include simple statisical plots that
are linked to the data or more complex node-link diagrams showing
various subgraphs. The key will be to have a well defined and
systematic method for extracting nodes and edges for marginal plots
along with a set of rules for interpretation. This will allow the
user to explore the graphs in a predictable and reproducible manner.
To support user directed navigation through collections of graph
representations, our graph vocabulary will support interaction and
basic graph tasks. These include node select, path identification,
algorithm execution (e.g., find the connected components), and
'details-on-demand' for specific data items.
Linked visual similarity matrices, node-link plots, property plots, and node details. |
The real world graphs that motivate this research are large graphs
built around drug discovery and genomic data collections.
While the backbone of the graphs (just the nodes and edges) can
currently fit into main memory, they are already past the scale were a
graphics pipeline can render them in real time. Even with graphs in
memory, the amount of screen real estate available for display places
limits on visualizations. Tiled display systems afford some
flexibilty in layout and may provide some novel opportunities for
managing the visual graph mining process. Additionally, the
peripheral data (which is most important to an end user) will not
always be immediately available in memory and will most likely reside
on remote servers. Thus, it will be important
to develop a general system architecture that can scale with the
data.
As with other projects where visualization is the end goal (our
work on large dot-plot visualization and chemical comparisons are
examples), it is entirely possible that this project will end up
focusing primarily on the algorithms and data structures necessary to
scale with the data and manage distributed system resources.
Research Plan
- Acquire large, heterogeneous graphs (CGB and Lilly)
- Adapt VSM and node-link diagrams to heterogeneous graphs
- Partitioning
- Coloring
- Linked plots
- 'Stacked' plots
- Overlays
- Augment main visualizations with 'marginal' visualizations
- Property/Statistical plots
- Node-link plots
- User Tasks (in context of adapted and augment plots)
- Node selection
- Path identification
- 'Cluster' selection
- 'Details-on-demand'
- Node statistics
- Cluster statistics
- System issues
- Data structures for interactive visualization
- Data structures for interactive graph queries
- Data structures/vis tricks for compositing graph images (e.g.,overlays, details)
- Distributed data flow
- Large display surfaces (e.g. tiled display walls, multi-head systems)
Chem/Bio Applications
- Interaction and similarity graph mining
- Pathway EDA
- Drug interaction EDA
- Our examples of these graphs all come from the life sciences.
Related Work
- Visual similarity matrix research
- Component-based visualization (e.g., Chris North's 'snap together' vis)
- Scatterplot matrices/Trellis visualization
- Pajek, Perfuse
- Cytoscape
- GAP - Generalized Association Plots (Chen 2002)
Risks
Without user studies, it is difficult to measure usefulness of
adapted and augmented plots. Are interpretation guides and usage
'algebras' enough?
Potentially too broad: any one topic in the research plan list could
turn into a dissertation - esp. when combined with system integration
and performance issues.
May overlap with current work in
Jean-Daniel Fekete's
lab.

