|
The Matrix Template Library (MTL) has had two fundamental goals from the
time of its conception:
To be a comprehensive, elegant, well-engineered, high-quality
library of re-usable components for scientific computing
To offer vendor-tuned levels of performance
At first, this might seem an impossible task. After all,
traditionally, abstraction has been considered to be the enemy of
performance. The common wisdom in scientific computing has been that
high levels of computational performance could only be achieved by
using Fortran.
The MTL was able to definitively achieve its two fundamental goals
through the use of an important new programming paradigm:
generic programming. Generic
programming allows for a compact and concise implementation of MTL
as a component library for scientific computing (the first goal).
At the same time, performance optimizations are expressed
within MTL in a generic fashion, resulting in a
portable high-performance library completely written in a high-level
language (in this case, C++). No escapes to Fortran, assembly, or
external libraries are required.
Matrix Types
The use of generic programming enables a large variety of matrix types
to be supported (i.e., for each algorithm) in the MTL with many fewer
lines of code. The matrix types supported include:
Arbitrary types can be used for matrix elements (the requirements that must be satisfied by the type will depend on the algorithms in which they will be used). Common types that can be used include
- float
- double
- complex<float>
- complex<double>
Algorithms / Linear Algebra Operations
The algorithmic abstractions that were developed for MTL were
motivated to a large extent by the mathematical structures underlying
numerical linear algebra. That is, there are well-defined
properties that give rise to the mathematical structures known as
linear algebras (as well as Banach spaces, Hilbert spaces, etc.).
Remarkably, there are a very small number of algorithms required to
represent the mathematical structures typically associated with
numerical linear algebra:
| Mathematical Operation |
MTL Algorithm |
| alpha * x |
scale() |
| x + y |
add() |
| A * x |
mult() |
| | x | |
norm() |
| < x , y > |
dot() |
| A^T |
transpose() |
Of course each of these mathematical operations is represented by a
family of actual functions and algorithms. However, through the use of
generic programming the number of actual functions written can be
drastically reduced by taking advantage of high-level similarities in
structure. The complete list of algorithms can be found in the
programmer's guide.
High Performance
To achieve highly optimized levels of performance, two issues must be
addressed. The first is the more general problem of optimizing for the
performance enhancing features of the computer architecture. This must
be done no matter what kind of language/compiler is being use (or even
if it is being written in assembly). The second set of issues regards
the problem of writing abstract and generic code in a high level
language without incurring severe overheads. We briefly discuss our
solutions to both the architecture and language/compiler issues here.
- Static Polymorphism
- The template facilities in C++ allow functions to be selected at
compile-time based on data type, as compared to virtual functions
which allow for function selection at run-time (dynamic polymorphism).
Static polymorphism provides a mechanism for abstraction which
preserves high performance, since the template functions can be
inlined just as regular functions. This ensures that the numerous
small function calls in the MTL (such as iterator increment operators)
introduce no extra overhead. This is especially critical in the MTL
since there are ofter hundreds of function calls in the inner-loop
of an algorithm.
- Lightweight Object Optimization
- The generic programming style introduces a large number of small
objects into the code. This can incur a performance penalty because
the presence of a structure can interfere with compiler optimizations,
including the mapping of the individual data items to registers. This
problem is solved with lightweight object optimization, also know as
scalar replacement of
aggregates[24].
- Automatic Unrolling and Instruction Scheduling
- Modern compilers can do a great job of unrolling loops and scheduling
instructions, but typically only for specific (recognizable) cases.
There are many ways, especially in C and C++ to interfere with the
optimization process. The MTL containers and algorithms are designed
to result in code that is easy for the compiler to optimize.
Furthermore, the iterator abstraction makes inter-compiler
portability possible, since it encapsulates how looping is performed.
- Algorithmic Blocking
- To obtain high performance on a modern
microprocessor, an algorithm must properly exploit the associated
memory hierarchy and pipeline architecture (typically through careful
tiling and loop blocking). Ideally, one would like to express high
performance algorithms in a portable fashion, but there is not enough
expressiveness in languages such as C or Fortran to do so. Recent
efforts (PHiPAC[4], ATLAS[38]) have resorted to going outside the
language, i.e., to code generation systems, in order to gain this kind
of flexibility. We have shown that with the use of meta-programming
techniques in C++, it is possible to create flexible and elegant
high-performance kernels that automate the generation of blocked and
unrolled code. We have implemented such a collection of kernels, the
Basic Linear
Algebra Instruction Set (BLAIS).
|