Introduction to The Matrix Template Library

  |   Home   |   Documentation   |  

The Matrix Template Library (MTL) has had two fundamental goals from the time of its conception:

  • To be a comprehensive, elegant, well-engineered, high-quality library of re-usable components for scientific computing
  • To offer vendor-tuned levels of performance

    At first, this might seem an impossible task. After all, traditionally, abstraction has been considered to be the enemy of performance. The common wisdom in scientific computing has been that high levels of computational performance could only be achieved by using Fortran. The MTL was able to definitively achieve its two fundamental goals through the use of an important new programming paradigm: generic programming. Generic programming allows for a compact and concise implementation of MTL as a component library for scientific computing (the first goal). At the same time, performance optimizations are expressed within MTL in a generic fashion, resulting in a portable high-performance library completely written in a high-level language (in this case, C++). No escapes to Fortran, assembly, or external libraries are required.

    Matrix Types

    The use of generic programming enables a large variety of matrix types to be supported (i.e., for each algorithm) in the MTL with many fewer lines of code. The matrix types supported include:
    Arbitrary types can be used for matrix elements (the requirements that must be satisfied by the type will depend on the algorithms in which they will be used). Common types that can be used include
    • float
    • double
    • complex<float>
    • complex<double>

    Algorithms / Linear Algebra Operations

    The algorithmic abstractions that were developed for MTL were motivated to a large extent by the mathematical structures underlying numerical linear algebra. That is, there are well-defined properties that give rise to the mathematical structures known as linear algebras (as well as Banach spaces, Hilbert spaces, etc.).

    Remarkably, there are a very small number of algorithms required to represent the mathematical structures typically associated with numerical linear algebra:

    Mathematical Operation MTL Algorithm
    alpha * x scale()
    x + y add()
    A * x mult()
    | x | norm()
    < x , y > dot()
    A^T transpose()
    Of course each of these mathematical operations is represented by a family of actual functions and algorithms. However, through the use of generic programming the number of actual functions written can be drastically reduced by taking advantage of high-level similarities in structure. The complete list of algorithms can be found in the programmer's guide.

    High Performance

    To achieve highly optimized levels of performance, two issues must be addressed. The first is the more general problem of optimizing for the performance enhancing features of the computer architecture. This must be done no matter what kind of language/compiler is being use (or even if it is being written in assembly). The second set of issues regards the problem of writing abstract and generic code in a high level language without incurring severe overheads. We briefly discuss our solutions to both the architecture and language/compiler issues here.
    Static Polymorphism
    The template facilities in C++ allow functions to be selected at compile-time based on data type, as compared to virtual functions which allow for function selection at run-time (dynamic polymorphism). Static polymorphism provides a mechanism for abstraction which preserves high performance, since the template functions can be inlined just as regular functions. This ensures that the numerous small function calls in the MTL (such as iterator increment operators) introduce no extra overhead. This is especially critical in the MTL since there are ofter hundreds of function calls in the inner-loop of an algorithm.
    Lightweight Object Optimization
    The generic programming style introduces a large number of small objects into the code. This can incur a performance penalty because the presence of a structure can interfere with compiler optimizations, including the mapping of the individual data items to registers. This problem is solved with lightweight object optimization, also know as scalar replacement of aggregates[24].
    Automatic Unrolling and Instruction Scheduling
    Modern compilers can do a great job of unrolling loops and scheduling instructions, but typically only for specific (recognizable) cases. There are many ways, especially in C and C++ to interfere with the optimization process. The MTL containers and algorithms are designed to result in code that is easy for the compiler to optimize. Furthermore, the iterator abstraction makes inter-compiler portability possible, since it encapsulates how looping is performed.
    Algorithmic Blocking
    To obtain high performance on a modern microprocessor, an algorithm must properly exploit the associated memory hierarchy and pipeline architecture (typically through careful tiling and loop blocking). Ideally, one would like to express high performance algorithms in a portable fashion, but there is not enough expressiveness in languages such as C or Fortran to do so. Recent efforts (PHiPAC[4], ATLAS[38]) have resorted to going outside the language, i.e., to code generation systems, in order to gain this kind of flexibility. We have shown that with the use of meta-programming techniques in C++, it is possible to create flexible and elegant high-performance kernels that automate the generation of blocked and unrolled code. We have implemented such a collection of kernels, the Basic Linear Algebra Instruction Set (BLAIS).