The Matrix Template Library Performance Results

  |   Home   |   Documentation   |  

One of the most exciting aspects of MTL is the high level of performance that it delivers.

Here we present performance results for several MTL algorithms on various platforms. Presently, we include results for matrix-matrix product and matrix-vector product on Sun UltraSPARC and for the IBM RS6000. Once the performance tests become part of our nightly tests, we will be adding results for more architectures, including SGI and Pentium. We will also be adding more algorithms.

About the Tests

The tests here illustrate several points.

  • First, they demonstrate how a single generic algorithm can provide high levels of performance across different architectures. The same generic matrix-matrix product algorithm provides vendor-tuned levels of performance on both the UltraSPARC and the RS6000. (We note that the optimizations within the algorithms are parameterized in a generic fashion and these parameters are set appropriately for the target architecture. The code within the routine itself is the same for both targets, however.)
  • Second, they demonstrate how a single generic algorithm can provide high levels of performance across vastly different data formats. The same generic matrix-vector product algorithm gives high levels of performance for both dense and sparse matrix types.
  • Finally, they demonstrate that vendor-tuned levels of performance (or, if you like, Fortran levels of performance) can be achieved with C++ and that this level of performance is achievable in the presence of domain-specific abstractions.

The MTL matrix-matrix multiply timed in the experiments below is the recursive algorithm described in the POOSC BLAIS paper.

Ultrasparc Results

The particular machine used was an UltraSPARC 170E using KAI C++ v3.3c in conjunction with Sun's C compiler,

Sun UltraSPARC dense matrix-matrix multiply performance.

Matrix-matrix multiply kernel.

The above graph shows dense matrix-matrix multiply performance for small matrices. This focuses on the inner loop, or kernel, of the algorithm, which was implemented using the BLAIS library.

Sparse matrix-vector multiply performance (row oriented).

Dense matrix-vector multiply performance (column oriented).

RS 6000 Results

The particular machine used was an RS6000 590. The compiler used was KAI C++ v3.2 with IBM's XLC compiler.

IBM RS6000 dense matrix-matrix multiply performance.