One of the most exciting aspects of MTL is the high level of
performance that it delivers.
Here we present performance results for
several MTL algorithms on various platforms.
Presently, we include results for matrix-matrix product and
matrix-vector product on Sun UltraSPARC and for the IBM RS6000.
Once the performance tests become part of our nightly tests, we will
be adding results for more architectures, including SGI and Pentium.
We will also be adding more algorithms.
About the Tests
The tests here illustrate several points.
- First, they demonstrate how a single generic algorithm can provide
high levels of performance across different architectures. The same
generic matrix-matrix product algorithm provides vendor-tuned levels
of performance on both the UltraSPARC and the RS6000. (We note that
the optimizations within the algorithms are parameterized in a generic
fashion and these parameters are set appropriately for the target
architecture. The code within the routine itself is the same for both
- Second, they demonstrate how a single generic algorithm can
provide high levels of performance across vastly different data
formats. The same generic matrix-vector product algorithm gives high
levels of performance for both dense and sparse matrix types.
- Finally, they demonstrate that vendor-tuned levels of performance
(or, if you like, Fortran levels of performance) can be achieved with
C++ and that this level of performance is achievable in the presence
of domain-specific abstractions.
The MTL matrix-matrix multiply timed in the experiments below is
the recursive algorithm described in the POOSC BLAIS paper.
The particular machine used was an UltraSPARC 170E
using KAI C++ v3.3c in conjunction with Sun's C compiler,
Sun UltraSPARC dense matrix-matrix multiply performance.
Matrix-matrix multiply kernel.
The above graph shows dense matrix-matrix multiply performance for
small matrices. This focuses on the inner loop, or kernel, of the
algorithm, which was implemented using the BLAIS library.
Sparse matrix-vector multiply performance (row oriented).
Dense matrix-vector multiply performance (column oriented).
RS 6000 Results
The particular machine used was an RS6000 590. The compiler used was
KAI C++ v3.2 with IBM's XLC compiler.
IBM RS6000 dense matrix-matrix multiply performance.