Due to the extra complexity in sparse matrix storage, the resultant flop rate of large sparse matrix -- vector product is a far cry from that of small dense linear systems. In the dense matrix multiply contest this year the winner of our class achieved 160 Mflop/s on a single processor on Xolas during the contest, and we had witnessed peak performance of 230 Mflop/s by the SUN Performance Library specially built to the v8plus architecture. In contrast, a straight forward implementation of the sparse matrix - vector product yields only about 30 Mflop/s on the same machine:


(# of processors)
1
2
3
4
5
6
7
8
(Peak Mflop/s)
30.1790
61.2940
85.5789
112.588
137.041
157.297
181.132
201.322
(Mflop/s per processor)
30.1790
30.6470
28.5263
28.1470
27.4082
26.2161
25.5903
25.1653

Unless we did something very stupid (such as mis-calculating the flop rate), we see no simple and ready way to improve the single processor speed. Theoretically, for b=Ax, the more compact (even better, more diagonal) the non-zero elements of the sparse matrix A are, the better chance there will be for cache hit in updating b. Usually A is generated from a 3D lattice where nearest /and 2nd nearet neighbours interact, thus how to best express A to conform the above criterion is important. Perhaps in the end a divide and conquer method will work best. We see this as our next project.

On the other hand, the scaling with the number of processors is pretty good for SMP. As a benchmark on how xolas is doing, the same code gets 14.329 Mflop/s for 1 process and 26.541 Mflop/s for 4 processes on a DEC-alpha 200 workstation.