************************************************
Speed benchmark of SGI Origin2000 using Mulcp: 
************************************************

Mulcp is a program to calculate the local density of states, which can
work under either single-processor (workstation) or multi-processor
shared memory environment
(http://mumu.mit.edu/~liju99/Project/project.html). The core is a very
large sparse matrix-vector product subroutine which is going to be
benchmarked used high precision real-time timer. Because sparse
matrix-vector product can not use the cache efficiently, it can not
(by far) attain the peak flop rate of a machine (390 Mflop/s per
r10000 processor(mmm), 200 Mflop/s per r4400 processor(mumu)).
Nevertheless that is what happens for most programs, and for
comparison of speed it is OK.

Machine configurations now:

MMM:

FPU: MIPS R10010 Floating Point Chip Revision: 0.0
CPU: MIPS R10000 Processor Chip Revision: 2.6
4 195 MHZ IP27 Processors
Main memory size: 1024 Mbytes
Instruction cache size: 32 Kbytes
Data cache size: 32 Kbytes
Secondary unified instruction/data cache size: 4 Mbytes

Mumu:

FPU: MIPS R4000 Floating Point Coprocessor Revision: 0.0
CPU: MIPS R4400 Processor Chip Revision: 6.0
Data cache size: 16 Kbytes
Instruction cache size: 16 Kbytes
Secondary unified instruction/data cache size: 1 Mbyte on Processor 0
Main memory size: 96 Mbytes

************************************************************

Using cc compiler on mumu with compiler options "-mips2 -O2", the
generated binary "mulcp1" has 

A. On mumu,  13.89101 Mflops/s

B. On Origin2000:
 
   1 processor:   52.3 Mflops/s
   2 processors:  100.6 Mflops/s
   3 processors:  137.0 Mflops/s
   4 processors:  151.8 Mflops/s

************************************************************

Using cc compiler on mmm with compiler options "-64 -mips4 -O3
-Ofast=IP27 -IPA -TARG:platform=IP27 -r10000", the generated 
binary "mulcp2" has

  1 processor:  56.4   Mflops/s
  2 processors: 119.4  Mflops/s
  3 processors: 163.9  Mflops/s
  4 processors: 179.4  Mflops/s

************************************************************

In making these benchmarks, it is carefully observed (using "top")
that there are no other major processes running on mumu or mmm.

For reference, the same program on Xolas network at MIT, (Ultra HPC
5000, 8 processors, each processor 334 Mflop/s peak floprate), with
the best compiler option "-xO5 -native -xdepend -xchip=ultra
-xarch=v8plus", attains

1 processor:    32.6     Mflops/s
2 processors:   55.3     Mflops/s
3 processors:   76.4     Mflops/s
4 processors:   96.4     Mflops/s
8 processors:   181.1    Mflops/s

(Li Ju,8.1.97)