************************************************ Speed benchmark of SGI Origin2000 using Mulcp: ************************************************ Mulcp is a program to calculate the local density of states, which can work under either single-processor (workstation) or multi-processor shared memory environment (http://mumu.mit.edu/~liju99/Project/project.html). The core is a very large sparse matrix-vector product subroutine which is going to be benchmarked used high precision real-time timer. Because sparse matrix-vector product can not use the cache efficiently, it can not (by far) attain the peak flop rate of a machine (390 Mflop/s per r10000 processor(mmm), 200 Mflop/s per r4400 processor(mumu)). Nevertheless that is what happens for most programs, and for comparison of speed it is OK. Machine configurations now: MMM: FPU: MIPS R10010 Floating Point Chip Revision: 0.0 CPU: MIPS R10000 Processor Chip Revision: 2.6 4 195 MHZ IP27 Processors Main memory size: 1024 Mbytes Instruction cache size: 32 Kbytes Data cache size: 32 Kbytes Secondary unified instruction/data cache size: 4 Mbytes Mumu: FPU: MIPS R4000 Floating Point Coprocessor Revision: 0.0 CPU: MIPS R4400 Processor Chip Revision: 6.0 Data cache size: 16 Kbytes Instruction cache size: 16 Kbytes Secondary unified instruction/data cache size: 1 Mbyte on Processor 0 Main memory size: 96 Mbytes ************************************************************ Using cc compiler on mumu with compiler options "-mips2 -O2", the generated binary "mulcp1" has A. On mumu, 13.89101 Mflops/s B. On Origin2000: 1 processor: 52.3 Mflops/s 2 processors: 100.6 Mflops/s 3 processors: 137.0 Mflops/s 4 processors: 151.8 Mflops/s ************************************************************ Using cc compiler on mmm with compiler options "-64 -mips4 -O3 -Ofast=IP27 -IPA -TARG:platform=IP27 -r10000", the generated binary "mulcp2" has 1 processor: 56.4 Mflops/s 2 processors: 119.4 Mflops/s 3 processors: 163.9 Mflops/s 4 processors: 179.4 Mflops/s ************************************************************ In making these benchmarks, it is carefully observed (using "top") that there are no other major processes running on mumu or mmm. For reference, the same program on Xolas network at MIT, (Ultra HPC 5000, 8 processors, each processor 334 Mflop/s peak floprate), with the best compiler option "-xO5 -native -xdepend -xchip=ultra -xarch=v8plus", attains 1 processor: 32.6 Mflops/s 2 processors: 55.3 Mflops/s 3 processors: 76.4 Mflops/s 4 processors: 96.4 Mflops/s 8 processors: 181.1 Mflops/s (Li Ju,8.1.97)