AMLI GMRES n sec

Mflops sec

Mflops [R.sub.y] = 1 274625 2.57 662 7.32 1012 571787 5.11 737 19.3 1036 1092727 10.8 677 45.9 1040 1685159 16.2 801 80.6 1049 [R.sub.y] = 10 274625 3.48 639 17.6 908 571787 6.95 768 49.8 944 1092727 14.7 790 134.1 935 1685159 24.5 841 248.8 961 [R.sub.y] = 100 274625 16.9 669 83.9 907 571787 32.3 801 180.4 944 1092727 60.1 862 >900 sec 1685159 90.6 874 >900 sec 7.3 Example 3

These observations imply that a 1024-processor CM-5 cannot execute Loop C at a speed greater than (160MB/sec/16 processors) x (1FLOP/16 bytes) x (1024 processors) = 625

MFLOPS. This is about 1% of its speed on Loop A.

In this case, for order 0 and 1 we have the same number of

Mflops and iterations for some [[sigma].sub.1] and, in three runs, the same (rounded) number of

Mflops but a different number of iterations for order 2 (necessarily lower because more expensive than order 0 and 1).

The MHL-algorithm runs at up to 100

MFlops in the reduction and 130

MFlops in the update, while DSBTRD only reaches 65 and 70

Mflops, respectively.

static void matmul(double[][] A, double[][] B, double[][] C, int m, int n, int p) { int i, j, k; for (i-0; i<m; i++) { for (j-0; j<p; j++) { for (k=0; k<zl; k++) { C[i][3] += A[i][k]*B[k][j]; } } } } The Java code is compiled into a native executable by the IBM High Performance Compiler for Java (HPCJ) [Seshadri 1997], and achieves a performance of 5

Mflops on the RS/6000 F50.

In terms of raw number crunching, the 200MHz Power3 could in theory crank out about 800

MFLOPS, with about 630

MFLOPS realized on benchmark tests.

Each processing node of the CM-5 is equipped with a 32MHz SPARC v7 processor rated at a peak performance of 32 Mips or 5

MFlops.(6) The experiments were run on a 32-processor system running version 7.3 Final I Rev 3 of the CMOST operating system, with version 3.2 of the CMMD message passing library.

(1) Mtops (millions of theoretical operations per second) are roughly comparable to millions of floating-point operations per second (

Mflops), but take into account integer computation, variations in word length between systems, and can serve to rate the performance of low- as well as high-end computers.

2) power consumption, e.g., millions of floating point operations per second (

MFLOPS) per watt, an important military consideration, is less in the DSP chip

For example, a typical computing time for analyzing a metal insert filter with 15 modes is 10 seconds for each frequency point on a typical, common, low-cost workstation with 10

MFLOPS. A typical complete diplexer may be optimized within a one-day run.

The chip set will exceed 100 MIPS and 100

MFLOPS and will generate more than 100,000 polygons per second.

At 9.2 double-precision Linpack

MFLOPS and and SPECmark of 25.8, it offers unparalleled price performance in the UNIX server market.