
   ==================================================================
   ===                                                            ===
   ===           GENESIS Distributed Memory Benchmarks            ===
   ===                                                            ===
   ===                            LPM1                            ===
   ===                                                            ===
   ===             Local Particle-Mesh Device Simulation          ===
   ===                                                            ===
   ===               Author:   Roger Hockney                      ===
   ===     Department of Electronics and Computer Science         ===
   ===               University of Southampton                    ===
   ===               Southampton SO9 5NH, U.K.                    ===
   ===     fax.:+44-703-593045   e-mail:rwh@uk.ac.soton.ecs       ===
   ===                                                            ===
   ===     Copyright: SNARC, University of Southampton            ===
   ===                                                            ===
   ===          Last update: March 1992; Release: 2.0             ===
   ===                                                            ===
   ==================================================================


1. Description
--------------

This benchmark is the simulation of an electronic device using a 
particle-mesh (PM) method, often also called a particle-in-cell (PIC)
simulation. In each timestep the electric and magnetic fields on an 
(LMAX x MMAX) mesh are advanced explicitly in time using Maxwell's
equations, and the particles (electrons) are advanced in the fields 
using Newton's equations.

The benchmark is described as local because the time scale is such that
the fields may be computed explicitly, using fields only local to each 
mesh point. The number of particles at the end of the run of 1 picosecond
is given empirically by

                       628*alpha**1.172  

As the number of mesh-points increases for the same physical dimension,
the time-step must be reduced to satisfy the CFL stability criterion.  
This effect has an important influence on the meaning of the performance
metrics. The performance is expressed in several different metrics (and
units) for comparison purposes.  As well as the traditional Speedup and
Efficiency, we give the Temporal (tstep/s), Simulation (sim-ps/s), and
Benchmark (mflop/s(lpm1)) performance, which are much more meaningful and
useful measures.

Parallelisation is by one-dimensional domain decomposition, in the first 
coordinate. Each processor is responsible for a slab of space, and stores
the mesh-ponts and coordinates of particles in its region of space. During
each timestep, particle coordinates are transferred between processors as
the particles move from region to region.

Temporal Performance

Temporal performance is the inverse of the execution time, here expressed
in units of timestep per second (tstep/s).  This is the fundamental metric
of performance, because it is in absolute units and one can guarantee that
the code with the highest temporal performance executes in the least time.

Speedup and Efficiency

Speedup, Sp, has the traditional definition of the ratio of 1-proc to 
n-proc. execution time, and Efficiency, Ep, is Speedup per processor. 
Because Speedup is a relative measure, the program with the highest 
Speedup may not execute in the least time! Be warned.

Simulation Performance

This metric measures the amount of simulated time computed in one real 
wall-clock second. It is the most meaningful metric for a simulation, 
because it is what the user actually wishes to maximise. For this 
benchmark, the units are simulated picosecond per second (sim-ps/s). 
In this metric larger problems with more mesh points run slower (which 
in fact they do), although they generate more Speedup and Mflop/s! This 
metric also includes the fact that problems with a smaller space step 
often must use a smaller timestep, and therefore take more timesteps to
cover the same amount of simulated time. 

Benchmark Performance

This metric is calculated from the nominal number of floating-point 
operations needed to perform the benchmark on a single processor.  For 
the one-nanosecond benchmark setup here, the average number of floating-
point operations per timestep is defined to be:

             F_b(alpha) = 46*75*33*alpha + 58*628*alpha**1.172

where the size factor alpha=1,2,4,8 for cases NBEN3=1,2,3,4. The first 
term above is the work to update the fields on the mesh, and the second
term is the work to move the particles.  Then the benchmark performance is
	
              R_b(alpha,p) = F_b(alpha)/Tp(alpha,p)

Performance calculated in this way has the units Mflop/s(LPM1). Different
parallel implementations may, in fact, perform more or fewer operations
than the above, but they are only credited with the number given by the
formula.  Because F_b is fixed for all codes, we can quarantee that the
code with the highest benchmark performance executes in the least time.


2. Operating Instructions
-------------------------

Changing problem size and numbers of processes:

Most of the parameters are internally fixed. The user has to specify 
only the number of processors and the number of particles. These are 
input from the standard input on cannel 5.

Suggested Problem Sizes :

Four benchmark cases are provided (NBEN3=1,2,3,4), giving four problem 
sizes described by the size factor alpha=1,2,4,8 and mesh numbers
(75*alpha,33).


Compiling and running the benchmark:
 
1) To compile and link the benchmark type:   make
  
2) To run the benchmark type:     host
   
3) Input parameters from the standard input:

    -Number of nodes for mimd run is at maximum one less than
     the number of nodes allocated by getcube (4 in above example)
     because one node is always used to perform a 1-processor
     check calculation.

    -For every problem size, the 1-processor calculation must be
     performed once. The results are stored in the four check 
     result files: res1p.size1, ... , res1p.size4. After that
     benchmark runs can be performed without waiting for the
     1-processor run, which is slow for the largest problem size.
     To omit the 1-processor run answer 0 to the second question.

The results for the four problem sizes, cases 1,2,3 and 4, and
different number of processors are put automatically in different 
output files, with notation (for example):

  lpm1c3p25 - output for lpm1 benchmark, case 3 for 25 processors

If you wish to put the files elsewhere there is a prompt to tell you
when to do it with a Unix cp command.

Files

      host.u      - host program, contains PARMACS for host.
      node.u      - node main program and all communication interface
                    routines, therefore all node PARMACS calls are here.
      benctl.f    - benchmark control, may be changed to modify
                    output, but usually left alone. No PARMACS here.
      lpm1bk.f    - body of benchmark code. Not to be touched.
      res1p.size1 - correct results on one processor for standard
                    size problem, case1, (75x33) mesh.
      res1p.size2 - results for case2 problem (150x33) mesh.
      res1p.size3 - results for case3 problem (300x33) mesh.
      res1p.size4 - results for case4 problem (600x33) mesh.
      secowa.f    - LPM1 program second timer, which calls
                    the standard benchmark system timer
      header.f    - standard header information
      pm1c4p100   - etc, output files generated by program

3. Accuracy Check
-----------------
Because the simulation uses random numbers, the multi-processor
calculation cannot be expected to give identical results to the
uni-processor calculation. however, the percentage difference in
particle number, NP, and average B-field, BAV, in the last timestep,
should not exceed a few percent.  Calculations are accepted if
differences < 10%
