Summary of workshop on GRAPE:TNG


************************************************************************
*                                                                      *
*                        A Brief Summary of                            *
*                                                                      *
*       Workshop on GRAPE:TNG Design, Applications, and Funding        * 
*                                                                      *
*           National Center for Supercomputing Applications            * 
*                    Champaign/Urbana, Illinois                        * 
*                                                                      *
*                         December 11, 1995                            * 
*                                                                      *
************************************************************************

List of Participants
--------------------

Greg Bryan            gbryan@ncsa.uiuc.edu
Gus Evrard            evrard@pablo.physics.lsa.umich.edu
Carlos Frenk          C.S.Frenk@durham.ac.uk
Piet Hut              piet@sns.ias.edu
Jun Makino            makino@kyohou.c.u-tokyo.ac.jp
Steve McMillan        steve@zonker.drexel.edu
Michael Norman        norman@ncsa.uiuc.edu
Larry Smarr           smarr@ncsa.uiuc.edu
Thomas Sterling       tron@cesdis1.gsfc.nasa.gov
Frank Summers         summers@astro.princeton.edu
Makoto Taiji          taiji@chianti.c.u-tokyo.ac.jp
Peter Teuben          teuben@astro.umd.edu
Joel Welling          welling@psc.edu


 
The Future of Supercomputing
----------------------------

Larry Smarr gave his view of the future of supercomputing and his
plans for NCSA. 

-- The commodity microprocessor has taken over the market. Without the
cold war to spur defense purchases of the latest and greatest
supercomputers, companies need to address the full market, from
workstations up to supercomputers, in order to be economically viable.
The supercomputer market alone is just not large enough to support a
company. Vendors are using a model of the same chips and operating system
running workstations, departmental servers, and supercomputers.

-- The dominant paradigm of the future will be shared memory
multi-processor systems (SMP). Nearly every major vendor (except Intel?)
at Supercomputing 95 showed a strategy leading to an SMP machine. In
practice, only 8-16 CPUs can be connected to a common memory before bus
conflicts create a bottleneck. The SMP supercomputers will be arrays of
these 8-16 CPU units with highspeed interconnects. For the programmer,
the machine will appear as a global shared memory machine, but data
locality and cache coherence (if not built in) may be extremely
important factors to consider. SMP code within the 8-16 CPU units
with message passing between the units might prove to be fast.

-- An example of this is the Convex Exemplar at NCSA, which is becoming
the major compute engine at NCSA. The Convex crossbar interconnect
provides very good remote memory accesses, though it is still a penalty
of about 100x longer than local memory access. A simple port of a gas
dynamics code (CMHOG) that ran well on the CM-5 shows very good scaling
on the Exemplar.

-- What to expect in the year 2000? Intel will have a one-off machine
with thousands of P7 processors at about 1.5 GFlops/CPU. The centers
will probably have smaller machines with hundreds of CPUs. Note that HP
and Intel are jointly developing the P7, so that the Exemplar successor
will also use a P7 (aka PA-9XXX) processor. Expect about 300-500 GB of
RAM and about 100 TB of storage. Net connections to certain major
research sites should be up to 150 Mbits/s.

-- If one is going to use special purpose hardware, it must provide a
significant advancement of the field. "Significant" may be defined for
practical purposes as allowing calculations to be done that will not be
possible otherwise for at least 3 years. Price/performance ratio is also
a consideration, but GRAPE has shown itself very strong in that area. An
important concern is that the project must not fall behind schedule, else
the lead time over general purpose machines is diminished.



GRAPE History
-------------

Makoto Taiji presented an overview of GRAPE history. See the papers
referenced at the end for the details.

-- Interesting table. Shows the chip space needed for doing high
precision calculations as well as the extra needed for the table look-up
storage. Performance in Flops is a bit hard to measure because the
operations would be implemented differently on a general purpose
machine.

                                     Grape-3   Grape-4  MD-Grape

   Number of transistors per chip     1e5        4e5      8e5
   Performance per chip (GFlops)      0.64       0.75     1.0

-- About 40 GRAPE systems in use worldwide. This point brought up a
discussion about what the market might be for small versions of
GRAPE:TNG, which we referred to as a Junior version. Would these same
places want to buy an upgrade to a TNG:Jr that would provide about 1-5
TFlops as a backend to a workstation? Seems reasonable and should be
incorporated into plans.

-- GRAPE-4 won the Gordon Bell prize at Supercomputing 95.



GRAPE:TNG Current Ideas
-----------------------

Jun Makino presented simple ideas and estimates of performance for
building the next GRAPE machine.

-- No change to the basic structure of a host computer frontend
to the GRAPE board.

-- GRAPE-4 uses 1 micron fabrication line width, 400 k transistors,
32 MHz clock, 600 MFlops/chip, 5-8 Watts of power per chip

-- If one were to begin building GRAPE:TNG now, one could move to
0.35 micron, 4-6 M transistors, 50-100 MHz, 10-30 GFlops, 20 W

-- If one waits until 1998 to begin work, the numbers are 0.2-0.25
micron, 10-20 M transistors, 50-200 GFlops, 10-40 W

-- Jun's current idea is to use the technology available in 1998 to create
a machine with about 10,000 chips to get the Petaflops performance.
Several reasons for waiting. First, Jun and the group at Tokyo are
primarily astronomers and want some time away from machine design to get
some science done. Second, a delay will allow for time to study how to
incorporate as many types of problems into the new machine as possible. 
Third, and maybe most important, funding may take a while to coordinate,
with expected funding for a big project in Japan not readily available
until the 1998 timeframe.  If we don't apply for funding in the US until
next year (1996) and it doesn't start until 97, then the study period is
only a year. Also, of course, 1998 is the first year in which reaching a
petaflops will be economically viable (less than $10 million, say).
Crossing such a performance benchmark is a target to aim for and is
helpful in generating interest.

-- Some discussion over the waiting period. There are problems
associated with trying to get funding for a project that does not have
strong deliverables for each year of the project. Wouldn't an
immediate start on a machine that reaches about 100 Tflops also be a good
idea? People generally agreed that, for some applications, there are
studies that need to be done to best assess how GRAPE:TNG can be
utilized -- some design and study phase will be needed.

-- Most applications will not be able to get Petaflops speed. Those that
will include simulations of globular clusters and binary black holes in
galactic nuclei. These will only require a 10 GFlops host machine.
Architecture of the host is not that critical, except in that the I/O must
be scalable. Need about 2 GB/s I/O for the above calculations.

-- In considering GRAPE usage, the current bottleneck is the host
speed. For the globular cluster problem, the GRAPE-4 is limited by the
speed of the DEC Alpha workstation used as a host.  With this host, the
point at which the GRAPE hardware achieves half its peak speed corresponds
to a few times 10^5 particles, but, even at teraflops speeds, core
collapse in such a system would still require several years' worth of
calculation.  The "half-peak number" decreases as the speed of the front
end increases.  At a speed of 0.5 Tflops, a 50k particle simulation would
reach core collapse in about a month.  Increasing the speed of the present
host by a factor of 4-5 would achieve half-peak speed for this value of N.



Smoothed Particle Hydrodynamics and GRAPE
-----------------------------------------

During one of the discussions on scientific applications, the idea of
implementing SPH in hardware was brought up. This topic had also been
broached at the GRAPE workshop in Germany. On can almost come up with a
design for a board using a look-up table that could be used for both
gravity and SPH calculations, but certain difficulties of doing SPH
force symmetrization and the viscosity calculation make this not an
attractive option. However, if one could define an "optimal" SPH
implementation, then creating a hardware board would be a similar
development process to the GRAPE. One would then have to find a group
willing to design and fabricate an SPH hardware board, as the Tokyo
group would not have the extra manpower. Extra money would be required
as well.  Such an approach would also require a thorough testing of SPH
parameterizations and a study of how to coordinate the two boards;
meaning extra development time. The idea remains open for consideration,
discussion, and offers of time, manpower, and money.



US Funding Ideas
----------------

Various comments from various people.

-- NSF Centers re-competition. The re-competition had not been announced by
the time of this meeting. It has since been announced and details are
available at the following URL:

      http://www.cise.nsf.gov/cise/ASC/

The ideas before announcement were that, if we can make a strong enough
scientific case, we could become aligned as a partnership center with
one of the supercomputing centers making bids under this program. The
amount of funding available would of course be tight, but may be enough
to cover the project.  The project has not been evaluated carefully in
light of the announced program solicitation. Much more discussion,
outlining, and budgeting needs to be done on this point if this is to be
an option. Preproposals are due by April 1.

-- Argument to address: what happens if a graduate student comes up with
a brilliant algorithm idea that bypasses our speed-up? Can we develop
new algorithms to take advantage of the GRAPE hardware or are we really
locked into specific paradigms? One answer is that Tree and P3M
algorithms can and have been adapted to GRAPE. There is also some memory
of someone adapting the Fast Multipole Method to GRAPE, with results
similar to P3M. One advantage of FMM over P3M is that you can use
GRAPE-4 without multiple calls since in FMM you calculate Plummer
law forces instead of the force with a cutoff. Algorithmic development
is possible.

-- Argument in favor: compare the GRAPE successful history to the
history of QCD special purpose calculations. The point here is that some
people on review panels might have doubts about special purpose hardware
because some previous projects did not deliver the kind of advantages we
will claim. Showing several projects done on budget and on time and with
the claimed performance should remove those doubts.

-- Argument in favor: This collaboration is bring together experts in
hardware, software, and science under one project. Though that seems the
natural thing to do, it is only rarely achieved in practice. Such a
balanced approach covering all the bases will be a strong point.



Scientific Applications
-----------------------

Much of the discussion here was devoted to creating the table presented
at the end of this section. The point of the table is to numerically
justify which applications will be well suited to GRAPE:TNG because it
will enable larger N simulations to be done years earlier. An important
point is how many years advantage over general purpose supercomputers we
will gain and that is listed in the final column of the table. The other
necessary points to mention are the target science problems that can be
solved with GRAPE:TNG, but not with general purpose. Let us begin with
some brief discussions of the problem areas.


-- Globular Cluster Evolution

     This is the quintessential problem for GRAPE: the CPU time is
heavily dominated by pairwise force calculations of collisionless
particles and the memory storage is small enough to be handled by a
workstation. In addition, smart algorithms do not increase speed
significantly because the accuracy criterion must be set so high. Thus,
adding a hardware turbocharger for force calculations makes excellent
sense.

     GRAPE-4 was designed to address the theoretical prediction of core
oscillations and did indeed confirm their existence. GRAPE:TNG will
allow another factor of ten more particles. Some target science projects
mentioned: determine characteristics of gravo-thermal oscillations over
a wide rage of N (including the highly chaotic oscillations expeceted
for N > 10^5), calibrate Fokker-Planck and Monte Carlo methods, allow
cluster simulations on a star-by-star basis, including a treatment of
stellar evolution and physical collisions, in order to trace the
formation mechanisms of X-ray binaries, millisecond pulsars, and blue
stragglers. Only N-body simulations let one study binary and multiple
systems properly, and GRAPE:TNG will allow one to do this with a
realistic number of particles.


-- Black Hole Binary in Galactic Nucleus

     Essentially, the same numerical arguments work for this problem as
for the globular cluster problem: force calculation dominated, small RAM
requirements, high accuracy necessary. Note that the computational cost
here scales as N^2, since the dynamical time scale is independent on N,
the number of stars; whereas in the globular cluster case the total cost
scales as N^3, since the relaxation time (and thus the overall evolution
time scale) scales linearly with N.

     When two galaxies merge, their central black holes will spiral in
to close proximity, but it is unclear under what circumstances this
process will hang up or run till completion leading to a complete merger
into a single larger black hole.  In subsequent merging with other
galaxies or proto-galaxies, interesting three- or four-body interactions
can take place between the various single holes and binary holes present
in the centers of the merging galaxies. 

     First scientific goal would be to determine the fate of black hole
binary in the core. With GRAPE-4 we can handle 10^6 particles, which is
translated to M_BH/M_field about 10^3-4. A petaflops machine makes it
possible to handle 10^7-8 particles or M_BH/M_field \sim 10^5. This is
close to BH mass ratio in spirals. For ellipticals, we hope that we can
extrapolate these results to larger mass ratio.


-- Planet Formation

     This topic was suggested as a good possible application, but the
room did not feel qualified to make detailed comments at the time.


-- Galaxy Simulations - Collisionless

     What we include here are isolated simulations of galaxies (ex.,
following development of spiral structure, interactions, and mergers).
Such simulations can be speeded up by smart algorithms (Tree is the
predominant choice), but are still CPU limited because of the accuracy
and large number of timesteps required.

     Jun Makino has done studies of implementing a Tree code on the
GRAPE hardware. The bottleneck is that the tree construction, which must
be done on the host, takes around 5% of the CPU time (theta ~ 1). Hence,
any speed-up is limited to a factor of 20. Supposing this maximum
speed-up, one gets a factor of about 16 in particle number, without
allowing for an increase in the number of timesteps. It could be
feasible to design a chip to handle the tree construction and achieve
further speed-up. On the other hand, the speed-up is dependent on the
theta parameter and if one needs smaller theta, then the gain with GRAPE
increases a lot.

     We note that almost all of the current simulations on these
problems include SPH to follow the gas component.  Could it be that
there are no more important problems to solve using gravity only? One
class of problem that does fit here is the structure of dark matter
halos of galaxies (i.e., a simulation where the galaxy is modelled at
high resolution while the cosmological tidal field is modelled at low
resolution). One can certainly accomodate as many particles as possible
in the pure gravity reference calculation, though some will argue that
such a calcualtion does not correspond to the real universe where we
know that baryons have a significant affect on small scales.  One can
also extend this idea to doing focused collapses of individual clusters
of galaxies to follow substructure dissolution within clusters.


-- Galaxy Simulations - Collisionless plus SPH

     Adding hydrodynamics to the mix enables significantly more problems to
be addressed. The time spent doing the SPH calculations on the front end
might decrease the expected speed-up. However, if one could get the SPH
calculations done in parallel to the GRAPE calculations, no decrease
would be seen. Here again, Tree is the algorithm of choice. Best case,
factor of 20 speed-up. Worst case, factor of 2 speed-up. Further worries
about the addition of additional physics (radiation, non-equilibrium
species fractions, metallicity tracking) which could add a large amount
of work to the front end and erase any significant speed-up of GRAPE.

     Science problems with higher resolution include providing mass
resolution that resolves globular cluster scales to examine their
possible formation in mergers, increasing the detail of modelling of
star formation and associated feedback, and allowing wider dynamic range
(for the high resolution region) in cosmological collapses of isolated
galaxies.


-- Cosmological Simulations - Collisionless

     Currently, the largest simulations here are RAM limited more than
CPU limited: the largest run possible is the largest run that can fit
into memory. We should specify here that we only consider high
resolution algorithms like Tree, P3M, and Adaptive PM. GRAPE offers
nothing to low resolution PM codes.

     Whether or not this RAM limitation will continue is subject to a
couple factors. First, if one uses extra particles to do a larger volume
of space, then the number of timesteps will remain constant and CPU time
per particle will only as log N. If one uses extra particles for higher
resolution within the same volume of space, then the number of timesteps
will grow and CPU time per particle will increase more like N^1/3 log N
or N^1/2 log N.  Second, will the ratio of RAM in bytes to CPU
performance in flops remain near its present value of around 1? Buying
big computers is largely about purchasing RAM, which in recent years has
not dropped in cost per byte the way CPU performance has dropped in cost
per flops. If this trend is not temporary (it was about an even decline
in the past), one might see the bytes/flops ratio drop and RAM
limitations increase. Third, although theoretical performance may
increase markedly, actual performance never keeps pace. Many flops are
wasted on speculative branch calculations and the inability to adapt
code to a specific pipeline architecture. Memory accesses (due to cache
misses) are probably the dominant CPU killer in all these big
simulations, and memory speed will not increase by orders of magnitude
any time soon. In sum, whichever factors one weights most highly will
determine whether RAM limitation will continue.

     The dominant problem in this category is being able to resolve
galactic size halos within a representative region of the universe. 
Calling this 5e9 h^-1 M_sun resolution within (200 h^-1 Mpc)^3, you need
5e8 particles. General purpose supercomputers should be able to handle
this by the year 2000. One marquee project would be simulating a volume
the size of the Sloan survey, about (600 h^-1 Mpc)^3. Similar resolution
requires 1.2e10 particles.


-- Cosmological Simulations - Collisionless plus SPH

     Adding SPH increases both the RAM requirements (about a factor of
2) and the CPU requirements (about a factor of 4) compared to pure
collisionless for the same size simulation. Also, the problems one
attacks with hydro codes tend to reach toward kpc scales (versus 10's of
kpc for collisionless) and require many more timesteps. These
simulations are more CPU limited, but a the SPH portion must be done
on the host.

     Carlos stated that for the Hydra code they have running on the T3D,
the SPH part takes about 30% of the calculation. Hence, he would only
expect a maximum factor of 3 speed-up. In standard P3M codes, the balance
is similar and only a factor of a few speed-up could be obtained.
However, GRAPE provides an efficient means of neighbor finding that can
be exploited. Instead of doing SPH calculations over a fixed volume of
neighbors (as is natural for basic P3M), one can perform calculations
over a fixed number of neighbors (as is natural for Tree codes). This
change will dramatically reduce the SPH load on the host. We do not have
good estimates of the expected speed-up, but an order of magnitude is
not unreasonable for highly clustered regions. The offsetting requirement
will be the number of timesteps required by the Courant condition. SPH
implementations with GRAPE will require study to see how much advantage
they offer.

     Science targets here are mainly related to galaxy formation and
convincing galaxy identification within cosmological simulations. These
simulations will provide the statistics to really test a model that the
isolated or focused simulations can't provide. X-ray clusters (modulo
metallicity effects) can be adequately handled by grid based codes. Same
necessity of including star formation and associated feedback as
mentioned above, along with attendant worries about too much calculation
being done on front end.


---------------------------------------------------------------------------
                                  TABLE
---------------------------------------------------------------------------

The table lists the largest N in a simulation as a figure of merit.  The
implicit assumption being that more resolution enables more science. An
important point is how many years advantage over general purpose
supercomputers we will gain and that is listed in the final column of
the table.  Numbers were derived from quick estimates and will need
considerable refinement before being used to justify a proposal. The
idea is to see what kind of advantage GRAPE:TNG can be expected to
deliver and which problems are best suited.



              PARTICLE NUMBER OF LARGEST SIMULATION FEASIBLE
                   AND GRAPE-TNG TIME ADVANTAGE


                  1995          1995          2000           2000      GRAPE-TNG
  Problem    General Purpose   GRAPE-4   General Purpose   GRAPE-TNG     Time
    Area	Computer                    Computer                   Advantage


  Globular
  Cluster	  10^4		5.10^4	      5.10^4	    5.10^5     10 years
  Evolution

  Black Hole
  Binary in	  10^5		10^6	      10^6	    3.10^7     10 years
Galactic Nucleus

  Galaxy 
  Sims &	  10^6  	2.10^6        3.10^7	    10^8        3 years
  Interactions

  Galaxy Sim &
  Interactions    2.10^5	10^6	      5.10^6	    10^7        2 years
  with SPH


  Cosmology	  3.10^7	3.10^7        5.10^8	    5.10^8 +)   0 years


  Cosmology	  4.10^6	3.10^6	      10^8	    3.10^8      3 years
  with SPH


+): note that the GRAPE-TNG can provide a speed-up of a factor 10 or 20
    with respect to the top-of-the-line general-purpose computer, even
    if the maximum particle number will be limited by the maximum size
    of RAM on the front end (as pessimistically assumed here).

Notes:

  First of all, these numbers are all rough estimates; one could argue
  about each particular number on the level of a factor 2 or 3.

  Second, for general-purpose computers, these numbers are based on what
  is available for the top-end astrophysics user, say someone involved in
  a grand-challenge-type program.

  Third, for the GRAPE-4, these numbers are based on what can be achieved
  by running for a month with the current configuration; a faster
  front-end would increase the maximum particle numbers significantly in
  the last four areas.

  Fourth, for the GRAPE-TNG, these numbers reflect a month-long run with
  only a modest fraction of a top-of-the-line supercomputer as a
  front-end; if a full such computer were used as a front-end, in the last
  four areas the simulations indicated here could be performed in just a
  few days.

---------------------------------------------------------------------------


REFERENCES
----------

- GRAPE-4 system web page:

   http://butterfly.c.u-tokyo.ac.jp:8080/pub/people/makino/grape4.html

- GRAPE:TNG - Makino & Taiji

   http://butterfly.c.u-tokyo.ac.jp:8080/pub/people/makino/papers/tngpro.ps

- GRAPE system design - Makino

   http://butterfly.c.u-tokyo.ac.jp:8080/pub/people/makino/papers/tradeoff.ps

- GRAPE 4 - Makino & Taiji (SC 95)

   http://butterfly.c.u-tokyo.ac.jp:8080/pub/people/makino/papers/sc95.ps

- GRAPE 4 - Makino et al. (SIAM PPSC 95)

   http://butterfly.c.u-tokyo.ac.jp:8080/pub/people/makino/papers/siampp95.ps

- GRAPESPH code - Steinmetz

     http://xxx.lanl.gov/abs/astro-ph/9504050/

- P3MG3A cosmology code - Brieu, Summers, & Ostriker

     ApJ, 453, 566 (1995)
     http://xxx.lanl.gov/abs/astro-ph/9411001/

- GRAPE 2A and Molecular Dynamics

     J Comp Chem 15, 1372 (1994) - Higo, Endo, & Makino

     Proteins 20, 139 (1994) - Ito, Fukushige, & Kitamura

- PASJ SPECIAL FEATURE ON GRAPE - Vol 45, No 3 (1993)

  - GRAPE Overview - Ebisuzaki et al

     PASJ 45, 269 (1993)

  - SPH and GRAPE 1 - Umemura et al.

     PASJ 45, 311 (1993)

  - GRAPE 3 - Okumura et al.

     PASJ 45, 329 (1993)

  - HARP - Makino, Kokubo, & Taiji

     PASJ 45, 349 (1993)

- Treecode and GRAPE - Makino

     PASJ 43, 621 (1991)

- GRAPE 2 - Ito et al.

     PASJ 43, 547 (1991)

- GRAPE 1 - Sugimoto et al.

     Nature 345, 33 (1990)


Last updated on 30-jan-96 by PJT.
teuben@astro.umd.edu