************************************************************************
*                                                                      *
*                    Request for Comments                              *
*                                                                      *
*         Petaflops Astrophysical Particle Simulations                 *
*                on GRAPE: The Next Generation                         *
*                                                                      *
*      Announcement of Workshop on Design and Funding Ideas            *
*                                                                      *
************************************************************************

Highlights:
-----------

   --- GRAPE special purpose hardware breaks 1 teraflops barrier

   --- Next GRAPE project aims for 1 petaflops by year 2000

   --- 10^6 particle star cluster simulations achievable

   --- 10^10 particle cosmology simulations achievable

       --> 1000 particles/galaxy in a 300 h^-1 Mpc simulation

       --> 50 particles/galaxy in a Sloan survey volume

   --- Other N-body possibilities available for exploration

       --> planetesimal dynamics

       --> galactic dynamics

       --> particle hydrodynamics

   --- Ideas, opinions, and participation invited


Details:
--------

Comments are being solicited on design ideas and astrophysical
applications of the next generation GRAPE special purpose hardware for
calculating the gravitational N-body problem. We also announce a one day
workshop at the NCSA supercomputer center in Champaign-Urbana, Illinois
on December 11.  At this meeting we will explore design ideas and
funding avenues for U.S. participation in building this new machine and
utilizing its computational power.

The current GRAPE-4 machine is the world's fastest computer reaching a
peak speed of over 1 teraflops (10^12 flops) at a price tag of only 2
million US dollars. The machine is based on special purpose hardware
with a primary application to star cluster dynamics.  General purpose
supercomputers should reach the teraflops benchmark in a year or so at a
price 10-20 times higher. By the year 2000, the GRAPE hardware should
achieve petaflops speeds (10^15 flops) at a cost of 10 million US dollars
using relatively straightforward improvements in design and fabrication.
Such a price/performance ratio is simply unprecedented. The designers
seek ideas, specifications, and advice on incorporating other N-body
applications into the new project, dubbed GRAPE: The Next Generation
(TNG). The description below discusses possible cosmological simulations
on GRAPE:TNG as an example, but all aspects of N-body simulations which
could be pursued with GRAPE:TNG are invited to join the discussion.

Several GRAPE projects have worked to speed up N-body simulations by
shifting the direct summation force calculation from software to
hardware.  The GRAPE-1 board, completed in 1989, employs a limited
numerical precision calculation at a peak speed of 120 Mflops. It's
successor, GRAPE-3, uses a custom LSI to package 48 pipelines of 300
Mflops each (14.4 Gflops total). About twenty GRAPE-3 boards with 4 or 8
pipelines are in use at various institutions around the world.  The
GRAPE-2 machine uses commercial 32 and 64 bit LSI chips for a higher
precision (and higher cost) calculation. An enhanced version, GRAPE-2A,
reached a peak speed of 180 Mflops in 1992. The recently completed
successor, GRAPE-4, bundles together 1692 Hermite accelerator pipelines
(HARP chips) to achieve a peak speed of 1.1 Tflops. 

The GRAPE:TNG project is in the planning stages and aims for a
thousand-fold increase in speed in less than five years. By using a
finer fabrication line width than GRAPE-4, GRAPE:TNG can pack 20 times
more transistors per chip and utilize a faster effective clock speed. 
These improvements not only increase the number of pipelines, but also
reduce the time per calculation.  Current projections indicate that
petaflops speeds are within reach. This speed would enable globular
cluster calculations with 10^6 particles and cosmology calculations
with 10^10 particles. Researchers in other N-body applications (for
which we do not have the expertise) are encouraged to explore the
possibilities GRAPE:TNG will offer and participate in the project.

Traditionally, the GRAPE-2 and GRAPE-4 machines have been targeted for
stellar dynamics calculations, while cosmology, which can tolerate
reduced precision, has been done on the GRAPE-3. To broaden the support
and funding base for GRAPE:TNG, cosmological and other astrophysical
applications are being included in the design considerations. The cost
of the project is estimated to be about 10 million US dollars, with
somewhat over half of the funding being sought from sources in Japan and
the UK. The proposal being considered would seek several million dollars
to fund astrophysical applications of the GRAPE:TNG project covering
both hardware and software.  An affiliation with one of the national
supercomputing centers is also being sought as a symbiosis: the center
would be able to provide a supercomputer front end and high speed I/O
expertise to the project, while the GRAPE:TNG would provide a high
profile supercomputing project that pushes computational boundaries by
non-standard methods.  After achieving the demonstration petaflops
benchmark, a share of the machine proportional to investment would be
dedicated to the research in the proposal.


Feedback:
---------

A detailed consideration of the possibilities and prospects for doing
cosmology on GRAPE:TNG is given below. Researchers in the N-body field
and interested parties are asked to respond to the short questions
following this message. Please distribute this message to those we may
have missed or whose email addresses we did not readily find.

There will also be a one day workshop discussing the design and
applications of GRAPE:TNG as well as assembling the basics of a funding
proposal on December 11 at NCSA. We are also considering a short meeting
to be held at Supercomputing 95 in San Diego. Further information will
be sent when available and all are welcome to attend.

Further information will be sent only to those who request it. Send
responses to:


     Email:                summers@astro.princeton.edu

     Postal mail:          Frank Summers
                           Princeton University Observatory
                           Peyton Hall
                           Princeton, NJ 08544

     FAX:                  609-258-1020


Questions may be directed to the email address above or by phone at
609-258-3810 (Summers), 609-734-8075 (Hut), or 215-895-2723 (McMillan).


We will greatly appreciate honest opinions, both positive and negative,
so please be candid.  Thank you very much for your input.


       Frank Summers       summers@astro.princeton.edu
       Jun Makino          makino@kyohou.c.u-tokyo.ac.jp
       Piet Hut            piet@sns.ias.edu
       Steve McMillan      steve@zonker.drexel.edu
       Mike Norman         norman@ncsa.uiuc.edu


------------------------------------------------------------------------
                        GRAPE:TNG Questionnaire
------------------------------------------------------------------------

Name and email address:



1) General commments on the feasibility and usefulness of the project.



2) Specific comments on issues in our request for comment.



3) Other N-body applications that could benefit from GRAPE:TNG.



4) Other issues you think are relevant.



5) Are you interested in:

    --- receiving more information as the project progresses?


    --- attending the workshop on December 11 at NCSA?


    --- attending a short meeting at Supercomputing 95?


    --- giving a brief presentation at the December workshop?


    --- (USA researchers) becoming involved in a proposal collaboration?



------------------------------------------------------------------------
                    Cosmology Project Considerations
------------------------------------------------------------------------

The following questions are considered below:

   1) What are the projected design elements of GRAPE:TNG?

   2) What design elements of GRAPE:TNG are being considered to
        accommodate cosmology calculations?

   3) What cosmology codes can be run on GRAPE:TNG?

   4) Can hydrodynamics be incorporated into GRAPE:TNG codes?

   5) How large of a simulation will GRAPE:TNG handle?

   6) Will the GRAPE:TNG simulations be faster and larger than those
        achievable by general purpose supercomputers?

   7) How will one handle the analysis of huge data sets?

   8) What are the primary scientific motivations?

   9) Where can I find more information?

***************

   1) What are the projected design elements of GRAPE:TNG?


The following is very rough idea of the petaflops machine :

  Clock speed: 100-200 MHz

  Floating point operations per pipeline: 60 (for Hermite scheme)

  Number of pipelines: 0.8-1.6 x 10^5 (total system)

  Number of processor chips : 1-2 x 10^4 (8 pipelines/chip)

  [Technology assumed: design rule: 0.25 um, die size: 20-25mm squared]

  Physical size: about 25 cabinets, 2 x 2 x 5 ft (depends on packaging)

For direct summation calculations, these figures enable:

  Speed: about 1 minute / step for 3 x 10^7 particles direct summation

  On board memory: 10^7-10^8 particles (could be 10^9 if one uses SDRAM,
     which, in 5 years, could be cheaper)

***************

   2) What design elements of GRAPE:TNG are being considered to
        accommodate cosmology calculations?

The basic requirement for cosmology calculations is that the board be
connected to a supercomputer class front end machine. Stellar dynamics
calculations will use 10^5-6 particles and can be accommodated on a
workstation front-end. The current GRAPE-4 board uses a Dec TurboChannel
interface which limits the possible host computers. The interface choice
and design will incorporate the connectivity and I/O speed requirements
of cosmology calculations.

A second consideration is that cosmology calculations do not need full
32/64 bit precision during the whole calculation. Because cost scales
directly with the number of transistors, it is desirable to use limited
precision where possible. The trade-offs between cosmology and stellar
dynamics requirements, custom versus commercial LSI chips, and cost will
have to be balanced.

One further idea is to utilize a force look-up table. The P3MG3A code
(adapting P3M algorithm to GRAPE-3A) requires a custom force shape to
join the PM and PP parts of the force. Currently, the shape is
approximated by calling the GRAPE board 3 times, and scaling and summing
the results. If the GRAPE board had a user loadable look-up table for
the force, the PP calculation time would drop by a factor of 3. The
MD-GRAPE board (prototype recently completed) utilizes this look-up
table procedure, and we plan to adapt the P3MG3A code to MD-GRAPE as a
next step.

***************

   3) What cosmology codes can be run on GRAPE:TNG?

The current code under consideration is the P3MG3A code, which
implements the P3M algorithm on GRAPE. On a SPARC 20 workstation with
the GRAPE-3AF board, the code achieves a peak speed of 51% of a single
processor vectorized Cray C90 code. Parallelism has not yet been
exploited.  Because the PM and PP calculations are independent and
require similar amounts of CPU time, performing them in parallel is a
straightforward method for improving speed. Ferrell and Bertschinger
have described parallelization of the PM calculation which would allow
the PM part to keep pace with the GRAPE improvements to the PP part.

Tree codes and GRAPE have been explored, but further work is necessary.
Tree construction and traversal are significant calculations that must
be performed on the host. The advantage of GRAPE is to shift calculation
from the host to the hardware and the balancing of the treecode is
actually a hindrance. The speed increase from using a GRAPE board is
estimated to be only a factor of 3-10 over a treecode using a general
purpose computer. Of course, one may be able to modify the treecode
implementation to better utilize the GRAPE hardware. For instance,
increasing the cell size of the lowest level cells so that they contain
more than one particle would reduce the cost of the tree construction as
well as shift more calculation to the hardware board.


***************

   4) Can hydrodynamics be incorporated into GRAPE:TNG codes?

Yes, but it will take some algorithm changes.

The P3MG3A code does not currently implement hydrodynamics. The SPH
algorithm can be easily added to the code and the basic implementation
has been coded and tested.  Other researchers have also implemented SPH
with the GRAPE board.

The reason that SPH has not been added to the P3MG3A code is that it
would place undue requirements on the host computer. In theory, SPH
calculations require only the 50 nearest neighbors to calculate
hydrodynamic forces. As such, if GRAPE finds the neighbors, the host
computer can handle the SPH calculations. In practice, one implements a
minimum smoothing length for the SPH calculations to keep the
hydrodynamic resolution on the same scale as the gravitational
resolution. With strong cooling to form galaxy-like objects, one can end
up with thousands of particles within a smoothing length (from a 64^3
particle calculation). The host computer cannot handle these large SPH
calculations. The unphysical effect of not implementing a minimum
smoothing length is that the high density gas continues to collapse to
meaningless scales and the pressure which should provide support on the
gravitational resolution scale is not present.

To reduce or remove the CPU load of gas clumps, there are two avenues to
explore. First, for galaxy formation simulations, one wants to convert
the gas into a collisionless stellar fluid. This procedure will remove
gas particles from the high density regions and convert them to gravity
only calculations that can be handled by GRAPE. For simulations where a
star formation scheme is not appropriate, one may try replacing gas
clumps by more massive single particles. This procedure will inevitably
introduce some errors and must be refined and tested for feasibility.

***************

   5) How large of a simulation will GRAPE:TNG handle?

We estimate 10^9 - 10^10 particles for gravity calculations and
10^8-10^9 for gravity and hydro runs.

In terms of calculation speed, these targets are realizable. The gravity
calculations are dominated by the clusters of galaxies. Supposing that
we resolve 10^15 M_sun clusters with 10^9 M_sun particles, we get about
10^6 particles per cluster. The PP force will need to be calculated over
a scale less than a Mpc. To be conservative, consider 10^6 particle PP
calculations for 1000 clusters at each step. According to the design
estimates above, such calculations will take about a minute and a half
in total. Even adding a factor of 10 for I/O overhead and general
pessimism, the gravity calculation can handle a 300 h^-1 Mpc simulation
with 8.6e9 particles within 15 minutes. For 6500 equal timesteps (which
would be appropriate for this resolution), the run takes about 2 months.
Judicious choice of integration variable and/or individual particle
timesteps can reduce this figure to a manageable scale. Current
experience with hydro calculations indicates that their size will lag
the size of a gravity calculation by an order of magnitude. With an
efficient implementation, the same will hold for a GRAPE:TNG based code.

In another sense, the largest simulation that can be done is limited by
the largest data set one can store on the computer.  The program size
for a 2048^3 particle, 2048^3 mesh P3MG3A simulation is 350 GB. Noting
that the latest big supercomputer order has 288 GB of memory (the 9000
node Intel P6 machine for Sandia National Labs), it does not seem
unreasonable to expect that size RAM on other supercomputers in five
years. Larger RAM would permit a larger simulation, with adjustment
of load between host and GRAPE as needed.

***************

   6) Will the GRAPE:TNG simulations be faster and larger than those
        achievable by general purpose supercomputers?

GRAPE:TNG simulations should be more than 10 times faster and larger.

The theoretical peak speed of GRAPE:TNG will be roughly 1000 times
faster than the general purpose computers of the same cost, which
effectively gives you a factor of 10-30 improvement in the throughput. 
If the frontend is upgraded, even without developing GRAPE, GRAPE:TNG
will be faster than general-purpose computer for fairly long time after
completion. Five years later, the general purpose computer speed will
increase 10 times, but GRAPE:TNG will be still 100 times faster, to give
overall speedup of 5-15 or so.

Section 5) estimates that a 10^10 particle simulation will be a real
possibility on GRAPE:TNG in the year 2000. General purpose supercomputer
simulations are now at the 10^7.5 particle level, and are increasing by a
factor of 8 every 3 years. Such a pace leads to 10^9 particles in the year
2000, and 10^10 particles by 2003.

Another consideration is the amount of CPU time available for a
calculation; whether it is a day, a week, or a month. When connected to
a workstation, GRAPE simulations have the very strong advantage of being
able to run 24 hours a day, with no sharing of resources, no proposals
to write, and no file transfer to perform. When connected to a
supercomputer, this advantage will diminish, except in that, as a
targeted project connected with a supercomputing center, special
allowance of computer time may be expected.

The argument that simulation size will be RAM limited instead of CPU
limited would imply that GRAPE:TNG could offer no advantage in size.
There are a couple reasons why we do not believe this argument will hold
in five years. First, an increase in particle number will be accompanied
by finer resolution scales that will require an increased the number of
timesteps. Simulation algorithms that are well balanced in CPU and RAM
limits today will shift to being CPU limited even if their CPU scaling
holds perfectly. Second, though the theoretical peak speed of computers
is increasing, a significant amount of the increase is based on such
elements as super-scalar processing and predictive branch calculations
which translate into real performance only when your code can be
structured to fit. Also, all are familiar with the performance hits of
cache misses and the 70 ns memory speed limitations. The point being
that recent trends have made it harder to convert theory into practice
for very large simulations (though one would not want to be pessimistic
about the ingenuity of silicon architects). Hence, we feel that CPU
constraints will increase in importance over time.

***************

   7) How will one handle the analysis of huge data sets?

By preparing in advance.

As the current Grand Challenge groups have demonstrated, efficient
analysis code for very large simulations must also be a priority. A
single output of positions and velocities for 10^10 particles is 240 GB.
Clearly, supercomputer power will be needed to sift through such data.
Building upon present work in the field, the GRAPE:TNG proposal will
include analysis software development. One specific way that the GRAPE
hardware can help is in finding neighbor lists for particles. Efficient
neighbor searches will greatly speed up group finding algorithms,
density estimates, and other local pairwise calculations.

***************

   8) What are the primary scientific motivations?

Gravitational lensing, dissolution of substructure, better statistical
measures for large surveys, wider dynamic range for galaxy formation.

For gravity simulations, one of the best practical applications is to
predict observations of gravitational lensing in regions where the mass
is dominated by dark matter. These include the centers of clusters of
galaxies producing long arcs and multiple images of background galaxies,
and the weak shear observed at larger radii in clusters. In addition,
the superposition of density fluctuations from large-scale structures
and collapsed objects of a wide range of mass along a typical
line-of-sight in the field is expected to produce a pattern of
distortion which could be observationally measured.

The large N-body simulations would allow one to simulate gravitational
lensing along random lines of sight. In principle, the observational
determination of the number of gravitational lenses with very large
splitting angles provides an excellent probe to the number of the most
extremely massive halos.  The main difficulty in using these
observations to constrain cosmological models of structure formation is
the effects of matter superposed with the clusters along the line of
sight, the deviations from spherical symmetry, uncertainties in the mass
function at very high masses, and the effects of substructure within
clusters.  Thus, in order to incorporate all these effects in a
calculation, one needs both the high resolution to correctly reproduce
the substructure in rich clusters, as well as a very large box to have a
large number of rich clusters and to be able to simulate long lines of
sight along which lensing is observed.  The much higher dynamic range
that would be achieved with the proposed GRAPE machine is needed to
reproduce the dark matter distribution correctly.

Another pure gravity problem considers the ability to trace galaxy halos
through clustering. Gravity calculations have traditionally found that
galaxy sized halos lose their identities when merged into a cluster
halo. However, recent work has questioned this tenet and pursues the
idea that increased resolution will allow the halos to remain distinct
during merging. The argument is that the halos will become more tightly
bound before the merger, thus increasing their ability to resist
numerical heating. If such tracing could be achieved, one could obtain
cosmological measures, such as the correlation function and velocity
dispersion of galaxies, using relatively straightforward gravitational
physics and avoiding the detailed modelling of hydrodynamics, radiative
processes, and star formation. Large simulations are required to allow the
galaxy scales to become tightly bound, yet have enough dynamic range to
sufficiently cover the formation of the largest clusters.

Along the same lines, redshift surveys of galaxies are reaching to very
large volumes and the simulations must provide adequate predictions of
the different cosmological models. To simulate a volume comparable to
the Sloan survey (an 800 h^-1 Mpc cube) with mass resolution of 10^10
h^-1 M_sun requires just over 10^10 particles. Data from surveys such as
the Sloan and the Two Degree Field will start flowing in a few years and
will require the very wide dynamic range simulations the GRAPE:TNG
machine will be capable of performing.

Galaxy formation simulations with hydrodynamics have begun to create
objects with many characteristics of galaxies out of generalized
cosmological initial conditions. The resolution scales of such
simulations must be on the few kiloparsecs scale just to get the gross
properties and even smaller to probe internal characteristics. For the
statistics to cover a fair sample of the universe and the dynamics to
include the large scale tidal fields, one needs a box size of order 100
Mpc. The wide dynamic range offered by the GRAPE:TNG machine will be
required to provide a suitably sampled population to study aspects of
galaxy formation such as the morphology-density relationship.

***************

   9) Where can I find more information?

--- WWW page on the GRAPE-4 system:

    http://butterfly.c.u-tokyo.ac.jp:8080/pub/people/makino/grape4.html

--- Paper on the P3MG3A code:

    http://xxx.lanl.gov/abs/astro-ph/9411001/

--- Grand Challenge Cosmology Consortium:

    http://zeus.ncsa.uiuc.edu:8080/GC3_Home_Page.html

--- HPCC group at University of Washington:

    http://www-hpcc.astro.washington.edu/

    Of special note, see:

    http://www-hpcc.astro.washington.edu/siamhtml/siamhtml.html

--- The NASA HPCC Project at Los Alamos National Lab

    http://qso.lanl.gov/hpcc.html

Last modified: 19-oct-95 teuben@astro.umd.edu