.. _bench:

Benchmarking
============

How fast does the N-body treecode run?
To what degree does optimization/vectorizing help? When do
programs become I/O dominated? Some of the numbers quoted below should
be taken with great care, since a lot of other factors can go into
the timing result. 

A number of programs in NEMO have a command line parameter such as
``nmodel=N``, ``nbench=N`` or ``iter=N`` (N normally set to 1)
but together with ``help=c`` or prefixing with ``/usr/bin/time`` will
give an accurate measurement how long
the code takes to execute ``N`` loops of a particular algorithm. For
some programas their respective man pages discuss a particular benchmark.

On the top level we have ``make bench5`` and ``make bench``, the latter
dynamically controlled with the scattered ``Benchfile``'s


N-body integration
------------------

The standard NEMO benchmark of the treecode integration is to
``hackcode1`` without any parameters.  It will generate a spherical
stellar system in virial equilibrium with 128 particles, and integrate
it for 64 timesteps (``tol=1 eps=0.05``).  In the table below the
amount of CPU (in seconds) needed for **one** timestep is listed in
column 2. When not otherwise mentioned, the code used is the standard
NEMO ``hackcode1`` with default compilation on the machine
quoted. Note that one can often obtain significant performance
increase by studying the native compiler and in particular its
optimization options.

Modern machines are too fast to measure the 1986 example (where
a single step would be around 5 sec) so we
integrate longer and normalize to measure a single step. For example

.. code-block:: bash

   /usr/bin/time hackcode1 out=. freq=100 tstop=1000  > /dev/null
   5.88user 0.04system 0:05.93elapsed 99%CPU		

would compute to an entry in the table below of 0.000059 sec,
or around 100,000 times that of the 1986 computers.

Since the development machine (a Sun 3/60) ran at 20 MHz, with current (2022)
speeds around 5GHz, this amounts to a 250x clock speed. But the code
runs another 400x faster. Part of that is the improved instruction cycle, but
part of this no doubt (probably smaller) factor is due to improved compiler technology.


.. list-table::    Treecode Benchmarks
   :header-rows: 1

   * - Machine
     - cpu-sec/step
     - code      
     - comments
   * - i9-12900K @ 5.2 GHz
     - 0.000059
     - hackcode1
     - 2022 desktop
   * - i5-1135G6 @ 4.2 GHz
     - 0.000089
     - hackcode1
     - 2020 laptop
   * - i7-8550U @ 4 GHz
     - 0.000178
     - hackcode1  -
     - 2018 laptop
   * - core 2 duo @ 2.0 GHz
     - 0.0012
     - hackcode1
     - 2007 laptop

   * - Sun Ultra-140
     - 0.012
     - hackcode1
     - -xO4 -xcg92 -dalign -xlibmil

   * - G3 PowerPC 250Mhz
     - 0.026
     - hackcode1
     - -O

   * - 486DX4-100
     - 0.068
     - hackcode1
     - (~1995 linux)

   * - Sun-4/60 Sparcstation 1
     - 0.420
     -
     -
     
   * - Sun-3/60
     - 5.400
     -
     - -fswitch (orig development)
   * - 3b1 (10Mhz 68010)
     - 49.000
     -
     -
   * - 386SX (16Mhz)
     - 87.000
     -
     - (linux) software floating point


The rubbish below are from the old latex table, TBD which ones make it into the new table
     
i7-3630QM @ 3.4 GHz          & 0.000177 & hackcode1 & 2014 laptop \\
i70-870 @ 2.93 GHz	     & 0.00030 & hackcode1 & 2010 desktop \\

Dec-alpha		     & 0.0042 & hackcode1 & -O4 -fast \\
Dec-alpha		     & 0.0048 & hackcode1 & default \\
CRAY X/YMP48                 & 0.0060 & TREECODE V3 & estimate (1989) \\
Onyx-2			     & 0.0088 & hackcode1 & default (1996) \\
ETA-10                       & 0.010 & TREECODE V2 & estimate (1987)  \\

Sun 20/62                    & 0.013 & hackcode1 & default (1994) \\
Cyber 205                    & 0.018 & TREECODE V2 & estimate (1986) \\
Sun 20/61                    & 0.020 & hackcode1 & \\
HP/UX 700                    & 0.020 & hackcode1 &  \\
Sun Ultra-140		     & 0.024 & hackcode1 & default \\
Sun 20/??		     & 0.024 & hackcode1 & -xO4 -xcg92 -dalign -xlibmil \\

Sun 10/51                    & 0.029 & hackcode1 & -O -fast -fsingle \\
Cray-2                       & 0.029 & TREECODE2   & REAL - Pitt, oct 91\\
% SGI ???                      & 0.030 & hackcode1   & John Wangs machine
DEC DS3000/400 alpha         & 0.036 & hackcode1   & default compilation \\
Pentium-100                  & 0.038 & hackcode1   & default \\
SGI Indigo		     & 0.045 & hackcode1   & default compilation \\
CRAY YMP                     & 0.059 & hackcode1   & default compilation \\
% bootes:
Sparc-10                     & 0.063 & hackcode1   & using {\tt acc -cg92} \\

486DX2-66 (linux)            & 0.093 & hackcode1   & -DSINGLEPREC \\
Sparc-2	                     & 0.099 & gravsim V1  & \\
IBM R/6000                   & 0.109 & hackcode1   & default cc compiler \\
Dec 5000/200		     & 0.116 & hackcode1   & \\
Sparc-2                      & 0.130 & hackcode1   & -DSINGLEPREC -fsingle \\
Sparc-2                      & 0.180 & hackcode1   & \\
Multiflow 14/300             & 0.190 & hackcode1   & \\
Convex C220                  & 0.290 & & \\
NeXT                         & 0.240 &             & [ganymede 68040, nov 91]\\
Sparcstation1+               & 0.340 & & \\

Alliant FX??                 & 0.430 & gravsim V1 & \\
Alliant FX4/w 3 proc's       & 0.590 & & \\
VAX workstation 3500         & 0.970 & & \\
Sun-4/60 Sparcstation 1      & 1.040 & treecode2   & cf. C-code @ 0.420 \\
Sun-3/110                    & 1.660 & hackcode1 & fpa.il \\


Nbody0
~~~~~~

The program is Aarseth's simplest
nbody code (contained in Binney and Tremaine, 1987, no regularization or nearest neighbors).
The input is
a Hubble expanding cartesian lattice, w/ 925 pts, GMtot=1, expansion
factor = 6 (omega = 1.2).  Long version followed for 60 time units,
short version for 5. Results are summarized in table below. First
table compiled by D. Richstone.

It seems the input data have been lost.


.. list-table::    N-Body0 Benchmarks
   :header-rows: 1

   * - Machine
     - time1 (sec)
     - time2 (sec)
     - speed
   * - Sun-4/110(Pele - 8Mb)
     - 
     - 21,753
     - 0.41
   * - Vaxstation 3100(Miffy - M48, 24Meg)
     -
     - 1302
     - 0.65

Sparc 1			&		& 1023	&	0.83 \\

Sparc IPC(Courage - 16 Mb) &	9,015	&	  850	&	1.000 \\

Sparc 2 	&	4,483	&		&	2.01 \\

Sparc 2'	&		&	  417	&	2.04 \\

Dec 5000/200 	&		&	  318	&	2.67 \\

Stardent(ism)	&		&	  211 	&	4.03 \\

IBM Risc (Juno)	&	2,117	&	  198	&	4.27 \\
					
IBM Risc (wibm01)&	2,115	&		&	4.26 \\
Convex		&		&	  172 	&	4.94 \\
HP/UX 700     &                  &     26.2   &     \\

Cray YMP	&		&	  19.1	&	44.5


Orbit integration
-----------------

Benchmark is taking 100,000 leapfrog steps. For 2D optimized 
potentials the timing on
a Sparc-1 station is about 12" for ``log`` or ``plummer``, and 
23" for ``teusan85`` in the core region (orbit remaining within
the body of the bar).  See also "make bench5", where one of the
benchmarks computes an orbit. Here we take about 80M steps, in
5 seconds, or 200M in the same amount as a sparcstation-1, or
about 2000x faster, or about 20,000x faster than a Sun 3/60.