phy-4660/feedback/papers/0504097/si.tex

%auto-ignore


\paragraphfont{\small}
\subsectionfont{\normalsize}


\newcommand\be{\begin{equation}}
\newcommand\ee{\end{equation}}
\renewcommand{\vec}[1]{ {{\bf #1}} }

\renewcommand{\textfraction}{0}
\renewcommand{\floatpagefraction}{1.0}
\renewcommand{\topfraction}{1.0}
\renewcommand{\bottomfraction}{1.0}

\renewcommand\caption[1]{%
  \myCaption{#1}
}

\title{\vspace*{-1cm}{\Large Simulating the joint evolution of quasars, galaxies\\
    and their large-scale distribution} \vspace*{0.2cm} \\ {\em \large Supplementary Information}\vspace*{0.3cm}}

\author{\parbox{13.5cm}{\small\sffamily%
V.~Springel$^{1}$, %
S.~D.~M.~White$^{1}$, %
A.~Jenkins$^{2}$, %
C.~S.~Frenk$^{2}$, %
N.~Yoshida$^{3}$, %
L.~Gao$^{1}$, %
J.~Navarro$^{4}$, %
R.~Thacker$^{5}$, %
D.~Croton$^{1}$, %
J.~Helly$^{2}$, %
J.~A.~Peacock$^{6}$, %
S.~Cole$^{2}$, %
P.~Thomas$^{7}$, %
H.~Couchman$^{5}$, %
A.~Evrard$^{8}$, %
J.~Colberg$^{9}$ \& %
F.~Pearce$^{10}$}\vspace*{-0.5cm}}

\renewcommand\refname{\large References}

\date{}

\bibliographystyle{naturemag}


\baselineskip14pt % On-and-half-space the manuscript.
\setlength{\parskip}{4pt}
\setlength{\parindent}{18pt}%


\twocolumn

\setlength{\footskip}{25pt}
\setlength{\textheight}{670pt}
\setlength{\oddsidemargin}{-8pt}
\setlength{\topmargin}{-41pt}
\setlength{\headsep}{18pt}
\setlength{\textwidth}{469pt}
\setlength{\marginparwidth}{42pt}
\setlength{\marginparpush}{5pt}

\addtolength{\topmargin}{-0.6cm}


\maketitle


\renewcommand{\thefootnote}{\arabic{footnote}}
\footnotetext[1]{\footnotesize Max-Planck-Institute for Astrophysics, Karl-Schwarzschild-Str.~1, 85740 Garching, Germany}
\footnotetext[2]{\footnotesize Institute for Computational Cosmology, Dep. of
    Physics, Univ. of Durham, South Road, Durham  DH1 3LE, UK}
\footnotetext[3]{\footnotesize Department of Physics, Nagoya University, Chikusa-ku, Nagoya 464-8602, Japan}
\footnotetext[4]{\footnotesize Dep. of Physics \& Astron., University of Victoria, Victoria, BC, V8P 5C2, Canada}
\footnotetext[5]{\footnotesize Dep. of Physics \& Astron., McMaster Univ., 1280
  Main St. West, Hamilton, Ontario, L8S 4M1, Canada}
\footnotetext[6]{\footnotesize Institute of Astronomy, University of Edinburgh, Blackford Hill, Edinburgh EH9 3HJ, UK}
\footnotetext[7]{\footnotesize Dep. of Physics \& Astron., University of Sussex, Falmer, Brighton BN1 9QH, UK}
\footnotetext[8]{\footnotesize Dep. of Physics \& Astron., Univ. of Michigan, Ann Arbor, MI 48109-1120, USA}
\footnotetext[9]{\footnotesize Dep. of Physics \& Astron., Univ. of Pittsburgh, 3941 O'Hara Street, Pittsburgh PA 15260, USA}
\footnotetext[10]{\footnotesize Physics and Astronomy Department, Univ. of Nottingham, Nottingham NG7 2RD, UK}


{\bf\small This document provides supplementary information for the
  above article in Nature.  We detail the physical model used to
  compute the galaxy population, and give a short summary of our
  simulation method. Where appropriate, we give further references to
  relevant literature for our methodology. \vspace*{-0.3cm}}

\renewcommand{\thefootnote}{\fnsymbol{footnote}}

\small


\subsection*{Characteristics of the simulation\vspace*{-0.3cm}}


Numerical simulations are a primary theoretical tool to study the
nonlinear gravitational growth of structure in the Universe, and to
link the initial conditions of cold dark matter (CDM) cosmogonies to
observations of galaxies at the present day.  Without direct numerical
simulation, the hierarchical build-up of structure with its
three-dimensional dynamics would be largely inaccessible.

Since the dominant mass component, the dark matter, is assumed to
consist of weakly interacting elementary particles that interact only
gravitationally, such simulations use a set of discrete point
particles to represent the collisionless dark matter fluid. This
representation as an N-body system is obviously only a coarse
approximation, and improving its fidelity requires the use of as many
particles as possible while remaining computationally
tractable. Cosmological simulations have therefore always striven to
increase the size (and hence resolution) of N-body computations,
taking advantage of every advance in numerical algorithms and computer
hardware.  As a result, the size of simulations has grown continually
over the last four decades.  Fig.~\ref{FigNvsTime} shows the progress
since 1970.  The number of particles has increased exponentially,
doubling roughly every 16.5 months.  Interestingly, this growth
parallels the empirical `Moore's Law' used to describe the growth of
computer performance in general. Our new simulation discussed in this
paper uses an unprecedentedly large number of $2160^3$ particles, more
than $10^{10}$. We were able to finish this computation in 2004,
significantly ahead of a simple extrapolation of the past growth rate
of simulation sizes. The simulation represented a substantial
computational challenge that required novel approaches both for the
simulation itself, as well as for its analysis. We describe the most
important of these aspects in the following.  As an aside, we note
that extrapolating the remarkable progress since the 1970s for another
three decades, we may expect cosmological simulations with $\sim
10^{20}$ particles some time around 2035.  This would be sufficient to
represent all stars in a region as large as the Millennium volume with
individual particles.

\begin{figure*}
\begin{center}
\hspace*{-0.6cm}\resizebox{16cm}{!}{\includegraphics{si_fig1.eps}}
\end{center}
\caption{Particle number in high resolution N-body simulations of
  cosmic structure formation as a function of publication
  date\cite{Peebles1970,Miyoshi1975,White1976,Aarseth1979,Efstathiou1981,Davis1985,White1987,Carlberg1989,Suto1991,Warren1992,Gelb1994,Zurek1994,Jenkins1998,Governato1999,Bode2001,Colberg2000,Wambsganss2004}.
  Over the last three decades, the growth in simulation size has been
  exponential, doubling approximately every $\sim 16.5$ months (blue
  line).  Different symbols are used for different classes of
  computational algorithms. The particle-mesh (PM) method combined
  with direct particle-particle (PP) summation on sub-grid scales has
  long provided the primary path towards higher resolution. However,
  due to their large dynamic range and flexibility, tree algorithms
  have recently become competitive with these traditional ${\rm P^3M}$
  schemes, particularly if combined with PM methods to calculate the
  long-range forces. Plain PM
  simulations\cite{Klypin1983,White1983,Centrella1983,Park1990,Bertschinger1991}
  have not been included in this overview because of their much lower
  spatial resolution for a given particle number. Note also that we
  focus on the largest simulations at a given time, so our selection
  of simulations does not represent a complete account of past work on
  cosmological simulations.  \label{FigNvsTime}}
\end{figure*}


\begin{figure}
\begin{center}
\vspace*{-0.2cm}\resizebox{7.5cm}{!}{\includegraphics{si_fig2a.eps}}
\vspace*{-0.2cm}\resizebox{7.5cm}{!}{\includegraphics{si_fig2b.eps}}
\end{center}
\caption{Different realizations of the initial power spectrum. The top
  and bottom panels show measured power-spectra for 20 realizations of
  initial conditions with different random number seeds, together with
  the mean spectrum (red symbols). The latter lies close to the input
  linear power spectrum (black solid line). In the bottom panel, the
  measurements have been divided by a smooth CDM-only power
  spectrum\cite{Bardeen1986} to highlight the acoustic
  oscillations. One of the realizations has been drawn in blue; it
  shows a fluctuation pattern that superficially resembles the pattern
  around the second acoustic peak. However, this is a chance effect;
  the fluctuations of each bin are independent.
\label{FigRealizations}}
\end{figure}


\paragraph*{Initial conditions.}
We used the Boltzmann code {\small CMBFAST}\cite{Seljak1996} to
compute a linear theory power spectrum of a $\Lambda$CDM model with
cosmological parameters consistent with recent constraints from WMAP
and large-scale structure data\cite{Spergel2003,Seljak2004}. We then
constructed a random realization of the model in Fourier space,
sampling modes in a sphere up to the Nyquist frequency of our $2160^3$
particle load. Mode amplitudes $|\delta_{\vec k}|$ were determined by
random sampling from a Rayleigh distribution with second moment equal
to $P(k)=\left<|\delta_{\vec k}|^2\right>$, while phases were chosen
randomly. A high quality random number generator with period $\sim
10^{171}$ was used for this purpose.  We employed a massively parallel
complex-to-real Fourier transform (which requires some care to satisfy
all reality constraints) to directly obtain the resulting displacement
field in each dimension.  The initial displacement at a given particle
coordinate of the unperturbed density field was obtained by tri-linear
interpolation of the resulting displacement field, with the initial
velocity obtained from the Zel'dovich approximation. The latter is
very accurate for our starting redshift of $z=127$. For the initial
unperturbed density field of $2160^3$ particles we used a {\em
glass-like} particle distribution. Such a glass is formed when a
Poisson particle distribution in a periodic box is evolved with the
sign of gravity reversed until residual forces have dropped to
negligible levels\cite{White1996}.  For reasons of efficiency, we
replicated a $270^3$ glass file 8 times in each dimension to generate
the initial particle load.  The Fast Fourier Transforms (FFT) required
to compute the displacement fields were carried out on a $2560^3$ mesh
using 512 processors and a distributed-memory code. We deconvolved the
input power spectrum for smoothing effects due to the interpolation
off this grid.


We note that the initial random number seed was picked in an
unconstrained fashion. Due to the finite number of modes on large
scales and the Rayleigh-distribution of mode amplitudes, the mean
power of the actual realization in each bin is expected to scatter
around the linear input power spectrum. Also, while the expectation
value $\left<|\delta_{\vec k}|^2\right>$ is equal to the input power
spectrum, the median power per mode is biased low due to the
skew-negative distribution of the mode amplitudes. Hence, in a given
realization there are typically more points lying below the input
power spectrum than above it, an effect that quickly becomes
negligible as the number of independent modes in each bin becomes
large.  We illustrate this in the top panel of
Figure~\ref{FigRealizations}, where 20 realizations for different
random number seeds of the power spectrum on large scales are shown,
together with the average power in each bin. Our particular
realization for the Millennium Simulation corresponds to a slightly
unlucky choice of random number seed in the sense that the
fluctuations around the mean input power in the region of the second
peak seem to resemble the pattern of the acoustic oscillations (see
the top left panel of Figure~6 in our Nature article).  However, we
stress that the fluctuations in these bins are random and
uncorrelated, and that this impression is only a chance effect. In the
bottom panel of Figure~\ref{FigRealizations}, we redraw the measured
power spectra for the 20 random realizations, this time normalised to
a smooth CDM power spectrum without acoustic oscillations in order to
highlight the baryonic `wiggles'. We have drawn one of the 20
realizations in blue. It is one that resembles the pattern of
fluctuations seen in the Millennium realization quite closely while
others scatter quite differently, showing that such deviations are
consistent with the expected statistical distribution.


\paragraph*{Dynamical evolution.}

The evolution of the simulation particles under gravity in an
expanding background is governed by the Hamiltonian \be H= \sum_i
\frac{\vec{p}_i^2}{2\,m_i\, a(t)^2} + \frac{1}{2}\sum_{ij}\frac{m_i
m_j \,\varphi(\vec{x}_i-\vec{x}_j)}{a(t)}, \label{eqHamil} \ee where
$H=H(\vec{p}_1,\ldots,\vec{p}_N,\vec{x}_1,\ldots,\vec{x}_N, t)$.  The
$\vec{x}_i$ are comoving coordinate vectors, and the corresponding
canonical momenta are given by $\vec{p}_i=a^2 m_i \dot\vec{x}_i$.  The
explicit time dependence of the Hamiltonian arises from the evolution
$a(t)$ of the scale factor, which is given by the Friedman-Lemaitre
model that describes the background cosmology. Due to our assumption
of periodic boundary conditions for a cube of size $L^3$, the
interaction potential $\varphi(\vec{x})$ is the solution of \be
\nabla^2 \varphi(\vec{x}) = 4\pi G \left[ - \frac{1}{L^3} +
\sum_{\vec{n}} \delta_\epsilon(\vec{x}-\vec{n}L)\right], \label{eqpot}
\ee where the sum over $\vec{n}=(n_1, n_2, n_3)$ extends over all
integer triplets.  The density distribution function
$\delta_\epsilon(\vec{x})$ of a single particle is spread over a
finite scale $\epsilon$, the gravitational softening length.  The
softening is necessary to make it impossible for hard binaries to form
and to allow the integration of close particle encounters with
low-order integrators.  We use a spline kernel to soften the point
mass, given by $\delta_\epsilon(\vec{x}) = W(|\vec{x}|/2.8\epsilon)$,
where $W(r) = 8 (1- 6 r^2 + 6 r^3)/\pi$ for $0\le r<1/2$, $W(r) = {16}
(1- r)^3/\pi$ for $1/2 \le r<1$, and $W(r)=0$ otherwise.  For this
choice, the Newtonian potential of a point mass at zero lag in
non-periodic space is $-G\,m/\epsilon$, the same as for a
`Plummer-sphere' of size $\epsilon$, and the force becomes fully
Newtonian for separations larger than $2.8\epsilon$. We took
$\epsilon=5\,h^{-1}{\rm kpc}$, about $46.3$ times smaller than the
mean particle separation. Note that the mean density is subtracted in
equation (\ref{eqpot}), so the solution of the Poisson equation
corresponds to the {\em peculiar potential}, where the dynamics of the
system is governed by $\nabla^2 \phi(\vec{x}) = 4\pi G
[\rho(\vec{x})-\overline\rho]$.


The equations of motion corresponding to equation (\ref{eqHamil}) are
$\sim 10^{10}$ simple differential equations, which are however
coupled tightly by the mutual gravitational forces between the
particles. An accurate evaluation of these forces (the `right hand
side' of the equations) is computationally very expensive, even when
force errors up to $\sim 1\%$ can be tolerated, which is usually the
case in collisionless dynamics\cite{Hernquist1993}.  We have written a
completely new version of the cosmological simulation code {\small
GADGET}\cite{Springel2001} for this purpose.  Our principal
computational technique for the gravitational force calculation is a
variant of the `TreePM' method\cite{Xu1995,Bode2000,Bagla2002}, which
uses a hierarchical multipole expansion\cite{Barnes1986} (a `tree'
algorithm) to compute short-range gravitational forces and combines
this with a more traditional particle-mesh (PM)
method\cite{Hockney1981} to determine long-range gravitational forces.
This combination allows for a very large dynamic range and high
computational speed even in situations where the clustering becomes
strong. We use an explicit force-split\cite{Bagla2002} in
Fourier-space, which produces a highly isotropic force law and
negligible force errors at the force matching scale. The algorithms in
our code are specially designed for massively parallel operation and
contain explicit communication instructions such that the code can
work on computers with distributed physical memory, a prerequisite for
a simulation of the size and computational cost of the Millennium Run.


\begin{figure*}
\begin{center}
\resizebox{15cm}{!}{\includegraphics{si_fig3.eps}}
\end{center}
\caption{The power spectrum of the dark matter distribution in the
Millennium Simulation at various epochs (blue lines). The gray lines
show the power spectrum predicted for linear growth, while the dashed
line denotes the shot-noise limit expected if the simulation particles
are a Poisson sampling from a smooth underlying density field. In
practice, the sampling is significantly sub-Poisson at early times and
in low density regions, but approaches the Poisson limit in nonlinear
structures. Shot-noise subtraction allows us to probe the spectrum
slightly beyond the Poisson limit. Fluctuations around the linear
input spectrum on the largest scales are due to the small number of
modes sampled at these wavelengths and the Rayleigh distribution of
individual mode amplitudes assumed in setting up the initial
conditions.  To indicate the bin sizes and expected sample variance on
these large scales, we have included symbols and error bars in the
$z=0$ estimates. On smaller scales, the statistical error bars are
negligibly small.
\label{FigPowerSpec}}
\end{figure*}


For the tree-algorithm, we first decompose the simulation volume
spatially into compact {\em domains}, each served by one
processor. This domain decomposition is done by dividing a space
filling Peano-Hilbert curve into segments. This fractal curve visits
each cell of a fiducial grid of $1024^3$ cells overlayed over the
simulation exactly once. The decomposition tries to achieve a
work-load balance for each processor, and evolves over time as
clustering progresses. Using the Peano-Hilbert curve guarantees that
domain boundaries are always parallel to natural tree-node boundaries,
and thanks to its fractal nature provides for a small
surface-to-volume ratio for all domains, such that communication with
neighbouring processors during the short-range tree force computation
can be minimised. Our tree is fully threaded (i.e.~its leaves are
single particles), and implements an oct-tree structure with monopole
moments only. The cell-opening criterion was
relative\cite{Salmon1994}; a multipole approximation was accepted if
its conservatively estimated error was below $0.5\%$ of the total
force from the last timestep. In addition, nodes were always opened
when the particle under consideration lay inside a 10\% enlarged outer
node boundary. This procedure gives forces with typical errors well
below $0.1\%$.


For the PM algorithm, we use a parallel Fast Fourier Transform
(FFT)\footnote[1]{Based on the www.fftw.org libraries of MIT.} to
solve Poisson's equation.  We used a FFT mesh with $2560^3$ cells,
distributed into 512 slabs of dimension $5\times 2560\times 2560$ for
the parallel transforms.  After clouds-in-cells (CIC) mass assignment
to construct a density field, we invoke a real-to-complex transform to
convert to Fourier space. We then multiplied by the Greens function of
the Poisson equation, deconvolved for the effects of the CIC and the
trilinear interpolation that is needed later, and applied the
short-range filtering factor used in our TreePM formulation (the short
range forces suppressed here are exactly those supplied by the
tree-algorithm). Upon transforming back we obtained the gravitational
potential. We then applied a four-point finite differencing formula to
compute the gravitational force field for each of the three coordinate
directions.  Finally, the forces at each particle's coordinate were
obtained by trilinear interpolation from these fields.

A particular challenge arises due to the different data layouts needed
for the PM and tree algorithms. In order to keep the required
communication and memory overhead low, we do not swap the particle
data between the domain and slab decompositions.  Instead, the
particles stay in the domain decomposition needed by the tree, and
each processor constructs patches of the density field for all the
slabs on other processors which overlap its local domain.  In this
way, each processor communicates only with a small number of other
processors to establish the binned density field on the slabs.
Likewise, the slab-decomposed potential field is transfered back to
processors so that a local region is formed covering the local domain,
in addition to a few ghost cells around it, such that the finite
differencing of the potential can be carried out for all interior
points.


Timestepping was achieved with a symplectic leap-frog scheme based on
a split of the potential energy into a short-range and long-range
component. The short-range dynamics was then integrated by subcycling
the long-range step\cite{Duncan1998}. Hence, while the short-range
force had to be computed frequently, the long-range FFT force was
needed only comparatively infrequently. More than 11000 timesteps in
total were carried out for the simulation, using individual and
adaptive timesteps\footnote[2]{Allowing adaptive changes of timesteps
formally breaks the symplectic nature of our integration scheme, which
is however not a problem for the dynamics we follow here.} for the
particles. A timestep of a particle was restricted to be smaller than
$\Delta t = \sqrt{2 \eta \epsilon/|\vec{a}|}$, where $\vec{a}$ is a
particle's acceleration and $\eta=0.02$ controls the integration
accuracy. We used a binary hierarchy of timesteps to generate a
grouping of particles onto timebins.

The memory requirement of the code had to be aggressively optimised in
order to make the simulation possible on the IBM p690 supercomputer
available to us. The total aggregated memory on the 512 processors was
1 TB, of which about 950~GB could be used freely by an application
program. In our code {\small {\em Lean}-GADGET-2} produced for the
Millennium Simulation, we needed about 400~GB for particle storage and
300~GB for the fully threaded tree in the final clustered particle
state, while the PM algorithm consumed in total about 450~GB in the
final state (due to growing variations in the volume of domains as a
result of our work-load balancing strategy, the PM memory requirements
increase somewhat with time). Note that the memory for tree and PM
computations is not needed concurrently, and this made the simulation
feasible.  The peak memory consumption per processor reached 1850~MB
at the end of our simulation, rather close to the maximum possible of
1900~MB.


\begin{figure}
\begin{center}
\resizebox{8.5cm}{!}{\includegraphics{si_fig4.eps}}\vspace*{-0.3cm}
\end{center}
\caption{Measured distribution of mode amplitudes in the Millennium
  Simulation at redshift $z=4.9$. Only modes in the $k$-range
    $0.03\,h/{\rm Mpc} < k < 0.07\,h/{\rm Mpc}$ are included (in total
    341 modes), with their amplitude normalised to the square root of
    the expected linear power spectrum at that redshift. The
    distribution of modes follows the expected Rayleigh distribution
    very well. The bottom panel shows the relative deviations of the
    measurements from this distribution, which are in line with the
    expected statistical scatter.
\label{FigRayleigh}}
\end{figure}


\paragraph*{On the fly analysis.}

With a simulation of the size of the Millennium Run, any non-trivial
analysis step is demanding. For example, measuring the dark matter
mass power spectrum over the full dynamic range of the simulation
volume would require a 3D FFT with $\sim 10^5$ cells per dimension,
which is unfeasible at present.  In order to circumvent this problem,
we employed a two stage procedure for measuring the power spectrum
where a ``large-scale'' and a ``small-scale'' measurement were
combined. The former was computed with a Fourier transform of the
whole simulation box, while the latter was constructed by folding the
density field back onto itself\cite{Jenkins1998}, assuming periodicity
for a fraction of the box.  The self-folding procedure leads to a
sparser sampling of Fourier space on small scales, but since the
number of modes there is large, an accurate small-scale measurement is
still achieved.  Since the PM-step of the simulation code already
computes an FFT of the whole density field, we took advantage of this
and embedded a measurement of the power spectrum directly into the
code. The self-folded spectrum was computed for a 32 times smaller
periodic box-size, also using a $2560^3$ mesh, so that the power
spectrum measurement effectively corresponded to a $81920^3$ mesh. We
have carried out a measurement each time a simulation snapshot was
generated and saved on disk. In Figure~\ref{FigPowerSpec}, we show the
resulting time evolution of the {\it dark matter} power spectrum in
the Millennium Simulation.  On large scales and at early times, the
mode amplitudes grow linearly, roughly in proportion to the
cosmological expansion factor. Nonlinear evolution accelerates the
growth on small scales when the dimensionless power $\Delta^2(k)= k^3
P(k)/(2\pi^2)$ approaches unity; this regime can only be studied
accurately using numerical simulations. In the Millennium Simulation,
we are able to determine the nonlinear power spectrum over a larger
range of scales than was possible in earlier work\cite{Jenkins1998},
almost five orders of magnitude in wavenumber $k$.

On the largest scales, the periodic simulation volume encompasses only
a relatively small number of modes and, as a result of the Rayleigh
amplitude sampling that we used, these (linear) scales show
substantial random fluctuations around the mean expected power.  This
also explains why the mean power in the $k$-range $0.03\,h/{\rm Mpc} <
k < 0.07\,h/{\rm Mpc}$ lies below the linear input power.  In
Figure~\ref{FigRayleigh}, we show the actual distribution of
normalised mode amplitudes, $\sqrt{|\delta_\vec{k}|^2 / P(k)}$,
measured directly for this range of wavevectors in the Millennium
Simulation at redshift $z=4.9$. We see that the distribution of mode
amplitudes is perfectly consistent with the expected underlying
Rayleigh distribution.


Useful complementary information about the clustering of matter in
real space is provided by the two-point correlation function of dark
matter particles.  Measuring it involves, in principle, simply
counting the number of particle pairs found in spherical shells around
a random subset of all particles.  Naive approaches to determine these
counts involve an $N^2$-scaling of the operation count and are
prohibitive for our large simulation. We have therefore implemented
novel parallel methods to measure the two-point function accurately,
which we again embedded directly into the simulation code, generating
a measurement automatically at every output. Our primary approach to
speeding up the pair-count lies in using the hierarchical grouping
provided by the tree to search for particles around a randomly
selected particle. Since we use logarithmic radial bins for the pair
counts, the volume corresponding to bins at large radii is
substantial. We use the tree for finding neighbours with a
range-searching technique.  In carrying out the tree-walk, we check
whether a node falls fully within the volume corresponding to a
bin. In this case, we terminate the walk along this branch of the tree
and simply count all the particles represented by the node at once,
leading to a significant speed-up of the measurement.


Finally, the exceptionally large size of the simulation prompted us to
develop new methods for computing friends-of-friends (FOF) group
catalogues in parallel and on the fly.  The FOF groups are defined as
equivalence classes in which any pair of particles belongs to the same
group if their separation is less than 0.2 of the mean particle
separation. This criterion combines particles into groups with a mean
overdensity that corresponds approximately to the expected density of
virialised groups. Operationally, one can construct the groups by
starting from a situation in which each particle is first in its own
single group, and then testing all possible particle pairs; if a close
enough pair is found whose particles lie in different groups already
present, the groups are linked into a common group.  Our algorithm
represents groups as link-lists, with auxiliary pointers to a list's
head, tail, and length. In this way we can make sure that, when groups
are joined, the smaller of two groups is always attached to the tail
of the larger one. Since each element of the attached group must be
visited only once, this procedure avoids a quadratic contribution to
the operation count proportional to the group size when large groups
are built up.  Our parallel algorithm works by first determining the
FOF groups on local domains, again exploiting the tree for range
searching techniques, allowing us to find neighbouring particles
quickly.  Once this first step of group finding for each domain is
finished, we merge groups that are split by the domain decomposition
across two or several processors. As groups may in principle percolate
across several processors, special care is required in this step as
well. Finally, we save a group catalogue to disk at each output,
keeping only groups with at least 20 particles.


In summary, the simulation code evolved the particle set for more than
11000 timesteps, producing 64 output time slices each of about 300 GB.
Using parallel I/O techniques, each snapshot could be written to disk
in about 300 seconds.  Along with each particle snapshot, the
simulation code produced a FOF group catalogue, a power spectrum
measurement, and a two-point correlation function measurement.
Together, over $\sim20$~TB of data were generated by the
simulation. The raw particle data of each output was stored in a
special way (making use of a space-filling curve), which allows rapid
direct access to subvolumes of the particle data. The granularity of
these subvolumes corresponds to a fiducial $256^3$ mesh overlayed over
the simulation volume, such that the data can be accessed randomly in
pieces of $\sim 600$ particles on average.  This storage scheme is
important to allow efficient post-processing, which cannot make use of
an equally powerful supercomputer as the simulation itself.


\begin{figure*}
\begin{center}\resizebox{15cm}{!}{\includegraphics{si_fig5.eps}}
\end{center}
\caption{Schematic organisation of the merger tree in
  the Millennium Run. At each output time, FOF groups are identified
  which contain one or several (sub)halos. The merger tree connects
  these halos. The FOF groups play no direct role, except that the
  largest halo in a given FOF group is the one which may develop a
  cooling flow according to the physical model for galaxy formation
  implemented for the trees. To facilitate the latter, a number of
  pointers for each halo are defined. Each halo knows its descendant,
  and its most massive progenitor. Possible further progenitors can be
  retrieved by following the chain of `next progenitors'. In a similar
  fashion, all halos in a given FOF group are linked together.
\label{FigMergTree}}
\end{figure*}


\subsection*{Postprocessing of the simulation data\vspace*{-0.3cm}}


\paragraph*{Substructure analysis.}

High-resolution simulations like the present one exhibit a rich
substructure of gravitationally bound dark matter subhalos orbiting
within larger virialised structures\cite{Ghigna1998}. The FOF group
finder built into the simulation code is able to identify the latter,
but not the `subhalos'. In order to follow the fate of infalling halos
and galaxies more reliably, we therefore determine dark matter
substructures for all identified FOF halos.  We accomplish this with
an improved and extended version of the {\small SUBFIND}
algorithm\cite{Springel2001b}. This computes an adaptively smoothed
dark matter density field using a kernel-interpolation technique, and
then exploits the topological connectivity of excursion sets above a
density threshold to identify substructure candidates. Each
substructure candidate is subjected to a gravitational unbinding
procedure. If the remaining bound part has more than 20 particles, the
subhalo is kept for further analysis and some basic physical
properties (angular momentum, maximum of its rotation curve, velocity
dispersion, etc.) are determined. An identified subhalo was extracted
from the FOF halo, so that the remainder formed a featureless
`background' halo which was also subjected to an unbinding
procedure. The required computation of the gravitational potential for
the unbinding was carried out with a tree algorithm similar to the one
used in the simulation code itself.

Finally, we also compute a virial mass estimate for each FOF halo in
this analysis step, using the spherical-overdensity approach and the
minimum of the gravitational potential within the group as the central
point.  We identified $17.7\times 10^6$ FOF groups at $z=0$, down from
a maximum of $19.8\times 10^6$ at $z=1.4$, where the groups are more
abundant yet smaller on average.  At $z=0$, we found a total of
$18.2\times 10^6$ subhalos, and the largest FOF group contained 2328
of them.


\paragraph*{Merger tree definition and construction.}

Having determined all halos and subhalos at all output times, we
tracked these structures over time, i.e.~we determined the
hierarchical merging trees that describe in detail how structures
build up over cosmic time. These trees are the key information needed
to compute physical models for the properties of the associated galaxy
population.

Because structures merge hierarchically in CDM universes, a given halo
can have several progenitors but, in general, it has only one
descendant because the cores of virialised dark matter structures do
not split up into two or more objects. We therefore based our merger
tree construction on the determination of a unique descendant for any
given halo. This is, in fact, already sufficient to define the merger
tree construction, since the progenitor information then follows
implicitly.

To determine the appropriate descendant, we use the unique IDs that
label each particle and track them between outputs. For a given halo,
we find all halos in the subsequent output that contain some of its
particles.  We count these particles in a weighted fashion, giving
higher weight to particles that are more tightly bound in the halo
under consideration. In this way, we give preference to tracking the
fate of the inner parts of a structure, which may survive for a long
time upon infall into a bigger halo, even though much of the mass in
the outer parts can be quickly stripped. The weighting is facilitated
by the fact that the results of the {\small SUBFIND} analysis are
stored in order of increasing total binding energy, i.e.~the most
bound particle of a halo is always stored first.  Once these weighted
counts are determined for each potential descendant, we select the one
with the highest count as the descendant. As an additional refinement
(which is not important for any of our results), we have allowed some
small halos to skip one snapshot in finding a descendant. This deals
with cases where we would otherwise lose track of a structure that
temporarily fluctuates below our detection threshold.


In Figure~\ref{FigMergTree}, we show a schematic representation of the
merger tree constructed in this way. The FOF groups are represented at
different times with boxes each of which contains one or more
(sub)halos. For each halo, a unique descendant is known, and there are
link-list structures that allow the retrieval of all progenitors of a
halo, or of all other halos in the same FOF group. Not all the trees
in the simulation volume are connected with each other. Instead, there
are $14.4\times 10^6$ separate trees, each essentially describing the
formation history of the galaxies contained in a FOF halo at the
present time. The correspondence between trees and FOF halos is not
exactly one-to-one because some small FOF halos did not contain a
bound subhalo and were dropped, or because some FOF halos can be
occasionally linked by feeble temporary particle bridges which then
also combines their corresponding trees. We have stored the resulting
tree data on a per-tree basis, so that the physical model for galaxy
formation can be computed sequentially for all the trees individually,
instead of having to apply the model in one single step. The latter
would have been impossible, given that the trees contain a total of
around 800 million halos.


\subsection*{Physical model for galaxy formation\vspace*{-0.3cm}}

`Semi-analytic' models of galaxy formation were first proposed more
than a decade ago\cite{White1991}. They have proven to be a very
powerful tool for advancing the theory of galaxy
formation\cite{Kauffmann1993,Cole1994,Kauffmann1996,Kauffmann1997,Baugh1998,Sommerville1999,Cole2000,Benson2002},
even though much of the detailed physics of star formation and its
regulation by feedback processes has remained poorly understood. The
term `semi-analytic' conveys the notion that while in this approach
the physics is parameterised in terms of simple analytic models,
following the dark matter merger trees over time can only be carried
out numerically. Semi-analytic models are hence best viewed as
simplified simulations of the galaxy formation process.  While the
early work employed Monte-Carlo realizations of dark matter
trees\cite{Kauffmann1993tree,Somerville1999tree}, more recent work is
able to measure the merging trees directly from numerical dark matter
simulations\cite{Kauffmann1999}. In the most sophisticated version of
this technique, the approach is extended to include dark matter
substructure information as well\cite{Springel2001b}. This offers
substantially improved tracking of the orbits of infalling
substructure and of their lifetimes. In the Millennium Simulation, we
have advanced this method further still, using improved substructure
finding and tracking methods, allowing us fully to exploit the
superior statistics and information content offered by the underlying
high-resolution simulation.

Our semi-analytic model integrates a number of differential equations
for the time evolution of the galaxies that populate each hierarchical
merging tree.  In brief, these equations describe radiative cooling of
gas, star formation, the growth of supermassive black holes, feedback
processes by supernovae and AGN, and effects due to a reionising UV
background. Morphological transformation of galaxies and processes of
metal enrichment are modelled as well.  Full details of the scheme
used to produce specific models shown in our Nature article will be
provided in a forthcoming publication\cite{Croton2005}, but we here
include a very brief summary of the most important aspects of the
model. Note, however, that this is just one model among many that can
be implemented in post-processing on our stored Millennium Run
data-structures. A prime goal of our project is to evaluate such
schemes against each other and against the observational data in order
to understand which processes determine the various observational
properties of the galaxy population.


\paragraph*{Radiative cooling and star formation.}

We assume that each virialised dark matter halo contains (initially) a
baryonic fraction equal to the universal fraction of baryons,
$f_b=0.17$, which is consistent with WMAP and Big-Bang nucleosynthesis
constraints. A halo may lose some gas temporarily due to heating by a
UV background or other feedback processes, but this gas is assumed to
be reaccreted once the halo has grown in mass sufficiently. The
influence of the UV background is directly taken into account as a
reduction of the baryon fraction for small halos, following fitting
functions obtained from detailed hydrodynamical
models\cite{Gnedin2000,Kravtsov2004}.

We distinguish between cold condensed gas in the centre of halos
(forming the interstellar medium), and hot gas in the diffuse
atmospheres of halos. The latter has a temperature equal to the virial
temperature of the halo, and emits bremsstrahlung and line radiation.
The corresponding cooling rate is estimated following standard
parameterisations\cite{White1991,Springel2001b}, which have been shown
to provide accurate matches to direct hydrodynamical simulations of
halo formation including radiative
cooling\cite{Yoshida2002,Helly2003}. We note that, following the
procedures established already in reference\cite{White1991}, the
cooling model accounts for a distinction between a cold infall regime
and cooling out of a hot atmosphere. The transition is at a mass scale
close to that found in detailed analytic calculations of the cooling
process\cite{Forcado1997,Birnboim2003} and in recent hydrodynamical
simulations\cite{Keres2004}.

The cooling gas is assumed to settle into a disk supported by
rotation. We directly estimate the disk size based on the spin
parameter of the hosting dark matter halo\cite{Mo1998}. Once the gas
surface density exceeds a critical threshold motivated by
observations\cite{Kauffmann1996b}, we assume that star formation
proceeds in the disk, with an efficiency of order $\simeq 10\%$ on a disk
dynamical time. This parameterisation reproduces the phenomenological
laws of star formation in observed disk
galaxies\cite{Kennicutt1989,Kennicutt1998} and the observed gas
fractions at low redshift.

Supernova explosions associated with short-lived massive stars are
believed to regulate star formation in galaxies, particularly in small
systems with shallow potential wells\cite{Dekel1986}. Observations
suggest that supernovae blow gas out of star-forming disks, with a
rate that is roughly proportional to the total amount of stars
formed\cite{Martin1999}. We adopt this observational scaling, and
estimate how much of this gas can join the hot halo of the galaxy
given the total amount of energy released by the supernovae, and how
much may be blown out of the halo entirely. The efficiency of such
mass-loss is a strong function of the potential well depth of the
galaxy. In our model, small galaxies may blow away their remaining gas
entirely in an intense burst of star formation, while large galaxies
do not exhibit any outflows.


\paragraph*{Morphological evolution.}

We characterise galaxy morphology by a simple bulge-to-disk ratio
which can be transformed into an approximate Hubble type according
observational trends\cite{Simien1986}. While the generic mode of gas
cooling leads to disk formation, we consider two possible channels for
the formation of bulges: secular evolution due to disk instabilities,
or as a consequence of galaxy merger events.

Secular evolution can trigger bar and bulge formation in disk
galaxies. We invoke simple stability arguments for self-gravitating
stellar disks\cite{Mo1998} to determine the mass of stars that needs
to be put into a nuclear bulge component to render the stellar disk
stable.

Galaxy mergers are described by the halo merger tree constructed from
the simulation, augmented with a timescale for the final stages of a
merger whenever we lose track of a substructure due to finite spatial
and time resolution. We then estimate the remaining survival time in a
standard fashion based on the dynamical friction timescale. We use the
mass ratio of two merging galaxies to distinguish between two classes
of mergers. {\em Minor mergers} involve galaxies with mass ratio less
than $0.3$. In this case, we assume that the disk of the larger galaxy
survives, while the merging satellite becomes part of the bulge
component. For larger mass ratios, we assume a {\em major merger}
takes place, leading to destruction of both disks, and reassembly of
all stars in a common spheroid. Such an event is the channel through
which pure elliptical galaxies can form. The cold gas in the satellite
of a minor merger, or the cold gas in both galaxies of a major merger,
is assumed to be partially or fully consumed in a nuclear starburst in
which additional bulge stars are formed.  The detailed
parameterisation of such induced starbursts follows results obtained
from systematic parameter studies of hydrodynamical galaxy collision
simulations\cite{Mihos1994,Mihos1996,Cox2004}.


\paragraph*{Spectrophotometric modelling.}

To make direct contact with observational data, it is essential to
compute spectra and magnitudes for the model galaxies, in the
passbands commonly used in observations. Modern population synthesis
models allow an accurate prediction of spectrophotometric properties
of stellar populations as a function of age and
metallicity\cite{Bruzual1993,Bruzual2003}. We apply such a
model\cite{Bruzual2003} and compute magnitudes in a number of
passbands separately for both bulge and disk components and in both
rest- and observer-frames. Dust obscuration effects are difficult to
model in general and present a major source of uncertainty, especially
for simulated galaxies at high redshift. We apply a rather simple
plane-parallel slab dust model\cite{Kauffmann1999}, as a first-order
approximation to dust extinction.

Metal enrichment of the diffuse gas component can also be important,
because it affects both cooling rates in moderately sized halos and
galaxy colours through the population synthesis models.  Our treatment
of metal enrichment and transport is close to an earlier semi-analytic
model\cite{DeLucia2004}. In it, metals produced and released by
massive stars are placed first into the cold star forming gas, from
which they can be transported into the diffuse hot halo or into the
intergalactic medium by supernova feedback. We assume a homogenous
metallicity (i.e.~perfect mixing) within each of the gas components,
although the hot and cold gas components can have different
metallicities.


\paragraph*{Active galactic nuclei.}

Supermassive black holes are believed to reside at the centre of most,
if not all, spheroidal galaxies, and during their active phases they
power luminous quasars and active galactic nuclei.  There is
substantial observational evidence that suggests a connection between
the formation of galaxies and the build-up of supermassive black holes
(BH). In fact, the energy input provided by BHs may play an important
role in shaping the properties of
galaxies\cite{Silk1998,DiMatteo2005,Springel2005a}, and in reionising
the universe\cite{Haardt1996,Madau2004}.

Our theoretical model for galaxy and AGN formation extends an earlier
semi-analytic model for the joint build-up of the stellar and
supermassive black hole components\cite{Kauffmann2000}.  This adopts
the hypothesis that quasar phases are triggered by galaxy mergers.  In
these events, cold gas is tidally forced into the centre of a galaxy
where it can both fuel a nuclear starburst and be available for
central AGN accretion. We parameterise the efficiency of the feeding
process of the BHs as in the earlier work\cite{Kauffmann2000}, and
normalise it to reproduce the observed scaling relation between the
bulge mass and the BH mass at the present
epoch\cite{Magorrian1998,Ferrarese2000}.  This `quasar mode' of BH
evolution provides the dominant mass growth of the BH population, with
a total cumulative accretion rate that peaks at $z\simeq 3$, similar
to the observed population of quasars.

A new aspect of our model is the addition of a `radio mode' of BH
activity, motivated by the observational phenomenology of nuclear
activity in groups and clusters of galaxies. Here, accretion onto
nuclear supermassive BHs is accompanied by powerful relativistic jets
which can inflate large radio bubbles in clusters, and trigger sound
waves in the intracluster medium (ICM). The buoyant rise of the
bubbles\cite{Churazov2001,Brueggen2002} together with viscous
dissipation of the sound waves\cite{Fabian2003} is capable of
providing a large-scale heating of the ICM, thereby offsetting cooling
losses\cite{Vecchia2004}. These physical processes are arguably the
most likely explanation of the `cooling-flow puzzle': the observed
absence of the high mass dropout rate expected due to the observed
radiative cooling in clusters of galaxies. We parameterise the radio
mode as a mean heating rate into the hot gas proportional to the mass
of the black hole and to the $3/2$ power of the temperature of the hot
gas.  The prefactor is set by requiring a good match to the bright end
of the observed present-day luminosity function of galaxies. The
latter is affected strongly by the radio mode, which reduces the
supply of cold gas to massive central galaxies and thus shuts off
their star formation. Without the radio mode, central cluster galaxies
invariably become too bright and too blue due to excessive cooling
flows. The total BH accretion rate in this radio mode becomes
significant only at very low redshift, but it does not contribute
significantly to the cumulative BH mass density at the present epoch.


\bibliography{si}