Wang, L (reprint author), Peking Univ, Kavli Inst Astron & Astrophys, Yiheyuan Lu 5, Beijing 100871, Peoples R China.
Accurate direct N-body simulations help to obtain detailed information about the dynamical evolution of star clusters. They also enable comparisons with analytical models and Fokker-Planck or Monte Carlo methods. NBODY6 is a well-known direct N-body code for star clusters, and NBODY6++ is the extended version designed for large particle number simulations by supercomputers. We present NBODY6++ GPU, an optimized version of NBODY6++ with hybrid parallelization methods (MPI, GPU, OpenMP, and AVX/SSE) to accelerate large direct N-body simulations, and in particular to solve the million-body problem. We discuss the new features of the NBODY6++ GPU code, benchmarks, as well as the first results from a simulation of a realistic globular cluster initially containing a million particles. For million-body simulations, NBODY6++ GPU is 400-2000 times faster than NBODY6 with 320 CPU cores and 32 NVIDIA K20X GPUs. With this computing cluster specification, the simulations of million-body globular clusters including 5 per cent primordial binaries require about an hour per half-mass crossing time.
National Natural Science Foundation of China [11073025, 11010237, 11050110414, 11173004]
; Silk Road Project at National Astronomical Observatories of China (NAOC)
; Chinese Academy of Sciences, at NAOC [2009S1-5]
; 'Qianren' special foreign experts programme of China, at NAOC
; Volkswagen Foundation
; Julich Supercomputing Center (JSC)
; NASU under the Main Astronomical Observatory GRID/GPU computing cluster project
; Peter and Patricia Gruber Foundation through the PPGF fellowship
; Peking University One Hundred Talent Fund 
; John Templeton Foundation
; DFG cluster of excellence 'Origin and Structure of the Universe'