@MASTERSTHESIS\{IMM2010-06503,
    author       = "M. G. Madsen",
    title        = "Acceleration of a non-linear water wave model using a {GPU}",
    year         = "2010",
    school       = "Technical University of Denmark, {DTU} Informatics, {E-}mail: reception@imm.dtu.dk",
    address      = "Asmussens Alle, Building 305, {DK-}2800 Kgs. Lyngby, Denmark",
    type         = "",
    note         = "{DTU} supervisors: Allan Peter Engsig-Karup, apek@imm.dtu.dk, Bernd Dammann and Jeppe Revall Frisvad, {DTU} Informatics",
    url          = "http://www.imm.dtu.dk/English.aspx",
    abstract     = "The primary objective of this work is to use a {GPU} (massively parallel hardware) to accelerate an existing optimized sequential algorithm, solving a potential flow problem. The potential flow problem poses an initial value problem at a {2D} surface, coupled with a {3D} Laplace problem. A low storage Defect Correction method with a multigrid preconditioner is used to solve a flexible order approximation of the Laplace problem. The widely used explicit RK4 method is applied for time integration.
The primary reason for porting this particular solver, is that both Defect Correction and the preconditioner are expected to be well suited for GPUs, given that the right discretization is used. The work focuses on both analysis and implementation of the multigrid method, and understanding how it should be configured in order to be an efficient preconditioner for the Defect Correction algorithm. Only little attention is given to the standard 4 stage Runge Kutta method.
The most significant results of the work is that rethinking the memory layout both provides a significant increment in problem size and gives a boost to the solution time, even for a naive {CUDA} implementation. In particular the program
developed can hold a Laplace problem of up to 100,000,000 degrees of freedom in {4GB} {RAM}. For problems of this size, the iterative solution to the Laplace problem is improved by a decimal within a matter of seconds. This is up to 10 times faster than the existing {CPU} implementation. Although the target platform is the Compute Capability 1.3 Tesla architecture, it is also shown that moving the program to a Fermi architecture {GPU,} accelerates the code even further with a resulting speedup of up to 42 times faster than the existing {CPU} code. Remarkably the speedup on the Fermi-architecture is achieved with the naive implementation of the program."
}