@MISC\{IMM2012-06367,
    author       = "P. E. Aackermann and P. J. D. Pedersen",
    title        = "Development of a {GPU-}accelerated {MIKE} 21 Solver for Water Wave Dynamics",
    year         = "2012",
    publisher    = "Technical University of Denmark, {DTU} Informatics, {E-}mail: reception@imm.dtu.dk",
    address      = "Asmussens Alle, Building 305, {DK-}2800 Kgs. Lyngby, Denmark",
    note         = "Supervised by Associate Professor Allan P. Engsig-Karup, apek@imm.dtu.dk, {DTU} Informatics",
    url          = "http://www.imm.dtu.dk/English.aspx",
    abstract     = "Development of a {GPU-}accelerated {MIKE} 21 Solver for Water Wave Dynamics
- With encouragement by the company {DHI} are the aim of this B.Sc. thesis to investigate, whether if it is possible to accelerate the simulation speed of DHIs commercial product {MIKE} 21 {HD,} by formulating a parallel solution scheme and implementing it to be executed on a {CUDA-}enabled {GPU} (massive parallel hardware).
{MIKE} 21 {HD} is a simulation tool, which simulates water wave dynamics in lakes, bays, coastal areas and seas by solving a set of hyperbolic partial differential equations called shallow water equations. The solution scheme is the Alternating Direction Implicit (ADI) method, which results in a lot of tri-diagonal matrix systems, which have to be solved efficiently.
Two different parallel solution schemes are implemented. The first (S1) solves each tri-diagonal in parallel using a single {CUDA} thread for each system. This approach use the same solution algorithm as {MIKE} 21 {HD,} Thomas algorithm. The other solution schemes (S2) adds more parallelism into the system by using several threads to solve each system in parallel. In order to do this efficient are several parallel solution algorithms investigated. The focus have been on the Parallel Cyclic Reduction (PCR) algorithm and a hybrid algorithm of Cyclic Reduction (CR) and {PCR}.
We discover that S2 are beneficial to use for small problems, while S1 yields better results for larger systems. We have obtained 42x and 80x speedup in double-precision for S1 and S2 respectively, compared to a representative sequential C implementation of {MIKE} 21 {HD}. Furthermore, the impact of switching to perform calculation in single-precision been investigated. This resulted in 145x and 203x speedup for S1 and S2, respectively. However, this had some precision lost when using single-precision. All test throughout the project is performed on the graphics card {NVIDIA} GeForce {GTX} 590."
}