@ARTICLE\{IMM2001-03280, author = "B. S. Andersen and F. Gustavson and J. Wasniewski", title = "A Recursive Formulation of Cholesky Factorization of a Matrix in Packed Storage Format", year = "2001", month = "jun", keywords = "real symmetric matrices, complex Hermitian matrices, positive definite matrices, Choleski factorization and solution, recursive algorithms, novel packed data strucrures, linear systems of equations", pages = "214-244", journal = "{ACM} Transactions on Mathematical Software", volume = "27", editor = "John Reid", number = "2", publisher = "ACM", url = "http://www2.compute.dtu.dk/pubdb/pubs/3280-full.html", abstract = "A new compact way to store a symmetric or triangular matrix called {RPF} for Recursive Packed Format is fully described. Novel ways to transform {RPF} to and from standard packed format are included. A new algorithm, called {RPC} for Recursive Packed Cholesky that operates on the {RPF} format is presented. Algorithm {RPC} is based on level-3 {BLAS} and requires variants of algorithms \{\$\backslash\$bf {TRSM}\} and \{\$\backslash\$bf {SYRK}\} that work on {RPF}. We call these \{\$\backslash\$bf {RP}\$\backslash\$\_TRSM\} and \{\$\backslash\$bf {RP}\$\backslash\$\_SYRK\} and find that they do most of their work by calling \{\$\backslash\$bf {GEMM}\}. It follows that most of the execution time of {RPC} lies in \{\$\backslash\$bf {GEMM}\}. The advantage of this storage scheme compared to traditional packed and full storage is demonstrated. First, the {RPC} storage format uses the minimal amount of storage for the symmetric or triangular matrix. Second, {RPC} gives a level-3 implementation of Cholesky factorization whereas standard packed implementations are only level 2. Hence, the performance of our {RPC} implementation is decidedly superior. Third, unlike fixed block size algorithms, {RPC} requires no block size tuning parameter. We present performance measurements on several current architectures that demonstrate improvements over the traditional packed routines. Also {SMP} parallel computations on the {IBM} {SMP} computer are made. The graphs that are attached in the appendix of the paper, show that the {RPC} algorithms are superior by a factor between 1.6 and 7.4 for order around 1000, and between 1.9 and 10.3 for order around 3000 over the traditional packed algorithms. For some architectures, the {RPC} performance results are almost the same or even better than the traditional full-storage algorithm results." }