@ARTICLE\{IMM2001-03280,
    author       = "B. S. Andersen and F. Gustavson and J. Wasniewski",
    title        = "A Recursive Formulation of Cholesky Factorization of a Matrix in Packed Storage Format",
    year         = "2001",
    month        = "jun",
    keywords     = "real symmetric matrices, complex Hermitian matrices, positive definite matrices, Choleski factorization and solution, recursive algorithms, novel packed data strucrures, linear systems of equations",
    pages        = "214-244",
    journal      = "{ACM} Transactions on Mathematical Software",
    volume       = "27",
    editor       = "John Reid",
    number       = "2",
    publisher    = "ACM",
    url          = "http://www2.compute.dtu.dk/pubdb/pubs/3280-full.html",
    abstract     = "A new compact way to store a symmetric or triangular matrix
called {RPF} for Recursive Packed Format is fully described.
Novel ways to transform {RPF} to and from standard packed format
are included. A new algorithm, called {RPC} for Recursive Packed Cholesky
that operates on the {RPF} format is presented. Algorithm {RPC} is based
on level-3 {BLAS} and requires variants of algorithms \{\$\backslash\$bf {TRSM}\} and
\{\$\backslash\$bf {SYRK}\} that work on {RPF}. We call these \{\$\backslash\$bf {RP}\$\backslash\$\_TRSM\} and
\{\$\backslash\$bf {RP}\$\backslash\$\_SYRK\} and find that they do
most of their work by calling \{\$\backslash\$bf {GEMM}\}. It follows that most of the execution
time of {RPC} lies in \{\$\backslash\$bf {GEMM}\}.
The advantage of this storage scheme compared to traditional packed
and full storage is demonstrated. First, the {RPC} storage format uses the
minimal amount of storage for the symmetric or triangular matrix.
Second, {RPC} gives a level-3 implementation of Cholesky factorization
whereas standard packed implementations are only level 2. Hence, the
performance of
our {RPC} implementation is decidedly superior. Third, unlike fixed block size
algorithms, {RPC} requires no block size tuning parameter.
We present performance measurements on several current architectures
that demonstrate improvements over the traditional packed
routines. Also {SMP} parallel computations on the {IBM} {SMP} computer are
made. The graphs that are attached in the appendix of the paper, show
that the {RPC} algorithms are superior by a factor between 1.6 and 7.4
for order around 1000, and between 1.9 and 10.3 for order around 3000 over the
traditional packed algorithms. For some architectures, the {RPC} 
performance results
are almost the same or even better than the traditional full-storage
algorithm results."
}