@TECHREPORT\{IMM2011-06035,
    author       = "F. G. Gustavson and J. Wasniewski and J. J. Dongarra and J. R. Herrero and J. Langou",
    title        = "Level-3 Cholesky Factorization Routines as Part of Many Cholesky Algorithms",
    year         = "2011",
    number       = "",
    series       = "IMM-Technical Report-2011-11",
    institution  = "Technical University of Denmark, {DTU} Informatics, {E-}mail: reception@imm.dtu.dk",
    address      = "Asmussens Alle, Building 305, {DK-}2800 Kgs. Lyngby, Denmark",
    type         = "",
    url          = "http://www.imm.dtu.dk/English.aspx",
    abstract     = "Some Linear Algebra Libraries use Level-2 routines during the factorization part of any Level-3 block factorization algorithm. We discuss four Level-3 routines called DPOTF3i, i = a,b,c,d, a new type of {BLAS,} for the factorization part of a block Cholesky factorization algorithm for use by {LAPACK} routine {DPOTRF} or for {BPF} (Blocked Packed Format) Cholesky factorization. The four routines DPOTF3i are Fortran routines. Our main result is that performance of routines DPOTF3i is still increasing when the performance of Level-2 routine DPOTF2 of {LAPACK} starts to decrease. This means that the performance of {DGEMM,} {DSYRK,} and {DTRSM} will increase due to their use of larger block sizes and also by making less passes over the matrix elements. We present corroborating performance results for DPOTF3i versus DPOTF2 on a variety of common platforms. The four DPOTF3i routines use simple register blocking; different platforms have different numbers of registers and so our four routines have different register blocking sizes. Blocked Packed Format (BPF) is introduced and discussed. {LAPACK} routines for {POTRF} and {PPTRF} using {BPF} instead of full and packed format are shown to be trivial modifications of {LAPACK} {POTRF} source codes. We call these codes {BPTRF}. There are two forms of {BPF}: we call them lower and upper {BPF}. Upper {BPF} is shown to be identical to Square Block Packed Format (SBPF). {SBPF} is used in “LAPACK” implementations on multi-core processors. Performance results for {DBPTRF} using upper {BPF} and {DPOTRF} for large n show that routines DPOTF3i do increase performance for large n. Lower {BPF} is almost always less efficient than upper {BPF}. A form of inplace transposition called vector inplace transposition can very efficiently convert lower {BPF} to upper {BPF}."
}