IMM publications 2011
This list is provided for search engines.
To search IMM publications, please start from
- Arngren, M., Larsen, J., Hansen, P. W., Eriksen, B., Larsen, R., Supplementary material for Analysis of Pre-Germinated Barley using Hyperspectral Image Analysis, Department of Informatics and Mathematical Modelling, 2011 [full] [bibtex] [pdf]
This technical report is a supplement to the paper: Analysis of Pre-Germinated Barley using Hyperspectral Image Analysis. It includes a more detailed analysis of the dataset and general supplementary material.
- Arngren, M., Hyperspectral NIR Camera, Department of Informatics and Mathematical Modelling, 2011 [full] [bibtex] [pdf]
This technical report describes a hypespectral camera from Headwall Photonics, Inc, in terms of performance and standard operating procedure.
- Gustavson, F. G., Wasniewski, J., Dongarra, J. J., Herrero, J. R., Langou, J., Level-3 Cholesky Factorization Routines as Part of Many Cholesky Algorithms, Technical University of Denmark, DTU Informatics, E-mail: firstname.lastname@example.org, 2011 [full] [bibtex] [pdf]
Some Linear Algebra Libraries use Level-2 routines during the factorization part of any Level-3 block factorization algorithm. We discuss four Level-3 routines called DPOTF3i, i = a,b,c,d, a new type of BLAS, for the factorization part of a block Cholesky factorization algorithm for use by LAPACK routine DPOTRF or for BPF (Blocked Packed Format) Cholesky factorization. The four routines DPOTF3i are Fortran routines. Our main result is that performance of routines DPOTF3i is still increasing when the performance of Level-2 routine DPOTF2 of LAPACK starts to decrease. This means that the performance of DGEMM, DSYRK, and DTRSM will increase due to their use of larger block sizes and also by making less passes over the matrix elements. We present corroborating performance results for DPOTF3i versus DPOTF2 on a variety of common platforms. The four DPOTF3i routines use simple register blocking; different platforms have different numbers of registers and so our four routines have different register blocking sizes. Blocked Packed Format (BPF) is introduced and discussed. LAPACK routines for POTRF and PPTRF using BPF instead of full and packed format are shown to be trivial modifications of LAPACK POTRF source codes. We call these codes BPTRF. There are two forms of BPF: we call them lower and upper BPF. Upper BPF is shown to be identical to Square Block Packed Format (SBPF). SBPF is used in “LAPACK” implementations on multi-core processors. Performance results for DBPTRF using upper BPF and DPOTRF for large n show that routines DPOTF3i do increase performance for large n. Lower BPF is almost always less efficient than upper BPF. A form of inplace transposition called vector inplace transposition can very efficiently convert lower BPF to upper BPF.
- Jensen, B. S., Gallego, J. S., Larsen, J., A Predictive Model of Music Preference using Pairwise Comparisons - Supporting material and Dataset, Department of Informatics and Mathematical Modelling, 2011 [full] [bibtex] [zip]
- Larsen, P., Ladelsky, R., Lidman, J., McKee, S. A., Karlsson, S., Zaks, A., Automatic Loop Parallelization via Compiler Guided Refactoring, Technical University of Denmark, DTU Informatics, E-mail: email@example.com, 2011 [full] [bibtex] [pdf]
For many parallel applications, performance relies not on instruction-level parallelism, but on loop-level parallelism. Unfortunately, many modern applications are written in ways that obstruct automatic loop parallelization. Since we cannot identify sufficient parallelization opportunities for these codes in a static, off-line compiler, we developed an interactive compilation feedback system that guides the programmer in iteratively modifying application source, thereby improving the compiler’s ability to generate loop-parallel code. We use this compilation system to modify two sequential benchmarks, finding that the code parallelized in this way runs up to 8.3 times faster on an octo-core Intel Xeon 5570 system and up to 12.5 times faster on a quad-core IBM POWER6 system. Benchmark performance varies significantly between the systems. This suggests that semi-automatic parallelization should be combined with target-specific optimizations. Furthermore, comparing the first benchmark to hand-parallelized, hand-optimized pthreads and OpenMP versions, we find that code generated using our approach typically outperforms the pthreads code (within 93-339%). It also performs competitively against the OpenMP code (within 75-111%). The second benchmark outperforms hand-parallelized and optimized OpenMP code (within 109-242%).