Conference Papers

  1. [DSN'21] Amir Taherin, Tirthak Patel, Devesh Tiwari, Giorgis Georgakoudis, Ignacio Laguna. Analyzing System Failures on Two Generations of Supercomputers: Lessons and Opportunities. The IEEE/IFIP International Conference on Dependable Systems and Networks(DSN), virtual event, 2021.

  2. [CCGrid'21] Konstantinos Parasyris, Giorgis Georgakoudis, Leonardo Bautista-Gomez, Ignacio Laguna. Co-Designing Multi-Level Checkpoint Restart for MPI Applications. The 21th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, May 10-13, 2021, Melbourne, Victoria, Australia.

  3. [SC'20] Bradley Swain, Yanze Li, Peiming Liu, Ignacio Laguna, Giorgis Georgakoudis, Jeff Huang. OMPRacer: Fast, Precise, and Scalable Static Race Detection for OpenMP Programs. ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Virtual, Nov 16-19, 2020.

  4. [SC'20] Hui Guo, Ignacio Laguna, Cindy Rubio-González. pLiner: Isolating Lines of Floating-Point Code for Compiler Induced Variability. ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Virtual, Nov 16-19, 2020.

  5. [IISWC'20] Konstantinos Parasyris, Ignacio Laguna, Harshitha Menon, Markus Schordan, Daniel Osei-Kuffuor, Giorgis Georgakoudis, Mike Lam, Tristan Vanderbruggen. HPC-MixPBench: An HPC Benchmark Suite for Mixed Precision Analysis. 2020 IEEE International Symposium on Workload Characterization.

  6. [IISWC'20] Luanzheng Guo, Giorgis Georgakoudis, Konstantinos Parasyris, Ignacio Laguna, Dong Li. MATCH: An MPI Fault Tolerance Benchmark Suite. 2020 IEEE International Symposium on Workload Characterization.

  7. [ISC'20] Giorgis Georgakoudis, Luanzheng Guo, Ignacio Laguna. Reinit++: Evaluating the Performance of Global-Restart Recovery Methods For MPI Fault Tolerance. ISC High Performance, Frankfurt, Germany, Jun 22-24, 2020.

  8. [IPDPS'20] Ignacio Laguna. Varity: Quantifying Floating-Point Variations in HPC Systems Through Randomized Testing. 34th IEEE International Parallel & Distributed Processing Symposium (IPDPS), New Orleans, May 18-22, 2020.

  9. [PPoPP'20] Daniel DeFreez, Antara Bhowmick, Ignacio Laguna, Cindy Rubio-González. Detecting and Reproducing Error-Code Propagation Bugs in MPI Implementations. ACM Principles and Practice of Parallel Programming (PPoPP), San Diego, Feb 22-26, 2020.

  10. [SC'19] Ignacio Laguna, Ryan Marshall, Kathryn Mohror, Martin Ruefenacht, Anthony Skjellum, Nawrin Sultana. A Large-Scale Study of MPI Usage in Open-source HPC Applications. ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Denver, Colorado, Nov 17-19, 2019.

  11. [ISC'19] Ignacio Laguna, Paul C. Wood, Ranvijay Singh, Saurabh Bagchi. GPUMixer: Performance-Driven Floating-Point Tuning for GPU Scientific Applications. ISC High Performance, Frankfurt, Germany, Jun 16-20, 2019 (Best Paper Award).

  12. [ICS'19] iPradeep Kotipalli, Ranvijay Singh, Paul Wood, Ignacio Laguna, and Saurabh Bagchi. AMPT-GA: Automatic Mixed Precision Floating Point Tuning for GPU Applications. 33rd ACM International Conference on Supercomputing (ICS), pp. 1-11, Jun 26-28, Phoenix, AZ.

  13. [HPDC'19] Michael Bentley, Ian Briggs, Ganesh Gopalakrishnan,Dong H. Ahn, Ignacio Laguna, Gregory L. Lee, Holger E. Jones. Multi-level Analysis of Compiler-Induced Variability and Performance Tradeoffs. The 28th International Symposium on High-Performance Parallel and Distributed Computing, Phoenix, Arizona, USA - June 24-28, 2019.

  14. [IPDPS'19] Giorgis Georgakoudis, Ignacio Laguna, Hans Vandierendonck, Dimitrios S. Nikolopoulos, Martin Schulz. SAFIRE: Scalable and Accurate Fault Injection For Parallel Multithreaded Applications. 33rd IEEE International Parallel & Distributed Processing Symposium (IPDPS), Rio de Janeiro, Brazil, May 20-24, 2019.

  15. [ASE'19] Ignacio Laguna. FPChecker: Detecting Floating-Point Exceptions in GPU Applications. 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), San Diego, CA, 2019.

  16. [CARLA'19] Anthony Skjellum, Martin Rüfenacht, Nawrin Sultana, Derek Schafer, Ignacio Laguna, Kathryn Mohror. ExaMPI: A Modern Design and Implementation to Accelerate Message Passing Interface Innovation. 6th Latin American Conference on High Performance Computing (CARLA), Costa Rica, Sep 25–27, 2019.

  17. [SC'18] Luanzheng Guo, Dong Li, Ignacio Laguna, Martin Schulz. FlipTracker: Understanding Natural Error Resilience in HPC Applications. ACM/IEEE Conference for High Performance Computing, Networking, Storage and Analysis (SC), Dallas, TX, 2018.

  18. [EuroMPI'18] Nawrin Sultana, Anthony Skjellum, Ignacio Laguna, Matthew Shane Farmer, Kathryn Mohror and Murali Emani. MPI Stages: Checkpointing MPI State for Bulk Synchronous Applications. In Proceedings of the 25th European MPI Users Group Meeting (EuroMPI), Barcelona, Spain, Sep. 23-26, 2018.

  19. [IPDPS'18] Simone Atzeni, Ganesh Gopalakrishnan, Zvonimir Rakamaric, Ignacio Laguna, Gregory L Lee, Dong H Ahn. SWORD: A Bounded Memory-Overhead Detector of OpenMP Data Races in Production Runs. The The 32nd IEEE International Parallel and Distributed Processing Symposium (IPDPS), May, Vancouver, Canada, 2018.

  20. [SC'17] Giorgis Georgakoudis, Ignacio Laguna, Dimitrios S. Nikolopoulos, Martin Schulz. REFINE: Realistic Fault Injection via Compiler-Based Instrumentation for Accuracy, Portability and Speed. ACM/IEEE Conference for High Performance Computing, Networking, Storage and Analysis (SC), Denver, CO, 2017.

  21. [PPoPP'17] Sato, Kento, Dong H. Ahn, Ignacio Laguna, Gregory L. Lee, Martin Schulz, and Christopher M Chambreau. Noise Injection Techniques to Expose Subtle and Unintended Message Races. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), Austin, Texas, USA, Feb, 2017.

  22. [IPDPS'17] David Beckingsale, Olga Pearce, Ignacio Laguna, and Todd Gamblin. Apollo: Reusable Models for Fast, Dynamic Tuning of Input-Dependent Code. In The 31th IEEE International Parallel and Distributed Processing Symposium (IPDPS), May, Orlando, Florida, USA, 2017.

  23. [SC'16] Ignacio Laguna, Martin Schulz. Pinpointing Scale-Dependent Integer Overflow Bugs in Large-Scale Parallel Applications. ACM/IEEE Conference for High Performance Computing, Networking, Storage and Analysis (SC), Salt Lake City, 2016.

  24. [IPDPS'16] Simone Atzeni, Ganesh Gopalakrishnan, Zvonimir Rakamaric, Dong H. Ahn, Ignacio Laguna, Martin Schulz, Gregory L. Lee, Joachim Protze, Matthias S. Muller. ARCHER: Effectively Spotting Data Races in Large OpenMP Applications. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Chicago, May 23-27, 2016.

  25. [CGO'16] Ignacio Laguna, Martin Schulz, David F. Richards, Jon Calhoun, Luke Olson. IPAS: Intelligent Protection Against Silent Output Corruption in Scientific Applications. In the 14th IEEE/ACM International Symposium on Code Generation and Optimization (CGO), Barcelona, March 12-18, 2016.

  26. [SC'15] Kento Sato, Dong H. Ahn, Ignacio Laguna, Gregory L. Lee, Martin Schulz. Clock Delta Compression for Scalable Order-Replay of Non-Deterministic Parallel Applications. In the ACM/IEEE Conference for High Performance Computing, Networking, Storage and Analysis (SC), Austin, Texas, Nov, 2015.

  27. [ICCS'15] A. Chien, P. Balaji, P. Beckman, N. Dun, A. Fang, H. Fujita, K. Iskra, Z. Rubenstein, Z. Zheng, R. Schreiber, J. Hammond, J. Dinan, I. Laguna, D. Richards, A. Dubey, B. van Straalen, M. Hoemmen, M. Heroux, K. Teranishi, A. Siegel. Versioned Distributed Arrays for Resilience in Scientific Applications: Global View Resilience. In the International Conference On Computational Science (ICCS), Reykjavik, Iceland, June 1-3, 2015.

  28. [EuroMPI'14] Ignacio Laguna, David F. Richards, Todd Gamblin, Martin Schulz, Bronis R. de Supinski. Evaluating User-Level Fault Tolerance for MPI Applications. In EuroMPI/ASIA, Kyoto, Japan, Sep 9-12, 2014.

  29. [PLDI'14] Subrata Mitra, Ignacio Laguna, Dong H. Ahn, Saurabh Bagchi, Martin Schulz, and Todd Gamblin. Accurate Application Progress Analysis for Large-Scale Parallel Debugging. In ACM International Symposium on Programming Language Design and Implementation (PLDI), Edinburgh, UK, June 9-11, 2014.

  30. [SRDS'13] Ignacio Laguna, Subrata Mitra, Fahad A Arshad, Nawanol Theera-Ampornpunt, Zongyang Zhu, Saurabh Bagchi, Samuel P Midkiff, Mike Kistler, Ahmed Gheith. Automatic Problem Localization via Multidimensional Metric Profiling. In IEEE 32nd International Symposium on Reliable Distributed Systems (SRDS), Braga, Portugal, Sep-Oct, 2013.

  31. [PACT'12] Ignacio Laguna, Dong H. Ahn, Bronis R. de Supinski, Saurabh Bagchi, Todd Gamblin. Probabilistic Diagnosis of Performance Faults in Large-Scale Parallel Applications. In International Conference on Parallel Architectures and Compilation Techniques (PACT), Minneapolis, MN, Sep, 2012.

  32. [DSN'12] Greg Bronevetsky, Ignacio Laguna, Saurabh Bagchi and Bronis R. de Supinski. Characterization via Abnormality-Enhanced Classification. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Boston, Massachusetts, Jun, 2012.

  33. [SC'11] Ignacio Laguna, Todd Gamblin, Bronis R. de Supinski, Saurabh Bagchi, Greg Bronevetsky, Dong H. Ahn, Martin Schulz, Barry Rountree. Large Scale Debugging of Parallel Tasks with AutomaDeD. In ACM/IEEE Supercomputing (SC), Seattle, WA, Nov 2011.

  34. [DSN'10] Greg Bronevetsky, Ignacio Laguna, Surabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, Martin Schulz. AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Chicago Illinois, Jun-Jul, 2010.

  35. [Middleware'09] Ignacio Laguna, Fahad A. Arshad, David M. Grothe, Saurabh Bagchi. How To Keep Your Head Above Water While Detecting Errors. In ACM/IFIP/USENIX 10th International Middleware Conference (Middleware), UIUC Illinois, Nov-Dec 2009.

  36. [SC'09] Dong H. Ahn, Bronis R. de Supinski, Ignacio. Laguna, Greg L. Lee, Ben Liblit, Barton P. Miller, and Martin Schulz. Scalable Temporal Order Analysis for Large Scale Debugging. In ACM/IEEE Supercomputing (SC), Portland, OR, Nov 2009.

  37. [SRDS'07] Gunjan Khanna, Ignacio Laguna, Fahad A. Arshad and Saurabh Bagchi. Distributed Diagnosis of Failures in a Three Tier E-Commerce System. In IEEE Symposium on Reliable Distributed Systems (SRDS), Beijing, China, Oct 2007.

  38. [SRDS'07] Gunjan Khanna, Ignacio Laguna, Fahad A. Arshad and Saurabh Bagchi. Stateful Detection in High Throughput Distributed Systems. In IEEE Symposium on Reliable Distributed Systems (SRDS), Beijing, China, Oct 2007.

Journal Papers

  1. [CACM] Dong H. Ahn, Allison H. Baker, Michael Bentley, Ian Briggs, Ganesh Gopalakrishnan, Dorit M. Hammerling, Ignacio Laguna, Gregory L. Lee, Daniel J. Milroy, Mariana Vertenstein. Keeping Science on Keel When Software Moves. Communications of the ACM, no. 2, 2021, 66-74.

  2. [JPDC] Luanzheng Guo, Ignacio Laguna, Dong Li. PARIS: Predicting Application Resilience Using Machine Learning. Accepted in the Journal of Parallel and Distributed Computing.

  3. [CCPE] Nawrin Sultana, Martin Rüfenacht, Anthony Skjellum, Purushotham Bangalore, Ignacio Laguna, and Kathryn Mohror. Understanding the Use of MPI in Exascale Proxy Applications. Concurrency and Computation: Practice and Experience, Wiley.

  4. [TPDS] Shinobu Miwa, Ignacio Laguna, Martin Schulz. PredCom: A Predictive Approach to Collecting Approximated Communication Traces. IEEE Transactions on Parallel & Distributed Systems.

  5. [ParCo] Nawrin Sultana, Martin Rüfenacht, Anthony Skjellum, Ignacio Laguna, and Kathryn Mohror. Failure recovery for bulk synchronous applications with MPI stages. Parallel Computing, Volume 84, May 2019, Pages 1-14.

  6. [IJHPCA] Sato, Kento, Ignacio Laguna, Gregory L Lee, Martin Schulz, Christopher M Chambreau, Simone Atzeni, Michael Bentley, et al.. Pruners: Providing reproducibility for uncovering non-deterministic errors in runs on supercomputers. The International Journal of High Performance Computing Applications, Vol 33, Issue 5, 2019.

  7. [CCPE] Sourav Chakraborty, Ignacio Laguna, Murali Emani, Kathryn Mohror, Dhabaleswar K. Panda, Martin Schulz, Hari Subramoni. EReinit: Scalable and Efficient Fault Tolerance for Bulk-Synchronous MPI Applications. Concurrency and Computation: Practice and Experience, Wiley, Volume 32, Issue 3, 2020.

  8. [IJHPCA] A. Chien, P. Balaji, N. Dun, A. Fang, H. Fujita, K. Iskra, Z. Rubenstein, Z. Zheng, J. Hammond, I. Laguna, D. Richards, A. Dubey, B. van Straalen, M. Hoemmen, M. Heroux, K. Teranishi, A. Siegel. Exploring versioned distributed arrays for resilience in scientific applications: global view resilience. The International Journal of High Performance Computing Applications (IJHPCA), 31, no. 6 (2017): 564-590..

  9. [IJHPCA] Ignacio Laguna, David F. Richards, Todd Gamblin, Martin Schulz, Bronis R. de Supinski, Kathryn Mohror, and Howard Pritchard. Evaluating and Extending User-Level Fault Tolerance in MPI. The International Journal of High Performance Computing Applications (IJHPCA), vol. 30, num. 3, pp. 305-319, Sep, 2016.

  10. [CACM] Ignacio Laguna, Dong H. Ahn, Bronis R. de Supinski, Todd Gamblin, Gregory L. Lee, Martin Schulz, Saurabh Bagchi, Milind Kulkarni, Bowen Zhou, Zhezhe Chen, and Feng Qin. Debugging high-performance computing applications at massive scales. In Communications of the ACM, September, 2015.

  11. [TPDS] Ignacio Laguna, Dong Ahn, Bronis de Supinski, Saurabh Bagchi, and Todd Gamblin. Diagnosis of Performance Faults in Large Scale MPI Applications via Probabilistic Progress-Dependence Inference. IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 26, no. 5, pp. 1280-1289, May 2015.

  12. [CSE] Martin Schulz, Jim Belak, Abhinav Bhatele, Peer-Timo Bremer, Greg Bronevetsky, Marc Casas, Todd Gamblin, Katherine E. Isaacs, Ignacio Laguna, Joshua Levine, Valerio Pascucci, David Richards, Barry Rountree. Performance analysis techniques for the exascale co-design process. Parallel Computing: Accelerating Computational Science and Engineering (CSE), vol. 25, pag. 19, 2014, IOS Press.

Workshop Papers

  1. [IWOMP] Giorgis Georgakoudis, Johannes Doerfert, Ignacio Laguna and Tom Scogland. FAROS: A Framework To Analyze OpenMP Compilation Through Benchmarking and Compiler Optimization Analysis. In the International Workshop on OpenMP (IWOMP), Sep 21-24, 2020 (Best Paper Award).

  2. [ROSS] Stephen Herbein, David Domyancic, Paul Minner, Ignacio Laguna, Rafael Ferreira da Silva, and Dong H. Ahn. MCEM: Multi-Level Cooperative Exception Model for HPC Workflows. In Proceedings of the 9th International Workshop on Runtime and Operating Systems for Supercomputers (ROSS '19). ACM, New York, NY, USA, 27-32.

  3. [ScalA] Ranvijay Singh, Paul Wood, Ravi Gupta, Saurabh Bagchi, and Ignacio Laguna. Snowpack: efficient parameter choice for GPU kernels via static analysis and statistical prediction. In Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA ’17), @SC17, Denver, CO, 2017.

  4. [FTXS] Ayush Patwari, Ignacio Laguna, Martin Schulz, Saurabh Bagchi. Understanding the Spatial Characteristics of DRAM Errors in HPC Clusters. The 7th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) @HPDC, Washington, D.C., USA, Jun, 2017.

  5. [IWOMP] Joachim Protze, Dong H. Ahn, Ignacio Laguna, Martin Schulz, and Matthias S. Muller. Testing Infrastructure for OpenMP Debugging Interface Implementations. In the International Workshop on OpenMP (IWOMP), Oct 5, 2016.

  6. [IWOMP] Joachim Protze, Ignacio Laguna, Dong H. Ahn, John DelSignore, Ariel Burton, Martin Schulz, and Matthias S. Muller. Lessons Learned from Implementing OMPD: a Debugging Interface for OpenMP. In the 11th International Workshop on OpenMP (IWOMP), Aachen, Germany, October 1-2, 2015.

  7. [LLVM-HPC] Joachim Protze, Simone Atzeni, Dong H Ahn, Martin Schulz, Ganesh Gopalakrishnan, Matthias S Muller, Ignacio Laguna, Zvonimir Rakamaric, Greg L Lee. Towards providing low-overhead data race detection for large OpenMP applications. In Workshop on LLVM Compiler Infrastructure in HPC, held in conjunction with SC’14, New Orleans, Louisiana, Nov, 2014.

  8. [ScalA] Ignacio Laguna, Edgar A Leon, Martin Schulz, Mark Stephenson. A study of application-level recovery methods for transient network faults. In Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA’13), held in conjunction with SC’13, Denver, Colorado, Nov, 2013.

  9. [SEHPCCSE] Dong H Ahn, Gregory L Lee, Ganesh Gopalakrishnan, Zvonimir Rakamaric, Martin Schulz, Ignacio Laguna. Overcoming extreme-scale reproducibility challenges through a unified, targeted, and multilevel toolset. In 1st International Workshop on Software Engineering for High Performance Computing in Computational Science and Engineering (SEHPCCSE’13), held in conjunction with SC’13, Denver, Colorado, Nov, 2013.

  10. [SELSE] Greg Bronevetsky, Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin Schulz. Statistical Fault Detection for Parallel Applications with AutomaDeD. In 6th IEEE Workshop on Silicon Errors in Logic - System Effects (SELSE’10), Stanford, CA, Mar 23-24, 2010.