I/O sleuthing to track down errors in performance or correctness - at scale!
We often see computational science codes focusing on I/O only when something has gone wrong: A job took longer than expected, data was incorrect, or obscure failure messages come back from the storage system. Diagnosing and debugging I/O problems share many similarities to debugging any parallel application, but do have their own peculiarities and tools. In this one-day course Rob will cover the kinds of I/O problems one is likely to encounter in HPC, how to investigate those problems, and how to fix them.
Rob Latham strives to make scientific applications use I/O more efficiently. After earning his BS (1999) and MS (2000) in Computer Engineering at Lehigh University (Bethlehem, PA), he eventually ended up at Argonne National Laboratory (ANL). His research focus has been on high performance I/O for scientific applications and I/O metrics. He has worked on the ROMIO MPI-IO implementation, the parallel file systems PVFS (v1 and v2), Parallel NetCDF, and Mochi I/O services.
Selected Resources
I/O Sleuthing workshop materials I/O Sleuthing workshop video recordings I/O Sleuthing: Digging into Storage Performance blog post