BoF: Analyzing Parallel I/O

Parallel application I/O performance often does not meet user expectations. Additionally, slight access pattern modifications may lead to significant changes in performance due to complex interactions between hardware and software. These challenges call for sophisticated tools to capture, analyze, understand, and tune application I/O.

In this BoF, we will highlight recent advances in monitoring tools to help address this problem. We will also encourage community discussion to compare best practices, identify gaps in measurement and analysis, and find ways to translate parallel I/O analysis into actionable outcomes for users, facility operators, and researchers.

The BoF is held in conjunction with the Supercomputing conference. The schedule is listed here.

Date Thursday Nov. 16th, 2017
Time 12:15-13:00
Venue Room 402-403-404, Denver, USA

Organization

The BoF is organized by

Agenda

We have a series of 8+2 (min) talks followed by a longer discussion:

  • Introduction – Phil Carns – Slides
  • Towards Total Knowledge of I/O at NERSC through Holistic Monitoring – Glenn Lockwood – Slides
  • Analyzing Lustre File System Performance With Splunk – Ross Miller – Slides
  • Real-Time I/O Monitoring of HPC Applications – Eugen Betke – Slides
  • Job-based I/O-monitoring with LLview – Wolfgang Frings – Slides
  • Panel (15 minutes) – Moderated by Julian Kunkel

Speakers

  • Glenn Lockwood is a member of NERSC's Advanced Technologies Group at Lawrence Berkeley National Laboratory who specializes in I/O performance analysis, extreme-scale storage architectures, and emerging I/O technologies and APIs. His work is centered around understanding I/O performance by correlating performance analysis across all levels of the I/O subsystem, from node-local page cache to back-end storage devices.
  • Ross Miller is a software developer in the Technology Integration group at Oak Ridge National Lab where he works on a variety of projects including file system monitoring tools.
  • Eugen Betke has completed his study of computer science in 2015 with specialization on machine learning and I/O performance. In his master thesis he applied machine learning methods to predict I/O performance. At the beginning of 2016 he started as a researcher at the German Climate Computing Center. His key areas are analysis and optimization of HPC-I/O; he, for example, developed a cluster wide monitoring system for Lustre on Mistral. During his scientific career, he has acquired about three years of experience in these areas.
  • Wolfgang Frings is member of the Application Support Division at the Jülich Supercomputing Centre. His research interest focuses on parallel I/O, I/O middleware, and HPC system monitoring. He is author of several software tools used at many HPC centers. Among these are SIONlib, a library to support task-local parallel file I/O on large-scale systems and LLview, a batch system monitoring software.