The objectives of the Performance Conscious HPC (PeCoH) project are, firstly, to raise awareness and knowledge of users for performance engineering, i.e., to assist in identification and quantification of potential efficiency improvements in scientific codes and code usage. Secondly, the goal is to increase the coordination of performance engineering activities in Hamburg and to establish novel services to foster performance engineering that are then evaluated and implemented on participating data centers. The coordination is implemented by setting up and establishing a Hamburg regional HPC competence center (HHCC) as virtual organization that will be contributing to federalized HPC infrastructure of HLRN and the Gauß-Alliance. The HHCC will combine the strengths of the university data center (Regionales Rechenzentrum, RRZ), the German Climate Computing Center (DKRZ) and the data center of Technical University Hamburg-Harburg (TUHH).
The Scientific Computing group of Prof. Thomas Ludwig has a long history in parallel file system research but also investigates energy efficiency and cost-efficiency aspects and has developed tools for performance analysis. Prof. Ludwig is the director of German Climate Computing Center (DKRZ). The group is embedded into the German Climate Computing Center (DKRZ) since 2009 and, thus, also addresses important aspects for earth system scientists. With regards to teaching, the group offers interdisciplinary seminars and other courses about software engineering in science since 2012.
The Software Construction Methods group (In German: Softwareentwicklungs- und -konstruktionsmethoden – SWK) of Prof. Matthias Riebisch and his prior group at the Ilmenau University of Technology has experience in the adaptation and optimization of software architectures and software development processes. This includes for example, measuring software quality properties, the forecast of properties after changes using impact analysis techniques, and methods for partitioning software applications for optimized execution on parallel computing platforms.
The Scientific Visualization group of Prof. Stephan Olbrich focuses on the development of methods for parallel data extraction and efficient rendering for volume and flow visualization of high-resolution, unsteady phenomena. Prof. Olbrich is the director of Regional Computing Center (RRZ). The group integrated their software into HPC applications, e.g., for climate research. Their activities are part of the cluster of excellence Integrated Climate System Analysis and Prediction (CliSAP) that focuses on post-processing data sets as well as on parallel data extraction at the runtime as part of the simulation.
DKRZ has long been devoted one of its departments to user consultancy. They support DKRZ’s users in the effective use of its systems by providing general personal advice on the use of the systems, helping users to port applications to the HPC systems, and by offering conceptual guidance on parallelization and optimization strategies for user code, specifically with respect to the provided HPC system.
The HPC team at RRZ operates a 396 node Linux cluster and more than 2 PByte of disk storage. The team is part of the consulting network of the North German Supercomputing Alliance (Höchstleistungsrechenzentrum Nord (HLRN) in German). HPC activities (locally and for HLRN) include user support, user education (in parallel programming and single-processor optimization) and benchmarking. RRZ maintains and further develops BQCD (Berlin quantum chromodynamics program) as one of the HLRN application benchmark codes: BQCD (Download) (see also M. Allalen, M. Brehm, and H. Stüben. Performance of Quantum Chromodynamics (QCD) Simulations on the SGI Altix. Computational Methods in Science and Technology, 14(2):69–75, 2008).
The TUHH RZ has HPC consultants to support local users as well as users of the North German Supercomputing Alliance (HLRN). TUHH RZ operates a 244 nodes Linux cluster. TUHH RZ and RRZ cooperate in sharing specialized parts of their HPC hardware.
This project is funded by the DFG in the call Performance Engineering.
Please also see other funded projects in the call.
The foundation of our work plan implements services on the HHCC for scientists of different fields. Those services offer basic support for performance engineering, co-development, and education. Furthermore they provide feedback for users and collect as well as disseminate knowledge for users' support. Orthogonal to those services, we research methods to raise user awareness for advanced performance engineering. The proposed methods cover cost-awareness, competence management, the demonstration of benefit via success stories and the quantification of benefit for alternative processing concepts.
An overview of the methods and services is given in the figure. Both services and methods benefit from each other: While services transfer computer science knowledge to scientists, they also identify and channel user needs and experience that is fed into the method development. The research strategies described by the methods are evaluated and resulting analyses are pushed into the service ecosystem. All the methods and services mentioned in the following will be implemented in close collaboration with the three data centers.
There are conceptual competences relevant for most data centers – such as understanding the resource management, and hardware/software-specific competences like the performance expectation of collective MPI operations. We will increase transparency on the benefit of HPC competences, especially those relevant for performance engineering. Therefore, relevant competences are identified and classified according to necessity and usefulness for different scientific domains. We will also systematically analyze the need for performance optimization from the perspectives of a scientist and a data center. To make the competences more visible in the community, we will establish HPC certification levels. We will bundle teaching material required to master the various certification levels and organize standardized online examinations to acquire the certificates.
Executing a scientific application and keeping data is costly for the data center and, unfortunately, users are not aware of these costs. Understanding the cost for running applications and keeping data is important in identifying the benefit of optimization – which is costly in terms of brainware. We will develop models to approximate costs for scientists and foster the discussion with the open-source community to embed cost-efficiency computation into job summaries of resource managers. Additionally we will track the resource utilization and efficiency of applications and the data centers and provide this information to the users.
According to our experience, demonstrating the benefit of performance engineering – in terms of better productivity for scientists – is needed to reach better acceptance. Therefore, we will create and publish success stories from different scientific fields. To achieve this goal, we will cooperate with scientists and attend to them throughout the project period. We will also gather existing and upcoming optimization studies and compute their respective cost-benefit using our cost-models. While there are already many studies available, finding the appropriate information is a daunting task. Therefore, we will investigate methods to increase searchability of and navigation between relevant performance data.
Together with domain scientists, we will estimate and evaluate the benefit of novel concepts for running their experiments based the individual scientific use cases. During the project runtime, we will approach users, discuss the benefit of alternative architectures, programming concepts and workflow systems, and model this benefit to estimate its value. For certain use cases, tools and concepts from Big Data analytics, in-situ visualization but also software engineering yield the potential to significantly improve the scientific outcome – and in the meantime increase data center efficiency. Also, compiler-assisted debugging tools to speed up the development process will be investigated. Our results will be monitored using the developed survey and published as a (success) story.
This service will disseminate knowledge, education, concepts and chances of performance engineering. Therefore, we will establish the virtual organization for the HHCC and publish all results and methodological concepts together with educational and supporting material (such as a knowledge base) there. In our experience, it is very difficult to find relevant information, therefore, we explore approaches to increase searchability of this information.
This service gives users the opportunity to elaborate on their HPC (and especially performance engineering) skills. We will establish HPC competence certificates (``HPC-Führerschein'') on various levels. Therefore, we will package online educational resources (OER) from existing material and extend it towards useful online courses. This fits very well into the Hamburg Open Online University and we will collaborate with this initiative to improve the pedagogical quality of the courses. Available courses in the region will be summarized and shown on the web page of the HHCC. Together with the certification process, this will help users to understand and improve their performance optimization skills.
This service provides pro-actively feedback to the users and gives them metrics to reflect about their performance – how well do their codes run. Thus, it identifies opportunities to improve the current situation. Firstly, we support deployment of an infrastructure to measure and assess resource utilization on the application level but also quantify costs. We then elaborate on tools that create semi-automatically feedback to the users about the efficiency of their runs. The collected information is periodically reviewed by our user support to identify performance issues and initiate conversation with the users to work upon them.
This service aids users that seek help explicitly to identify and mitigate performance issues. Since performance optimization of individual applications is time consuming, the focus of the support is to identify issues in preparation and configuration of of individual applications and workflows. Optimizing these low-hanging fruits also promises significant performance improvements and can be transferred easier between scientists.
In this service, a few joint efforts with scientists are implemented to evaluate and utilize new software concepts such as programming languages and tools but also understand the potential of novel architectures and processing systems. This is achieved by inferring knowledge from existing studies and by providing simplified performance and cost models for those alternatives. Together with scientists we establish pilot studies to support re-write of existing codes and document those results as success stories. While we conduct the co-development, we will periodically capture the fraction of work time spent in different tasks, e.g., design, programming and runtime. This will allow to understand the required programming time and combined with our cost analysis, the cost-efficiency of novel approaches can be made more visible. To allow other scientists to conduct similar studies, we will develop and publish a quality control method for conducting surveys that assess the benefit of systematic performance engineering.
|2018-06-01||Concept Paper for the HPC Certification Program (Draft Version 0.91 – June 1, 2018)||PeCoH||Hamburg (Germany)|
|2018-02-14||Concept Paper for the HPC Certification Program (Draft Version 0.9 – February 1, 2018)||PeCoH||Hamburg (Germany)|
|2017-11-12||Handout about the work in progress of our HPC Certification Program||SC-17||Denver/Colorado (USA)|
|2017-06-18||Handout about the work in progress of our HPC Certification Program||ISC-17||Frankfurt (Germany)|