Table of Contents
The objectives of the Performance Conscious HPC (PeCoH) project are, firstly, to raise awareness and knowledge of users for performance engineering, i.e., to assist in identification and quantification of potential efficiency improvements in scientific codes and code usage. Secondly, the goal is to increase the coordination of performance engineering activities in Hamburg and to establish novel services to foster performance engineering that are then evaluated and implemented on participating data centers. The coordination is implemented by setting up and establishing a Hamburg regional HPC competence center (HHCC) as virtual organization that will be contributing to federalized HPC infrastructure of HLRN and the Gauß-Alliance. The HHCC will combine the strengths of the university data center (Regionales Rechenzentrum, RRZ), the German Climate Computing Center (DKRZ) and the data center of Technical University Hamburg-Harburg (TUHH).
People involved from WR
- Dr. Julian Kunkel (Ansprechpartner)
This project is funded by the DFG in the call Performance Engineering.
Please also see other funded projects in the call.
The foundation of our work plan implements services on the HHCC for scientists of different fields. Those services offer basic support for performance engineering, co-development, and education. Furthermore they provide feedback for users and collect as well as disseminate knowledge for users' support. Orthogonal to those services, we research methods to raise user awareness for advanced performance engineering. The proposed methods cover cost-awareness, competence management, the demonstration of benefit via success stories and the quantification of benefit for alternative processing concepts.
An overview of the methods and services is given in the figure. Both services and methods benefit from each other: While services transfer computer science knowledge to scientists, they also identify and channel user needs and experience that is fed into the method development. The research strategies described by the methods are evaluated and resulting analyses are pushed into the service ecosystem. All the methods and services mentioned in the following will be implemented in close collaboration with the three data centers.
There are conceptual competences relevant for most data centers – such as understanding the resource management, and hardware/software-specific competences like the performance expectation of collective MPI operations. We will increase transparency on the benefit of HPC competences, especially those relevant for performance engineering. Therefore, relevant competences are identified and classified according to necessity and usefulness for different scientific domains. We will also systematically analyze the need for performance optimization from the perspectives of a scientist and a data center. To make the competences more visible in the community, we will establish HPC certification levels. We will bundle teaching material required to master the various certification levels and organize standardized online examinations to acquire the certificates.
Executing a scientific application and keeping data is costly for the data center and, unfortunately, users are not aware of these costs. Understanding the cost for running applications and keeping data is important in identifying the benefit of optimization – which is costly in terms of brainware. We will develop models to approximate costs for scientists and foster the discussion with the open-source community to embed cost-efficiency computation into job summaries of resource managers. Additionally we will track the resource utilization and efficiency of applications and the data centers and provide this information to the users.
According to our experience, demonstrating the benefit of performance engineering – in terms of better productivity for scientists – is needed to reach better acceptance. Therefore, we will create and publish success stories from different scientific fields. To achieve this goal, we will cooperate with scientists and attend to them throughout the project period. We will also gather existing and upcoming optimization studies and compute their respective cost-benefit using our cost-models. While there are already many studies available, finding the appropriate information is a daunting task. Therefore, we will investigate methods to increase searchability of and navigation between relevant performance data.
Benefit of (new) concepts
Together with domain scientists, we will estimate and evaluate the benefit of novel concepts for running their experiments based the individual scientific use cases. During the project runtime, we will approach users, discuss the benefit of alternative architectures, programming concepts and workflow systems, and model this benefit to estimate its value. For certain use cases, tools and concepts from Big Data analytics, in-situ visualization but also software engineering yield the potential to significantly improve the scientific outcome – and in the meantime increase data center efficiency. Also, compiler-assisted debugging tools to speed up the development process will be investigated. Our results will be monitored using the developed survey and published as a (success) story.
This service will disseminate knowledge, education, concepts and chances of performance engineering. Therefore, we will establish the virtual organization for the HHCC and publish all results and methodological concepts together with educational and supporting material (such as a knowledge base) there. In our experience, it is very difficult to find relevant information, therefore, we explore approaches to increase searchability of this information.
This service gives users the opportunity to elaborate on their HPC (and especially performance engineering) skills. We will establish HPC competence certificates (``HPC-Führerschein'') on various levels. Therefore, we will package online educational resources (OER) from existing material and extend it towards useful online courses. This fits very well into the Hamburg Open Online University and we will collaborate with this initiative to improve the pedagogical quality of the courses. Available courses in the region will be summarized and shown on the web page of the HHCC. Together with the certification process, this will help users to understand and improve their performance optimization skills.
This service provides pro-actively feedback to the users and gives them metrics to reflect about their performance – how well do their codes run. Thus, it identifies opportunities to improve the current situation. Firstly, we support deployment of an infrastructure to measure and assess resource utilization on the application level but also quantify costs. We then elaborate on tools that create semi-automatically feedback to the users about the efficiency of their runs. The collected information is periodically reviewed by our user support to identify performance issues and initiate conversation with the users to work upon them.
This service aids users that seek help explicitly to identify and mitigate performance issues. Since performance optimization of individual applications is time consuming, the focus of the support is to identify issues in preparation and configuration of of individual applications and workflows. Optimizing these low-hanging fruits also promises significant performance improvements and can be transferred easier between scientists.
In this service, a few joint efforts with scientists are implemented to evaluate and utilize new software concepts such as programming languages and tools but also understand the potential of novel architectures and processing systems. This is achieved by inferring knowledge from existing studies and by providing simplified performance and cost models for those alternatives. Together with scientists we establish pilot studies to support re-write of existing codes and document those results as success stories. While we conduct the co-development, we will periodically capture the fraction of work time spent in different tasks, e.g., design, programming and runtime. This will allow to understand the required programming time and combined with our cost analysis, the cost-efficiency of novel approaches can be made more visible. To allow other scientists to conduct similar studies, we will develop and publish a quality control method for conducting surveys that assess the benefit of systematic performance engineering.