Theses

2024

Exploratory Exploitation of Heterogeneous High Performance Architectures using OpenMP in Large Scale Microphysics Applications

Author	Yannik Koenneker
Type	Master's Thesis
Advisors	Georgiana Mania
Reviewers	Thomas Ludwig, Georgiana Mania
Date	2024-09-16
Abstract	The development of modern simulations, exemplified by ICON, is a lengthy process that spans numerous years. Over time, the system architectures for which these simulations were initially developed and optimized for undergo significant modifications. This necessitates that programmers adapt to new hardware in order to fully leverage the potential performance gains. The process of porting existing code to new systems and architectures, as well as maintaining them, is inherently time-consuming and therefore costly. First attempts at porting ICON to LUMI, an AMD system, saw significant hurdles because of missing compiler support. In order to reduce costs in the future, parts of the ICON model have been translated to C++ in order to evaluate different programming models that offer performance portability. This thesis is concerned with the topic of OpenMP offloading to GPUs. We implemented a series of optimizations with the objective of enhancing performance and subjected each to a detailed analysis to ascertain the extent of the improvement. The presented work analyzes different optimization approaches using OpenMP. Our findings indicated that the most effective approach was to reduce the number of required registers per thread. It is noteworthy that the implemented changes had a greater impact on smaller problem sizes than on larger ones. This work offers insight into the optimization strategies that could be adapted to other algorithms, as well as a method of enhancing existing algorithms by a factor of anywhere between 1.87 and 39.34, depending on the configuration.

Thesis – Publication details

2023

Modelling MPI Communication using Colored Petri Nets

Author	Tronje Krabbe
Type	Master's Thesis
Advisors	Michael Blesel, Michael Kuhn
Reviewers	Michael Kuhn, Thomas Ludwig
Date	2023-07-24
Abstract	The Message Passing Interface (MPI) is a widely adopted standard for use in high-performance, distributed memory, parallel computing. Due to its high complexity, caused by complex communication patterns and often unknown numbers of participating processes, both manual and automated error detection for MPI software is difficult. Colored Petri Nets (CPNs) offer a powerful, high-level modelling framework to model complex systems. Arbitrary annotations in a modelling language allow a CPN to model almost anything. They are suited particularly well for modelling distributed systems, making them a good match for MPI software. This thesis presents a novel approach for modelling the communication patterns of MPI programs as Colored Petri Nets. It includes a fully-featured CPN implementation, a proposed specification of an intermediate representation for compiling MPI software into CPNs, as well as a proof-of-concept evaluator tool which can analyze the proposed IR and detect a variety of programming errors. The evaluator is as yet incomplete, supporting only a very small subset of the MPI standard, but the modelling approach should transfer to most MPI operations, and the performance of the software is promising. Other approaches to modelling and checking MPI software are either slower or defunct.

Thesis – Publication details

A Multi-purpose Framework for Efficient Parallelized Execution of Charged Particle Tracking

Author	Georgiana Mania
Type	PhD Thesis
Advisors	Thomas Ludwig
Reviewers	Thomas Ludwig, Simon McIntosh-Smith
Date	2023-07-12
Abstract	Complex particle tracking software used in High Energy Physics experiments already pushes the edges of computing resources with demanding requirements for speed and memory throughput, but the future experiments pose an even greater challenge. Although many supercomputers already reached petascale capacities using many-core architectures and accelerators, many scientific applications still need to efficiently make use of the new resources. This research work is focused on investigating the parallelization potential of the particle reconstruction flow and ultimately, on developing a prototype of a framework which could contribute to a smooth transition to a platform-agnostic code base which avoids duplicating implementations for different architectures, while ensuring improved performance.

Thesis – Publication details – URL

2022

System profiling and data aggregation for smart compression in Lustre

Author	Pablo Correa Gómez
Type	Master's Thesis
Advisors	Anna Fuchs, Michael Kuhn
Reviewers	Michael Kuhn, Thomas Ludwig
Date	2022-08-17
Abstract	In High-performance computing (HPC) setups, the IO and data transfer can be a big part of the processing requirements of scientific applications. When that is the case and they become a bottleneck, the application performance can degrade. This problem is expected to become more common since CPU processing has been for many years and continues growing at a faster rate than network or storage speed. Moreover, the imbalance between different machines with different roles in the setup and applications' inefficiencies make this problem worse. In this thesis, compression is considered a solution to this problem. Compression allows trading the excess in computation power for a reduction in the data size, both for IO and transfer. However, static compression can potentially result in a similar set of inefficiencies as those that it aims to solve. For this reason, I propose to extract and analyze information from the HPC setup, introducing a collection and decision-making process that makes compression smart. The integration point for compression is the parallel filesystem, which is the piece of software that, in HPC, takes care of the data transfer. For this work, Lustre, the most popular filesystem among big HPC deployments, is the filesystem of choice. In consequence, this thesis analyses a typical Lustre setup to identify and extract the components that would take part in the smart compression. Those components are studied to obtain the metrics relevant for compression. Later, the required process for smart compression is considered, and two relevant decisions, the location of the compression, and the algorithm configuration to use are analyzed in detail. The analysis assesses the key metrics for each decision and possible ways to integrate their calculation. Finally, to prove the relevance of smart compression, a small set of experiments show both the benefits of compression and the dangers of a wrong configuration.

Thesis – Publication details

2023

Vecpar OpenMP Prototype for Particle Physics Example

Author	Henning Lindemann
Type	Bachelor's Thesis
Advisors	Georgiana Mania
Reviewers	Thomas Ludwig, Georgiana Mania
Date	2023-08-10

Thesis – Publication details

2022

Data-Aware Compression for HPC using Machine Learning

Author	Julius Plehn
Type	Master's Thesis
Advisors	Michael Kuhn, Anna Fuchs, Jakob Lüttgau
Reviewers	Michael Kuhn, Thomas Ludwig
Date	2022-05-09
Abstract	While compression can provide significant storage and cost savings, its use within HPC applications is often only of secondary concern. This is in part due to the inflexibility of existing approaches where a single compression algorithm has to be used throughout the whole application, but also because insights into the behavior of the algorithms within the context of individual applications are missing. Several compression algorithms are available, with each one also having a unique set of options. These options have a direct influence on the achieved performance and compression results. Furthermore, the algorithms and options to use for a given dataset are highly dependent on the characteristics of said dataset. This thesis explores how machine learning can help identify fitting compression algorithms with corresponding options based on actual data structure encountered during I/O. To do so, a data collection and training pipeline is introduced. Inferencing is performed during regular application runs and shows promising results. Moreover, it provides valuable insights into the benefits of using certain compression algorithms and options for specific data.

Thesis – Publication details

Performance study on GPU offloading techniques using the Gauß matrix inverse algorithm

Author	Yannik Koenneker
Type	Bachelor's Thesis
Advisors	Georgiana Mania
Reviewers	Thomas Ludwig, Georgiana Mania
Date	2022-03-04
Abstract	Inverting matrices is a crucial part in many algorithms in linear algebra, computer graphics and data analysis. There are many libraries providing algorithms to achieve this but none that allow for calling from the GPU context. GPUs and accelerators become more and more prevalent in high performance computers. Having no ready-to-use implementation scientists need to write their own algorithms. In this thesis the Gauß-Elimination algorithm is implemented using OpenMP, OpenACC, CUDA, HIP and OpenCL. These implementations are then compared to the already existing libraries Eigen and cuBLAS in terms of speed, precision and implementation effort.

Thesis – Publication details

2021

Analysis of elastic Cloud solutions in an HPC Environment

Author	Johannes Coym
Type	Master's Thesis
Advisors	Jannek Squar, Michael Kuhn
Reviewers	Michael Kuhn, Thomas Ludwig
Date	2021-10-23
Abstract	Cloud Services, like AWS, Azure, and Google Cloud, are a growing market, and all of them also started providing their cloud services for HPC use cases in the last years. Amazon even developed their own fabric adapter for AWS, claiming greatly improved load distribution compared to existing solutions. This thesis aims to evaluate the current performance and profitability of cloud services for HPC applications. Specific focus will be on the rentability of running several specific or even all applications in the cloud. For this cloud side, AWS, Azure and Google Cloud will be used for profitability analysis. These cloud providers will have to compare to an on-premise HPC cluster for different job configurations. On the side of the on-premise HPC cluster, the cluster usage of a whole year of one cluster will be analysed to gain the required data. Additionally, the costs of running those variants will be compared on a typical lifespan of an HPC cluster with all of its acquisition costs and running costs. These cost factors and node requirements will then be taken into a cost function to assist a cluster owner's decision when HPC cloud systems provide a better value than running an independent cluster. The choice between an on-premise HPC system and the cloud will also not just be looked at as an explicit or, as there is also the possibility of owning a smaller cluster and outsourcing some parts to the cloud. The cost function for the comparison against the cloud can then provide a way to outsource specific jobs to the cloud, reducing the total cost of the cluster and potentially even optimising the remaining jobs for the job scheduler.

Thesis – Publication details

Optimising Scientific Software for Heterogeneous Cluster Computers: Evaluation of Machine Learning Methods for Source Code Classification

Author	Ruben Felgenhauer
Type	Master's Thesis
Advisors	Jannek Squar, Peter Hauschildt
Reviewers	Thomas Ludwig, Peter Hauschildt
Date	2021-09-23
Abstract	Since high performance compute centres are shifting more to use accelerators like GPUs, vector processors, or many-core CPUs, HPC programmers are often confronted with a very heterogeneous hardware environment. Different computation units have different requirements in order to use them most efficiently. Typically, scientific software is optimised for specific target architectures based upon decisions that are made before it is known which hardware composition will be available at the time of running. This can lead to cluster computers being used under capacity which wastes computational resources and energy. With the evolution and resulting gain in popularity of automatic parallelisation tools like OpenMP and sophisticated static code analysis methods, source code can increasingly be written in a more readable fashion with acceptable performance cuts. Therefore, given the choice between performance and maintainability, it can increasingly be made in favour of the latter. However, at the time of writing, this only goes so far that the programmer decides which sections to parallelise and optimise for which target architecture. Ultimately, to efficiently tackle cluster heterogeneity, the goal should be to automatically find the optimal mapping of sub-programs to computation units and performing the required parallelisation. Central to this task is source code classification. It has been shown by Barchi et al. that machine learning classification can be used to determine a suitable computation unit for OpenCL source code samples. In this thesis, we evaluate machine learning methods for general-purpose classification of program parts to extend this principle to support all ahead-of-time compiled programming languages. First, we combine the ASTNN by Zhang et al., a neural network architecture that can be used for the classification of C source code samples, with RetDec, a retargetable decompiler by Křoustek (Avast Software) based on LLVM which generates C code from machine instructions. This is compared with a straight-forward approach where general-purpose classifiers from scikit-learn are trained on byte-value data of the object code files' .text section. We show that the modified ASTNN outperforms these methods in all of our performed benchmarks, but that it comes with several limitations including high memory consumption and training time, and unresponsiveness of the decompiler on some samples.

Thesis – Publication details

Scimon - Scientific Monitor for Automated Run Logging and Reproducibility

Author	Daniel Bremer
Type	Master's Thesis
Advisors	Jakob Lüttgau, Michael Kuhn
Reviewers	Michael Kuhn, Thomas Ludwig
Date	2021-06-09
Abstract	In many scientific fields, the ability to simulate on computers and precompute experiments became a huge part of research. In classical experiments, researchers often utilized notebooks to document their experiments and actions, but with the introduction of computers as an experimental environment the problem sizes quickly increased, often making it hard to document with classic means. With data for the environment of such experiments not being available, reproducibility, a cornerstone for reviewable science, is impacted and lowered. This thesis provides means to increase reproducibility for computational science experiments by automatically capturing workflows, as well as the system's environment. Goals are not only on-the-fly logging of steps that lead to results but also generation of artifacts that can help to reproduce runs. The proposed Scimon software suite approaches reproducibility from multiple vectors. Logging information on workflows is enabled by logging specific commands entered in a user's shell or in jobs submitted to clusters. Besides the logged commands, also all files used and generated in the runs are stored, allowing re-execution with the same inputs. To ensure the correct version of the executed applications, the Git version control system is utilized to get the user's code state. Considering Artifact Descriptions required for conferences, such as the Supercomputing Conference, workflows and input files do not make a sufficient description, as it also requires information on the system's environment and libraries and other external code used with the applications. Scimon allows to automatically read such libraries from the applications in the building process and, being compatible with Spack, large parts of the system environment can be captured quickly. All this is done mostly invisible and unobtrusive to the user, allowing normal workflows which can be evaluated at a later point in time. Evaluation with multiple use cases will show the broad range of applications for Scimon. Besides the ability to correctly reproduce software that is in active development and could have runs in uncommitted states, a broad range of artifacts is captured. The ability to not only automatically extract dependencies, but to capture the environment of a system because of a tight integration with Spack generates extensive documentation, enabling a high level of reproducibility.

Publication details

Performance modeling of one-sided and two-sided MPI-3 communication

Author	Niclas Schroeter
Type	Bachelor's Thesis
Advisors	Michael Kuhn, Jannek Squar
Reviewers	Michael Kuhn, Jannek Squar
Date	2021-05-09
Abstract	The MPI 3.0 standard introduced numerous changes to its remote memory access interface. This interface offers support for one-sided communication. In this thesis, the performance of MPI-3 RMA is compared to the performance of the more traditional point-to-point communication with MPI. This comparison is conducted on a stencil-based application for the calculation of partial differential equations using either the Gauss-Seidel method or Jacobi's method, involving three versions of said application that communicate through different means. These versions consist of a point-to-point version and two RMA versions, one that uses shared memory and one that does not. Results indicate that neither approach to communication is clearly favored. Therefore, the better performing communication form has to be determined on a case-by-case approach for any given application.

Thesis – Publication details

Containerizing a user-space storage framework for reproducibility

Author	Marcel Papenfuss
Type	Bachelor's Thesis
Advisors	Michael Kuhn, Kira Duwe
Reviewers	Michael Kuhn, Kira Duwe
Date	2021-03-11
Abstract	Container solutions are becoming increasingly popular in the High Performance Computing (HPC) field. They support points such as reproducibility and mobility in the form that complete system environments can be reproduced and executed on other systems. In this thesis, JULEA, a flexible user-space storage system, is used to show how such a container solution can look like for such an application. Different container engines like Docker, Podman and Singularity are introduced and a concrete implementation approach for each is shown. Besides concrete solutions, the planning and design of workflow plays an important role. With the help of GitHub Actions, a completely automated workflow is designed, which is responsible for the image creation. In addition, an evaluation of the implementation problems will be presented and a comparison between existing and new container solutions will be shown.

Thesis – Publication details

Message passing safety and correctness checks at compile time using Rust

Author	Michael Blesel
Type	Master's Thesis
Advisors	Michael Kuhn, Jannek Squar
Reviewers	Michael Kuhn, Thomas Ludwig
Date	2021-02-15
Abstract	Message passing is the de facto standard for large scale parallelization of applications in the domain of high performance computing. With it comes a lot of complexity and possible parallelization errors. Traditional programming languages like C and Fortran and the message passing interface MPI do not provide many compile time correctness checks for such parallel applications. This thesis shows the design and implementation of a new message passing library using the Rust programming language. Rust focuses heavily on memory safety features and rigorous compile time correctness checks. The thesis explores whether a message passing library can be designed in a way that utilizes these features to provide an easier experience for developers of HPC software by conducting better compile time correctness checks. Many possible errors in MPI software are related to unsafe memory usage. Rust's memory ownership concept can be applied to message passing operations to provide better error protection. Furthermore modern language features like generic programming can be used to achieve a more convenient and safer interface for message passing functions. The thesis demonstrates that some common erroneous code patterns of message passing with MPI can be avoided with this approach. It however also shows that particular error classes regarding the correctness of the communication schemes of message passing applications require further correctness checking tools, such as static analysis or compiler modifications.

Thesis – Publication details

2020

Leistungsanalyse und -optimierung der Netzwerkkommunikation in dem HPC-Speichersystem JULEA unter Verwendung des OFI Frameworks

Author	Arne Struck
Type	Bachelor's Thesis
Advisors	Michael Kuhn, Kira Duwe
Reviewers	Michael Kuhn, Kira Duwe
Date	2020-12-01
Abstract	Netzwerkkommunikation ist ein klassischer Flaschenhals der modernen Datenverarbeitung, im Speziellen im Bereich des High Performance Computing (HPC). In dieser Arbeit soll ein Beitrag zur Beantwortung der Frage geliefert werden, ob ein Performanzgewinn im Bereich der Netzwerkkommunikation durch direkte Integration einer spezialisierten Kommunikationslösung in ein bereits existierendes HPC-Programm möglich ist. Diese Frage soll auch unter dem Aspekt beleuchtet werden, dass keine speziellen Vorkenntnisse des untersuchten Frameworks beim Durchführenden vorliegen. Als Referenzbeispiel dient das Open Fabrics Interface (OFI) als Vertreter der spezialisierten Netzwerkframeworks und JULEA exemplarisch als HPC-Anwendung. JULEA stellt ein flexibles Framework zur entfernten und lokalen Datenspeicherung dar und setzt für die Netzwerkkommunikation auf Berkley Sockets. OFI bietet eine Sammlung meist auf HPC spezialisierter Netzwerklösungen an. Libfabric ist hierbei eine Kernkomponente OFIs und stellt die API zur Zielanwendung dar. Die Ergebnisse zeigen, dass eine direkte Integration libfabrics in JULEA unter Verwendung des libfabric sockets Providers in einer geringeren Datendurchsatzrate resultiert als die bisher verwendete direkte socket Implementation unter Verwendung von angebotenen Optimierungen. Besonders stark ist der Unterschied bei geringen Dateigrößen (im KB-Bereich) pro Übertragungsvorgang, während bei Dateigrößen im MB-Bereich sich die Durchsatzraten angleichen. Allerdings existiert auch im MB-Bereich immer noch ein Performanzvorsprung vor der bisherigen Variante. Des weiteren wird demonstriert, dass auch eine direkte Integration bei komplexen Frameworks eine inhärente Komplexität und Fehleranfälligkeit mit sich bringt. Es wird gezeigt, dass eine direkte Integration prinzipiell möglich ist, allerdings werden Performanzgewinne hierdurch nicht zwingend erreicht. Gründe hierfür können darin gefunden werden, dass Speziallösungen unter Umständen durch eine etablierte Lösung angebotene Optimierungen fehlen. Dies wiederum kann dazu führen, dass die Speziallösung perfomanztechnisch einer etablierten Lösung unterlegen sind. Des weiteren wird gezeigt, dass eine direkte Integration ohne Anpassung des Referenzprogrammes an die Vorgehensweisen des Netzwerkframeworks zu Performanzverlusten führen und somit leichte Anpassungen des Kommunikationsschemas notwendig sein kann.

Thesis – Publication details

Analysis of the Impact of Aging on the EXT4 and ZFS Filesystems and Their Countermeasures

Author	Lars Thoms
Type	Master's Thesis
Advisors	Michael Kuhn
Reviewers	Thomas Ludwig, Michael Kuhn
Date	2020-11-27
Abstract	Filesystems, like almost everything, are subject to an aging process. The effects of this are critical in terms of performance. In most cases, depending on the specific filesystem, long-term use leads to a massive fragmentation of the entire storage system. This master thesis quantifies these performance losses using the benchmark tool IOR and compares common countermeasures. For this purpose, different storage media and filesystems (EXT4 and ZFS) are artificially aged using Geriatrix. Afterward, they are defragmented using various methods like the defragmentation tool of EXT4 or several copy mechanisms: file-based by using rsync or streamed with ZFS standard tools. After each step, the files' fragmentation is determined, and the respective read and write performance. For this purpose, a new tool called fraggy was developed, which counts the filesystem's file fragments by using fiemap. Additionally, a patch extended ZFS to offer the fiemap interface. Furthermore, Geriatrix had to be modified to operate on ZFS. The results of this thesis show that aging effects occur in the form of fragmentation. The direct consequences are losses in the read and write performance of new data sets. The defragmentation tool of EXT4 is not suitable for long-term use and worsens the result even more. The best solution is actually to copy the data to a new filesystem. However, file-based transfers are not suitable for ZFS; the better solution is to transfer a complete snapshot to a new pool via its standard streaming tools.

Thesis – Publication details

2019

Integrating self-describing data formats into file systems

Author	Benjamin Warnke
Type	Master's Thesis
Advisors	Michael Kuhn, Kira Duwe
Reviewers	Thomas Ludwig, Michael Kuhn
Date	2019-11-08
Abstract	The computational power of huge supercomputers is increasing exponentially every year. The use of this computational power in scientific research in various fields quickly generates large amounts of data. The amount of data, that can be stored in individual file systems, as well as their access speed, also increases exponentially. Unfortunately, the memory size and speed grow slower than the computational performance. The research needs a tool that makes it possible to quickly find the data of interest to researchers in these huge file collections. File system interfaces were defined in the early days of computers and have not been modified since their invention. A search for certain data in huge data collections was not intended. This search could be done by custom metadata. This was not directly possible in the past because the Portable Operating System Interface (POSIX) does not contain the required functions. Self-Describing Data Format (SDDF) were developed for the exchange and reusability of research results between research groups. These are file formats that store metadata along with their raw data, that is the file itself contains the description of the data as well as how it is encoded. Looking for information according to certain criteria, all eligible files had to be opened. This thesis describes the development of a new interface as well as the reference-implementation for the combined storage of data and metadata. This novel interface includes functions for storing metadata in a dedicated Structured Query Language (SQL) backend. Storing metadata in a dedicated backend enables a more efficient search process. Since metadata is now stored in a central database, the files which meet the search-criteria can be opened systematically, as many files are excluded from the beginning. This saves time. The existence of metadata in file systems also has the advantage of intelligently distributing file content across different storage nodes. For example, metadata can be written to faster storage mediums than the raw data which is needed less frequently. This saves time and money.

Thesis – Publication details

Python for High Performance I/O

Author	Johannes Coym
Type	Bachelor's Thesis
Advisors	Michael Kuhn
Reviewers	Michael Kuhn, Thomas Ludwig
Date	2019-07-30
Abstract	Python is gaining importance in High Performance Computing (HPC) due to the faster and easier development than in more performance-oriented programming languages like C. This makes Python very attractive for prototyping new ideas, but also in performance aspects Python is getting better with modules like NumPy, SciPy and even optimized versions specific for Intel CPUs. Python even has support to access libraries written in C, to utilize the performance of C, while having the object orientation and ease-of-use of Python. But although Python is making progress in High Performance Computing, Python has little direct support for High Performance I/O (HPIO) systems. Since the calculation power in HPC systems is growing much faster than the speed of I/O devices, this is an area which is quickly gaining importance. The goal of this thesis is to evaluate the possibilities of HPIO in Python. Therfore an interface to an existing HPIO system will be implemented using Python's support to access C libraries. The I/O system used for this implementation will be JULEA, which provides a modular design and a object-oriented like implementation in C, which makes the conversion to a real object-oriented language like Python easier. For this implementation, different approaches for the interface between Python and C will be evaluated to ensure a good maintainability and performance.

Thesis – Publication details

Structured metadata for the JULEA storage framework

Author	Michael Straßberger
Type	Bachelor's Thesis
Advisors	Michael Kuhn, Kira Duwe
Reviewers	Michael Kuhn, Kira Duwe
Date	2019-05-21
Abstract	Today's high performance computing produces large amounts of data for scientific research in different areas. To handle such amounts of data different solutions exists in the HPC (high performance computing) field. In the past, most storage solutions lacked the ability to add own user defined metadata to the files. To accommodate this storage frameworks developed SDDF (self describing data format) standards which stores the metadata alongside with the object data. While SDDF provide a solution for storing metadata, it does not solve the issue of searching in large datasets. A search would have to read every file for its metadata information, instead of searching in a centralized database. This thesis provides a concept and implementation for a flexible structured metadata management based on the JULEA storage framework.

Thesis – Publication details

Verification of one-sided MPI communication code using static analysis in LLVM

Author	Marcel Heing-Becker
Type	Master's Thesis
Advisors	Jannek Squar, Michael Kuhn
Reviewers	Thomas Ludwig, Michael Kuhn
Date	2019-02-25
Abstract	When implementing a software model using the one-sided communication of MPI, the developer is required to obey certain rules with respect to memory consistency, process synchronization and operation completion in order to avoid data races and undefined behavior. For a non-trivial application, dealing with different synchronization and memory models that are provided by MPI version 3, manual correctness checks may turn out to be cumbersome. This thesis evaluates the ability to perform a verification of properties specific to the use of one-sided communication of MPI by statically analyzing an LLVM intermediate representation using two different IFDS problem models, powered by a static analysis framework named Phasar.

Thesis – Publication details

Efficient handling of compressed data in ZFS

Author	Hauke Stieler
Type	Bachelor's Thesis
Advisors	Michael Kuhn, Anna Fuchs
Reviewers	Michael Kuhn, Anna Fuchs
Date	2019-01-16
Abstract	Since the beginning of the computer era, the speed of computation, network and storage as well as the storage capacity, have grown exponentially. This effect is well known as Moore's law and leads to increasing gaps between the performance of computation and storage. High performance computing organizations like the Deutsches Klimarechenzentrum (DKRZ) suffer from these gaps and will benefit from solutions introduced by the IPCC for Lustre projects. To close this cap applications and file systems use compression in order to speed up the storage of data. Even if file systems like ZFS are already able to compress data, distributed file systems like Lustre are not completely able to use these functionality in an efficient way. A client-side compression implementation done by Anna Fuchs, presented in her thesis “Client-Side Data Transformation in Lustre”, reduces the size of the data before it is send over the network to the storage server. First changes to ZFS were presented by Niklas Behrmann in his thesis “Support for external data transformation in ZFS” adding support to ZFS for externally compressed data. The interconnection of these two theses was done by Sven Schmidt in his thesis “Efficient interaction between Lustre and ZFS for compression” using the client-side compression with the new API functions of ZFS. This thesis presents an efficient way to handle externally compressed data in ZFS in order to do fast read and write calls. The code changes and complexity of the solutions were kept at a minimum and are as maintainable as possible. In order to do this, a refactoring of the existing work was done before adding features to the code base. Introducing explicit flags and simplifying code paths enables ZFS to handle the data as efficient as possible without neglecting the quality of the software. A correctness and performance analysis, using the test environment of ZFS, shows the efficiency of the implementation and reveals also tasks to do in future work.

Thesis – Publication details

2018

Characterization and translation of OpenMP use cases to MPI using LLVM

Author	Tim Jammer
Type	Master's Thesis
Advisors	Jannek Squar, Michael Kuhn
Reviewers	Thomas Ludwig, Michael Kuhn
Date	2018-12-04
Abstract	OpenMP makes it fairly easy to program parallel applications. But OpenMP is limited to shared memory systems. Therefore this thesis will explore the possibility to translate OpenMP to MPI by using the LLVM compiler infrastructure. Translating OpenMP to MPI would allow to further scale up parallel OpenMP applications as distributed memory systems may be used as well. This thesis will explore the benefits of the translation to MPI for several classes of parallel applications, that are characterized by similarity regarding computation and data movement. The improved scalability of MPI can be exploited best, if there is a regular communication, like a stencil code. For other cases, like a Branch and Bound algorithm, the performance of the translated program looks promising but further tuning is required in order to be able to fully exploit more CPU cores offered by scaling up to a distributed memory system. However the developed translation does not work well when all to all communication is required, like in a butterfly communication scheme of a fast Fourier transform.

Thesis – Publication details

Learned Index Structures: An Evalution of their Performance in Key-Value Storage Solutions

Author	Leonhard Reichenbach
Type	Bachelor's Thesis
Advisors	Michael Kuhn, Jakob Lüttgau
Reviewers	Michael Kuhn, Jakob Lüttgau
Date	2018-09-10
Abstract	Traditional index structures like B-Trees and hash tables have reached the peaks of their performance. Their existing implementations are highly tuned to use the hardware as efficiently as possible. However, they are static solutions and do not exploit existing inner structures in the indexed data which would offer another source for further performance increases. Additionally, some of their underlying techniques are fundamentally flawed. For instance, the hash function in a hash table that hashes keys to uniformly distributed random numbers will always inevitably cause a collision rate of approximately 36% due to statistic phenomena. Thereby, creating a demand for better solutions. One of those solutions could be learned index structures. The traditional index structures are in the wide sense just models taking a key and predicting an index position. If the resulting array is sorted, the index is effectively modeling a cumulative distribution function. If this distribution can be learned by a machine learning model like a neural network, the potential to reduce collisions further is given. In this thesis, an in-memory key-value store named HTableDB was implemented. It is able to use learned index models as well as regular hash functions for indexing and thus offers an easy way to compare the two. Using HTableDB this comparison was made for keys consisting of various file paths taken from a scientific computing environment. The learned index model used was a two-stage feed-forward network. For datasets with keys that belong to a cluster of similar ones, like the ImageNet dataset, the learned index models showed a collision rate of 43% while being approximately 45 times slower than regular hash functions. However, more sophisticated network architectures could lower the collision rate of learned index models in the future.

Thesis – Publication details

Leistungsverbesserung der Simulation von 3D Strahlung auf extrasolare Planeten

Author	Oliver Pola
Type	Bachelor's Thesis
Advisors	Peter Hauschildt, Michael Kuhn, Jannek Squar
Reviewers	Michael Kuhn, Peter Hauschildt
Date	2018-08-06
Abstract	PHOENIX ist ein moderner Code zur Modellrechnung von Stern- und Planeten-Atmosphären und kommt bei der Charakterisierung extrasolarer Planeten zum Einsatz. Die fortschreitende Beobachtungstechnik erfordert immer komplexere Modelle, damit einher gehen hohe Laufzeiten zur Berechnung, selbst bei Verwendung von Hochleistungsrechnern und entsprechenden Parallelisierungs-Methoden wie MPI und OpenMP. Diese Arbeit soll daher zur Verbesserung der Leistung von PHOENIX beitragen, wobei die Betrachtungen auf einen Programmteil, den sphärischen Tracker, konzentriert werden. Zunächst wird dazu eine umfangreiche Leistungsanalyse von PHOENIX vorgenommen, die Ansätze für mögliche Verbesserungen liefert. Dabei werden die Skalierung anhand des Speedups betrachtet, ein Profil erstellt und Spurdaten analysiert. Die Ergebnisse der Leistungsanalyse legen nahe, sich mit OpenMP zu beschäftigen. Dazu wird das vorhandene OpenMP-Schema anhand von gängigen Vorgehensweisen zur Optimierung von OpenMP-Code überarbeitet. Dadurch lässt sich aber kein Leistungsgewinn erzielen. Stattdessen konzentriert sich diese Arbeit auf Interna von PHOENIX. Sogenannte Charakteristiken durchlaufen die Elemente des sphärischen Gitters, durch das die Planeten-Atmosphäre dargestellt wird. Um alle Gitterelemente zu erfassen, wird eine unterschiedliche Anzahl an Charakteristiken benötigt. Die Ursachen und Auswirkungen dieser schwankenden Anzahl werden untersucht. Es stellt sich heraus, dass diese Anzahl ein Maß für den Workload des betrachteten Trackers darstellt, weshalb nach Methoden gesucht wird, diese Zahl zu reduzieren. Variationen der verschachtelten Schleifen, in denen die Charakteristiken erzeugt werden, liefern solche Methoden und es zeigt sich, dass zusammen mit der Anzahl der Charakteristiken auch die Laufzeit reduziert werden kann.

Thesis – Publication details

Development of a decision support system for performance tuning in Lustre

Author	Julius Plehn
Type	Bachelor's Thesis
Advisors	Michael Kuhn, Anna Fuchs
Date	2018-05-23
Abstract	Performance critical components in HPC require an extensive configuration. This is especially true when a filesystem has to be configured for maximum performance, as it is a good practice to benchmark the system every time configuration changes have been made. Furthermore the correlation between the configuration and performance metrics might not be obvious, especially after several benchmarks. As the configurable components are distributed, the configuration management is an additional challenge. This trial and error process is time consuming and error-prone. In order to be able to make valuable decisions it is important to gain detailed insights into the system and the state of the configurations. The goal of this thesis is to evaluate the usage of a decision support system to simplify the process of performance optimization. Therefore a detailed insight into the possibilities of performance optimization is provided. Additionally, a Command Line Interface (CLI) is developed to perform automated Lustre client changes and IOR benchmarks. The decision support itself is supported by a web interface, which is used to visualize the configuration changes and the benchmark results.

Thesis – Publication details

Vector Folding for Icosahedral Earth System Models

Author	Jonas Tietz
Type	Bachelor's Thesis
Advisors	Nabeeh Jumah, Julian Kunkel
Date	2018-03-26
Abstract	The performance of High Performance Computing (HPC) applications become increasingly bound by the access latency of main memory. That is why strategies to minimize accesses to memory and maximize the use of the caches are crucial for any serious HPC application. There is lots of research on the topic of trivial rectangular grids, like using SIMD (single instruction multiple data) instructions, to operate on multiple values at once, or cache blocking techniques, which try to divide the grids into chunks, which fit into the cache. Additionally, there are new interesting techniques for minimizing loads in stencil computations like vector folding. This thesis will take a look at the theoretical performance benefits, especially vector folding in conjunction with an icosahedral grid, and test them in a simple test case. As a result the performance improved slightly in comparison over traditional vectorization techniques.

Thesis – Publication details

Enabling Single Process Testing of MPI in Massive Parallel Applications

Author	Tareq Kellyeh
Type	Master's Thesis
Advisors	Julian Kunkel, Christian Hovy
Date	2018-03-20
Abstract	While many parallel programming models exist, the dominant model is MPI. It has been considered as the de facto standard for building parallel programs that use message passing. In spite of this popularity, there is a lack of tools that support testing of MPI programs. When considering unit testing, it is not widely applied to scientific programs, even though it is an established practice in professional software development. However, with MPI, the communicated data comes from different processes which increases the effort of creating small test units. In this thesis, a solution to reduce the effort of testing massive parallel applications is developed. By applying this solution, any selected piece of MPI parallelized code that forms a part of such applications can be tested. The used method is based on the technique: Capture and Replay. This technique extracts data while executing the application and uses this data as an input for the MPI communications in the test phase. The structures, that contain the extracted data, are generated automatically. As a step towards enabling Unit Testing of MPI applications, this thesis supports the user in writing appropriate test units and executing them by a single process solely. In this way, repeating the expensive parallel execution of MPI programs can be avoided. This step is considered as the most important contribution of this thesis.

Thesis – Publication details

Modeling and Performance Prediction of HDF5 data on Objectstorage

Author	Roman Jerger
Type	Master's Thesis
Advisors	Julian Kunkel, Jakob Lüttgau
Date	2018-03-15

Thesis – Publication details

Optimizing ArduPower

Author	Daniel Bremer
Type	Bachelor's Thesis
Advisors	Michael Kuhn, Mohammad Reza Heidari
Date	2018-03-07
Abstract	In the recent years the performance of computation systems is not rated only by their FLOPS anymore but also by their power consumption. With the aim to design energy aware computing systems and software, monitoring systems have to be used to gather information on the power consumption. By making improvements to the existing ArduPower platform, a highly modular, efficient and easy to use internal power monitoring system is build and tested. After evaluating different aspects of power monitoring systems, the hardware design of a probe based, self-configuring system is described utilizing probes for current measurement and a shield for the Arduino Mega 2560. For increased efficiency in data transmission a sophisticated protocol is designed to ensure the highest possible sampling rates and enable detailed analysis of a system's power consumption.

Thesis – Publication details

Modern Storage Stack with Key-Value Store Interface and Snapshots Based on Copy-On-Write Bε-Trees

Author	Felix Wiedemann
Type	Master's Thesis
Advisors	Michael Kuhn
Date	2018-01-11
Abstract	The ever increasing gap between computational power and storage capacity on the one side and storage throughput on the other side leads to I/O bottlenecks in many applications. To cope with huge data volumes, many storage stacks apply copy-on-write techniques. Copy-on-write enables efficient snapshots and guarantees on-disk consistency. However, a major downside of copy-on-write is potentially massive fragmentation as data is always redirected on write. As fragmentation leads to random reads, the I/O throughput of copy-on-write storage stacks suffers especially under random write workloads. In this thesis, we will design, implement, and evaluate a copy-on-write storage stack that uses Bε-Trees which are a generalisation of B-Trees and limit fragmentation by design due to larger node sizes. The storage stack has an underlying storage pool which consists of groups of storage devices called vdevs. Each vdev is either a single storage device, a mirror of devices, or a group of devices with additional parity blocks so that a storage pool improves performance and/or resilience compared to a single storage device. In the storage pool, the data is protected by checksums. On top of the storage pool, we use Bε-Trees to save all user data and metadata. The user interface of the storage stack provides data sets which have a simple key-value store interface and save their data in dedicated Bε-Trees. Each data set can be individually snapshotted as we simply use the path-copying technique for the corresponding Bε-Tree. In the performance evaluation, our storage stack shows its advantage over ZFS – a mature copy-on-write storage stack – in a database workload. Our storage stack is not only 10 times faster regarding small random overwrites (6.6 MiB/s versus 0.66 MiB/s) but it also exhibits a much smaller performance degradation in the following sequential read of data. While the sequential read throughput of ZFS drops by 82% due to the random writes, our storage stack only incurs a 23% slowdown. Hence, limiting fragmentation by design can be very useful for copy-on-write storage stacks so that the read performance is higher and more consistent regardless of write access patterns.

Thesis – Publication details – Sources

2017

Compiler assisted translation of OpenMP to MPI using LLVM

Author	Michael Blesel
Type	Bachelor's Thesis
Advisors	Michael Kuhn
Date	2017-10-14
Abstract	OpenMP and MPI are the two most commonly used parallelization APIs in the field of scientific computing. While OpenMP makes it relatively easy to create parallel software, it only supports shared memory systems. MPI software can be executed on distributed memory system and therefore offers higher scalability options. Unfortunately MPI software is more difficult to implement. This thesis discusses an approach to automatically translate OpenMP programs into MPI programs. With help of the tools provided by the LLVM Project, a transformation pass for the LLVM compiler is developed, which replaces OpenMP with MPI during the compilation of a program.

Thesis – Publication details

Simulation of Storage Tiering and Data Migration

Author	Kira Duwe
Type	Master's Thesis
Advisors	Michael Kuhn
Date	2017-09-18
Abstract	The ever-present gap between the growth of computational power in contrast to the capabilities of storage and network technologies makes I/O the bottleneck of a system. This is especially true for large-scale systems found in HPC. Over the years a number of different storage devices emerged each providing their own advantages and disadvantages. Fast memory elements such as RAM are very powerful but come with high acquisition costs. With limited budgets and the requirement for long-term storage over several decades, a different approach is needed. This led to a hierarchical structuring of different technologies atop of one another. While tape systems are capable of preserving large amounts of data reliably over 30 years, they are also the most affordable choice for this purpose. They form the bottom layer of the hierarchy, whereas high-throughput and low-latency devices like non-volatile RAM are located at the top. As the upper layers are limited in capacity due to their price, data migration policies are essential for managing the file movement between the different tiers in order to maximise the system's performance. Since data loss and downtime are a concern, these policies have to be evaluated in advance. Simulations of such hierarchical storage systems provide an alternative way of analysing the effects of placement strategies. Although there is consent that a generic simulator of diverse storage systems able to represent complex infrastructures is indispensable, the existing proposals lack a number of features. In this thesis, an emulator for hierarchical storage systems has been designed and implemented supporting a wide range of existing and future hardware as well as a flexible topology model. A second library is conceptualised on top offering a file handling interface to the application layer as well as a set of data migration schemes. Only minor changes are required to run an application on the emulated storage system. The validation shows a maximum performance of both libraries in the range of 7 to 9 GB per second when executed in RAM. Analysing the impact of the used block size lead to the recommendation to use at least 100 kB in order to maximise the resulting performance.

Thesis – Publication details – URL

Efficient interaction between Lustre and ZFS for compression

Author	Sven Schmidt
Type	Bachelor's Thesis
Advisors	Anna Fuchs, Michael Kuhn
Date	2017-08-28
Abstract	As predicted by Moore's law, computational power was increasing rapidly within the last years, roughly doubling every 14.5 months throughout the history of the TOP500, whilst storage capacity and speed showed far less significant growing factors. This results in an increasing gap, and, especially for High Performance Computing, Input and Output became a performance bottleneck. Furthermore, for the storage of data with up to multiple petabytes using distributed file systems like Lustre, another bottleneck evolves from the need to transfer the data over the network. For the compensation of this bottlenecks, investigating compression techniques is more urgent than ever, basically aiming to exploit computational power to reduce the amount of data transferred and stored. One approach for the Lustre file system was presented by Anna Fuchs in her thesis “Client-side Data Transformation in Lustre”. For the efficient storage within the underlying ZFS file system, Niklas Behrmann extended ZFS to allow storing externally compressed data as if it was compressed by ZFS itself, which allows to make use of the already existing infrastructure. This thesis interconnects both works. First, modifications to the read and write path are made to handle compressed data, that is, receiving it from Lustre and handing it to ZFS and vice versa. Moreover, metadata regarding the compression (such as the used algorithm) is stored together with the data as a header and given back to the client for decompression on the read path. The ultimate goal is to tailor new functionality as tightly as possible to the existing structures for best performance. First benchmarks showed, that the amount of data transferred over the network could be reduced by a fair amount, while the new functionality did not introduce performance regressions. Rather, reading compressed data turns out to be indeed faster.

Thesis – Publication details

Page-Based Compression in the Linux Kernel

Author	Benjamin Warnke
Type	Bachelor's Thesis
Advisors	Michael Kuhn, Anna Fuchs
Date	2017-08-18
Abstract	Since computers exist, the performance of processors increases faster than the throughput of networks and permanent storage. As a result, the performance bottleneck is the throughput between the processor the other system components. To improve overall cluster performance, the processor can be used to compress the data before sending it to a slower component. In order to apply the compression efficiently to all kind of data intensive applications, compression can be performed in an underlying file system which is widely used in clusters. One of these file systems is Lustre. LZ4 is focused on speed and is currently one of the fastest lossless compression algorithms. The algorithm can compress up to 4.51 GB/s while decompression reaches about 9.14 GB/s. All available LZ4 implementations use continuous buffers. Since Lustre is a Linux kernel module, the memory is accessed via page arrays. To be able to use LZ4 the data has to be copied into a continuous buffer and back later. The aim of this thesis to improve the throughput of the compression by reducing the memory utilization and the number of copying processes. Therefore a modified version of LZ4 is introduced, which works directly on page-based buffers. After the implementation of the LZ4 algorithm with page-based buffers, it became clear that the performance was not sufficient. Therefore, a new algorithm called BeWalgo is designed that doubles the compression throughput when page arrays are used as buffers. The drawback of the BeWalgo algorithm is that the compression ratio may be lower in comparison to LZ4.

Thesis – Publication details

Dynamic decision-making for efficient compression in parallel distributed file systems

Author	Janosch Hirsch
Type	Master's Thesis
Advisors	Michael Kuhn, Anna Fuchs
Date	2017-08-12
Abstract	The technology gap between computational speed, storage capacity and storage speed poses big problems especially for the HPC field. A promising technique to bridge this gap is data reduction through compression. Compression algorithms like LZ4 can reach compression speeds high enough to be applicable for the HPC field. Consequently efforts to integrate compression into the Luste file system are in progress. Client side compression also brings the potential to increase the network throughput. But to be able to fully exploit the compression potential the compression configuration has to be adapted to its environment. The more adaptations to the data structure and machines condition the better the compression effectiveness will be. This objective of this thesis is to design a decision logic that dynamically adapts the compression configuration to maximize a desired trade-off between application speed and compression. Different compression algorithms and the conditions for compression on the client side of a distributed file systems are examined to identify possibilities to apply compression. Finally an implemented prototype of the decision and adaption logic is evaluated with different network speeds and starting points to further improve the concept are given.

Thesis – Publication details

Verarbeitung von Klimadaten mit Big-Data-Werkzeugen

Author	Alexander Erhardt
Type	Master's Thesis
Advisors	Julian Kunkel
Date	2017-07-31
Abstract	Die Verarbeitung und Analyse von Klimadaten umfassen heutzutage größere Datenmengen, die sehr oft strukturiert innerhalb der NetCDF-Dateien aufbewahrt werden. Die Verarbeitungsprozesse der Datenanalyse benötigen komplexe leistungsfähige Systemen mit größerem Berechnungspotential, um die Datenverarbeitung in akzeptabler Zeit ausführen zu können. Moderne Big-Data-Werkzeuge bieten gut strukturierte Plattformen für die Verarbeitung wissenschaftlicher Daten innerhalb der NetCDF-Dateien. In dieser Arbeit werden mögliche Alternativen der Verwendung von Big-Data-Werkzeugen erläutert, die eine Möglichkeit schaffen, die vom Nutzer angeforderte Verarbeitungsabläufeinnerhalb einer Weboberfläche auszuführen und die Ergebnisse mit Hilfe einer grafischen Datendarstellung begutachten zu können. Auf der Basis des entwickelten Systems wird untersucht, inwiefern die aktuellen Werkzeuge für interaktive Analyse der Klimadaten geeignet sind. Dabei werden sämtliche Berechnungsprozesse mittels SciSparks auf einem Cluster von Berechnungsknoten ausgeführt. Die Steuerung dieser Prozessen sowie Visualisierung der Verarbeitungsergebnisse ermöglicht Apache Zeppelin innerhalb einer Webschnittstelle. Es wird untersucht, inwiefern genannte Werkzeuge angeforderte Voraussetzungen bereits erfüllen können. Diese Systeme werden durch einige Komponenten erweitert, um einen Prototyp des vorgestellten Ansatzes zu entwickeln. Somit werden auf der Basis theoretischer Grundlagen die aufgesetzten Komponenten in einem System mit einer Benutzerwebschnittstelle zusammengefasst. Dabei wurde vorhandene SciSparkFunktionalität mit den implementierten CDO-Operatoren und dem Stencil-Verfahren für ein-, zwei- und dreidimensionale NetCDF-Variablen erweitert. Zum Schluss wird gezeigt, wie effizient eine Ausführung der unterschiedlichen Prozessabläufe in dem entwickelten System sein kann und welche Einschränkungen auf die Software und Hardware ungeeignet beziehungsweise nicht leistungsfähig genug sind.

Thesis – Publication details

Static Code Analysis for HPC Use Cases

Author	Frank Röder
Type	Bachelor's Thesis
Advisors	Alexander Droste, Michael Kuhn
Date	2017-07-26
Abstract	The major objective of this thesis is to approach the procedure of getting into compiler-based checks. The focus are high-performance computing use cases. Especially the Message-Passing-Interface (MPI) is used to execute parallel tasks via inter-process communication, including parallel reading and writing of files are taken into account. A motivation states why it is remarkable to use static analysis. Following this, techniques and tools to improve software development with static analysis are introduced. Nowadays parallel software has large code bases. With rising complexity the possibility of generating bugs is undeniable. Tools to reduce the error-proneness are important factors of efficiency. The infrastructure of LLVM as well as the Clang Static Analyzer (CSA) are introduced to understand static analysis and how to capture information of the relevant compile phases. Based on this, the utility of an existing check is explained. Problems exposing at runtime are observed through code simulation in the frontend named symbolic execution. In what follows, the comprehension is transferred to the use case of purpose. Common mistakes to overlook like issues with readability and bad code styles are checked through analysis of the abstract-syntax-tree. For this intention the LLVM tool Clang-Tidy has been extended with new checks. The checks regarding symbolic execution involve MPI-IO related double closes and operations concerning file access. The routines to find these bugs have been added to the CSA. This thesis makes use of the already existing structure named MPI-Checker, which provides the handling of MPI. As a summary the benefits of working on checks to detecting serious bugs are mentioned.

Thesis – Publication details

Database VOL-plugin for HDF5

Author	Olga Perevalova
Type	Bachelor's Thesis
Advisors	Michael Kuhn, Eugen Betke
Date	2017-07-05
Abstract	HDF5 is an open source, hierarchical, and self-describing format for flexible and efficient I/O for high volume and complex data, that combines data and metadata. Advantages of this format make it widely used by many scientific applications. In a parallel HDF5 application when a large number of processes access a shared file simultaneously synchronization mechanism used by many file systems may significantly degrade I/O performance. Separation of metadata and data is the first step to solve this problem. The main contribution of this thesis is a prototype of an HDF5-VOL-Plugin that separates metadata and data. To this end, metadata are stored in an SQLite3 database and data in a shared file. It uses MPI for synchronization of metadata when several processes access the SQLite3 database. In the context of this work a benchmark test has been developed. It measures access times for each metadata operation and the overall I/O performance. The execution time of the Database VOL-plugin is compared to the native solution. The test results show that the database plugin consistently demonstrates good performance. The thesis concludes with a critical discussion of the approach by looking at the metadata from different perspectives: scientific applications vs. HDF5.

Thesis – Publication details

A Numerical Approach to Nonlinear Regression Analysis by Evolving Parameters

Author	Christopher Gerlach
Type	Master's Thesis
Advisors	Michael Kuhn
Date	2017-06-29
Abstract	Nonlinear regression analysis is an important process of statistics and poses many challenges to the user. While linear models are analytically solvable, nonlinear models can in most cases only be solved numerically. What many numeric methods have in common, is that they require a proper starting point to reach satisfactory results. A poor choice of starting values can greatly reduce the convergence speed or in many cases even result in the algorithm not to converge at all. This thesis proposes a genetic numerical hybrid method to approach the problem from a nontraditional angle. The approach combines genetic algorithms with traditional numeric methods and proposes a design suitable for massive parallelization with GPGPU computing. It is shown that the approach can solve a large set of practical test problems without having to specify any starting values and that is fast enough for practical use, utilizing only consumer grade hardware.

Thesis – Publication details

Interactive Recommender Systems For A Professional Social Network

Author	Mirko Köster
Type	Master's Thesis
Advisors	Julian Kunkel
Date	2017-06-09
Abstract	In this thesis, we research interactive recommender systems and present a method to offer interactive recommendations in the form of recommender settings. Specifically, this is done in the domain of job recommendations at XING, a professional social network. These settings allow users to tune some aspects of the job recommender system, i.e. their preferred career level, whether they are willing to commute or even move to a new location, and which topics (skills, jobroles and disciplines) they like or dislike. These topics are explicitly not taken from the users' profiles, as profiles on XING rather reflect the CV of the user, i.e. things that the user did in the past but not what the user aims to work on in the future. Instead, we generate the topics from the job recommendations we already offer, which are influenced by the users' profiles, their behavior on the platform as well as from their previously specified recommender settings. These topics can thus be seen as a summary of the users' job recommendations. By tweaking the recommendation settings, the actual job recommendations immediately change which in turn has an influence on the selectable topics thus allowing the user to interactively refine the recommendation settings and explore the item space. We implemented our recommender settings approach in the back-end of the actual job recommendation service, thus turning XING's job recommender into an interactive recommender service. Moreover, we implemented a prototype application that allows users to experience the interactive job recommendations. Given both the adjusted job recommender service and our prototype, we conducted both a large-scale quantitative evaluation as well as a user study in which we collected qualitative feedback and analyzed the impact on user satisfaction.

Thesis – Publication details

Quality of service improvement in ZFS through compression

Author	Niklas Bunge
Type	Master's Thesis
Advisors	Michael Kuhn, Anna Fuchs
Date	2017-05-31
Abstract	This thesis evaluates the improved use of data compression to reduce storage space, increase throughput and reduce bandwidth requirements. The latter is an interesting field of application, not only for telecommunication but also for local data transfer between the CPU and the storage device. The choice of the compression algorithm is crucial for the overall performance. For this reason part of this work is the reflection of which algorithm fits best to a particular situation. The goal of this thesis comprises the implementation of three different features. At first, updating the existing lz4 algorithm enables support for the “acceleration” called lz4fast. This compression speed-up, in lieu of compression ratio, increases write speed on fast storage devices such as SSDs. Second, an automatic decision procedure adapts the compression algorithms gzip-(1-9) and the new updated lz4 to the current environment in order to maximize utilization of the CPU and the storage device. Performance is improved compared to no compression but is highly depends on the hardware setup. On powerful hardware the algorithm successfully adapts to the optimum. The third and last feature enables the user to select a desired file-write-throughput. Scheduling is implemented by delaying and prioritizing incoming requests. Thereby compression is adjusted to not impair the selected requirements while reducing storage space and reducing bandwidth demand respectively. By preferring “fast” files over “slow” files - high throughput over low throughput - the average turnaround time is reduced while maintaining the average compression ratio.

Thesis – Publication details

In-situ Transformation for Input/Output Access Patterns by Applying Building Blocks of Optimization Schemas

Author	Daniel Schmidtke
Type	Master's Thesis
Advisors	Julian Kunkel
Date	2017-04-19
Abstract	This thesis is about the finding of optimization strategies, that can be applied by in-situ transformation of input/output access patterns and the classification of these strategies. The found optimizations are then being implemented in SIOX and FUSE and evaluated with different benchmarks. The optimization strategies found in this thesis are a demonstration of the possibilities that can be achieved using in-situ transformation.

Thesis – Publication details

Support for external data transformation in ZFS

Author	Niklas Behrmann
Type	Master's Thesis
Advisors	Michael Kuhn, Anna Fuchs
Date	2017-04-06
Abstract	While computational power of high-performance computing systems doubled every two years over the last 50 years as predicted by Moore's law, the same was not true for storage speed and capacity. Compression has become a useful technique to bridge the increasing performance and scalability gap between computation and Input/Output (I/O). For that reason some local filesystems like ZFS support transparent compression of data. For parallel distributed filesystems like Lustre this approach does not exist. Lustre is frequently used in supercomputers. The Intel Parallel Computing Centers (IPCC) for Lustre filesystem project is aiming for compression support in Lustre at multiple levels. The IPCC are universities, institutions, and labs. Their primary focus is to modernize applications to increase parallelism and scalability. A prior thesis started the implementation of online compression with the compression algorithm LZ4 in Lustre. The focus of this implementation was to increase throughput performance. The data is compressed on clientside and send compressed to the server. However the compression leads potentially to a bad read performance. This problem might be solved through modifying the ZFS filesystem which is utilized by Lustre servers as a backend filesystem. ZFS already has a compression functionality integrated which provides good read performance for compressed data. The idea is to make use of this and store the Lustre's data in ZFS as if it was compressed by ZFS. Therefore a new interface that takes the necessary information has to be created. Implementing this is the purpose of this thesis. The goal is to enable the Lustre compression to save space on disk and most importantly fix the bad read performance. Throughout this thesis the necessary modifications to ZFS are described. The main task is to provide information to ZFS about the compressed size and the uncompressed size of the data. Afterwards a possible implementation of the specified feature is presented. First tests indicate that data which is compressed by Lustre can be read efficiently by ZFS if provided with the necessary metadata.

Thesis – Publication details

Suitability analysis of Object Storage for HPC workloads

Author	Lars Thoms
Type	Bachelor's Thesis
Advisors	Michael Kuhn
Date	2017-03-23
Abstract	This bachelor thesis reviews the possibility of using an Object Storage system like Ceph Object Storage (RADOS) especially about its performance and functionality of partial rewrite. Scientific high-performance computing produces large file objects and its metadata has to be fast searchable. That is why Object Storages are a good solution because they store data efficiently with simple API calls without the requirement to comply with POSIX specification. Unfortunately, these are overloaded and not performant. Above all, object storing in combination with metadata separation to store them in a search-efficient database will increase the performance of searching. Furthermore, per definition objects are supposed to be immutable, but if RADOS API calls are used, they are mutable and can be rewritten like on other filesystems. In this thesis, I am going to investigate whether that objects could be segmented rewritten. Accordingly, I am going to program a FUSE driver as a proof of concept and prepare a series of measurement to show performance and issues. Thereby, it is possible to use Ceph as normal Filesystem, because of mutable objects. Unfortunately, the write performance of this driver was low (around 3 MiB/s). At the end, there is a design concept of an HPC application using a Ceph cluster in combination with a document-oriented database to store metadata.

Thesis – Publication details

Extracting Semantic Relations from Wikipedia using Spark

Author	Hans Ole Hatzel
Type	Bachelor's Thesis
Advisors	Julian Kunkel
Date	2017-02-02
Abstract	In this work, the full text of both the German and the English Wikipedia were used for two subtasks. 1. Finding Compound Words 2. Finding Semantic Associations of Words The approach to the first task was to find all nouns in the Wikipedia and evaluate which of those form compounds with any other nouns that were found. PySpark was used to work through the whole Wikipedia dataset and the performance the part-of-speech tagging operation on the whole dataset was good. In this way, a huge list of nouns was created which could then be used to check it for compound words. As this involved checking each noun against every other noun the performance was not acceptable, with the analysis of the whole English Wikipedia taking over 200 hours. The data generated from the first subtasks was then for the task of both generating and solving CRA tasks. CRA tasks could be generated at a large scale. CRA tasks were solved with an accuracy of up to 33%. The second subtask was able to cluster words based on their semantics. It was established that this clustering works to some extend and that the vectors representing the words therefor have some legitimacy. The second subtask's results could be used to perform further analysis on how the difficulty of CRA tasks behaves with how words are related to each other.

Thesis – Publication details

2016

Energy usage analysis of HPC applications

Author	Tim Jammer
Type	Bachelor's Thesis
Advisors	Hermann Lenhart, Michael Kuhn
Date	2016-12-06
Abstract	The importance of energy consumption of large scale computer systems will grow in the future, as it is not only a huge cost factor, it also leads to additional difficulty to cool those systems down. In order to gain experience on energy consumed by model simulation, I will analyze the energy consumed by the ECOHAM North Sea ecosystem model to deduct which parts of the application use the most energy. First the influence of the measurement of the energy usage on the application will be discussed. It is important to keep this influence in mind, as one wants to know about the energy usage of the unchanged application, so that the gathered insights are transferable to the application when it is running without the energy measurement. Furthermore my thesis will provide an overview on the energy which is needed by the different phases of the application. A focus is placed on the serial section where the output is written. The busy waiting implemented by the MPI implementation leads to an increased energy consumption. Without this busy waiting the application needs about 4 percent less energy. Therefore, I propose that the programmer of an MPI application should be able to choose which MPI calls should perform a non busy waiting.

Publication details

Adaptive Selection of Lossy Compression Algorithms Using Machine Learning

Author	Armin Schaare
Type	Bachelor's Thesis
Advisors	Julian Kunkel, Anastasiia Novikova
Date	2016-11-29
Abstract	This goal of this thesis was to evaluate machine learning model's ability for their use as an automatic decision feature for compression algorithms. Their task would be to predict which compression algorithms perform best on what kind of data. For this, artificially generated data, itself, and its compression was analyzed, producing a benchmark of different features, upon which machine learning models could be trained. The models' goal was to predict the compression and decompression throughput of algorithms Additionally, models had to correctly attribute data to the algorithm producing the best compression ratios. Machine learning approaches under consideration were Linear Models, Decision Trees and the trivial Mean Value Model as a comparison baseline. It was found, that Decision Trees performed significantly better than Linear Models which in turn were slightly better than the Mean Value approach. Nevertheless, even Decision Trees did not produce a satisfying result which could be reliably used for practical applications.

Thesis – Publication details

Evaluation von alternativen Speicherszenarien für hierarchische Speichersysteme

Author	Marc Perzborn
Type	Bachelor's Thesis
Advisors	Julian Kunkel
Date	2016-10-31
Abstract	Ziel der vorliegenden Bachelorarbeit war es, das Simulationsprogramm FeO auf seine Korrektheit zu überprüfen und zu verbessern. Dazu wurden verschiedene Szenarien simuliert. Die Ergebnisse bestätigen zum großen Teil die Annahmen. Im Cache gespeicherte Informationen können schneller Ausgegeben werden, als nicht im Cache gespeicherte. Bei wenig verbauten Laufwerken müssen lesende Anfragen auf nicht gecachte Informationen warten, wenn jedes Laufwerk belegt ist. Das Speichermanagement eines vollen Cache funktioniert einwandfrei. Bei einem Cache mit freiem Speicherplatz wird nicht wie in einem realen System reagiert. Die Verarbeitungszeiten für Anfragen auf nicht gecachte Informationen variiert, wenn verschiedene Komponenten des Bandarchives, beispielsweise die Generation der Laufwerke, die Anzahl der Laufwerke des Bandarchives oder die Bandbreite von Komponenten, verändert werden.

Thesis – Publication details

Quality Control of Meteorological Time-Series with the Aid of Data Mining

Author	Jennifer Truong
Type	Master's Thesis
Advisors	Julian Kunkel
Date	2016-10-30
Abstract	This thesis discusses the topic quality controls in the meteorological field and in particular optimize them by adjustment and construction of an automated pipeline for the quality checks. Three different kinds of pipelines are developed through this thesis: The most general one has the focus on high error detection with a low false positive rate. But a categorized pipeline is also designed, which classify the data in “good”, “bad” and “doubtful”. Furthermore a fast fault detection pipeline is derived from the general pipeline to make it possible to react nearline to hardware fails. In this thesis general fundamentals about meteorological coherence, statistical analysis and quality controls for meteorology are described. After that the approach of this thesis are lead by the development of the automated pipeline. Meteorological measurements and their corresponding quality controls got explored to optimize them. Beside an optimization of existing quality controls, new automated tests are developed within this thesis. The evaluation of the designed pipeline shows that the quality of the pipeline depends on the input parameters. The more information we have for the input the better is the pipeline working. But the specialty of the pipeline is that it works with any kind of input, so it is not limited to strict input parameters.

Thesis – Publication details

MPI-3 algorithms for 3D radiative transfer on Intel Xeon Phi coprocessors

Author	Jannek Squar
Type	Master's Thesis
Advisors	Peter Hauschildt, Michael Kuhn
Date	2016-10-20
Abstract	One-sided communication has been added to the MPI standard with MPI-2 in 1997 and has been greatly extended with the introduction of MPI-3 in 2012. Even though one-sided communication offers many use cases, from which an application could benefit, it has only sporadically been used for HPC so far. The objective of this thesis is to examine its potential use for replacing a OpenMP section with equivalent code, which only makes use of MPI. This is done based on an already existing application, named PHOENIX. This application is currently developed at the observatory of Hamburg and has been designed to be executed on HPC systems. Its purpose is, among other things, to numerically solve the equations of 3D radiative transfer for stellar objects. For utilising HPC hardware at its full capacity PHOENIX makes use of MPI and OpenMP. In the course of this thesis a test application has been constructed, which mimics the OpenMP sections and allows to benchmark diverse combinations of MPI one-sided communication operations. The benchmarks are performed on a Intel Xeon Phi Knights Corner and on a Intel Xeon Phi Knights Landing to estimate if a certain approach is suitable for HPC hardware in general. In the end each approach is discussed and assessed which kind of communication pattern might benefit most of MPI one-sided communication.

Thesis – Publication details – URL

Characterizing Literature Using Machine Learning Methods

Author	Jan Bilek
Type	Master's Thesis
Advisors	Julian Kunkel
Date	2016-10-14
Abstract	In this thesis, we explore the classical works by famous authors available in Project Gutenberg – a free online ebook library. The contemporary computational power enables us to analyze thousands of books and find similarities between them. We explore the differences between books and genres with respect to features such as proportion of stop words, the distribution of part of speech classes or frequencies of individual words. Using this knowledge, we create a model which predicts book metadata, including author or genre, and compare the performance of different approaches. With multinomial naive Bayes model, we reached 74.1 % accuracy on the author prediction task out of more than 1 400 authors. For other metadata, the random forest classifier achieved the best results. Through most predictive features, we try to capture what is typical for individual genres or epochs. As a part of the analysis, we create Character Interactions model that enables us to visualize the interactions between characters in the book and define the main or central character of the book.

Thesis – Publication details

Suitability Analysis of GPUs and CPUs for Graph Algorithms

Author	Kristina Tesch
Type	Bachelor's Thesis
Advisors	Michael Kuhn
Date	2016-09-27
Abstract	Throughout the last years, the trend in HPC is towards heterogeneous cluster architectures that make use of accelerators to speed up computations. For this purpose, many current HPC systems are equipped with Graphics Processing Units (GPUs). These deliver a high floating-point performance, which is important to accelerate compute-intensive applications. This thesis aims to analyze the suitability of CPUs and GPUs for graph algorithms, which can be classified as data-intensive applications. These types of applications perform fewer computations per data element and strongly rely on fast memory access. The analysis is based on two multi-node implementations of the Graph500 benchmark, which execute a number of Breadth-first Searches (BFS) on a large-scale graph. To enable a fair comparison, the same parallel BFS algorithm has been implemented for the CPU and the GPU version. The final evaluation does not only include the performance results but the programming effort that was necessary to achieve the result as well as the cost and energy efficiency. Comparable performance results have been found for both the versions of Graph500, but a significant difference in the programming effort has been detected. The main reason for the high programming effort of the GPU implementation is that complex optimizations are necessary to achieve an acceptable performance in the first place. These require detailed knowledge of the GPU hardware architecture. All in all, the results of this thesis lead to the conclusion that the higher energy efficiency and, depending on the point of view, cost efficiency of the GPUs do not outweigh the lower programming effort for the implementation of graph algorithms on CPUs.

Thesis – Publication details

Leistungs- und Genauigkeitsanalyse numerischer Löser für Differentialgleichungen

Author	Joel Graef
Type	Bachelor's Thesis
Advisors	Fabian Große, Michael Kuhn
Date	2016-09-12
Abstract	Diese Bachelorarbeit beschäftigt sich mit der Frage, ob Lösungsverfahren für Differentialgleichungen mit höherer Ordnung in jedem Fall besser für die Verwendung in numerischen Modellen geeignet sind als solche mit niedrigerer Ordnung. Die Frage wird unter Verwendung von vier Lösungsverfahren im Hinblick auf zwei verschiedene Differentialgleichungen und ein NPD-Modell (Nährstoff-Phytoplankton-Detritus), welches ein vereinfachtes marines Ökosystem beschreibt, geklärt. Zunächst werden einige Hintergrundaspekte zu Lösungsverfahren für Differentialgleichungen vorgestellt und auf Einschritt- und Mehrschrittverfahren eingegangen. Hierbei werden insbesondere die verwendeten Lösungsverfahren nach Euler, Heun, Adams-Bashforth 2. Ordnung (AB2) und Runge-Kutta 4. Ordnung (RK4) behandelt. Bei der Leistungsanalyse werden die Verfahren hinsichtlich ihrer Genauigkeit und Laufzeit verglichen. Außerdem wird eine Schrittweitensteuerung vorgestellt, die bei einer Abweichung der Approximation zur analytischen Lösung die Schrittweite reduziert und nach einem bestimmten Intervall wieder erhöht. Sowohl mit Schrittweitensteuerung als auch ohne erreichte das Verfahren höchster Ordnung (RK4) die beste Laufzeit. Unter Verwendung eines NPD-Modells werden die Verfahren mit Ausnahme des AB2-Verfahren ebenfalls analysiert. Dabei wird festgestellt, dass sich die Nutzung vom Heun-, AB2- und RK4-Verfahren gegenüber dem Euler-Verfahren für das Modell nicht rentieren. Ausschlaggebend dafür ist die Wahl der Schrittweite, die von der Genauigkeit der Verfahren abhängt. Die Genauigkeit wird durch die Berechnung von Zusatzrechenschritten erhöht und erlaubt damit die Wahl eines gröberen Zeitschritts. Die Rechenzeit pro Zusatzrechenschritt ist bei der Nutzung des NPD-Modells größer als die Rechenzeiteinsparung durch den gröberen Zeitschritt. Da aber beispielsweise keine Schrittweitensteuerung im Modell implementiert wurde, bestehen durchaus weitere Ansatzpunkte zur Verbesserung der Laufzeit.

Thesis – Publication details

Performanceanalyse der Ein-/Ausgabe des Ökologiemodells ECOHAM5

Author	Simon Kostede
Type	Master's Thesis
Advisors	Michael Kuhn, Fabian Große, Hermann Lenhart
Date	2016-08-22
Abstract	Das Ziel dieser Arbeit ist die Analyse der Ein- und Ausgabe (E/A) des Ökosystemmodells ECOHAM5. ECOHAM5 ist ein paralleles HPC-Programm, das mit MPI parallelisiert ist. Es werden NetCDF Dateien als Ergebnis der Simulation ausgegeben. Wie bei vielen Erdsystem- und Klimamodellen wird bei ECOHAM5 nur serielle E/A durchgeführt, was die Skalierung stark einschränkt. Für die Analyse wurde ECOHAM5 für parallele E/A erweitert und es wurde die Performance gemessen und analysiert. Zudem wurde parallele E/A in ECOHAM5 implementiert. ECOHAM5 ist ein Erdsystemmodell, das die Ökologie der Nordsee simuliert. Das Modell wird genutzt um Fragen des Kohlenstoffflusses in der Nordsee im Rahmen des Klimawandels sowie Fragen zur Auswirkung unterschiedlicher Belastung des Ökosystems Nordsee durch Nährstoffeinträge von Stickstoff und Phosphor zu untersuchen. Dazu wird die Nordsee in ein dreidimensionales Gitter unterteilt und für jede Gitterzelle werden für eine Reihe von Zustandsvariablen numerische Differenzialgleichungen gelöst. Das Modellgebiet des ECOHAM-Gitters umfasst den Nordwesteuropäischen Kontinentalschelf (NECS) und Teile des angrenzenden Nordostatlantiks. ECOHAM5 ist in Fortran implementiert und nutzt MPI für die parallele Ausführung mit mehreren Prozessen. Jeder dieser Prozesse ist an der Berechnung der Simulation beteiligt. Die Simulationsergebnisse werden in der ursprünglichen Version von ECOHAM5 von einem Prozess/Rechenknoten, dem Masterknoten, mit NetCDF gespeichert. Diese serielle E/A wurde in dieser Arbeit verschiedentlich untersucht. Die Implementierung wurde statisch anhand des Quellcodes analysiert. Die Ausführung wurde gemessen und mithilfe des Tracingprogramms Vampir/Score-P ausgewertet. Für die E/A nutzt ECOHAM5 die Bibliotheken MPI, MPI-IO, HDF5 und NetCDF. Die neue Version von ECOHAM5 mit paralleler E/A konnte sich, auf dem Testsystem mit 10 Rechenknoten, zeitlich nicht von der Version mit serieller E/A absetzten, sondern war etwa zwischen 15% und 25% langsamer.

Thesis – Publication details

Untersuchung von Interaktiven Analyse- und Visualisierungsumgebungen im Browser für NetCDF-Daten

Author	Sebastian Rothe
Type	Master's Thesis
Advisors	Julian Kunkel
Date	2016-07-21
Abstract	Simulations- und Messergebnisse von Klimamodellen umfassen heutzutage oftmals große Datenmengen, die beispielsweise in NetCDF-Dateien als spezielle Datenstrukturen abgelegt werden können. Die Analyse dieser Messergebnisse benötigt meist komplexe und leistungsstarke Systeme, die es dem Nutzer ermöglichen, die Datenmenge an Simulationsergebnissen beispielsweise in tabellarischer Form oder durch grafische Repräsentation anschaulich darzustellen. Moderne Cloud-Systeme bieten dem Nutzer die Möglichkeit, Ergebnisse zu speichern und beispielsweise über das Internet weltweit verfügbar zu machen. Dieses Verfahren hat allerdings auch den Nachteil, dass dazu erst die gesamte Ergebnisdatei aus dem Cloud-System angefordert werden muss, bevor sie analysiert werden kann. Diese Arbeit befasst sich mit der Untersuchung eines alternativen Ansatzes, bei dem es für den Nutzer möglich sein soll, über eine Webanwendung erste Analysen auf serverseitig ausgeführten Werkzeugen durchzuführen, deren Ergebnisse dann im Webbrowser veranschaulicht werden können. Basis dieser ReDaVis (Remote Data Visualizer) genannten Anwendung bilden die Softwaresysteme OpenCPU und h5serv. Die Voranalysen arbeiten auf kleinen Teilmengen der Daten. Sie sollen Aufschluss darüber geben, ob detailliertere Analysen auf dem Gesamtdatensatz lohnenswert sind. Es soll untersucht werden, inwiefern vorhandene Tools diesen Ansatz bereits umsetzen können. Einige dieser Komponenten werden dann verwendet und durch eigene Komponenten ergänzt, um einen Software-Prototyp des vorgestellten Ansatzes entwickeln zu können. Dazu werden zunächst theoretische Grundlagen genauer erläutert, die dann dazu verwendet werden, die eingesetzten Komponenten als Webanwendung zusammenfassen zu können. Die Anwendung unterstützt neben Visualisierungstechniken zur grafischen Repräsentation der Datensätze auch die Möglichkeit, verschiedene aufeinanderfolgende Funktionen in Form einer Pipeline auf einen Datensatz anzuwenden. Es wird gezeigt, inwiefern die unterschiedlichen Konstellationen an Komponenten zusammenarbeiten können oder durch Einschränkungen auf Software- und Hardwareebene ungeeignet sind beziehungsweise mit Blick auf heute weit verbreitete Alternativen nicht leistungsfähig genug arbeiten.

Thesis – Publication details

Automation of manual code optimization via DSL-directed AST-manipulation

Author	Jonas Gresens
Type	Bachelor's Thesis
Advisors	Julian Kunkel
Date	2016-06-27
Abstract	Program optimization is a crucial step in the development of performance critical applications but in most cases only manually realizable due to its complexity. The substantial structural changes to the source code reduce the readability and maintainability and complicate the ongoing development of the applications. The objective of this thesis is to examine the advantages and disadvantages of an AST-based solution to the conflicting relationship between performance and structural code quality of a program. For this purpose a prototype is developed to automate usually manual optimizations based on instructions by the user. The thesis covers the design and implementation as well as the evaluation of the prototype for the usage as a tool in software development. As a result this thesis shows the categorical usability of the AST-based approach and the need for further investigation.

Thesis – Publication details

Client-Side Data Transformation in Lustre

Author	Anna Fuchs
Type	Master's Thesis
Advisors	Michael Kuhn
Date	2016-05-25
Abstract	Due to the increasing gap between computation power and storage speed and capacity, compression techniques for compensating the I/O bottleneck become more urgent than ever. Although some file systems already support compression, none of the distributed ones do. Lustre is a widely used distributed parallel file system in the HPC area, which can only profit from ZFS backend compression so far. Along with archiving desires to reduce storage space, network throughput can also benefit from compression on the client side. Userspace benchmarks showed, compression can increase throughput by up to a factor of 1.2 while decreasing the required storage space by half. This thesis primarily aims to analyze the suitability of compression for the Lustre client and to introduce online compression based on stripes. This purpose places certain demands on the compression algorithm to be used. Slow algorithms can have adverse effects and decrease system's overall performance. A higher compression ratio at the expense of lower speed can nevertheless be worthwhile due to the sharply reduced amount of data to be transferred. LZ4 is one of the fastest compression algorithms and a good candidate to be used on-the-fly. A prototype of LZ4 fast compression within a Lustre client will be presented for a limited number of use cases. In course of the design, different approaches are discussed with regard to transparency and avoidance of code duplication. Finally, some ideas for adaptive compression, client hints and server-side support will be presented.

Publication details

Modeling and Simulation of Tape Libraries for Hierarchical Storage Management Systems

Author	Jakob Lüttgau
Type	Master's Thesis
Advisors	Julian Kunkel
Date	2016-04-09
Abstract	The wide variety of storage technologies (SRAM, NVRAM, NAND, Disk, Tape, etc.) results in deep storage hierarchies to be the only feasible choice to meet performance and cost requirements when dealing with vast amounts of data. In particular long term storage systems employed by scientific users are mainly reliant on tape storage, as they are still the most cost-efficient option even 40 years after their invention in the mid-seventies. Current archival systems are often loosely integrated into the remaining HPC storage infrastructure. However, data analysis tasks require the integration into the scratch storage systems. With the rise of exascale systems and in situ analysis also burst buffers are likely to require integration with the archive. Unfortunately, exploring new strategies and developing open software for tape archive systems is a hurdle due to the lack of affordable storage silos, the resulting lack of availability outside of large organizations and due to increased wariness requirements when dealing with ultra durable data. Eliminating some of these problems by providing virtual storage silos should enable community-driven innovation, and enable site operators to add features where they see fit while being able to verify strategies before deploying on test or production systems. The thesis asseses moderns tape systems and also puts their development over time into perspective. Subsequently, different models for the individual components in tape systems are developed. The models are then implemented in a prototype simulation using discrete event simulation. It is shown that the simulation can be used to approximate the behavior of tape systems deployed in the real world and to conduct experiments without requiring a physical tape system.

Thesis – Presentation – Publication details

2015

Vorhersage von E/A-Leistung im Hochleistungsrechnen unter der Verwendung von neuronalen Netzen

Author	Jan Fabian Schmid
Type	Bachelor's Thesis
Advisors	Julian Kunkel
Date	2015-12-17
Abstract	Die Vorhersage der Laufzeit von Dateizugriffen im Hochleistungsrechner ist wichtig für die Entwicklung von Analysewerkzeugen, die Wissenschaftler bei der effizienten Nutzung der gegebenen Ressourcen unterstützen können. In dieser Bachelorarbeit wird das parallele Dateisystem eines Hochleistungsrechners analysiert und unter dem Einsatz künstlicher neuronaler Netze werden verschiedene Ansätze zur Modellierung der Ein-/Ausgabe-Leistung entwickelt und getestet. Dabei erreichen die entwickelten künstlichen neuronalen Netze bei der Vorhersage von Zugriffszeiten geringere Modellabweichungen gegenüber den tatsächlichen Zugriffszeiten als lineare Modelle. Es stellt sich heraus, dass der entscheidende Faktor für eine gute Modellierung des Ein-/Ausgabe-Systems darin liegt, zwischen gleichartigen Dateizugriffen, die allerdings zu verschiedenen Zugriffszeiten führen, zu unterscheiden. Die Laufzeitdifferenzen zwischen Dateizugriffen mit gleichen Aufrufparametern können durch die unterschiedliche Verarbeitung im System erklärt werden. Da diese Verarbeitungspfade nicht bekannt oder aus direkt messbaren Attributen ableitbar sind, zeigt sich, dass die Vorhersage der Zugriffszeiten eine nicht triviale Aufgabe ist. Ein Ansatz besteht darin, periodische Verhaltensmuster des Systems auszunutzen, um den Verarbeitungspfad eines Zugriffs vorauszusagen. Dieses periodische Verhalten gezielt für genauere Vorhersagen zu verwenden, erweist sich allerdings als schwierig. Um eine Näherung der Verarbeitungspfade zu bestimmen, wird in dieser Bachelorarbeit ein Verfahren eingeführt, bei dem die Residuen eines Modells zur Erstellung von Klassen genutzt werden, welche wiederum mit den Verarbeitungspfaden korrelieren sollten. Bei der Analyse dieser Klassen können Hinweise auf ihren Zusammenhang mit den Verarbeitungspfaden gefunden werden. So sind Modellierungen, die diese Klassenzuordnungen verwenden, in der Lage, wesentlich genauere Vorhersagen zu machen als andere Modelle. Die Vorhersage der Laufzeit von Dateizugriffen im Hochleistungsrechner ist wichtig für die Entwicklung von Analysewerkzeugen, die Wissenschaftler bei der effizienten Nutzung der gegebenen Ressourcen unterstützen können. In dieser Bachelorarbeit wird das parallele Dateisystem eines Hochleistungsrechners analysiert und unter dem Einsatz künstlicher neuronaler Netze werden verschiedene Ansätze zur Modellierung der Ein-/Ausgabe-Leistung entwickelt und getestet. Dabei erreichen die entwickelten künstlichen neuronalen Netze bei der Vorhersage von Zugriffszeiten geringere Modellabweichungen gegenüber den tatsächlichen Zugriffszeiten als lineare Modelle. Es stellt sich heraus, dass der entscheidende Faktor für eine gute Modellierung des Ein-/Ausgabe-Systems darin liegt, zwischen gleichartigen Dateizugriffen, die allerdings zu verschiedenen Zugriffszeiten führen, zu unterscheiden. Die Laufzeitdifferenzen zwischen Dateizugriffen mit gleichen Aufrufparametern können durch die unterschiedliche Verarbeitung im System erklärt werden. Da diese Verarbeitungspfade nicht bekannt oder aus direkt messbaren Attributen ableitbar sind, zeigt sich, dass die Vorhersage der Zugriffszeiten eine nicht triviale Aufgabe ist. Ein Ansatz besteht darin, periodische Verhaltensmuster des Systems auszunutzen, um den Verarbeitungspfad eines Zugriffs vorauszusagen. Dieses periodische Verhalten gezielt für genauere Vorhersagen zu verwenden, erweist sich allerdings als schwierig. Um eine Näherung der Verarbeitungspfade zu bestimmen, wird in dieser Bachelorarbeit ein Verfahren eingeführt, bei dem die Residuen eines Modells zur Erstellung von Klassen genutzt werden, welche wiederum mit den Verarbeitungspfaden korrelieren sollten. Bei der Analyse dieser Klassen können Hinweise auf ihren Zusammenhang mit den Verarbeitungspfaden gefunden werden. So sind Modellierungen, die diese Klassenzuordnungen verwenden, in der Lage, wesentlich genauere Vorhersagen zu machen als andere Modelle.

Thesis – Presentation – Publication details

Advanced Data Transformation and Reduction Techniques in ADIOS

Author	Tim Alexander Dobert
Type	Bachelor's Thesis
Advisors	Michael Kuhn
Date	2015-10-07
Abstract	Because of the slow improvements of storage hardware, compression has become very important for high performance computing. Efficient strategies that provide a good compromise between computational overhead and compression ratio have been developed in recent years. However, when data reduction is used, usually a single strategy is applied to the whole system. These solution generally do not take advantage of the structure within files, which is often known beforehand. This thesis explores several data transformation techniques that can take advantage of patterns within certain types of data to improve compression results. Specific examples are developed and their applications, strengths and weaknesses are discussed. With an array of transformations to choose from, users can make the best choice for each file type, leading to an overall reduction of space. To make this usable in a HPC environment, the transforms are implemented into an I/O library. ADIOS is chosen for this as it provides an easy way to configure I/O parameters and metadata, as well as an extensible framework for transparent on the fly data transformations. The prototyping and implementation process of the transformations is detailed and their effectiveness is tested and evaluated on scientific climate data. Results show that the transforms are quite powerful in theory, but do not have a great effect on real data. While not improving compression results, the discrete cosine transformation is worthwhile on its own, providing an option to sacrifice accuracy for size reduction.

Publication details

Static Code Analysis of MPI Schemas in C with LLVM

Author	Alexander Droste
Type	Bachelor's Thesis
Advisors	Michael Kuhn
Date	2015-09-25
Abstract	This thesis presents MPI-Checker, a static analysis checker for MPI code written in C, based on Clang's Static Analyzer. The checker works with path-sensitive as well as with non-path-sensitive analysis which is purely based on information provided by the abstract syntax tree representation of source code. MPI-Checker's AST-based checks verify correct type usage in MPI functions, utilization of collective communication operations and provides experimental support to verify if point-to-point function calls have a matching partner. Its path-sensitive checks verify aspects of nonblocking communication, based on the usage of MPI requests, which are tracked by a symbolic representation of their memory region in the course of symbolic execution. The thesis elucidates for MPI-Checker relevant parts of the LLVM/Clang API and how the implementation is integrated into the architecture. Furthermore, the basics of MPI are explained. MPI-Checker introduces only negligible overhead on top of the Clang Static Analyzer core and is able to detect critical bugs in real world codebases, which is shown by evaluating analysis results for the open source projects AMG2013 and OpenFFT.

Publication details

Automatisches Lernen der Leistungscharakteristika von Paralleler Ein-/Ausgabe

Author	Eugen Betke
Type	Master's Thesis
Advisors	Julian Kunkel
Date	2015-06-27
Abstract	Die Leistungsanalyse und -optimierung sind seit dem Beginn der elektronischen Datenverarbeitung notwendige Schritte in den Qualitätssicherungs- und Optimierungszyklen. Sie helfen eine qualitative und performante Software zu erstellen. Insbesondere im HPC-Bereich ist dieses Thema wegen der steigender Softwarekomplexität sehr aktuell. Die Leistungsanalysewerkzeuge helfen den Prozess wesentlich zu vereinfachen und zu beschleunigen. Sie stellen die Vorgänge verständlich dar und liefern Hinweise auf mögliche Verbesserungen. Deren Weiterentwicklung und Entwicklung neuer Verfahren ist deshalb essentiell für diesen Bereich. Das Ziel dieser Arbeit ist zu untersuchen, ob E/A-Operationen mit Hilfe von maschinellen Lernen automatisch den richten Cachetypen zugeornet werden können. Zu diesem Zweck werden Methoden entwickelt, die auf den CART-Entscheidungsbäumen und kMeans-Algorithmen basieren und untersucht. Die erhofften Ergebnisse wurden auf diese Weise nicht erreicht. Deswegen werden zum Schluss die Ursachen indentifiziert und diskutiert.

Thesis – Presentation – Publication details

Evaluation of performance and productivity metrics of potential programming languages in the HPC environment

Author	Florian Wilkens
Type	Bachelor's Thesis
Advisors	Michael Kuhn, Sandra Schröder
Date	2015-04-28
Abstract	This thesis aims to analyze new programming languages in the context of high-performance computing (HPC). In contrast to many other evaluations the focus is not only on performance but also on developer productivity metrics. The two new languages Go and Rust are compared with C as it is one of the two commonly used languages in HPC next to Fortran. The base for the evaluation is a shortest path calculation based on real world geographical data which is parallelized for shared memory concurrency. An implementation of this concept was written in all three languages to compare multiple productivity and performance metrics like execution time, tooling support, memory consumption and development time across different phases. Although the results are not comprehensive enough to invalidate C as a leading language in HPC they clearly show that both Rust and Go offer tremendous productivity gain compared to C with similar performance. Additional work is required to further validate these results as future reseach topics are listed at the end of the thesis.

Publication details

Dynamically Adaptable I/O Semantics for High Performance Computing

Author	Michael Kuhn
Type	PhD Thesis
Advisors	Thomas Ludwig
Date	2015-04-27
Abstract	File systems as well as libraries for input/output (I/O) offer interfaces that are used to interact with them, albeit on different levels of abstraction. While an interface's syntax simply describes the available operations, its semantics determines how these operations behave and which assumptions developers can make about them. There are several different interface standards in existence, some of them dating back decades and having been designed for local file systems; one such representative is POSIX. Many parallel distributed file systems implement a POSIX-compliant interface to improve portability. Its strict semantics is often relaxed to reach maximum performance which can lead to subtly different behavior on different file systems. This, in turn, can cause application misbehavior that is hard to track down. All currently available interfaces follow a fixed approach regarding semantics, making them only suitable for a subset of use cases and workloads. While the interfaces do not allow application developers to influence the I/O semantics, applications could benefit greatly from the possibility of being able to adapt them to their requirements. The work presented in this thesis includes the design of a novel I/O interface called JULEA. It offers support for dynamically adaptable semantics and is suited specifically for HPC applications. The introduced concept allows applications to adapt the file system behavior to their exact I/O requirements instead of the other way around. The general goal is an interface that allows developers to specify what operations should do and how they should behave - leaving the actual realization and possible optimizations to the underlying file system. Due to the unique requirements of the proposed interface, a prototypical file system is designed and developed from scratch. The new I/O interface and file system prototype are evaluated using both synthetic benchmarks and real-world applications. This ensures covering both specific optimizations made possible by the file system's additional knowledge as well as the applicability for existing software. Overall, JULEA provides data and metadata performance comparable to that of other established parallel distributed file systems. However, in contrast to the existing solutions, its flexible semantics allows it to cover a wider range of use cases in an efficient way. The results demonstrate that there is need for I/O interfaces that can adapt to the requirements of applications. Even though POSIX facilitates portability, it does not seem to be suited for contemporary HPC demands. JULEA presents a first approach of how application-provided semantical information can be used to dynamically adapt the file system's behavior to the applications' I/O requirements.

Thesis – Publication details – URL

Adaptive Compression for the Zettabyte File System

Author	Florian Ehmke
Type	Master's Thesis
Advisors	Michael Kuhn
Date	2015-02-24
Abstract	Although many file systems nowadays support compression, lots of data is still written to disks uncompressed. The reason for this is the overhead created when compressing the data, a CPU-intensive task. Storing uncompressed data is expensive as it requires more disks which have to be purchased and subsequently consume more energy. Recent advances in compression algorithms yielded compression algorithms that meet all requirements for a compression-by-default scenario (LZ4, LZJB). The new algorithms are so fast, that it is indeed faster to compress-and-write than to just write data uncompressed. However, algorithms such as gzip still yield much higher compression ratios at the cost of a higher overhead. In many use cases the compression speed is not as important as saving disk space. On an archive used for backups the (de-)compression speed does not matter as much as in a folder where some calculation stores intermediate results which will be used again in the next iteration of the calculation. Furthermore, algorithms may perform differently when compressing different data. The perfect solution would know what the user wants and choose the best algorithm for every file individually. The Zettabyte File System (ZFS) is a modern file system with built-in compression support. It supports four different compression algorithms by default (LZ4, LZJB, gzip and ZLE). ZFS already offers some flexibility regarding compression as different algorithms can be selected for different datasets (mountable, nested file systems). The major purpose of this thesis is to demonstrate how adaptive compression in the file system can be used to benefit from strong compression algorithms like gzip while avoiding, if possible, the performance penalties it brings along. Therefore, in the course of this thesis ZFS's compression capabilities will be extended to allow more flexibility when selecting a compression algorithm. The user will be able to choose a use case for a dataset such as archive, performance or energy. In addition to that two features will be implemented. The first feature will allow the user to select a compression algorithm for a specific file type and use case. File types will be identified by the extension of the file name. The second feature will regularly test blocks for compressibility with different algorithms. The winning algorithm of that test will be used until the next test is scheduled. Depending on the selected use case, parameters during the tests are weighted differently.

Thesis – Publication details

Optimization of non-contiguous MPI-I/O Operations

Author	Enno David Zickler
Type	Bachelor's Thesis
Advisors	Julian Kunkel
Date	2015-01-29
Abstract	High performance computing is an essential part for most science departments. The possibilities expand with increasing computing resources. Lately data storage becomes more and more important, but the development of storage devices can not keep up with processing units. Especially data rates and latencies are enhancing slowly, resulting in efficiency becoming an important topic of research. Programs using MPI provide the possibility to get more efficient by using more information about the file system. In this thesis, advanced algorithms for optimization of non-contiguous MPI-I/O operations are developed by considering well-known system specifications like data rate, latency, or block and stripe alignment, maximum buffer size or the impact of read-ahead-mechanisms. Access patterns combined with these parameters will lead to an adaptive data sieving for non-contiguous I/O operations.The parametrization can be done by machine learning concepts, which will provide the best parameters even for unknown access pattern. The result is a new library called NCT, which provides a view based access on non-contiguous data at a POSIX level. The access can be optimized by data sieving algorithms whose behavior could easily be modified due to the modular design of NCT. Existing data sieving algorithms were implemented and evaluated with this modular design. Hence, the user is able to create new advanced data sieving algorithms using any parameters he regards useful. The evaluation shows many possibilities for where such an algorithm improves the performance.

Thesis – Presentation – Publication details

2014

Performance Evaluation of Data Encryption in File Systems -- Benchmarking ext4, ZFS, LUKS and eCryptfs

Author	Hajo Möller
Type	Bachelor's Thesis
Advisors	Michael Kuhn, Konstantinos Chasapis
Date	2014-12-16
Abstract	It has become important to reliably protect stored digital data, both against becoming inaccessible as well as becoming available to third parties. Using a file system which guarantees data integrity protects against data losses, disk encryption protects against data breaches. Encryption is still thought to incur a large performance penalty when accessing the data. This thesis evaluates different approaches to data encryption using low-power hardware and open-source software, with a focus on the advanced file system OpenZFS, which features excellent protection against data loss but does not include encryption. It is shown that encryption using LUKS beneath ZFS is a viable method of gaining data protection, especially when using hardwareaccelerated encryption algorithms. Using a low-power server CPU with native AES instructions, ZFS as the file system and LUKS for encryption of the block device permits ensuring data integrity and protection at a low cost.

Publication details

Implementierung und Leistungsanalyse numerischer Algorithmen zur Methode der kleinsten Quadrate

Author	Niklas Behrmann
Type	Bachelor's Thesis
Advisors	Petra Nerge, Michael Kuhn
Date	2014-12-16
Abstract	Die Methode der kleinsten Quadrate gilt als Standardverfahren zur Lösung von Ausgleichsproblemen, wobei ein überbestimmtes Gleichungssystem gelöst wird, um damit die unbekannten Parameter einer Funktion möglichst genau zu bestimmen. In dieser Bachelorarbeit wird die Methode innerhalb der harmonischen Analyse von Gezeiten betrachtet. Hierzu steht ein Programm zur Verfügung, in dem die Methode der kleinsten Quadrate bisher mit Hilfe einer Bibliothek gelöst wird. Diese Arbeit zielt darauf ab, eine eigene Implementierung zu erarbeiten, da die vorhandene mithilfe von IBMs ESSL Bibliothek gelöst wird, welche nicht auf allen System zur Verfügung steht. Dazu werden im speziellen das Verfahren über die Gaußsche Normalengleichung mithilfe der Cholesky Zerlegung und die QR Zerlegung mithilfe des Householder Verfahrens betrachtet. Diese werden dann implementiert sowie unter der Verwendung der Lapack Softwarebibliothek in das bearbeitete Programm eingebaut. In der Leistungsanalyse zeigt sich, dass die Implementation mittels Cholesky Zerlegung, unter Beibehaltung der Ergebnisse, die besseren Laufzeiten erzielt.

Publication details

Einsatz von Beschleunigerkarten für das Postprocessing großer Datensätze

Author	Jannek Squar
Type	Bachelor's Thesis
Advisors	Petra Nerge, Michael Kuhn
Date	2014-12-03
Abstract	Diese Bachelorarbeit beschäftigt sich mit der Frage, ob der Einsatz von Beschleunigerkarten einen Vorteil beim Postprocessing großer Datensätze mit sich bringt. Diese Fragestellung wird anhand einer Xeon-Phi-Karte und dem Programm Harmonic Analysis untersucht, welches auf den Ausgabedaten einer Ozeansimulation eine harmonische Analyse durchführt. Zunächst werden die herausragenden Merkmale und unterschiedlichen Betriebsmodi - der native Modus, der Offload-Modus und der symmetrische Modus - der Xeon-Phi-Karte vorgestellt; auch der Aufbau von Harmonic Analysis wird näher beschrieben, um Einsatzmöglichkeiten der Xeon-Phi-Karte zu klären. Dabei zeichnen sich erste Probleme ab, da die verschiedenen Betriebsmodi Anpassungen an verwendeten Bibliotheken erforderlich machen. Harmonic Analysis wird dann zunächst so überarbeitet, dass Teile des Programms im Offload-Modus auf die Karte geladen und dort ausgeführt werden, außerdem wird die Möglichkeit der Vektorisierung geprüft, da die Kerne der Xeon-Phi-Karte jeweils über eine große Vektoreinheit verfügen. Bei der Leistungsanalyse wird die Programmlaufzeit für unterschiedliche Startparameter verglichen, im Endeffekt muss aber festgestellt werden, dass sich die Verwendung der Xeon-Phi-Karte für Harmonic Analysis nicht rentiert hat, da die erzielte Leistung hinsichtlich Effizienz, Kosten und absoluter Verbesserung der Laufzeit bei Einsatz der Xeon-Phi-Karte schlechter ist, als wenn sie nicht verwendet wird. Da aber im Rahmen dieser Bachelorarbeit noch nicht alle Möglichkeiten ausgereizt worden sind, werden noch mögliche Ansatzpunkte zur Weiterarbeit aufgeführt.

Thesis – Publication details – URL

Comparison of kernel and user space file systems

Author	Kira Duwe
Type	Bachelor's Thesis
Advisors	Michael Kuhn
Date	2014-08-28
Abstract	A file system is part of the operating system and defines an interface between OS and the computer's storage devices. It is used to control how the computer names, stores and basically organises the files and directories. Due to many different requirements, such as efficient usage of the storage, a grand variety of approaches arose. The most important ones are running in the kernel as this has been the only way for a long time. In 1994, developers came up with an idea which would allow mounting a file system in the user space. The FUSE (Filesystem in Userspace) project was started in 2004 and implemented in the Linux kernel by 2005. This provides the opportunity for a user to write an own file system without editing the kernel code and therefore avoid licence problems. Additionally, FUSE offers a stable library interface. It is originally implemented as a loadable kernel module. Due to its design, all operations have to pass through the kernel multiple times. The additional data transfer and the context switches are causing some overhead which will be analysed in this thesis. So, there will be a basic overview about on how exactly a file system operation takes place and which mount options for a FUSE-based system result in a better performance. Therefore, an overview is given on how the related operating system internals work as well as a detailed presentation of the kernel file systems mechanisms such as the system call. Thereby a comparison of kernel file systems, such as tmpfs and ZFS, and user space file systems, such as memfs and ZFS-FUSE is enabled. This thesis shows the kernel version 3.16 offers great improvements for every file system analysed. The meta data operations even of a file system like tmpfs raised by a maximum of 25%. Increasing the writing performance of memfs from about 220 MB/s to 2 600 MB/s, the write-back cache has an enormous impact with a factor of 12. All in all, the performance of the FUSE-based file systems improved dramatically, transforming user space file systems in an alternative for native kernel file systems altough they still can not keep up in every aspect.

Thesis – Publication details – URL

Optimization and parallelization of the post-processing of a tidal simulation

Author	Dominik Rupp
Type	Bachelor's Thesis
Advisors	Petra Nerge, Michael Kuhn
Date	2014-04-25
Abstract	The fields of oceanography and climate simulations in general strongly rely on information technology. High performance computing provides the hard and software needed to process complex climate computations. An employee of the work group Scientific Computing of the University of Hamburg implemented an application that does a post-processing using input data from a simulation of global ocean tides. The post-processing gains further insight on that data by performing complex calculations. It uses large NetCDF input files to execute a demanding harmonic analysis and to finally produce visualizable output. It is to be analyzed and evaluated for its suitability of use on a cluster computer. This is achieved by examining the program with tracing tools and finding routines that exhibit great potential of parallel execution. Also an initial estimate regarding the program's maximum speed-up is determined by applying Amdahl's law. Further, a parallelization approach has to be chosen and implemented. The results are analyzed and compared to prior expectations and evaluations. It turns out that by implementing a hybrid parallelization that speeds up calculations and input/output using OpenMP and MPI, a speed-up of 13.0 in comparison to the original serial program can be achieved on the cluster computer of the work group Scientific Computing. Finally, possible issues that result from this thesis are highlighted as future work. The mathematical background found in the appendix, discusses and differentiates the terms harmonic analysis and Fourier analysis that are strongly related to this thesis.

Publication details

Halbautomatische Überprüfung von kollektiven MPI-Operationen zur Identiﬁkation von Leistungsinkonsistenzen

Author	Sebastian Rothe
Type	Bachelor's Thesis
Advisors	Julian Kunkel
Date	2014-04-09
Abstract	Computersimulationen werden heutzutage vermehrt dazu genutzt, wissenschaftliche Experimente in virtuellen Umgebungen auszuführen. Um die Ausführungsdauer zu re- duzieren, werden parallele Programme entwickelt, die auf Rechenclustern ausgeführt werden. Programme, die auf mehrere Computersysteme verteilt sind, nutzen meist den MPI-Standard (Message Passing Interface), um den Nachrichtenaustausch zwischen den Rechnern realisieren zu können. Aufgrund des komplexen Aufbaus der Rechencluster wird die verfügbare Hardware allerdings oftmals nicht ideal ausgenutzt. Es existiert damit Optimierungspotential, das genutzt werden kann, um die Laufzeit der Applikationen weiter zu verringern. Leistungsanalysen bilden hierbei die Basis, um Schwachstellen im System oder in den genutzten MPI-Implementationen aufzudecken und sie später zu optimieren. Diese Arbeit befasst sich mit der Entwicklung des Analysewerkzeugs pervm (performance validator for MPI), das sich auf die Untersuchung der kollektiven Operationen von MPI konzentriert und dadurch Leistungsinkonsistenzen aufdecken soll. Dafür werden theoretische Grundlagen genauer erläutert, die dann dazu verwendet werden, das Zusammenspiel der benötigten Komponenten des Analysewerkzeugs zu erklären. Die Ausführung von pervm lässt sich in die Mess- und die Auswertungsphase unterteilen. Es können die Ausführungszeiten der eigentlichen MPI-Operation sowie verschiedener Algorithmen, die unterschiedlich eﬃziente Ausführungsmöglichkeiten einer kollektiven Operation beschreiben, ermittelt werden. Neben der Analyse dieser Messergebnisse bietet die Auswertungsphase des Werkzeugs zusätzlich die Möglichkeit, die theoretische Ausführungsdauer eines Algorithmus auf einem gegebenen System anhand dessen Leistungswerte zu simulieren. Die beschriebenen Ausführungsmöglichkeiten liefern zahlreiche Ansätze zur Identiﬁkation von Leistungsengpässen. Es wird gezeigt, inwiefern bei der Verwendung der kollektiven MPI-Operation Rückschlüsse auf den genutzten Algorithmus gezogen werden können. Referenzalgorithmen mit kürzeren Ausführungszeiten im Vergleich zur MPI-Operation liefern Hinweise auf weitere Inkonsistenzen in der Implementation der genutzten MPI-Bibliothek.

Publication details

An in-depth analysis of parallel high level I/O interfaces using HDF5 and NetCDF-4

Author	Christopher Bartz
Type	Master's Thesis
Advisors	Konstantinos Chasapis, Michael Kuhn, Petra Nerge
Date	2014-04-07
Abstract	Scientific applications store data in various formats. HDF5 and NetCDF-4 are data formats which are widely used in the scientific community. They are surrounded by high-level I/O interfaces which provide retrieval and manipulation of data. The parallel execution of applications is a key factor regarding the performance. Previous evaluations have shown that high-level I/O interfaces such as NetCDF-4 and HDF5 can exhibit suboptimal I/O performance depending on the application's access patterns. In this thesis we investigate how the parallel versions of the HDF5 and NetCDF-4 interfaces are behaving when using Lustre as underlying parallel file system. The I/O is performed in a layered manner: NetCDF-4 uses HDF5 and HDF5 uses MPI-IO which itself uses POSIX to perform the I/O. To discover inefficiencies and bottlenecks, we analyse the complete I/O path while using different access patterns and I/O configurations. We use IOR for our analysis. IOR is a configurable benchmark that generates I/O patterns and is well known in the parallel I/O community. In this thesis we modify IOR in order to fulfil our needs for analysis purposes. We distinguish between two general access patterns for our evaluation: disjoint and interleaved. Disjoint means that each process accesses a contiguous region in the file, whereas interleaved is an access to a non-contiguous region. The results show that neither the disjoint nor the interleaved access outperforms the other in every case. But when using the interleaved access in a certain configuration, results near the theoretical maximum are realised. We provide best practices for choosing the right I/O configuration depending on the need of application in the last chapter. The NetCDF-4 interface does not provide the feature to align the data section to particular address boundaries. This is a significant disadvantage regarding the performance. We provide an implementation and reevaluation for this feature and observe perspicuous performance improvement. When using NetCDF-4 or HDF5, the data can be broken into pieces called chunks which are stored in independent locations in the file. We present and evaluate an optimised implementation for determining the default chunk size in the NetCDF-4 interface. Beyond that, we reveal an error in the NetCDF-4 implementation and provide the correct solution.

Thesis – Publication details

Analyse und Optimierung von nicht-zusammenhängende Ein-/Ausgabe in MPI

Author	Daniel Schmidtke
Type	Bachelor's Thesis
Advisors	Julian Kunkel, Michaela Zimmer
Date	2014-04-07
Abstract	Das Ziel dieser Arbeit ist es, das Potential von Datasieving zu evaluieren und in Optimierungen nutzbar zu machen. Dazu werden die folgenden Ziele definiert. 1. Systematische Analyse der erzielbaren Leistung. 2. Transparente Optimierung. 3. Kontextsensitive Optimierung.

Thesis – Publication details

Automatic Analysis of a Supercomputer's Topology and Performance Characteristics

Author	Alexander Bufe
Type	Bachelor's Thesis
Advisors	Julian Kunkel
Date	2014-03-18
Abstract	Although knowing the topology and performance characteristics of a supercomputer is very important as it allows for optimisations and helps to detect bottleneck, no universal tool to determine topology and performance characteristic is available yet. Existing tools are often specialised to analyse either the behaviour of a node or of the network topology. Furthermore, existing tools are unable to detect switches despite their importance. This thesis introduces an universal method to determine the topology (including switches) and an eﬃcient way to measure the performance characteristics of the connections. The approach of the developed tool is to measure the latencies ﬁrst and then to compute the topology by analysing the data. In the next step, the gained knowledge of the topology is used to parallelise the measurement of the throughput in order to decrease the required time or to allow for more accurate measurements. A general approach to calculate latencies of connections that cannot be measured directly based on linear regression is introduced, too. At last, the developed algorithm and measurement techniques are validated on several test cases and a perspective of future work is given.

Thesis – Publication details

Flexible Event Imitation Engine for Parallel Workloads

Author	Jakob Lüttgau
Type	Bachelor's Thesis
Advisors	Julian Kunkel
Date	2014-03-18
Abstract	Evaluating systems and optimizing applications in high-performance computing (HPC) is a tedious task. Trace ﬁles, which are already commonly used to analyse and tune applications, also serve as a good approximation to reproduce workloads of scientiﬁc applications. The thesis presents design considerations and discusses a prototype implementation for a ﬂexible tool to mimic the behavior of parallel applications by replaying trace ﬁles. In the end it is shown that a plugin based replay engine is able to replay parallel workloads that use MPI and POSIX I/O. It is further demonstrated how automatic trace manipulation in combination with the replay engine allows to be used as a virtual lab.

Thesis – Publication details

2013

Design, Implementation, and Evaluation of a Low-Level Extent-Based Object Store

Author	Sandra Schröder
Type	Master's Thesis
Advisors	Michael Kuhn
Date	2013-12-18
Abstract	An object store is a low-level abstraction of storage. Instead of providing a block-level view of a storage device, an object store allows to access it by a more abstract way, namely via objects. Being on top of a storage device, it is responsible for storage management. It can be used as a stand-alone light-weight file system when only basic storage management is necessary. Moreover it can act as a supporting layer for full-featured file systems in order to lighten their management overhead. Only a few object store solutions exist. These object stores are, however, not suitable for these use cases. For example no user interface is provided or it is too difficult to use. The development of some object stores has ceased, so that the code of the implementation is not available anymore. That is why a new object store is needed facing those problems. In this thesis a low-level and extent-based object store is designed and implemented. It is able to perform fully-functional storage management. For this, appropriate data structures are designed, for example, so-called inodes and extents. These are file system concepts that are adapted to the object store design and implementation. The object store uses the memory mapping technique for memory management. This technique maps a device into the virtual address space of a process, promising efficient access to the data. An application programming interface is designed allowing easy use and integration of the object store. This interface provides two features, namely synchronization and transactions. Transactions allow to batch several input/output requests into one operation. Synchronization ensures that data is written immediately after the write request. The object store implementation is object-oriented. Each data structure constitutes a programming unit consisting of a set of data types and methods. The performance of the object store is evaluated and compared with well-known file systems. It shows excellent performance results, although it constitutes a prototype. The transaction feature is found to be efficient. It increases the write performance by a factor of 50 when synchronization of data is activated. It especially outperforms the other file systems concerning metadata performance. A high metadata performance is a crucial criterion when the object store is used as a supporting storage layer in the context of parallel file systems.

Thesis – Publication details

Automated File System Correctness and Performance Regression Tests

Author	Anna Fuchs
Type	Bachelor's Thesis
Advisors	Michael Kuhn
Date	2013-09-23
Abstract	To successfully manage big software projects with lots of developers involved, every developing step should be continuously verified. Automated and integrated test procedures remove much effort, reduce risk of error rate and enable a much more efficient development process, since the effects of every development step are continuously available. In this thesis the testing and analyzing of a parallel file system JULEA is automated. With it all these processes are integrated and linked to the version control system Git. Every checked change triggers a test run. Therefore the concept of Git hooks is used, by me of which it is possible to include testing to the common develop workflow. Especially scientific projects suffer from lack of careful and qualitative test mechanisms. In the course of this not only correctness is relevant, which forms the base for any further changes, but also the performance trend. A significant criterion of a quality of a parallel file system is its efficiency. Performance regression of a system caused by made changes can crucially affect the further development course. Hence it is important to draw conclusions about the temporal behavior instantaneously after every considerable develop step. The trends and results have to be evaluated and analyzed carefully. The best way to recognize this kind of information is the graphical way. The goal is to generate simple but meaningful graphics out of test results, which would help along to improve the quality of the developing process and the final product. Ideally the visualization would be available on a web site for more comfortable use. Moreover to abstract from the certain project, the test system is portable and universal enough to be integrated it in any project versioned with Git. Finally, some tests in need of improvement were located using this framework.

Thesis – Publication details – URL

Evaluating Distributed Database Systems for Use in File Systems

Author	Roman Michel
Type	Bachelor's Thesis
Advisors	Michael Kuhn
Date	2013-09-18
Abstract	After a short introduction to NoSQL and file systems, this thesis will look at various of the popular NoSQL database management systems as well as some younger approaches, opposing to their similarities and dissimilarities. Based on those analogies and recent developments, a selection of those databases will be analysed further, with the focus on scalability and performance. The main part of this analysis will be the setup of multiple instances of these databases, reviewing and comparing the setup process, as well as developing and evaluating benchmarks. The benchmarks will be focused on access patterns commonly found in file systems, which are to be found in the course of this thesis.

Publication details

Simulation of Parallel Programs on Application and System Level

Author	Julian Kunkel
Type	PhD Thesis
Advisors	Thomas Ludwig
Date	2013-07-30
Abstract	Computer simulation revolutionizes traditional experimentation providing a virtual laboratory. The goal of high-performance computing is a fast execution of applications since this enables rapid experimentation. Performance of parallel applications can be improved by increasing either capability of hardware or execution efficiency. In order to increase utilization of hardware resources, a rich variety of optimization strategies is implemented in both hardware and software layers. The interactions of these strategies, however, result in very complex systems. This complexity makes assessing and understanding the measured performance of parallel applications in real systems exceedingly difficult. To help in this task, in this thesis an innovative event-driven simulator for MPI-IO applications and underlying heterogeneous cluster computers is developed which can help us to assess measured performance. The simulator allows conducting MPI-IO application runs in silico, including the detailed simulations of collective communication patterns, parallel I/O and cluster hardware configurations. The simulation estimates the upper bounds for expected performance and therewith facilitates the evaluation of observed performance. In addition to the simulator, the comprehensive tracing environment HDTrace is presented. HDTrace offers novel capabilities in analyzing parallel I/O. For example, it allows the internal behavior of MPI and the parallel file system PVFS to be traced. While PIOsimHD replays traced behavior of applications on arbitrary virtual cluster environments, in conjunction with HDTrace it is a powerful tool for localizing inefficiencies, conducting research on optimizations for communication algorithms, and evaluating arbitrary and future systems. This thesis is organized according to a systematic methodology which aims at increasing insight into complex systems: The information provided in the background and related-work sections offers valuable analyses on parallel file systems, performance factors of parallel applications, the Message Passing Interface, the state-of-the-art in optimization and discrete-event simulation. The behavior of memory, network and I/O subsystem is assessed for our working group's cluster system, demonstrating the problems of characterizing hardware. One important insight of this analysis is that due to interactions between hardware characteristics and existing optimizations, performance does not follow common probability distributions, leading to unpredictable behavior of individual operations. The hardware models developed for the simulator rely on just a handful of characteristics and implement only a few optimizations. However, a high accuracy of the developed models to explain real world phenomenons is demonstrated while performing a careful qualification and validation. Comprehensive experiments illustrate how simulation aids in localizing bottlenecks in parallel file system, MPI and hardware, and how it fosters understanding of system behavior. Additional experiments demonstrate the suitability of the novel tools for developing and evaluating alternative MPI and I/O algorithms. With its power to assess the performance of clusters running up to 1,000 processes, PIOsimHD serves as virtual laboratory for studying system internals. In summary, the combination of the enhanced tracing environment and a novel simulator offers unprecedented insights into interactions between application, communication library, file system and hardware.

Thesis – Publication details – URL

Design and Evaluation of Tool Extensions for Power Consumption Measurement in Parallel Systems

Author	Timo Minartz
Type	PhD Thesis
Advisors	Thomas Ludwig
Date	2013-07-03
Abstract	In an effort to reduce the energy consumption of high performance computing centers, a number of new approaches have been developed in the last few years. One of these approaches is to switch hardware to lower power states in promising parallel application phases. A test cluster is designed with high performance computing nodes supporting multiple power saving mechanisms comparable to mobile devices. Each of the nodes is connected to power measurement equipment to investigates the power saving potential under different load scenarios of the specific hardware. However, statically switching the power saving mechanisms usually increases the application runtime. As a consequence, no energy can be saved. Contrary to static switching strategies, dynamic switching strategies consider the hardware usage in the application phases to switch between the different modes without increasing the application runtime. Even if the concepts are already quite clear, tools to identify application phases and to determine impact on performance, power and energy are still rare. This thesis designs and evaluates tool extensions for power consumption measurement in parallel systems with the final goal to characterize and identify energy-efficiency hot spots in scientific applications. Using offline tracing, the metrics are collected in trace files and can be visualized or post-processed after the application run. The timeline-based visualization tools Sunshot and Vampir are used to correlate parallel applications with the energy-related metrics. With these tracing and visualization capabilities, it is possible to evaluate the quality of energy-saving mechanisms, since waiting times in the application can be related to hardware power states. Using the energy-efficiency benchmark eeMark, typical hardware usage pattern are identified to characterize the workload, the impact on the node power consumption and finally the potential for energy saving. To exploit the developed extensions, four scientific applications are analyzed to evaluate the whole approach. Appropriate phases of the parallel applications are manually instrumented to reduce the power consumption with the final goal of saving energy for the whole application run on the test cluster. This thesis provides a software interface for the efficient management of the power saving modes per compute node to be exploited by application programmers. All analyzed applications consist of several, different calculation-intensive compute phases and have a considerable power and energy-saving potential which cannot be exhausted by traditional, utilization-based mechanisms implemented in the operating system. Reducing the processor frequency in communication and I/O phases can also gain remarkable savings for the presented applications.

Thesis – Publication details – URL

Evaluation of Different Storage Backends and Technologies for MongoDB

Author	Johann Weging
Type	Bachelor's Thesis
Advisors	Michael Kuhn
Date	2013-02-28
Abstract	Today's data base management systems store their data in conventional file systems. Some of them allocate files of the size of multiple gigabyte and handle the data alignment by them self. In theory these data base management systems can work with just contiguous space of memory for their data base files. This these attempt to reduce the over head produces by file operations, by implementing a object store back end for a data base management system. The reference software used in this thesis is MongoDB for data base management system, JZFS for the object which works on top on the ZFS file system. Unfortunately while developing the new storage back end it was discovered that this implementation is to extensive for a bachelor thesis. The development is documented and shown up until this point. Further work that has to be done is finishing the storage back end for MongoDB and evaluate it. The main questing is if a object store is really capable of reducing the I/O overhead of MongoDB. This thesis covers two parts. Fist the implementation of a object store storage back end for MongoDB based on JZFS and ZFS. It makes the attempt to implement this storage solution but while developing the storage back end it was discovered that the implementation is to extensive for a bachelor thesis. The development is documented and shown up until this point. After the implementation was consider too extensive, the focus was moved towards file system benchmarking. The benchmarking is done by using the meta data benchmark mdtest. It covers the file systems ext4, XFS, btrfs and ZFS on different hardware setups. Every file system was benchmarked on a HDD and a SSD, in addition ZFS was benchmarked on a HDD using a SSD for read and write cache. It turns out that ZFS is still suffering some serious meta data performance bottlenecks. Surprising is that the HDD with the SSD cache performs nearly as good as ZFS on top of a pure SSD setup. Btrfs performs quit well, a odd thing about btrfs is that it in some cases performs better on the HDD than on the SSD and when creating files or directories it outperformed the other file systems by far. Ext4 doesn't seem to scale against multiple threads accessing shared data, the performance mostly stays the same or sometimes even drops. Only with two threads the performance increases at some operations. XFS performed quit well in most of the test cases, there was only one odd case when reading directory stats, one thread on the HDD was faster than one thread on the SDD, but when increasing the thread count on the HDD the performance drops rapidly. Further work at this point would be to identify the bottlenecks of ZFS which slows it down in every case except for file removal and directory removal.

Publication details

2012

Effiziente Verarbeitung von Klimadaten mit ParStream

Author	Moritz Lahn
Type	Bachelor's Thesis
Advisors	Julian Kunkel
Date	2012-06-28
Abstract	In Zusammenarbeit mit der ParStream GmbH wird in dieser Arbeit untersucht in wieweit sich die von ParStream entwickelte Datenbank zur effizienteren Verarbeitung von Klimadaten nutzen lässt. Für die Auswertung der Klimadaten verwenden Wissenschaftler oftmals das Climate Data Operators Programm (CDO). Das CDO Programm ist eine Sammlung von vielen Operatoren zur Auswertung von Daten die von Klimasimulationen bzw. Erd-System Modellen stammen. Die Auswertung mit diesem Programm ist sehr zeitintensiv. Dieser Ausgangspunkt begründet die Motivation zur Nutzung der ParStream Datenbank, die mit einem eigens entwickelten spaltenorientierten Bitmap Index und einer komprimierten Indexstruktur, Anfragen an eine große Datenbasis parallel und sehr effizient verarbeiten kann. Mit dem beschleunigten Abruf der Daten eröffnen sich neue Möglichkeiten im Bereich der Echtzeit-Analyse, die bei der interaktiven Visualisierung von Klimadaten hilfreich sind. Als Ergebnis dieser Arbeit wird untersucht welche CDO Operatoren mit der ParStream Datenbank umsetzbar sind. Einige Operatoren werden zu Demonstrationszwecken mit der ParStream Datenbank umgesetzt. Die Leistungsvorteile werden durch Tests verifiziert und zeigen eine effizientere Verarbeitung von Klimadaten mit der ParStream Datenbank. Es hat sich herausgestellt, dass ParStream bei einigen Operatoren die Ergebnisse zwischen 2x und 20x schneller ausliefern kann als das CDO Programm. Als ein weiteres Ergebnis stellte sich bei der Klassifizierung der CDO Operatoren heraus, dass die meisten Operationen direkt durch SQL abgebildet werden können. Der Industriepartner stimmt einer Veröffentlichung des PDFs nicht zu.

Publication details

Energy-Aware Instrumentation of Parallel MPI Applications

Author	Florian Ehmke
Type	Bachelor's Thesis
Advisors	Thomas Ludwig, Timo Minartz
Date	2012-06-25
Abstract	Energy consumption in High Performance Computing has become a major topic. Thus various approaches to improve the performance per watt have been developed. One way is to instrument an application with instructions that change the idle and performance states of the hardware. The major purpose of this thesis is to demonstrate the potential savings by instrumenting parallel message passing applications. For successful instrumentation critical regions in terms of performance and power consumption have to be identified. Most scientific applications can be divided into phases that utilize different parts of the hardware. The goal is to conserve energy by switching the hardware to different states depending on the workload in a specific phase. To identify those phases two tracing tools are used. Two examples will be instrumented: a parallel earth simulation model written in Fortran and a parallel partial differential equation solver written in C. Instrumented applications should consume less energy but may also show a increase in runtime. It is discussed if it is worthwhile to make a compromise in that case. The applications are analyzed and instrumented on two x64 architectures. Differences concerning runtime and power consumption are investigated.

Publication details

Parameterising primary production and convection in a 3D model

Author	Fabian Große
Type	Diploma Thesis
Advisors	Jan O. Backhaus, Johannes Pätsch
Date	2012-05-16

Publication details

Replay Engine for Application Specific Workloads

Author	Jörn Ahlers
Type	Bachelor's Thesis
Advisors	Julian Kunkel
Date	2012-04-12
Abstract	Today many tools exist which are related to the processing of workloads. All of these have their specific area where they are used. Despite their differences they also have functions regarding the creation and execution of workloads in common. To create a new tool it is always needed to implement all of these functions even when they were implemented before in another tool. In this thesis a framework is designed and implemented that allows replaying of application specific work-loads. This gets realized through a modular system which allows to use existing modules in the creation of new tools to reduce development work. Additionally a function is designed to generate parts of the modules by their function headers to further reduce this work. To improve the generation, semantical information can be added through comments to add advanced behavior. To see that this approach is working examples are given which show the functionality and evaluate the overhead created through the software. Finally additional work that can be done to further improve this tool is shown.

Thesis – Publication details

2011

Evaluation of File Systems and I/O Optimization Techniques in High Performance Computing

Author	Christina Janssen
Type	Bachelor's Thesis
Advisors	Michael Kuhn
Date	2011-12-05
Abstract	High performance computers are able to process huge datasets in a short period of time by allowing work to be done on many computer nodes concurrently. This workload often poses several challenges to the underlying storage devices. When possibly hundreds of clients from multiple nodes try to acces the same files, those storage devices become bottlenecks and are therefore a threat to performance. In order to make I/O as efficient as possible, it is important to make the best use out of the given resources in a system. The I/O performance that can be achieved in a system results from a cooperation of several factors: the underlying file system, the interface that connects application and file system, and the implementation. Based on how well all of these factors work together, the best I/O performance can be achieved. In this thesis, an overview will be given of how different file systems work, what access semantics and I/O interfaces there are and how a cooperation of these, in addition to the use of ideal I/O optimization techniques, can result in best possible performance.

Thesis – Publication details – URL

Energieeffizienz und Nachhaltigkeit im Hochleistungsrechnen am Beispiel des Deutschen Klimarechenzentrums

Author	Yavuz Selim Cetinkaya
Type	Bachelor's Thesis
Advisors	Thomas Ludwig, Timo Minartz
Date	2011-11-21

Publication details

Estimation of Power Consumption of DVFS-Enabled Processors

Author	Christian Seyda
Type	Bachelor's Thesis
Advisors	Thomas Ludwig, Timo Minartz
Date	2011-03-28
Abstract	Saving energy is nowadays a critical factor, especially for data centers or high performance clusters which have a power consumption of several mega watts. The simple use of component energy saving mechanisms is not always possible, because this can lead to performance degradation. For this reason, high performance clusters are mainly not using them, even in low utilization phases. Modelling the power consumption of a component based on specific recordable values can help to find ways of saving energy or predicting the power consumption after replacing the component with a more efficient one. One of the main power consumer on a recent system is the processor. This thesis presents a model of the power consumption of a processor based on its frequency and voltage. Comparisons with real world power consumption were done to evaluate the model. Furthermore a tracing library was extended to be able to log the processor frequency and idle states if available. Using the presented model and the trace files, a power estimator has been implemented being able to estimate the power consumption of the processor in the given trace file—or a more energy efficient processor—helping to motivate the usage of power saving mechanisms and energy efficient processors, and showing the long term potential for energy saving.

Publication details

2010

Crossmedia File System MetaFS -- Exploiting Performance Characteristics from Flash Storage and HDD

Author	Leszek Kattinger
Type	Bachelor's Thesis
Advisors	Julian Kunkel, Olga Mordvinova
Date	2010-03-23
Abstract	Until recently, the decision which storage device is most suitable, in aspects of costs, capacity, performance and reliability has been an easy choice. Only hard disk devices offered requested properties. Nowadays rapid development of flash storage technology, makes these devices competitive or even more attractive. The great advantage of flash storage is, apart from lower energy consumption and insensitivity against mechanical shocks, the much lower access time. Compared with hard disks, flash devices can access data about a hundred times faster. This feature enables a significant performance benefit for random I/O operations. Unfortunately, the situation at present is that HDDs provide a much bigger capacity at considerable lower prices than flash storage devices, and this fact does not seem to be changing in the near future.Considering also the wide-spread use of HDDs, the continuing increase of storage density and the associated increase of sequential I/O performance, the incentive to use HDDs will continue. For this reason, a way to combine both storage technologies seems beneficial. From the view of a file system, meta data is often accessed randomly and very small, in contrast a logical file might be large and is often accessed sequentially. Therefore, in this thesis a file system is designed and implemented which places meta data on an USB-flash device and data on an HDD. The design also considers, how meta data operations can be optimized for a cheep low-end USB flash device, which provide flash media like fast access times but also characteristic low write rates, caused by the block-wise erase-before-write operating principle. All measured file systems show a performance drop for meta data updates on this kind of flash devices, compared with their behavior on HDD. Therefore the design focused on the possibility to map coherent logical name space structures (directories) close to physical media characteristics (blocks). To also check impacts by writes of data sizes equal or smaller then the selected block size, the option to write only full blocks or current block fill rates was given. The file system was implemented in the user space and operate through the FUSE interface. Despite of the overhead caused by this fact, the performance of write associated meta data operations (like create/remove) was better or equal than of those file systems used for benchmark comparison.

Thesis – Publication details – Sources

2009

Analyzing Metadata Performance in Distributed File Systems

Author	Christoph Biardzki
Type	PhD Thesis
Advisors	Thomas Ludwig
Date	2009-01-19

Thesis – Publication details – URL

Tracing Internal Behavior in PVFS

Author	Tien Duc Tien
Type	Bachelor's Thesis
Advisors	Thomas Ludwig, Julian Kunkel
Date	2009-10-05
Abstract	Nowadays scientific computations are often performed on large cluster systems because of the high performance they deliver. In such systems there are many reasons for bottlenecks which are related to both hardware and software. This thesis defines and implements metrics and information used for tracing events in MPI applications in conjunction with the parallel file system PVFS in order to localize bottlenecks and determine system behavior. They are useful for the optimizations of the system or applications. After tracing, data is stored in trace files and can be analyzed via the visualization tool Sunshot. There are two experiments made in this thesis. The first experiment is made on a balanced system. In this case Sunshot shows a balanced visualization between nodes, i.e. the load between nodes looks similar. Moreover, in connection with this experiment the new metrics and tracing information or characteristics are discussed in detail in Sunshot. In contrast, the second experiment is made on an unbalanced system. In this case Sunshot shows where bottlenecks occurred and components which are related.

Thesis – Publication details

Simulation-Aided Performance Evaluation of Input/Output Optimizations for Distributed Systems

Author	Michael Kuhn
Type	Master's Thesis
Advisors	Thomas Ludwig, Julian Kunkel
Date	2009-09-30

Thesis – Publication details – URL

Design and Implementation of a Profiling Environment for Trace Based Analysis of Energy Efficiency Benchmarks in High Performance Computing

Author	Stephan Krempel
Type	Master's Thesis
Advisors	Thomas Ludwig, Julian Kunkel
Date	2009-08-31

Thesis – Publication details

Model and simulation of power consumption and power saving potential of energy efficient cluster hardware

Author	Timo Minartz
Type	Master's Thesis
Advisors	Thomas Ludwig, Julian Kunkel
Date	2009-08-27

Thesis – Publication details – URL

2008

Ergebnisvisualisierung paralleler Ein/Ausgabe Simulation im Hochleistungsrechnen

Author	Anton Ruff
Type	Bachelor's Thesis
Advisors	Thomas Ludwig, Julian Kunkel
Date	2008-05-31

Publication details

2007

Container-Archiv-Format für wahlfreien effizienten Zugriff auf Dateien

Author	Hendrik Heinrich
Type	Bachelor's Thesis
Advisors	Thomas Ludwig, Julian Kunkel
Date	2007-09-30

Thesis – Publication details

Directory-Based Metadata Optimizations for Small Files in PVFS

Author	Michael Kuhn
Type	Bachelor's Thesis
Advisors	Thomas Ludwig, Julian Kunkel
Date	2007-09-03

Thesis – Publication details – URL

Towards Automatic Load Balancing of a Parallel File System with Subfile Based Migration

Author	Julian Kunkel
Type	Master's Thesis
Advisors	Thomas Ludwig
Date	2007-08-02

Thesis – Publication details – URL

Benchmarking of Non-Blocking Input/Output on Compute Clusters

Author	David Büttner
Type	Bachelor's Thesis
Advisors	Thomas Ludwig, Julian Kunkel
Date	2007-04-24

Thesis – Publication details – URL

2006

Tracing the Connections Between MPI-IO Calls and their Corresponding PVFS2 Disk Operations

Author	Stephan Krempel
Type	Bachelor's Thesis
Advisors	Thomas Ludwig
Date	2006-03-29

Thesis – Publication details – URL

Performance Analysis of the PVFS2 Persistency Layer

Author	Julian Kunkel
Type	Bachelor's Thesis
Advisors	Thomas Ludwig
Date	2006-02-15

Thesis – Publication details – URL

Table of Contents

Theses

2024

Exploratory Exploitation of Heterogeneous High Performance Architectures using OpenMP in Large Scale Microphysics Applications

2023

Modelling MPI Communication using Colored Petri Nets

A Multi-purpose Framework for Efficient Parallelized Execution of Charged Particle Tracking

2022

System profiling and data aggregation for smart compression in Lustre

2023

Vecpar OpenMP Prototype for Particle Physics Example

2022

Data-Aware Compression for HPC using Machine Learning

Performance study on GPU offloading techniques using the Gauß matrix inverse algorithm

2021

Analysis of elastic Cloud solutions in an HPC Environment

Optimising Scientific Software for Heterogeneous Cluster Computers: Evaluation of Machine Learning Methods for Source Code Classification

Scimon - Scientific Monitor for Automated Run Logging and Reproducibility

Performance modeling of one-sided and two-sided MPI-3 communication

Containerizing a user-space storage framework for reproducibility

Message passing safety and correctness checks at compile time using Rust

2020

Leistungsanalyse und -optimierung der Netzwerkkommunikation in dem HPC-Speichersystem JULEA unter Verwendung des OFI Frameworks

Analysis of the Impact of Aging on the EXT4 and ZFS Filesystems and Their Countermeasures

2019

Integrating self-describing data formats into file systems

Python for High Performance I/O

Structured metadata for the JULEA storage framework

Verification of one-sided MPI communication code using static analysis in LLVM

Efficient handling of compressed data in ZFS

2018

Characterization and translation of OpenMP use cases to MPI using LLVM

Learned Index Structures: An Evalution of their Performance in Key-Value Storage Solutions

Leistungsverbesserung der Simulation von 3D Strahlung auf extrasolare Planeten

Development of a decision support system for performance tuning in Lustre

Vector Folding for Icosahedral Earth System Models

Enabling Single Process Testing of MPI in Massive Parallel Applications

Modeling and Performance Prediction of HDF5 data on Objectstorage

Optimizing ArduPower

Modern Storage Stack with Key-Value Store Interface and Snapshots Based on Copy-On-Write Bε-Trees

2017

Compiler assisted translation of OpenMP to MPI using LLVM

Simulation of Storage Tiering and Data Migration

Efficient interaction between Lustre and ZFS for compression

Page-Based Compression in the Linux Kernel

Dynamic decision-making for efficient compression in parallel distributed file systems

Verarbeitung von Klimadaten mit Big-Data-Werkzeugen

Static Code Analysis for HPC Use Cases

Database VOL-plugin for HDF5

A Numerical Approach to Nonlinear Regression Analysis by Evolving Parameters

Interactive Recommender Systems For A Professional Social Network

Quality of service improvement in ZFS through compression

In-situ Transformation for Input/Output Access Patterns by Applying Building Blocks of Optimization Schemas

Support for external data transformation in ZFS

Suitability analysis of Object Storage for HPC workloads

Extracting Semantic Relations from Wikipedia using Spark

2016

Energy usage analysis of HPC applications

Adaptive Selection of Lossy Compression Algorithms Using Machine Learning

Evaluation von alternativen Speicherszenarien für hierarchische Speichersysteme

Quality Control of Meteorological Time-Series with the Aid of Data Mining

MPI-3 algorithms for 3D radiative transfer on Intel Xeon Phi coprocessors

Characterizing Literature Using Machine Learning Methods

Suitability Analysis of GPUs and CPUs for Graph Algorithms

Leistungs- und Genauigkeitsanalyse numerischer Löser für Differentialgleichungen

Performanceanalyse der Ein-/Ausgabe des Ökologiemodells ECOHAM5

Untersuchung von Interaktiven Analyse- und Visualisierungsumgebungen im Browser für NetCDF-Daten

Automation of manual code optimization via DSL-directed AST-manipulation

Client-Side Data Transformation in Lustre

Modeling and Simulation of Tape Libraries for Hierarchical Storage Management Systems

2015

Vorhersage von E/A-Leistung im Hochleistungsrechnen unter der Verwendung von neuronalen Netzen

Advanced Data Transformation and Reduction Techniques in ADIOS

Static Code Analysis of MPI Schemas in C with LLVM

Automatisches Lernen der Leistungscharakteristika von Paralleler Ein-/Ausgabe

Evaluation of performance and productivity metrics of potential programming languages in the HPC environment

Dynamically Adaptable I/O Semantics for High Performance Computing

Adaptive Compression for the Zettabyte File System

Optimization of non-contiguous MPI-I/O Operations

2014