Bright Spots: Team Experiences Implementing Continuous Integration

This is the first in an occasional series of Bright Spots blog articles, highlighting success stories of teams in the DOE Exascale Computing Project (ECP) who are implementing and benefitting from software development best practices as they build applications and software technologies for next-generation scientific discovery. If you'd like to contribute to a future Bright Spot article, please contact the BSSw Editorial Board.

Continuous Integration (CI) is an enabling technology that allows code development teams to insert testing directly into their development workflow. Although CI has many advantages, including more stable working code bases, faster test feedback with less distraction, and greater support for community contributions, there are also some pitfalls, like initial adoption and support burden, disruptive changes to team processes, and a potentially disorienting array of competing and overlapping new technologies.

How do these trade-offs influence the difficulty of adopting and maintaining CI automation? What’s the simplest way to get started? What kinds of team organizational and communication challenges can be expected?

To understand what has worked best, this article interviews three ECP development teams who have adopted continuous integration: ExaSMR, Visit, and ZFP. The ExaSMR project develops simulations of small modular reactors that provide the engineering community with highly detailed benchmark datasets. The Visit project is a highly scalable, turnkey application for data analysis, visualization, and movie making. ZFP is a compact library for lossy compression of multidimensional arrays of integer and floating-point data with controllable rate, precision, and accuracy.

Each of these teams has a unique story about how CI fits into their overall project goals, the evolution of their CI pipelines, and the challenges they faced along the way. The adopters of CI have all described its value to be “obvious” in retrospect because of the numerous occasions where either testing or pull-request reviews have caught errors that would have gone undetected otherwise. ZFP put significant effort into developing tests, resulting in over 5,000 unit tests with nearly full code coverage. VisIt’s extensive nightly regression testing combined with a permissive policy regarding breaking the mainline of development meant that the project’s initial CI testing could be rather modest and fit within the time limits of freely available services. The project has seen a greater impact from formal reviews of pull requests. ExaSMR noted that having CI in place sets the expectation that new contributions come with corresponding tests. This approach results in better quality code contributions from the community that are less stressful for the maintainers to review.

Case 1: ExaSMR

ExaSMR describes their CI experience with the OpenMC package. This package simulates Monte Carlo neutron and photon transport for use within multiphysics calculations coupled to computational fluid dynamics. OpenMC is a C++ library built on top of three geometry and math libraries (DAGMC, GSL, xtensor), and three I/O libraries (HDF5, pugixml, {fmt} a.k.a. fmtlib), and building both an executable and a modular Python ctypes interface. The project was first hosted on GitHub in 2012, and Travis CI was first added in February of 2015. The team used CI to install system packages and run its Python test harness. In December 2020, the project switched to GitHub actions, which supported the automation features the project had adopted by that time: a matrix of compilation and test options (turning on/off MPI/OpenMP/compiler optimizations), coveralls.io code-coverage integration, and automated creation of docker images for releases.

Paul Romano noted it was surprisingly easy for the team to get a first useful CI pipeline working with Travis CI. Switching to GitHub Actions was also relatively straightforward.

One of the unique aspects of the project is that the Monte Carlo method, being stochastic, results in the possibility of otherwise identical simulations diverging from one another on two systems with ever-so-slightly different characteristics; for example, a different implementation of standard library math functions might result in completely different answers (within uncertainty) if the simulations are run long enough.

This circumstance has sometimes made it difficult to reproduce CI failures locally. Romano has seen quite a few circumstances where developers report “it works for me but always fails on Travis or GitHub Actions.” This circumstance is especially prone to occur when someone makes a change that is expected to change reference answers (against which the tests check). Then, updating the reference results requires some sensitivity to the platform differences between development and CI environments to get just right.

Another hidden challenge the team noticed is keeping up with changes in dependencies. Every so often, a library will change its API, deprecate something, or otherwise change some behavior that forces the development team to deal with the change right away because not doing so means that tests no longer pass. ExaSMR tries to stay on top of this circumstnace by not pinning dependencies to specific versions. The consequence, however, is that issues can pop up periodically and divert attention from other areas.

Case 2: VisIt

The biggest testing focus is on the (C++) source code for VisIt’s executable metadata-server, engine, viewer, and controller. VisIt has dozens of external dependencies for geometry, graphics, file formats, and I/O. VisIt’s code base was hosted on NERSC’s Subversion servers until early 2019, when it was migrated to git and GitHub. Cyrus Harrison first added CI to the codebase in March 2019. Prior to that, only nightly testing was in use. Adding CI testing was complicated because of the graphical nature of the code and the large number of dependencies. Harrison opted to use docker images to provide pre-built dependency binaries, combined with a set of scripts that allowed running on both Travis and Circle-CI. The process to develop the scripts took weeks of trial and error. The team went on to add more features and switch to Azure Pipelines in August 2020. The CI tests run using a vanilla RedHat Linux base image, but the release process also builds images for Centos, Fedora, Debian, and Ubuntu.

Mark C. Miller observed that “costs and time limits were some of the key factors affecting our CI provider choice.” One of the largest continuing challenges faced by the team is the compute time required to build and run the tests. “Dependencies aside, it takes a 2017 4-core MacBookPro about 15 minutes to build VisIt using make -j 8 (oversubscription by 2x). It takes another 90+ mins to build dependencies. The entire nightly test suite itself can take 2-4 hours.” Even though Azure Pipelines provide unlimited runtimes for open-source projects, there is still a practical limit on what can be effectively tested in CI.

This circumstance has led VisIt to develop most tests by running commands through VisIt’s command-line controller. As a Python interface to a single executable, tests can exercise different code paths for things like plots, database readers, operators, rendering modalities, etc., without requiring a complete rebuild of VisIt. The team also uses stand-alone tests of some key classes. Substantial unit testing, however, would require refactoring the core of VisIt to support that modality of build and execution.

Miller notes there’s always a trade-off between better testing versus improving the product itself. There’s a laundry list of functionality for which the project would like to have more testing, like GUI events, multiple client/server modalities, numerical corner cases like NaN/Inf floating-point exception handling, and more comprehensive doc-tests. Rather than trying to tackle all of these directly, the developers have found that regular discussions comparing potential improvements have made them extremely productive. Discussing ideas at length allows the team to document ideas via GitHub issues, understand tradeoffs, and prioritize their work.

Miller also had advice for teams developing an initial process for CI. As a guiding principle, he advocates for starting small and aiming for the ultimate goal of moving a subset of tests into CI — ones that are responsive, informative, and cover areas that the team agree should be continuously checked. Those tests should use the code in the way it's intended for users to run, minimizing surprises down the road.

VisIt’s development path was heavily influenced by the prior experience of its developers with CI, along with a healthy dose of combing through documentation. Perhaps uniquely, the team uses CI primarily as a “smoke test” for compilation, dependencies, and their release process, placing a majority of their software quality focus on reviewing pull requests and running nightly tests on local resources. Those nightly tests generate, then check 3500+ images and 2000+ numbers/textual results. Although the tests require bit-for-bit accuracy, the team has a rolling error-resolution process.

Case 3: ZFP

ZFP’s main repository contains a C++ library and its C, Fortran, and Python interfaces. A separate repository builds Python Wheels binary distributions. ZFP was first hosted on GitHub in March 2016. In early 2017, ZFP migrated from a simple Makefile to CMake and added CI using Travis and AppVeyor. That initial investment was somewhat difficult because documentation on Google Test, CMake, and CI services were in early stages, and relevant guides were hard to find. Since then, the team has steadily expanded its test coverage to 95% with over 5,000 unit tests. Internally, this effort was made possible by hiring a full-time developer to focus on ZFP’s testing and CI pipeline.

As a library for lossy array compression, most of ZFP’s tests make use of either checksums or fuzzy matching of the compressed and the compressed-then-uncompressed data. The former demand bit-for-bit identical results, while the latter can account for cases where floating-point computations may vary across compilers, CPUs, etc. By adopting this paradigm, it becomes simple to create a massive array of tests over compression options (1, 2, 3, and 4D arrays of 32/64 bit integer and floating point) on numerous compilers and operating systems.

Testing is not without its challenges, however. As the codebase evolves, it sometimes happens that the “golden” standard checksums need to be updated. These updates require careful manual inspection of results across systems. Also, the CI tests are valuable for debugging issues, but reproducing failures and interacting with CI tests can be difficult. When things fail without leaving good diagnostic and error messages, one needs to make a small change and re-run the entire pipeline! Of course, this circumstance makes it hard to quickly track down the root cause. The ZFP team encourages adopters of CI to expect to use substantial trial and error to get oriented.

With CI, the need for good diagnostics and logs extends beyond a codebase. Like other teams, ZFP noted that sometimes intermittent issues come from inside the CI process’s network, software, and hardware stack that later disappear after re-running the test a few times. Such issues can look like logs that don’t come back to CDash, or log names and formats that don’t directly map to actual jobs on a given CI service. This situation demands extra work on the part of the development team to gain experience diagnosing errors while avoiding wild goose chases that soak up time and effort. Many of these issues are opportunities for making systems better and building collaborations with GitLab CI experts, facility teams, and upstream developers.

The ZFP project would also like to increase performance testing and coverage of shipped executable programs, stay up to date with compiler releases and platform libraries, and have a CI system that skips over tests that are provably “still working.”

Like VisIt, ZFP doesn’t require 100% of its tests to pass before merging unless the merge is going into the main development branch. This approach allows the main branch to serve as a base for releases while letting the feature branches be a place to work collaboratively on issues.

Summary

As a technology, CI has evolved and developed considerably over the past several years. Mature systems, connected to ubiquitous version control platforms like GitHub, GitLab, and Bitbucket, offer many examples, tutorials, and documentation that make the entry barrier low. The projects interviewed in this article discussed their specific adaptations for scientific and numerical software. All three projects mentioned comparing with “golden results” from full-program runs, as well as the unique mindset needed to develop and maintain unit tests. Testing is a mode and scope of work motivated by adding armor to a codebase. The benefit most cited here was identifying potential bugs early, thereby increasing the project’s ability to collaborate within the open source community.

All three projects also noted some drawbacks to CI: maintaining tests requires time and effort. Running these tools requires trial-and-error, and long-running tests can become a source of frustration. Teams noted that random failures (due to network, and other sources) were initially annoying but infrequent enough to safely ignore and usually resolved by a restart. Long-running tests can stress CI resources, with the mixed benefit/drawback of needing more careful attention to how tests are designed. Other than adding that development work, the teams didn’t notice excessively disruptive changes to their process.

Aside from CI, much can be learned from studying code construction choices across open-source projects. For example, the three projects here all use a different strategy to create Python interfaces to their codes. ExaSMR uses Python’s ctypes to wrap the C API, then installs those wrappers in site-packages so they can be found by setting PYTHONPATH. VisIt has an integrated Python environment. ZFP’s Python interface is built using Cython’s support for wrapping C++ classes, with a build and install process managed using the scikit-build plugin to CMake.

Overall, CI adds value by providing more stable working code bases, test feedback without developer intervention, and greater support for community contributions. As we watch the evolution of CI technologies, our teams are seeking several features. The ability to freeze the state of tests would both reduce redundant tests after small changes and also allow interaction with these states. Greater availability of testing hardware and runners would further boost the usefulness of this technology.

Acknowledgments

The idea and form of the Bright Spot series was shaped by the IDEAS-ECP team the BSSw.io Editorial Board. We owe the insights into developer experiences presented here to interviews with members of the ExaSMR, VisIt, and ZFP teams, including Paul Romano, Cyrus Harrison, and Peter Lindstrom. Comments, suggestions, collaboration and reviews from Mark C. Miller were especially appreciated during the writing of this article.

Author bio

David M. Rogers is a Computational Scientist in the National Center for Computational Sciences Division at Oak Ridge National Laboratory, where he works collaboratively to develop and apply new methods and theories for multiscale modeling using HPC.

Comment

More on Continuous Integration Testing