BSSw.io's David Bernholdt interviews Kate Keahey about the recently released workshop report "Report on Challenges of Practical Reproducibility for Systems and HPC Computer Science."
David Bernholdt: You organized a workshop on practical reproducibility. Explain what practical reproducibility is and why it is important?
Kate Keahey: Practical reproducibility is the idea that scientists should be able to interact with a digital representation of scientific results rather than just reading about them. This could, for example, mean that I could reproduce somebody’s experiment and “play” with it to gain a more intuitive insight into the explored phenomena. In doing so, I could also modify it, potentially to obtain a new result. Or I could do a side-by-side comparison with a competing solution. Or I could use it to teach a class and even assign extensions as homework. All of those things have the potential to accelerate the cycle of scientific exploration and sharing – or accelerate how quickly new scientific results penetrate the classroom.
Practical reproducibility could also influence the scientific process in more subtle ways. For example, right now, when I want to learn about a new area, I go to something like the ACM digital library and search for papers on relevant topics. What if, instead, I had a digital library of experiments that I could play with? What if I could search for relevant topics, and ideally, I could also search for experimental methodology used to construct certain experiments or data analysis methods? We tend to think of science as “standing on the shoulders of giants.” But science also spreads “horizontally” through innovation in methodologies and approaches to problem solving rather than only the result itself.
Another way of looking at practical reproducibility is also as a way of creating a “market” for a sound scientific practice – or capitalizing on that practice. A lot of people tend to think of reproducibility as something that simply allows us to validate results. But packaging computational results for reproducibility is often very hard. And if they then get reproduced once or maybe even not at all, that’s often not enough motivation to take the necessary effort, because that effort takes away from new research that could produce new results. But if a packaged result were to be reproduced many times, that changes the equation substantially.
David: Do you think that practical reproducibility is possible in practice? A lot of experiments require significant resources, or might take a long time to run, or the hardware might not be available any longer. Can they be made practically reproducible?
Kate: It certainly is not easy, and in the end it may not be possible – or “practically possible”, i.e., cost-effective – for absolutely every experiment. I tend to think of it as making some experiments practically reproducible – and we can certainly make it work for some of them – and then seeing if we can enlarge that set and what the specific challenges are. For example, when you package experiments as part of an artifact evaluation (AE) initiative at a conference and then they get reproduced, at least some of them are practically reproducible at the time – maybe not easily enough, but that is a specific challenge. Hardware availability – or lack thereof – used to be by far the biggest obstacle to practical reproducibility, but now with NSF’s sponsorship of open platforms like Chameleon, FABRIC, or ARA – which between them make complex and expensive edge to cloud, networking, and wireless hardware available to everybody in highly reconfigurable ways– much of that challenge has been solved. Any scientist from any US institution can have access to high-end resources, whether for original experimentation or for learning from the experiments of others.. But there are other challenges – in ease of packaging/reproduction, community practices, and research incentives. Some may indeed take a long time, which makes the cost of reproducing them prohibitively expensive. A suggestion at the workshop was to make a “scaled-down” result or make some aspects of the result reproducible; that’s certainly better than no reproducibility at all. Some experiments are indeed tied to hardware that may become unavailable because it is obsolete, and some results will be ephemeral in that they will be relevant in the context of that hardware only, but others may be reformulated in a more general way. I guess what I am saying is that if we make many – or most – of the experiments practically reproducible, we’d be doing better than we are now.
David: The workshop focused on HPC and systems computer science, but many of the insights and recommendations could have a broader computational science reach. Why focus on computer science research as opposed to addressing the broader questions?
Kate: Computer science has particularly challenging reproducibility requirements when compared with mainstream computational results. But computing is such an essential scientific tool these days that, in terms of volume of results obtained computationally, it is probably somewhere close to 1%. This means that when discussed in conjunction with reproducibility challenges in computational science, the computer science requirements are often treated somewhat as an “outlier” – they are not necessarily what everybody in the room wants to solve. At the same time, they are significantly more challenging; they often require unique hardware configurations, significantly broader configurability and control of computing resources, complex topologies, etc. So the idea was to create a group to focus on this more challenging problem, with the goal that the solution of this more challenging problem – or its aspects – can then also serve the broader computational problem.
David: Reproducibility is a hot topic right now, and there are many workshops that address the subject in one way or another. There is a whole conference – ACM REP – devoted to the topic of reproducibility. What made you think the community needed another workshop?
Kate: Part of it is what we already discussed: the focus on reproducibility challenges for computer science experimentation specifically, as they tend to be much more complex. But another aspect is that most of the existing community forums focus primarily on solutions and bring together people who build systems embodying those solutions. The main difference in this workshop is that we wanted to focus on characterizing the problem instead. So we invited reproducibility practitioners: authors, reviewers, and AE organizers, as well as some colleagues from the research infrastructure space who are all interested in serving this community, and we just talked about challenges. This is a difficult community to find and motivate to come together, but they are also our users. We want to understand better what they need from infrastructure, and I think they know that if we ask for their time, it’s because we want to serve them better.
David: What led to organizing this workshop, and what was its objective?
Kate: As part of Chameleon, we have supported many AE initiatives at various conferences in HPC – including Supercomputing, ICPP, SIGMOD – but also USENIX conferences like ATC, FAST, OSDI, and SOSP. The output these initiatives produce – in terms of results packaged for reproducibility – is the closest thing to practically reproducible artifacts we have. That makes the authors and reviewers who participate in those initiatives the best people to ask the question: What about practical reproducibility is still a challenge? Our support for the reproducibility initiatives allowed us to recruit the right people to attend, so the workshop doubled as a Chameleon User Meeting. We held the workshop at SC24 since Chameleon supported the SC24 AE initiative, though it was not officially affiliated with SC24 by the explicit wish of SC24 organizers. It absolutely blew me away in terms of what we learned and the level of detail and insight the discussions generated. Given my background, I was mainly focused on reproducibility requirements for research infrastructure, but the discussions were much broader, including tools, best practices, incentives, and even how AE initiatives are organized. I did not expect to talk about that, but this is what the community wanted to discuss, so that’s where we went. Everything is summarized in the workshop report that went through many iterations, including public comments from all workshop contributors that I am deeply grateful for, in particular, as they were as active and opinionated as the workshop participation itself.
David: What were the most interesting or surprising results/insights from the workshop?
Kate: There is a wealth of information inside the report, and I think different things will be interesting to different people. One thing that might be of immediate help to AE organizers is that we came up with two checklists for packaging artifacts – one for the essentials and the other for matters of style – that we put in the appendix and formatted such that they can be used separately from the report. These checklists can simplify the organization of AE initiatives enormously, both as a guideline to authors and as an evaluation sheet for reviewers. For me personally, one of the most interesting things was a discussion on how to decide when and/or to what extent an experiment has been reproduced. There were different approaches proposed, including a discussion with your favorite AI. AI also came up in discussing the complexities of setting up an experimental environment correctly – right now, a huge challenge for practical reproducibility. If Claude could make the challenges there disappear for me, that would really put practical reproducibility within reach to add to my reply to your earlier question.
From other insights, it was interesting to understand the different dynamics between authors and reviewers of papers and experiments. It is generally much easier to read a paper than to write one (even if you don’t count the work on obtaining the result itself), but this is not necessarily true of the difference in effort investment between authors and reviewers of digital artifacts. Reviewing digital artifacts also offers more opportunities for contribution, so that there is more opportunity for dialogue and collaboration between authors and reviewers. Right now, those opportunities are not always leveraged, but there is potential for that to change.
David: The workshop is over and the report is written; what are the next steps – for your group, and for the community at large?
Kate: First of all, we want to continue engaging with this community – those are amazing practitioners, collectively developing insight into how the research process -- and the requirements for research infrastructure -- are changing. We want to roll out the red carpet for every AE in systems and HPC, help them to the extent we can on Chameleon, and continue learning from them. I hope that the report – and in particular, the experiment packaging checklists – will make reproducibility initiatives easier to organize and smoother to run. We hope the community will help evolve them, and, to that end, we made them available and modifiable via GitHub so that each new reproducibility initiative can add its own insights as they arise and continue the discussion there.
Secondly, the challenge is how to create that digital library of experiments we were talking about before – something that is integrated with whatever research infrastructure the artifacts can be executed on. In Chameleon, we have a system called Trovi, which is a nucleus of this type of service: there are hundreds of artifacts that represent experiments packaged as part of various reproducibility initiatives or as part of classes taught on Chameleon; users can try them out, modify them, and otherwise use them to build their own experiments. Research infrastructure is a logical place for those artifacts to reside because you need compute resources to interpret digital artifacts: you don’t need much in the way of additional equipment to read a paper, but to run experiments, you are going to need hardware, and research infrastructure provides that. We hope that the number of digital artifacts continues to grow; we also aim to integrate with other types of research infrastructure, eventually giving users a broader platform to choose from.
And lastly, some of the insights gained from the workshop are immediately actionable for my group. For example, we now have pilot projects in using AI to create experimental environments. If we can have some clever AI take on the complexities of environment configuration, that would make a huge difference to how well and how easily we can reproduce experiments; that would really put practical reproducibility within reach. It would make a huge difference also to how experiments are constructed in the first place, the general methodology. Considering where we are now, even a prototype that works a fraction of the time but is easy to try would be of help, and I think many would be willing to try. We are looking at many other recommendations from the workshop: many motivate new tools that could significantly improve experimentation.
Further information
Kate's earlier blog article laying out some of the motivations for the workshop: Practical Reproducibility: Building a More Robust Research Ecosystem.
A paper describing the notion of "practical reproducibility": Three Pillars of Practical Reproducibility
The workshop report.
Author and interviewer bios
Kate Keahey is one of the pioneers of infrastructure cloud computing. She created the Nimbus project, recognized as the first open source Infrastructure-as-a-Service implementation, and continues to work on research aligning cloud computing concepts with the needs of scientific datacenters and applications. To facilitate such research for the community at large, Kate leads the Chameleon project, providing a deeply reconfigurable, large-scale, and open experimental platform for computer science research. To foster the recognition of contributions to science made by software projects, Kate co-founded and serves as co-Editor-in-Chief of the SoftwareX journal, a new format designed to publish software contributions. Kate is a Scientist at Argonne National Laboratory and a Senior Scientist at The University of Chicago Consortium for Advanced Science and Engineering (UChicago CASE).
David E. Bernholdt is a Distinguished R&D Staff Member in the Computer Science and Mathematics Division of Oak Ridge National Laboratory. His research interests relate to how we write scientific software, especially for high-performance computers, broadly interpreted. This includes programming models, programming languages, domain-specific languages, and runtime environments, as well as software engineering, software productivity, and software stewardship and sustainability. He leads Programming Environment and Tools for the Oak Ridge Leadership Computing Facility and has worked in numerous computational science projects over the years, especially in the area of fusion energy.