Most scientists are unprepared for the ongoing shift towards computational sciences for data exploration.
The real heroes of science are not the scientist but all the other people that make it work in the background:
Without these supporting roles most (applied) sciences wouldn’t produce many results. In the past technical departments created supporting positions for lab assistants, lab managers and craftspeople to support the scientists in their research.
Today, you don’t need to setup real experiments but can test almost all hypotheses with dedicated simulations first. The steadily decreasing cost for compute power provided by third party companies is one of the main reasons why computational approaches become more attractive. Creating realistic but synthetic data only with your computer has never been easier.
Thus, within the last 10 to 15 years science has moved heavily towards computational science in almost all areas; Regrettably, without creating the necessary support structure. After spending almost 10 years in academic settings I realize that this is partly to save money but I am also not sure if the majority of scientists have realized that. “Why does it take so long, it’s just code” is one of the common phrases you hear being thrown around way too often.
So with the increasing demand to use computational methods and without such an up-to-date supporting cast researchers face new challenges. The greatest burden for scientists now has become to understand compute models, high-performance and cloud computing systems and coding in general.
While you can expect some of that knowledge in technical fields, you are very unlikely finding it in a majority of people in social sciences. Unfortunately for them machine learning is entering their field as well (sentiment analysis, language processing, …). To be productive in such an environment, they need to be able to work with these tools, too.
From talking to colleagues and from my own experience, the hardest parts are:
Setting up the coding environment, first on your own system and then on the server your compute-heavy task actually runs
Understanding the target system’s technical infrastructure. Too many people use traditional high performance computers as if they are normal systems.
Robust and reusable code to speed up later developments.
This is a lot to handle, especially for rather non-technical fields. Even in technical disciplines you are unlikely to find too many graduates that are able to tackle this right from the get-go.
Unfortunately, most college programs are not up-to-date and graduates don’t know much about computing unless they are interested in it for its own sake. This leaves PhD students woefully unprepared for their scientific endeavors in a modern world. Too many scientist do not even version-control their code…
I have helped (or at least tried to) other people around their coding and setup in HPC environments and most of their issues are related to not understanding their compute systems well enough.
Open Source to the Rescue?
Luckily, there is plenty of open source software that can be used by scientists. At least, you don’t have to reinvent the wheel every time. However, this sounds better than is really is, because most codes are poorly documented and a real pain to get to work (anybody trying to build VisIt? ;)).
So while open source is great and also aligns well with a good scientific spirit, it’s still plenty of work to get it to work. And with the typical half-life of academics at one institution of 3 to 4 years, you lose all that hard-earned knowledge and the next PhD student or Post Doc starts back at zero. Sadly, collaboration is way too hard in such scenarios.
The complexity is not only in the science, but to large parts in the software and the system design around it.
More often than not, this keeps the research velocity low and is frustrating for scientists. You don’t want to spend time fixing software and build systems but get to make cool discoveries!
SciOps — Help Scientists Concentrate on the Science
So with dedicated software development positions unlikely to being created, science needs a different approach to solve the problem. Ideally, the goal should be for scientists to spend most of their time working on scientific progress.
To create such a supporting environment, we need new tools and approaches to working together. Researchers should focus on their code without caring (too much) about the actual target systems and the deployment of their code. Version control is automatically enforced by the necessary documentation of their work. Simulation and analysis scripts in the programming language of the scientists choice is are defined by the input and output data interfaces. Ideally, publication-ready plots are created for input-output relationships.
Imagine how much faster people could do science if you remove all the painstaking aspects of the process. The process of moving them out of the way and the corresponding tools are what I call SciOps.
Originally published on medium.