How Not to Cite Software (and how to be cited)

Posted by s.aragon on 31 March 2017 - 9:00am

By Will Usher, Senior Researcher: Infrastructure Systems Modeller, University of Oxford

Plagiarism is a serious issue, and we are all familiar with the horror stories of students unceremoniously ejected from courses for copying essays. Any undergraduate degree worth its salt teaches students how to cite work correctly, acceptable bounds on quotation and how to attribute ideas and concepts to their sources. But in the growing world of open-source research software, best practices have yet to be universally understood, as I recently found out.

During my PhD at University College London, I became involved in the heady enthusiasm of the Research Software Programming group, attending and then helping out at Software Carpentry workshops. As a consequence, I was keen to apply my new knowledge of Python, version control and software development to my research. As luck would have it, I discovered an existing Python library on Github, which implemented several Global Sensitivity Analysis routines I could make use of. As I used the library, I started adding bits and pieces, and so by the end of the PhD I had made a considerable contribution to the package.

It's probably safe to say that SALib (sensitivity analysis library) is the go-to Python library for the unfortunately still-far-too-niche use of global sensitivity analysis in modelling, and our user group has grown considerably over time. The Python library is available on PyPi, conda, and Github and has good documentation. We've recently been through a peer-review process with JOSS (Journal of Open Source Software) which further improved the state of our research code and made suggestions regarding how best to allow contributions from fellow users and developers. We decided to release the library under an MIT License and didn't think too much beyond that.

About a year ago, we discovered a Python sensitivity analysis library which had copied entire modules of our code. The developers had cherry-picked code from several related Python packages (including ours), wrapped them in a GUI with a nice logo and then had made their package available on a website. Our first reaction was anger. Here, a bunch of academics, who should know better, had stolen our code and passed it off as their own. Looking closely through the library, it was a jumble of bits of other people’s code and a long script pulling them all together. What a mess! The library was released under a GPL license, and made no mention of the code upon which it was built. After calming down, we sent them a polite e-mail, pointing out that they were welcome to make use of our library, even bits of the library, but that the terms of the MIT License mean that they must attribute the authors. Specifically, here's the condition included in an MIT License:

"The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software."

The authors replied, apologised and said they would sort it out, which they did by copying another of our files, detailing how to make contributions to our repository, which included a 'thank you' we had written to all our contributing authors! At this point, suspecting incompetence, we gave up and forgot about it.

The story isn't over yet, as in February 2016, a paper appeared in an Elsevier journal specialising in modelling and software. The paper described the same software, GUI, copied code and all. This time, the paper gave a link to our library, but didn't mention at all that our code (and others’) made up a significant proportion of their GUI’s functionality. What is most frustrating in this case is that it is the journal which has failed here. Even a cursory investigation into the source code would indicate that the library is made up of other's work, which is incorrectly cited.

So what's the big deal? Does it really matter if other people copy our code? After all, we've “given it away” online, and having more users helps build our reputation and creates success.

To this end, I've compiled, with the help of @Neil (Neil Chue Hong) at the Software Sustainability Institute (via a twitter chat) a set of guidelines when you want to make use of someone else's code and when you want to encourage others to use your code correctly.

Put a clear statement about how your software should be acknowledged in your README and ideally a CITATION file. A readme file is a place to tell users what to do, how to install your library, and how to use it best. This includes how to cite your software.
Don’t copy bits of code. Instead, import libraries whenever possible. For example we've put a lot of effort into testing, packaging and ensuring that SALib can be deployed. Cherry-picking individual files from a library breaks the (automated) link to future bug fixes and improvements in the code. An advantage of writing install packages which download dependent libraries from repositories is that you then do not need to worry about incorporating the licenses of dependent libraries in your software. If you do cherry pick code, check that your license is compatible with the license of the code, and place the code’s license in a LICENSE folder in your repository.
If someone publishes a paper which completely fails to credit you, contact the editors. If that fails, contact the publisher. Likewise, if there is a public repository for the offending software, create an issue or pull request to correct the attribution. Unfortunately, understanding of what "open-source" means, and how to deal with license obligations of open-source software is lagging behind the provision of "software" journals, as is clear in the anecdote above.
Consider adding a copyright and licensing statement (such as a link to LICENSE) in the header of each file in your library. This doesn't provide any extra legal protection, but it makes the need for attribution clearer as most people copy code blindly rather than maliciously.

In summary, citations are the currency by which academics measure success, and it should be recognised that this holds for software as well as journal papers. Evidently, there’s a way to go before software is recognised for the concrete research output it is as there is still a gap in the understanding as to how best attribute authors for their work in software.