By Steve Crouch, SSI Research Software Group lead.
This guide is the first in the Unit Testing for Scale and Profit series.
Demonstrating that a process generates the right results is important in any field of research, whether it’s research software generating those results or not. Automation, where possible, enables us to define a potentially complex process in a repeatable way that is quicker and far less prone to error than doing it manually. In this guide we’ll look into techniques of automated testing to improve the predictability of a software change, make development more productive, and help us produce code that works as expected and yields desired results. We'll use Python for illustration purposes, but the concepts and approaches can be readily applied to many other languages.
There are many reasons why automating the testing of your software is a good idea. Once written, automated tests can be run many times, for instance whenever we change our code. So as our software evolves and code is perhaps extended, tidied, fixed, updated to use new libraries, or optimised, running tests gives us confidence the code continues to do what it's supposed to do. And not just for ourselves: when others make use of your code, running these tests can help them build confidence in your code too. Another advantage is that running automated versions of manual tests (where those tests are conducive to automation) is much faster.
So when writing software we need to ask ourselves some key questions:
- Does the code we develop work the way it should do?
- Can we (and others) verify these assertions for themselves?
- And, perhaps most importantly, to what extent are we confident of the accuracy of results that appear in publications?
If we are unable to demonstrate that our software fulfills these criteria, why should anyone use it?
Note: You will need Python 3.7 or above if you wish to follow the coding examples.
What About Unit Testing in Other Languages?
Other unit testing frameworks exist for Python, including Nose2 and Unittest, and the approach to unit testing can be translated to other languages as well, e.g. pFUnit for Fortran, JUnit for Java (the original unit testing framework), Catch for C++, etc.
How should we test?
We can and should extensively test our software manually, and manual testing is well suited to testing aspects such as graphical user interfaces and reconciling visual outputs against inputs. However, even with a good test plan, manual testing is very time consuming and prone to error. Another style of testing is automated testing, where we write code that tests the functions of our software. Since computers are very good and efficient at automating repetitive tasks, we should take advantage of this wherever possible.
There are three main types of automated testing:
- Unit tests are tests for fairly small and specific units of functionality, e.g. determining that a particular function returns output as expected given specific inputs.
- Functional or integration tests work at a higher level, and test functional paths through your code, e.g. given some specific inputs, a set of interconnected functions across a number of modules (or the entire code) produce the expected result. These are particularly useful for exposing faults in how functional units interact.
- Regression testing is kind of a special case of testing that makes sure that your program’s output and behaviour hasn’t changed. For example, after making changes to your code to add new functionality or fix a bug, you may re-run your unit or integration tests to make sure they haven't broken anything. You may also add a new specific regression test to highlight if a particular bug has returned.
A collection of automated tests is often referred to as a test suite.
For the purposes of this guide, we’ll focus on unit and regression testing. But the principles and practices we’ll talk about can be built on and applied to functional and integration tests as well. But overall, a good guiding principle behind testing is to fail fast. By prioritising the identification of failure – where unit testing can really help us – affords us the opportunity to find and resolve issues early, in particular, before they may lead to published results.
Writing tests using a Unit Testing Framework
Keeping these things in mind, here’s a different approach that builds on the ideas we’ve seen so far but uses a unit testing framework. In such a framework we define our tests we want to run as functions, and the framework automatically runs each of these functions in turn, summarising the outputs. And unlike our previous approach it will run every test regardless of any encountered test failures.
Most people don’t enjoy writing tests, so if we want them to actually do it, it must be easy to:
- Add or change tests
- Understand the tests that have already been written
- Run those tests, and
- Understand those tests’ results
Test results must also be reliable. If a testing tool says that code is working when it’s not or reports problems when there actually aren’t any, people will lose faith in it and stop using it.
To illustrate, let's use an implementation of the factorial function, that multiplies a non-negative integer by every number below it. In actuality, our code will likely be more complex than this, and in fact Python already has a built-in factorial function, but let's assume it doesn't – an implementation of factorial is simple enough for us to quickly reason about its behaviour for the purposes of this guide.
In a new directory called mymath, place this Python with the filename factorial.py:
def factorial(n): """ Calculate the factorial of a given number. :param int n: The factorial to calculate :return: The resultant factorial """ if n == 0 or n == 1: return 1 else: return n * factorial(n-1)
So, factorial(3) will give us 6, factorial(5) gives us 120. You'll notice we have also included a Python docstring - a special type of Python comment - at the head of the function, briefly describing what the function does, its input parameter, and what it returns, which is good practice.
Now let's see what some unit tests might look like. Create a new directory called tests, and a new file within that directory called test_factorial.py:
from mymath.factorial import factorial def test_factorial_3(): assert factorial(3) == 6 def test_factorial_5(): assert factorial(5) == 120 def test_factorial_10(): assert factorial(10) == 3628800
Each of these test functions, in a general sense, is called a test case - these are a specification of:
- Inputs, e.g. the numbers we pass to our factorial function
- Execution conditions - what we need to do to set up the testing environment to run our test, e.g. in this case, we need to import the factorial function from our mymath source code. We could include this import statement within each test function, but since we are testing the same function in all of them, for brevity we'll include it at the top of the script.
- Testing procedure, e.g. call our factorial function with an input number and confirm that it equals our expected output. Here, we use Python's assert statement to do this, which will return false and fail the test if this condition does not hold
- Expected outputs, e.g. the numbers to which we compare the result of calling the factorial function
And here, we’re defining each of these things for a test case we can run independently that requires no manual intervention.
Going back to our list of requirements, how easy is it to run these tests? Well, these tests are written to be used by a Python package called pytest. Pytest is a testing framework that allows you to write test cases using Python. You can use it to test things like Python functions, database operations, or even things like service APIs - essentially anything that has inputs and expected outputs.
We’ll continue to use pytest to write unit tests in this guide, but what you learn can scale to more complex functional testing for applications or libraries.
First we need to set up our environment so we can run these tests using pytest. We'll be using a Python virtual environment, which is a useful way to install and manage Python packages for a particular project, and keeps all the packages needed for a project separate from other projects to avoid any confusion. Fortunately, Python 3 has built-in support for virtual environments for us. To create and make use of a new virtual environment, we can do (in the same directory we have the mymath directory):
$ python3 -m venv venv $ source venv/bin/activate
Now we have our virtual environment set up, we can install pytest using the Python pip package manager:
$ pip3 install pytest
Running the Tests
Now we can run these tests using pytest:
$ python3 -m pytest ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 rootdir: /home/user collected 3 items tests/test_factorial.py ... [100%] ============================== 3 passed in 0.02s ===============================
So what's happening here? When started without any arguments, pytest does a number of things to look for tests you have written. By default, it will recursively check in directories (including the current one) for files that begin with test_ and end with .py, and if found, it looks for functions whose names also start with the letters test_ and runs each one. It will even find test methods matching the same pattern within classes beginning with Test. See the pytest documentation on good practices if you'd like to know more about how pytest finds tests, and other file layouts you can use to arrange your tests.
Notice the ... after our test script:
- If the function completes without an assertion being triggered, we count the test as a success (indicated as .).
- If an assertion fails, or we encounter an error, we count the test as a failure (indicated as F). The error is included in the output so we can see what went wrong.
If we have many tests, we essentially get a report indicating which tests succeeded or failed. Going back to our list of requirements, do we think these results are easy to understand?
Retesting Refactored Code
It's fair to assume our code will likely change over time, and as it does so we should check the behaviour of new features and functions are correct. It's also possible that we (or others) will come across incorrect behaviour that requires fixing. In either event, we should add new tests to check behaviour.
For example, perhaps someone points out that running factorial(10000) leads to an error:
... File "/Users/user/tmp/factorial/mymath/factorial.py", line 8, in factorial if n == 0 or n == 1: RecursionError: maximum recursion depth exceeded in comparison
In this case, perhaps being able to deal with large numbers is important. So, we reimplement the function to avoid this error refactoring the function so that it uses an iterative programming technique as opposed to a recursive one, by changing factorial.py to the following:
def factorial(n): """ Calculate the factorial of a given number. :param int n: The factorial to calculate :return: The resultant factorial """ factorial = 1 for i in range(1, n + 1): factorial = factorial * i return factorial
Note we use the term refactoring here: we want to change how the function arrives at its result to avoid the error, but don't want to change the purpose or expected behaviour of the function (hence the docstring hasn't changed).
Now here's where our tests are really valuable and save us time. Once we've refactored our code, we can rerun them to check if the functional behaviour is the same, in what is known as regression testing. The good news is that they all pass, which gives us confidence that this new implementation works as expected.
Pytest can’t think of test cases for us, so we still have to decide what to test and how many tests to run. Our best guide here is economics: we want the tests that are most likely to give us useful information that we don’t already have. For example, we've tested with input values such as 3, 5, and 10, so there's probably not much point testing for 7, 9, or 15, since it’s hard to think of a bug that would show up in one case but not in the other.
This is what we should be doing: trying to think of test cases that are as different from each other as possible, so that we force the code we’re testing to execute in all the different ways it can – to ensure our tests have a high degree of code coverage.
A simple way to check the code coverage for a set of tests is to use pytest to tell us how many statements in our code are being tested. By installing a Python package to our virtual environment called pytest-cov that is used by pytest and using that, we can find this out:
$ pip3 install pytest-cov $ python3 -m pytest --cov=mymath.factorial tests/test_factorial.py ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 rootdir: /home/user plugins: cov-3.0.0 collected 3 items tests/test_factorial.py ... [100%] ---------- coverage: platform linux, python 3.8.10-final-0 ----------- Name Stmts Miss Cover ----------------------------------------- mymath/factorial.py 5 0 100% ----------------------------------------- TOTAL 5 0 100% ============================== 3 passed in 0.03s ===============================
This looks great! At this stage, all statements in our code are being tested. But as the statements in our code continue to grow, less of them will be covered by our tests, which is when we need to think about whether writing new tests makes sense. As we've mentioned, this is an economic decision as well as a technical one. In reality, beyond basic examples, achieving 100% test coverage often isn't practical at all, and with limited effort it makes sense to prioritise testing parts of the code that contribute to the accuracy of generated results.
Testing for Errors
Now the factorial function itself is only defined for non-negative integers, and we can see that our code doesn't deal with negative numbers. When we run factorial(-1) for example, we get 1, which is an invalid result since factorial is undefined for negative integers. We might decide to check for this at the start of our function like so, and raise an error if this occurs. Insert this on lines immediately before factorial = 1:
... if n < 0: raise ValueError('Only use non-negative integers.') ...
And so when we now run this with factorial(-1) we get:
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/user/tmp/factorial-ref2/mymath/factorial.py", line 11, in factorial raise ValueError('Only use non-negative integers.') ValueError: Only use non-negative integers.
Since we've changed our code, we should always check our tests still pass:
$ python3 -m pytest --cov=mymath.factorial tests/test_factorial.py ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 rootdir: /home/user plugins: cov-3.0.0 collected 3 items tests/test_factorial.py ... [100%] ---------- coverage: platform linux, python 3.8.10-final-0 ----------- Name Stmts Miss Cover ----------------------------------------- mymath/factorial.py 7 1 86% ----------------------------------------- TOTAL 7 1 86% ============================== 3 passed in 0.03s ===============================
Which they do. But now, we see that only 86% of our statements are covered. We've changed our code and the correct behaviour in a particular circumstance is to show an error. This looks like fundamental behaviour we should add a test case for, but how would we write it? Fortunately, pytest can help us to test for these types of test cases. Let's add a new one to tests/test_factorial.py (be sure to add import pytest at the top of the script):
import pytest ... def test_factorial_negative1(): with pytest.raises(ValueError): factorial(-1)
So here, we now import the pytest library and use it explicitly to test for the presence of a ValueError when we invoke factorial with a negative number. And when we run our tests again:
$ python3 -m pytest --cov=mymath.factorial tests/test_factorial.py ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 rootdir: /home/user plugins: cov-3.0.0 collected 4 items tests/test_factorial.py .... [100%] ---------- coverage: platform linux, python 3.8.10-final-0 ----------- Name Stmts Miss Cover ----------------------------------------- mymath/factorial.py 7 0 100% ----------------------------------------- TOTAL 7 0 100% ============================== 4 passed in 0.04s ===============================
We can see that not only do they all now pass, but our statement coverage is now back to 100%.
In this guide, we've taken a look at how we can write unit tests to help us automatically check that our software is functioning correctly, as well as what makes good test cases and how we can determine how much of our code is actually being tested. In the next guide in this series, we'll look at how we can parameterise our unit tests so we can run them many times over different test input data.
If you're after a more general overview of automation that is largely language agnostic, look at our high-level guide on testing your software, which covers automation from the build process to unit testing and continuous integration and what it can give you.
Want to discuss this post with us? Send us an email or contact us on Twitter @SoftwareSaved.