Ask Steve! - Testing questions about testing

Posted by m.jackson on 5 December 2011 - 12:00pm

Today’s post comes courtesy of Mike Jackson, also from the Software Sustainability Institute. If the Institute was the Dukes of Hazzard television show, with Steve as Bo Duke, then Mike Jackson would surely be Luke Duke. In this post, Mike answers a testing question about testing frameworks in Python.

Software testing is a vital part of software development. It not only allows us to demonstrate that our software satisfies its requirements but to ensure that our software is both correct and robust. Automated software testing provides us with a safety net during development, allowing us to fix bugs, make enhancements and extensions secure in the knowledge that if we break anything then the tests will catch this. After all, there are few things worse than fixing a bug to discover later that, in doing so, we’ve introduced a new one.

Philip Maechling of the Southern California Earthquake Center (SCEC), at USC, recently contacted the Institute with questions about software testing. Philip and his colleagues develop scientific software that outputs computational results into files. These files are typically simple ASCII text files but contain series’ of floating point numbers e.g. time series. Their acceptance testing involves comparing these files to existing reference result files.

Philip posed two questions:

Many unit test frameworks (e.g. JUnit and PyUnit) are focused around instantiating an object, or other software module, within a test class, calling methods on that module, then checking the values returned against expected values. While file comparisons can be done with such frameworks, they are complicated due to the need for floating point compares (which is tricky at the best of times), and differences in header information, or non-significant file contents. So, are you aware of any testing tools designed to support tests that are based on file-based comparisons?
In our file-based comparison tests, we often use the same reference files in multiple tests. In some testing circles, a directory of tests and expected test results are collected into a datastore called an “oracle”. When you want to know the correct results, you look up your test and find the expected result in this oracle. Are you aware of any software unit or acceptance testing tools that support the idea of a test oracle? The concept is simple, and we have implemented a couple of our own oracle datastores, but we seem to re-invent this each project. If there is a standard solution, I am interested in trying it out.

Question 2 is a generalisation of 1, using a set of reference files across multiple tests. As Philip comments, these reference files can be termed an “oracle”. More generally, “oracle” can be used to refer to anything which validates the outputs of a test i.e. checks that the outputs of the software during the test against the expected outputs. So, for example, in a PyUnit test that compares the outputs of a function, for some specific inputs, to some hard-coded values, the comparison code hard-coded values serve as the oracle. If a developer tests a GUI and assesses the correctness of its behaviour then they are serving as the oracle. Douglas Hoffman’s paper A Taxonomy for Test Oracles from Quality Week, 1998, gives an overview and taxonomy of oracles.

For question 1, an internet search did not reveal any Python frameworks that explicitly support tests that involve comparing floating point data files for equality. Even if a framework were available, there would still be work required by the developer to customise it towards the structure and content of their specific files. Two frameworks which adopt such a solution and provide something close to Philip’s requirements are Cram and TextTest. Cram is a framework for testing command-line applications. It runs commands and compares their outputs to expected outputs. The outputs are compared using pattern matching and regular expressions. TextTest is similar but also has support for GUI testing. Outputs are compared directly, but filters are provided to handle run dependant content and floating point differences outwith user-defined tolerances.

One can envisage at least two general approaches to comparing output files of floating point values to reference files. The first is to:

Write a convertor that can be used to convert the output file data format into a simpler format containing just the floating point data.
Write a validator that takes in two floating point data sets and compares these, applying rounding or allowing for equality within defined tolerances.
Write each test to load the expected results from the reference files, the actual results from the output files, apply the convertor to both sets of results, then use the validator compare the two.

The second is to:

Manually convert the reference files into template files. Regular expressions can be used to both handle parts of the files that might vary across test runs (e.g. headers) as well as for specifying expected floating point values.
Write a validator which compares an output file to a reference file, applying the regular expressions in the reference file to assess whether the output file matches.
Write each test to apply the validator, comparing the output files to the reference files.

Personally, I prefer the former solution as it avoids messing around with regular expressions.

For either solution, there are a number of Python libraries that can be used to construct a possible solution. These include:

PyUnit, Python’s unit test library. This has test assertion commands (assertAlmostEqual and assertNotAlmostEqual) for comparing floating point values to a specific number of decimal places or within a specific tolerance.
Python difflib library. This provides functions to compare two files and return the lines for which they differ. This is similar to the output from CVS and SVN “diff” commands. Cram uses difflib.
Python re regular expression library. Cram uses re.
Python filecmp file comparison library.
An introduction to writing regular expressions for floating point numbers.
TextTest (source code) and Cram (source code) are both open source products and it might be possible to reuse their functionality for comparing script files.
Hamcrest library for building “matchers” which are useful for expressing custom comparisons. It has been ported to many languages including Python.