Testing

Testing is an essential part for creating high-quality software. Many software developers understand the importance of testing but they do not really understand how to write tests and they do not understand the limitations of testing. This chaptper provides deeper insight about testing.

Structure of Tests

Many books talk about the importance of testing and how to write tests. However, few books talk about how to structure tests. Before writing any test, you need to answer an important question: do you want the testing code to be sent to the users (or customers)? In most cases, the answer is no. The program used to generate the product sent to customers is often called the “production code”. You do not want to include test code inside production code for many reasons. First, including testing code may make your product bigger (in terms of storage) and slower (in terms of execution time). Worse, testing often needs specific inputs and the expected outputs for these test cases. Will you also give these inputs and outputs to your customers? If you do not, why do you include testing code in the product?

Mixing production code and testing code is a common mistake for beginning software developers. Worse, many of them have a lot of “debugging messages” as they test code. When the program is ready for production and release, what will they do with these debugging messages? In almost all cases, your customers do not understand the debugging messages and you have to remove the code printing these messages. Here is the problem. When you remove the code printing these messages, the risk is too high for one of two things to happen. First, you do not remove all the code printing the debugging messages. Your customers are annoyed by the mysterious debugging messages. Your competitors know more than what you to reveal about your proudct. The second scenario is worse. You accidentally remove more code than you intend and you actually add bugs to your program. Your program no longer works and your customers are very angry at your product.

The correct way of creating tests is to separate testing code from the production code in different files (probably in different directories). The production code does what it needs to do as the product. The testing code tests the production code. When your product is ready, exclude the testing code, the inputs, and the expected outputs before releasing your product. Figure [fig:teststructure01] illustrates this concept.

../_images/teststructure01.png

Figure 32 Production code and test code should be completely separated.

Use Assert Correctly

Another common mistake among beginning software developers is to use assert incorrectly. This is what they often do.

# This is production code
....
assert(condition that should be true)
....

This is one example how assert may be used:

def func(x):
    assert (x > 1)

There are three problems when you do this. First, if the condition is not true, the program immediately stops. There is no second chance. Imagine that your program is a text editor. When a user wants to save the content to a file, your program does not want to erase an existing file. You put assert there claiming that the file must not exist. If a user accidentially uses a file whose name already exists, the program stops and everything typed by the user is lost. You will definietely have a user that will never buy any product from you. Some people argue that “I put assert there because I know it must be true.” That argument is self-contraditory. If you know that the condition must be true, you will not put assert there. Would you write

assert (1 > 0)

You don’t.

If you are abosolutly certain the condition must be true, you do not need to put assert there.

Some people say, “I will remove all assert before releasing the product.” This can be done easily using the grep -v command. This, however, brings the second problem. What would happen if some of the assert statements actually do something useful? Consider the following case:

assert (x = y)

This assigns y’s value to x.

What you really want to do is probably to ensure that x and y are the same:

assert (x == y)

The earlier assert statement assigns y’s value to x. Hence, their values are same and the intended assert is actually true. If you remove the assert and no longer assign y’s value to x, the program may not work any more. Human eyes are not good detecting the differences between = and ==. This is a difficult bug to fix because keeping assert means the program is correct. Removing assert means the program is wrong. These problems about misuse assert were observed in the authors’ classes. They actually happened multiple times.

The third problem is, perhaps, somewhat philosophical. It is the attitude of creating good software. Software is deployed in many safety-critical systems. Mistakes in these systems may cause signficant amounts of financial losses, body injuries, or even deaths. Allowing a program to sudden stop is simply an unacceptable way of thinking. Using assert is an irresponsible way of writing production code. It expresses the attutide “Something is not expected. My program commits suicide and the rest is not my problem any more.” A more responsbile way of writing good software is to handle the problem and do not stop the program.

If assert has these problems, does that mean assert should not be used at all? You should not use assert in production code. You can use assert in the testing code, as shown in Figure [fig:teststructure02].

../_images/teststructure02.png

Use assert in the testing code. Do not put assert in production code.

In this example, the testing code uses assert to check whether the output from function 1 is correct. This is particularly important before sending it to the input of function 2. This further emphasizes the importance of separating production code from testing code.

Exception

To be added later

Pytest

Python has a tool for testing calle pytest. It automatically looks for the files starting with test_ and ending with .py and executes the functions whose names start with test_. The following is a simple example.

#!/usr/bin/python3
# test_example1.py

def returntrue():
    return True

def returnfalse():
    return False

def test_truefalse():
    assert returntrue() == True
    assert returnfalse() == False
    

Running pytest gets the following output:

$ pytest
============ test session starts =============
platform linux -- Python 3.5.2, pytest-4.3.0, py-1.8.0, pluggy-0.9.0
rootdir: /..., inifile:
collected 1 item

test_example1.py .

Here, the rootdir is the directory where you run pytest. This output means the tests have passed.

Change the program as follows: returntrue returns False and returnfalse returns True:

#!/usr/bin/python3
# test_example2.py

def returntrue():
    return False

def returnfalse():
    return True

def test_truefalse():
    assert returntrue() == True
    assert returnfalse() == False
    

Running pytest gets this result, showing the test has failed.

======================== FAILURES ========================
____________________ test_truefalse ______________________

    def test_truefalse():
>       assert returntrue() == True
E       assert False == True
E        +  where False = returntrue()

test_example2.py:11: AssertionError
==================== 1 failed in 0.04 seconds ====================

When you create tests, it is important that you check whether the tests may fail. Passing tests is not the purpose of testing. The purpose of testing is to detect mistakes in production code and correct the code. If a test can never fail, this test is useless.

Next, considers a slightly more complex program for testing two functions: f1 and f2:

#!/usr/bin/python3
# test_example3.py

def f1(x):
    return (x + 1)

def f2(x):
    return (x - 1)

def test_f1f2():
    assert f1(3) == 4
    assert f1(-3) == -2
    assert f2(3) == 4
    
    

This is the output of pytest:

======================== FAILURES ========================
_______________________ test_f1f2 ________________________

    def test_f1f2():
        assert f1(3) == 4
        assert f1(-3) == -2
>       assert f2(3) == 4
E       assert 2 == 4
E        +  where 2 = f2(3)

test_example3.py:13: AssertionError

The result says that calling f1(3) and f1(-3) have passed the tests. Calling f2(3) has failed because f2(3) return 2 but the test program expects 4.

Earlier this chapter said testing code and production code should be separated. The examples for pytest above violated this rule. The following examples will follow the rule separating the testing code from the tested code. Consider a file with two functions: f1 and f2:

# funcs.py
def f1(x):
    print (x + 1)

def f2(x):
    print (x + 1)
    print (x * x)
    print (3 * x)
    

This is the output when calling the functions interactively:

$ python3
Python 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import funcs
>>> funcs.f1(3)
4
>>> funcs.f2(5)
6
25
15

A test program calls funcs.f1(3) and funcs.f2(5).

#!/usr/bin/python3
# test_example4.py
import funcs

def test_f1f2():
    funcs.f1(3)
    funcs.f2(5)
    
    
    

This is the output when calling the functions interactively:

Running pytest, however, has no output. What is happening? Instead of printing to a computer screen (also called the console), pytest, captures the output. The reason is that pytest does not want to print too many things. If too many things are printed to the screen, it is likely that some important messages are ignored. Another reason is that pytest often runs automatically when no human is watching (more details about this later in the section on Continuous Integration). There are two solutions: The first is to tell pytest to print to the screen by adding -s after pytest. This defeats the purpose of pytest’s intention not to print to the screen.

The second is a better and more general solution: it saves the captured output to a file. It is actually quite straightforward:

#!/usr/bin/python3
# test_example5.py

import funcs

def test_f1f2(capsys):
    funcs.f1(3)
    funcs.f2(5)
    captureout, captureerr = capsys.readouterr()
    capturename = 'capturedstdout'
    fhd = open(capturename, 'w')
    fhd.write(captureout)
    fhd.close()
    

    
    

The function test_f1f2() has an input for capturing the outputs. This input will be provided automatically by pytest. To get the captured output, it calls readouterr(). This function returns the captured output to standard output (also called stdout) and the standard error (also called stderr). Separating stdout from stderr gives more flexibility: the former is the normal output and the latter is the error messages. By default, both stdout and stderr are shown on a computer screen but it is possible saving them to different files.

The testing program test_example5.py saves the standard output to a file called capturedstdout. All outputs by the print statements in funcs are saved in this file. The content of the file is (as expected)

4
6
25
15

When the outputs are saved to a file, testing can be done automatically without human watching by following these steps:

  1. write the expected output to a file

  2. capture the output from pytest and save it to another file

  3. compare the two files

The following program shows these steps

#!/usr/bin/python3
# test_example6.py

import filecmp
import funcs

def test_f1f2(capsys):
    funcs.f1(3)
    funcs.f2(5)
    captureout, captureerr = capsys.readouterr()
    capturename = 'capturedstdout'
    fhd = open(capturename, 'w')
    fhd.write(captureout)
    fhd.close() # ensure that it has been saved
    assert filecmp.cmp(capturename, 'funcsexpected')

    

    
    

The tests uses filecmp.cmp to compare the context of two files. As explained, if a test cannot fail, it is useless. If funcs.f2(5) is replaced by funcs.f2(4), pytest fails. This suggests that pytest is actually testing funcs.f2.

Test Integer Partition

Only “toy” examples for pytest are shown so far. By capturing the outputs, pytest can compare expected outputs with the actual outputs. Thus, it becomes much easier testing many cases using a single test program. The following program tests integer partition using three different values: 3, 4, and 5. The expected outputs are stored in a separate directory called expected and the files are called, not surprisingly, 3, 4, and 5.

#!/usr/bin/python3
# test_intpart1.py

import intpart
import filecmp

def run_test(args, expected, capsys):
    capturename = 'capturedstdout'
    fhd = open(capturename, 'w')
    args = intpart.checkArgs(args)
    intpart.partition(args)
    captured = capsys.readouterr()
    fhd.write(captured.out)
    fhd.close()
    assert filecmp.cmp(capturename, expected)

def test_intpart(capsys):
    run_test('3', 'expected/3', capsys)
    run_test('4', 'expected/4', capsys)
    run_test('5', 'expected/5', capsys)

    

Test Coverage

Test coverage means how much percentage of code that has been tested. Getting test coverage is pretty easy: python has a program called coverage. Simply adding coverage run before the program to be tested (and the necessary arguments).

$ coverage run intpart.py -e -o 5
-e and -o cannot be both set
$ coverage report
Name         Stmts   Miss  Cover
--------------------------------
intpart.py      50     26    48%

The first coverage run command executes the Python program intpart.py with arguments -e -o 5. Since it is not possible partitioning any integer using only even numbers and using only odd numbers at the same time, the intpart.py program stops without partitioning the input value 5. The coverage is 48%.

To see which line has been tested, add annotate after coverage and the file intpart.py,cover is created, as shown below

  #!/usr/bin/python3
  # intpart.py

> import sys
> import argparse

> def printArray(arr, ind):
!   for i in range(0, ind - 1):
!     print (str(arr[i]) + ' + ', end='')
!   print (str(arr[ind - 1]))

> def partitionHelp(arr, ind, left, odd, even, order, notself):
!   if (left == 0):
      # if notself and ind is 1, then only one number is printed
      # this one number is itself
!     if ((not notself) or (ind > 1)):
!       printArray(arr, ind)
!   '''
!   There are four conditions when this i is used
!   1. not odd and not even:
!   2. odd and i is odd
!   3. even and i is even
!   '''
!   for i in range(1, left + 1):
!     if (order and (ind != 0) and (arr[ind - 1] > i)):
        # orders do not matter
        # the numbers must not be decreasing
!       continue
!     if ((not odd) and (not even)):
!       arr[ind] = i
!     elif (odd and (i % 2)):
!       arr[ind] = i
!     elif (even and ((i % 2) == 0)):
!       arr[ind] = i
!     else:
!       continue # do not use this value of i
!     partitionHelp(arr, ind + 1, left - i, odd, even, order, notself)

> def partition(args):
    # print (args)
>   odd = args.odd
>   even = args.even
>   order = args.order
>   val = args.value
>   notself = args.notself
>   if (odd and even):
>     sys.exit('-e and -o cannot be both set')
!   if (even and (val % 2)):
!     sys.exit('-e cannot partition an odd number')
!   print('== Partition ' + str(val) + ' ==')
!   if (odd):
!     print('== Using only odd numbers ==')
!   if (even):
!     print('== Using only even numbers ==')
!   arr = [0] * val
!   partitionHelp(arr, 0, val, odd, even, order, notself)

> def checkArgs(args = None):
>   parser = argparse.ArgumentParser(description='parse arguments')
>   parser.add_argument('-o', '--odd', action='store_true',
>                       help = 'odd numbers only', default = False)
>   parser.add_argument('-e', '--even',action='store_true',
>                       help = 'even numbers only', default = False)
>   parser.add_argument('-r', '--order',action='store_true',
>                       help = 'orders do not matter', default = False)
>   parser.add_argument('-s', '--notself',action='store_true',
>                       help = 'not to include itself', default = False)
>   parser.add_argument('value', type = int,
>                       help = 'number to parition')
>   pargs = parser.parse_args(args)
>   return pargs

> if __name__== "__main__":
>   args = checkArgs(sys.argv[1:])
>   partition(args)

If a line starts with >, this line has been tested. If a line starts with !, this line has not been tested.

Next, run intpart.py again without -e or -o. This time coverage report says 80% code is covered and coverage annotate generates the following file:

  #!/usr/bin/python3
  # intpart.py

> import sys
> import argparse

> def printArray(arr, ind):
>   for i in range(0, ind - 1):
>     print (str(arr[i]) + ' + ', end='')
>   print (str(arr[ind - 1]))

> def partitionHelp(arr, ind, left, odd, even, order, notself):
>   if (left == 0):
      # if notself and ind is 1, then only one number is printed
      # this one number is itself
>     if ((not notself) or (ind > 1)):
>       printArray(arr, ind)
>   '''
>   There are four conditions when this i is used
>   1. not odd and not even:
>   2. odd and i is odd
>   3. even and i is even
>   '''
>   for i in range(1, left + 1):
>     if (order and (ind != 0) and (arr[ind - 1] > i)):
        # orders do not matter
        # the numbers must not be decreasing
!       continue
>     if ((not odd) and (not even)):
>       arr[ind] = i
!     elif (odd and (i % 2)):
!       arr[ind] = i
!     elif (even and ((i % 2) == 0)):
!       arr[ind] = i
!     else:
!       continue # do not use this value of i
>     partitionHelp(arr, ind + 1, left - i, odd, even, order, notself)

> def partition(args):
    # print (args)
>   odd = args.odd
>   even = args.even
>   order = args.order
>   val = args.value
>   notself = args.notself
>   if (odd and even):
!     sys.exit('-e and -o cannot be both set')
>   if (even and (val % 2)):
!     sys.exit('-e cannot partition an odd number')
>   print('== Partition ' + str(val) + ' ==')
>   if (odd):
!     print('== Using only odd numbers ==')
>   if (even):
!     print('== Using only even numbers ==')
>   arr = [0] * val
>   partitionHelp(arr, 0, val, odd, even, order, notself)

> def checkArgs(args = None):
>   parser = argparse.ArgumentParser(description='parse arguments')
>   parser.add_argument('-o', '--odd', action='store_true',
>                       help = 'odd numbers only', default = False)
>   parser.add_argument('-e', '--even',action='store_true',
>                       help = 'even numbers only', default = False)
>   parser.add_argument('-r', '--order',action='store_true',
>                       help = 'orders do not matter', default = False)
>   parser.add_argument('-s', '--notself',action='store_true',
>                       help = 'not to include itself', default = False)
>   parser.add_argument('value', type = int,
>                       help = 'number to parition')
>   pargs = parser.parse_args(args)
>   return pargs

> if __name__== "__main__":
>   args = checkArgs(sys.argv[1:])
>   partition(args)

Many lines change from ! (not tested) to > (tested). The second test has 80% coverage, higher than the first test.

Figure [fig:comparecoverage].

../_images/comparecoverage.png

Compare the two coverage outputs.

Here are the tests that can achieve 100% coverage:

coverage run intpart.py -o 5
coverage run intpart.py -e 6
coverage run intpart.py -o 8
coverage run intpart.py -r 5
coverage run intpart.py -s 4
coverage run intpart.py
coverage run intpart.py -h

Is a higher coverage always better? This a common misunderstanding. The purpose of testing is to discover problems. If a test can discover problems, it is a good test even if it covers only a small percentage of code. The next common misunderstanding may surprise you: Even if your tests have 100% coverage, the program may still have undiscoverd problems. “How can that be possible?” “100% test coverage.”

The misunderstanding occurs from the definition of coverage. Test coverage is defined by whether each line has been tested; it is not defined by whether each scenario has been tested. What is the difference? Consider the following example:

if (condition1):
    # A
else:
    # B
...
if (condition2):
    # C
else:
    # D
...

The flow can be expressed in Figure [fig:twoconditions].

../_images/twoconditions.png

There are four possible scenarios of the two conditions:

  1. condition1: True, condition2: True

  2. condition1: True, condition2: False

  3. condition1: False, condition2: True

  4. condition1: False, condition2: False

If a test has True for condition1 and False for condition2, then A and D have been tested. Another test has False for condition1 and True for condition2, then B and C have been tested. These two test cases have tested A, B, C, and D. In other words, all code has been tested and the coverage is 100%.

“Wait” “How can it be possible to get 100% coverage even with only two tests? There are four scenarios.” You may ask.

That is the problem.

What does this mean? Does this mean coverage is not helpful? Test coverage is very helpful. percentage. You should check whether the specific lines you want to test have been covered. Do not focus on the coverage number only.

Continous Integration

Have you ever had a situation when your software has a bug but it is not discovered for a long time? The existence of the bug and its ability to escape detection may surprise your entire team. “How did that happen?”, you may ask. This may happen due to a wide variety of reasons. One of the possible reasons is that the software has not been well tested and as a result the bug is not detected. Testing is an important method detecting bugs. However, testing can be tideous. It is common that someone makes a “small” change and skips testing. The person believes the change is so small and cannot possibly have any bug. If several people add a few small changes here and there, the software soon is full of “small” bugs. Why do people not test their programs immediately after they have made changes? One reason is that testing require additional work.

Is it possible testing is fully automated without any additional effort? Continuous Integration (CI) does exactly that. You still need to write testers. There is no way avoiding that part. The previous sections explain how to write tests. What continuous integration does is to automatically invole these testers whenver you push your program to the shared repository. This section uses Travis-CI for continuous integration.

../_images/travis01.png

Travis-CI.com is a website supporting continuous integration.

Using Travis-CI is quite simple. You need to add a special file whose name is .travis.yml. It tells Travis-CI what to do when changes are pushed to github. This is a simple example:

$ python3 --version
Python 3.5.2

The content of this file should be pretty easy to understand: It says the repository uses Python as the language and version 3.5 is required. When changes are pushed, run pytest and it automatically looks for a Python program starting with test_ (such as test_intpart}. If it runn successfully, it is green. If the test fails, it will appear red. You are encouraged to modify the test, make it fail, and see what a failure looks like.

../_images/travis02.png

Running pytest successfully in Travis-CI.

Limitations of Testing

It is important understanding that testing has limitations. It is practically impossible testing every possible scenario. Every if is a condition. A condition divides a program into two different paths depending whether the condition is true or not. A non-trivial program can easily have hundreds of conditions. As a simple explanation why it is not possible testing every possible scenario. Let’s consider a program with 100 independent conditions, i.e., whether one condition is true or false does not affect any other condition. There are \(2^{100}\) possible scenarios. The fastest computer in the world can perform several hundred quadrillion (\(10^{15}\)) calculations per second; this is a parallel computer but let’s not worry about that detail. Because \(\log_2(10) \approx 3.3\), \(10^{15} \approx 2 ^ {50}\). Thus, the fastest computer can check as many as \(2 ^ {50}\) conditions per second. How many can this computer check per minute? \(2 ^ {50} \times 60\). How about a day (86,400 seconds)? \(2 ^ {50} \times 86400 \approx 2 ^ {66}\). In order to test \(2^{100}\) possible scenarios, this computer needs to work \(2^{34}\) days; this is very long time (much longer than the history of humans). In fact, the situation is even worse. The estimation above considers only two possible cases for each condition: true or false. If the condition compares two numbers, then “every possible scenario” means every possible pairs of these two numbers. What does this mean? It would be completely impossible testing every possible scenario of a program’s execution paths controlled by the conditions.

Does this mean there is no need to test because it is not possible testing all possible scenarios? No. What it means is that testing has to be done carefully so that testing can detect as many problems (i.e., bugs) as possible. Good tests should consider many different scenarios.

Halting Problems: Unsolvable Problems

Is it possible checking whether a program is correct by using another program? More specifically, is it possible writing a program, called \(p\), that has three inputs: (1) another program \(q\), (2) input data \(i\), and (3) expected output \(e\) and decides whether program \(q\) produces \(e\) when taking the input \(i\), without running the program \(q\). This problem is at least as difficult as another problem:

Is it possible writing a program, called \(p\), that has two inputs: (1) another program \(q\), (2) input data \(i\) and decides whether program \(q\) will eventually stop or not (i.e., entering an infinite loop) when taking the input \(i\). This is called thd halting problem in computation theory. It turns that the halting problem cannot be solved.

This is not a formal proof but it gives the intuition why the halting problem cannot be solved.

First, it is possible to write a program \(p\) that takes another program \(q\) as the input. An operating system is such as program \(p\). A compiler is also such a program \(p\). Second, \(p\) cannot simply run \(q\) with input \(i\). The reason is that if \(q\) does not stop, \(p\) runs indefintely without an answer. We do not know whether \(q\) takes very long time and will eventually stop, or it will never stop. Third, since \(p\) can take a program \(q\) as the input, \(p\) can also take itself as the input.

So far, nothing is surprising. Next, we will explain why such a program \(p\) is not possible. The “proof” uses contradition: We first assume that it is possible creating such a program \(p\) and then show that contradition occurs. Thus, the assumption (it is possible creating such a program \(p\)) is wrong.

If it is possible creating \(p\), then it is possible creating \(\tilde{p}\) by changing \(p\) slightly:

  • If \(p\)’s answer is yes, \(\tilde{p}\) enters an infinite loop and thus cannot answer the question.

  • If \(p\)’s answer is no, \(\tilde{p}\) says yes.

Next, rename \(\tilde{p}\) as \(p\).

Now, there is a problem. If \(p\)’s answer is Yes, it enters an infinite loop and it will not halt. If \(p\)’s answer is no, it says yes. In other words, \(p\)’s answer should be exactly the opposite of its answer. What does this mean? How can this be possible? Why do we get contradiction?

We get contradiction because we have made a wrong assumption: it is possible creating \(p\). The truth is that it is not possible creating a program \(p\) that takes a program \(q\) and an input \(i\) and decides whether \(q\) halts when giving the input \(i\).

What does this mean to testing? Deciding whether a program halts is no more difficult deciding whether the program produces a correct output. If it impossible to solve the halting problem, it is impossible writing a program \(p\) that decides whether program \(q\) given input \(i\) produces expected output \(e\).

Another unsolvable problem is determining whether two programs (that are not identical) have the same outputs when given the same inputs. Precisely, It is not possible creating a program \(p\) that takes two inputs: (1) a program \(q\) and (2) the third program \(r\) and answers whether \(q\) and \(r\) have the same output for every possible input.