Cover Your Ass

…or, what coverage is, and what it isn’t

Code Coverage is a simple and (for some reason) controversial topic. Ask any group of developers, and you’re likely to get many conflicting opinions on how to do it properly. Some will say “well 100% or bust”, others will say it doesn’t matter at all. Some absolute crackpot (totally not me) may come at you with a line like “well actually, there’s between 7-13 different types of coverage, depending on how you measure it”, and what’s wild is that they would all be right. And also all are wrong. Each tell a version of the truth, because coverage is simultaneously a really important and powerful tool for increasing the safety and quality of code, as well as an absolutely useless measure of code safety and quality.

Confused? Good! Now let’s begin our journey!

p.s. this is Part 1 of our series on understanding code coverage. This piece will be an overview, while Part 2 will be practical using Ape Framework’s new coverage feature. But more on that later.

What is Coverage Anyways?

“Code coverage* is defined as a percentage measure of the degree to which the source code of a program is executed when a particular test suite is run.”* - Wikipedia.

Uh, thanks Wikipedia. Very helpful.

So what does that mean? It means that when you create a test, there are particular areas of the code you're trying to verify that you are activating when you run it. It’s usually measured on a percentage basis, meaning if you have covered 80% of the entire source code then 80% has been activated while testing. Getting to 100% means you are activating the entire code in some way by running your tests. Is it the right way? Who knows, all it tells you is you got 100%. A perfect score. That’s all we care about. Our work is done. We are totally safe now.....

Right?

Annoyingly, the answer is no.

Code coverage metrics, as a percentage score, are really telling you very little about whether your code doing is the right thing, which is what we care about at the end of the day. The only thing that 80% coverage means is that 80% of the code did something during the test, not if it's what you wanted it to do or if it was done well.

To makes things even more confusing, most people only ever talk about "statement coverage", which is just one of many types we can use (don't worry we'll get into them later).

Code coverage is imperfect, but useful and should be combined with other tests, fuzzing, and old-fashioned eyes-on-the-code to work through if your program is actually doing what you want it to. And whatever you do, always keep Goodhart's Law in mind:

When a measure becomes a target, it ceases to be a good measure.” - C. Goodhart

Types of Coverage

Now, as the totally rational, sane person quoted in the beginning said (that’s totally not me), there are actually many different types of coverage we can use, so lets look into the the differences and what the shortcomings of each:

Statement Coverage

Fundamentally, a particular coverage metric such as statement coverage measures the hit rate (aka the number of times a particular statement is hit) of a particular type of structure present in our code, such as a single statement in source code. 

A statement is computer science jargon for an element of your program that expresses a complete “thought”, but just to keep things simple we are just going to think of it as a single line of source code.

Allow me to present a Python program as an example:

1 def function1(value: int) -> int: 2 3 if value == 0: 4 value = 1 5 6 return value

This is a pretty simple program, but what happens if I told you that you could get 100% statement coverage with just one test? That’s pretty efficient right?

As I hope you can see, if we make the call with value = 0, we will actually be able to achieve 100% coverage of this code, since we will hit the if statement on line 3, the assignment statement on line 4, and the return statement on line 6. Of course, we’re calling function1 so we will count that, and also we really don’t care about the whitespace on lines 2 and 5 (because they’re not actually programmatic statements).

Branch Coverage

That’s great! But what are we missing? Well, we begin to see how 100% statement coverage might not be telling the whole story. Looking at the if statement on line 3 gives a clue: what if this condition were false? The statement coverage metric is that line 3 is just hit, it doesn’t specify anything about how it was hit. That concept is actually a different form of coverage called branch coverage (sometimes called edge coverage). Branch coverage checks that when we hit a statement that can cause a branching point in your code (literally a point where it has a binary choice to proceed down one path or another) that we trigger both the positive and negative scenario of this condition (e.g. value == 0 and value != 0). Going back to our program, we only have one branch condition to measure, and we’ve only taken one of the branches in our test case of calling with value = 0, so this means we’ve only achieved 50% coverage of our branch conditions. Adding another test case where we call with value = 1 will be enough to satisfy the other prong of that branch point by not triggering the condition, which will skip the if statement and proceed right to line 6.

Alright, let’s talk about a slightly more complicated example:

1 def function2(value: int) -> int: 2 3 if value == 0 or value > 10: 4 value = 1 5 6 return value

This time by executing both value = 0 and value = 1 test cases, we are able to achieve 100% statement coverage as well as 100% branch coverage too. However, I’m sure you’re able to see that the additional logic on our if statement might be cause for concern. But how? We have 100% coverage!

Condition Coverage

Of course, the solution to any problem in computer science is that if your initial idea doesn’t work out, just add more abstraction. Clearly just checking that we proceeded down both paths of the branching if statement is not enough to fully exercise the logic of the condition inside of it. This may or may not be important to you depending on your needs, and indeed many coverage tools conveniently stop at statement and branch coverage.

But consider this: what if our understanding of function2 was that it was supposed to be functionally identical to function1? By only checking our two test cases value = 0 and value = 1, we would see that is indeed the case since they produce the same output. They would also produce the same output for value up to 10, and actually below 0 as well. But if we pass in value = 11 all of a sudden the two functions will return different values.

That isn’t good. But how do we detect this with coverage?

Good question. First, let’s talk about the condition in our if statement. The condition is a compound expression involving and. In function1 further above, we only had the expression value == 0 to evaluate both true and false to achieve full branch coverage. Now we have two expressions: value == 0 and value > 10. With our two test cases we’ve made both expressions true at the same time, but we’ve only evaluated the first expression false once, and have never made the second expression change value to false. For the purposes of computing branch coverage, this is enough, since if one of the expressions of an and expression evaluates false, then the whole expression evaluates false. But if instead we look at a metric called condition coverage we need to look a little deeper into the branch statement and determine if it contains a compound expression, and if it is a compound expression we need to check that we’ve achieved the positive and negative cases for all of the sub-expressions. Before adding our value = 11 test case, we would have had only 75% coverage of the and compound expression (positive and negative for value == 0 but only positive for value > 10).

Now we will have 100% coverage across 3 coverage metrics!

Other Types of Coverage

Of course branch, statement and condition coverage are only 3 examples of coverage that produce useful checks to ensure you’ve explored as many paths in your program as possible. There are other types of coverage to check for, but they can have diminishing returns vs. the amount of time you spend achieving 100% coverage on them. What’s more important is that you’ve exercised your code under both the anticipated scenarios that you expect your users to use, as well as all of the edge cases and failure modes that you want to protect your code against.

A good suggestion would be to design a test suite that covers all of the anticipated scenarios for your users that you want the code to function for, as well as checks the failure modes and edge cases that you don’t want the code to handle (by failing gracefully). Once you’ve done that, the next step would be to measure the coverage of this test suite and see where the gaps are. Those gaps are the key, because they show you places in your codebase where you are missing something: missing a test (perhaps not understanding a requirement or under-specification in your design document… you do have one, right anon?), or you made the code too complex (such that it has more edge cases than you expected), or otherwise made a mistake somewhere along the way.

But I promised to tell you more about other types of coverage, so here is a list (again, thanks Wikipedia for the categories):

  • Function coverage - higher level, but basically have you called every function possible

  • Modified Condition/Decision Coverage (MC/DC) - you’ve checked each sub-expression in a compound branching point individually e.g. if (a or b) and c)

  • Parameter Value Coverage - you’ve checked all possible input types that the function might be called with in practice (great use case for fuzzing since this is usually pretty hard)

  • Linear Code Sequence and Jump coverage - not only every if statement in your source code, but every type of branch condition that the final executable supports (for example in EVM programs, have you called the fallback method? triggered empty value checks? external call failures? etc.)

  • Path coverage - have you covered every possible path in the final executable? (often fairly low level for common usage, albeit more exhaustive)

  • Entry/Exit coverage - have you entered into every possible function and triggered every possible return statement in that function?

  • Loop coverage - have you checked all loops if it is possible to break early, or skip iteration entirely, as well as execute the full loop iterations without hitting the break condition, if one exists?

  • State coverage - have you explored all possible states that the program can take (in EVM programs often this isn’t possible without some form of Symbolic Execution, which is an exhaustive search of all possible states)

  • Data-flow coverage - has every variable been initialized and used?

Using Coverage Tools in Practice

Now that we understand pretty well the taxonomy of different types of coverage criteria and how they work, it’s important to know how one measures coverage in practice. You might be thinking “why do I care?” but it’s important to understand this, as some ways of collecting coverage information actually change your program!

Whoa there! That’s not cool! If the tool modifies my program, isn’t there danger that it could make it not match the production behavior?

Yes! That is a risk you are taking!

You see, some coverage tools will modify the program you are measuring coverage for by inserting variables in particular points in the program that check to see if they are hit or not. Usually these values don’t affect the execution of your program, but they can slow it down and have other performance effects. In EVM programs, where the execution cost of a program is actually one of the critical resources that gets tracked (e.g. gas), then even adding harmless variables incurs extra resource usage that could modify the state of the program, e.g. make it fail when it should be passing. Therefore, for measuring the coverage of EVM programs, it is recommended not to use these approaches as they will impact the correctness of the results.

Other tools, like Ape’s new coverage feature, will instead take execution traces and post-process this data into the relevant coverage metrics (fortunately for EVM programs, this is a relatively easy thing to do).

In our next post, we’ll walk through in more detail how to use Ape’s coverage feature effectively when preparing your EVM-based contracts for production deployment.

For now though, I hope you found this article useful in understanding exactly what coverage is and how it works!


For the latest on all things Ape follow us on: apeworx.io | Discord | Twitter | Bluesky

subscribe://

Loading...
highlight
Collect this post to permanently own it.
ApeWorX LTD logo
Subscribe to ApeWorX LTD and never miss a post.