State, behaviour and interaction testing

I want to advocate for better testing. I want people to pay more attention to the types of test they are writing because in my experience people are not explicitly aware of nor do they consciously think about this enough and they mostly write tests in a certain way out of habit or as influenced by the tooling. I’m not talking about the scope of the test when I say “types” (unit, integration, component, end-to-end, etc) but more-so what the test is setup to manipulate and assert. Specifically:

Behaviour
State
Interactions

I believe it’s easy to fall into the trap of assuming you are objectively testing “everything” because you are covering all branches, cases, methods etc yet I also believe if you aren’t paying attention to what type of tests you’re doing here you run the risk of writing less (or not) valuable tests despite that cosy objectivity.

Where this sits worst for me is when you have a “super-high-objectivity” approach: every class isolated, all dependencies mocked, every method called, every possible branch validated against the mocks. I’ve been there, it feels good but it is a very false sense of security.

Note that, as with many of my other posts, my lens and observations here are largely from a C# object-oriented mindset and environment but I think they apply broadly and it’s just more-so my gripes around tooling, attitudes, and my comments from personal experience that are primarily narrowed by this lens.

State-based testing

State based testing is all centered on asserting an expected state of the system after some action. Often you’ll have an implicit starting state and/or be required to seed an initial state.

This is a very functional way of asserting things I believe - we’re not treating the software like a series of classes and methods, we’re saying given some starting state, and some actions, we should reach some objective state. I put parameters into the machine, and it speaks back, and I don’t care about anything else.

Think of state testing an API - when I make a POST to create a resource I might expect that it sits in a database at-rest, and that is something I can assert. I shouldn’t care just how much middleware gets hit on the way down, how many classes sit in-between my API handler the the code that writes the data to the database, what is actually meaningful is that the API call I made has stored some data. You might choose to go a step further and assert state in terms of the public interface only - can this API be used to validate the state I have added or manipulated in the system by doing a subsequent GET? Can I try to retrieve a user that does not exist, then create them, then retrieve them again and get different results?

Behaviour-based testing

This is ultimately what you’re paid for (as far as the user sees). I think that’s incredibly powerful to keep in mind: pay close attention to testing thing things you’re paid to do because your stakeholders don’t give a hoot that you’re making a function call with very specific parameters. They care that when they click a button it generates a PDF of the data they’re looking at and sends it on an email to someone (or something like that, ask your favourite PO for inspiration).

This is where I’ve found mocking to be quite useless - we want to test as much real code as possible, either a class in isolation, a few components together, a full API or even multiple deployed services working together. That’s not to say there’s no room for faking it - I’ll often use in-memory analogues for third-party systems or databases to assert my behaviours but crucially I shouldn’t be held to the specific way I tie everything up here so long as it doesn’t actually matter.

Interaction-based testing

This is the testing you’re very likely used to when you rely heavily on mocks. Not that you can’t achieve state or behaviour based testing with mocking frameworks but in my experience they’re not usually worth it as they don’t offer much more than writing some code yourself where test doubles are required (or plugging in real dependencies where you can!).

Interaction based testing is all about asserting that the interaction between two components was as expected - either we’ve called a function with some given parameters, or that we have setup a canned response in the mock and some knock-on effects happened.

It’s the most code-aware tests and that makes it the least valuable. You can create an API to save dogs to a database a thousand different ways, who cares how you do it, the thing that matters is that the API saves dogs to a database. Interaction tests say “actually, it matters that the API calls the repository with this specific shape of a dog object, and if the repository throws an exception when adding the dog then the API should return a bad request response”. There’s a subtlety in that example - we don’t actually care that “when the database errors, we return a bad request” - we actually care about some behaviour like preventing users from overwriting dogs with the same name or something and it is much more valuable to write a test for that case than to isolate the code from the reason it is written.

Interactions don’t just have to be at the unit level - you can consider testing interactions between bigger components, via their HTTP API boundaries or distributed messaging. The more stable and well-documented those interfaces, the easier and better that can go.

So why should I care?

So, thinking more about this can hopefully lead you to conclude what sort of tests are worth your while and what aren’t. Take logging for example, the frankly immature view to focus on interaction testing could lead you to very justifiably make a case for asserting logger calls in your methods - it’s very easy to see that a method on ILogger is called and damned if you can’t verify that using a mock!

All you’re doing is building a double-entry bookkeeping system for your logger. It’s maybe not zero value but the cost of doing this compared with the value it delivers is drastically imbalanced - you restrict your agility for the sake of green ticks and you make all your code much harder to maintain, refactor and test because you are expecting some incredibly prescriptive and concrete internal detail. You could then consider only validating that a logger is called, but then what does that even ensure? Can it prevent important logs being removed?

What state is there to verify? Well, nothing really if you’re not verifying that your logs are doing anything more than outputting. I think an adjacent use case that is worthwhile is verifying state of an audit log for actions that should be audited. That’s not purely a development concern then, some end users rely on it.

What behaviour is there to verify? Nothing, really. The logging is incidental to the behaviour of the system - sure it’s critical for you to operationally support the software that ultimately delivers value, but the point here is that even your logger tests won’t verify that you’re actually writing valuable logs that do that. They’re just going to assert that you’re able to copy and paste a string in two places. Stick to code reviews, team standards and actual live system testing to ensure your logs are worthwhile, a mock won’t ever do that for you.

Speaking more broadly and reiterating what I wrote before, when we test interactions we’re often testing implementation detail. Implementation detail can be critical sometimes but I urge you to consider implementation details as opinions about how to reach an end goal, whereas behaviour and state are the end goal. Focussing on testing “opinions” can be very brittle and as things are refactored, require rethinking and rewriting - testing the end goal however should only be invalidated when requirements change.

With heavy interaction testing I’ve rarely been able to breathe without breaking some tests (and I have to underline here - I’m not talking about breaking the software or behaviour. Everything behaves perfectly correctly, the things that matter still work exactly the same as they used to) but with a state-and-behaviour oriented test suite I can often make some significant refactoring efforts without changing a single test which is what valuable tests are supposed to do - let you validate that your changes haven’t broken anything meaningful.

I touched on this earlier with my example about a dog API but if we’re validating interactions and not behaviour or state I struggle to make a judgement call on what code can justifiably be removed - it is not possible to understand from the tests alone the actual intents of the system versus the way it happened to be written.

Sure, it’s much harder to test some cases without interaction testing and that is the cases where such testing is useful. Where deployed components adhere to some contract between them there is value in testing that contract and ensuring it does not break, especially if you cannot scaffold the two components together in test (e.g. integration with a third party API).

Okay but, can I have some practical advice please

It’s hard to generalise and I think that’s a facet of good testing - I feel like I could very easily write crisp guidance on interaction testing and I think it’s also incredibly objective when you examine the things to be tested. I have a couple heuristics anyway:

Avoid asserting specifically that Subject Under Test calls IDependency.xyz() with given parameters
- At a low level specifically this is pretty much “covering implementation detail”
- I don’t care that IDogRepo.AddDog("Fido") was called, I care that MemoryDogRepo.Dogs.Contains("Fido")
Avoid asserting that when IDependency.xyz() returns a specific thing, something else is called
- Think of the case where IDependency needs disposing after usage. The interaction-style way to assert this would be to assert that xyz was called (dependency used) and then asserting that dependency.Dispose() was called. A meaningful difference: we should instead assert that dependency has been disposed. Calling the method is meaningless, but the state of it being disposed afterwards is important. What about if this is refactored to be DisposeAsync()? The interaction test falls down, but the state based test holds.
Avoid asserting the exact number of calls
- Validating a specific number of invocations on everything makes your tests awful to read and doesn’t give much value. If it ever truly matters, then assert there and there only.
- I would build a test double that works similarly to real-world behaviour. Back to the dog example, what would the real DogRepository do if you tried to add the same dog twice? Would we still have one dog or would it insert a duplicate? Would it actually model an upsert and change some details or just shrug it off? Would it throw an exception or return an error? What you should verify is how your code reacts to that, not that it never calls it twice

Behaviour and state tests are tests you can truly write ahead of the implementation, and I mean ahead of beginning the implementation. HTTP APIs are often a brilliant target for this since you don’t have to implement anything in order to make your test suite call POST /path with some body object. I can very often write an API call and assertion on that HTTP response, persistence and/or other boundaries (think ports and adapters) to implement all of my test cases to prove a feature before I begin such a feature. Doesn’t require that my server is handling the path yet, doesn’t necessarily require that I haven’t figured out the SQL query yet to join my tables for the data I want. I can setup and expect things at a higher level because I’m not validating internal implementation detail

I don’t believe it’s super valuable to organise tests based on whether they are state, behaviour or interaction based (like you might separate unit and integration tests, or fast and slow tests) - these are just different techniques you apply to assert outcomes. Different test classes might naturally focus on some kind or the other based on what they are testing, and a comprehensive test suite against the same target might include all types.

On mocking libraries in particular: while mocking Libraries aren’t strictly the enemy in a vacuum I find that they really encourage an over-reliance on interaction testing (they make it so easy!) and you would gain a lot more value and understanding from building your own test doubles and reusing them in stateful and behavioural testing primarily. Wean yourself off of mocking! Test things that matter!