Testing applications with database

🏠 Go home
Testing 2022-08-17

This post is about testing of an application which uses a database. And I have one big assumption about the testing. I want my tests to be parallelizable and independent. I want to run my tests in parallel and possibly parametrize each one in the property-based testing manner. I want to stress out that this is not an easy task in general, because for a subset of the test suit, I want to be able to allocate TCP ports, manipulate the external DBMS or things like mock-server.

If you start thinking about all the obstacles of testing a running application or just a component of the application that needs to interact with another stateful app, you really start feeling the meaning of Fowler's test pyramid and why we want the biggest portion of the test suit to reside in the unit-tests category. Unfortunately, it's not usually possible to test our data access layer without the actual database.

I was recently thinking about different ways to test code in the data access layer, let me show you what I got.

1. Injecting mock repository / DAO instances

About the naming

First things first, let's start with terminology. I personally call interfaces or classes providing access to the data storage a repository but there is actually a standard for the naming which basically says I do it wrong. Martin Fowler defines a repository as a collection-like interface for accessing domain objects. Therefore, if we were strict about the naming, none of these would conform to the repository pattern definition.

It turns out I rather usually use Java EE's (although, I don't think they are specific only to Java EE) data access objects (DAO). They are also interfaces for managing the persistence but they don't seem to have a strictly given structure. DAO can have arbitrary methods like getSomethingByName etc.

Alright! That's it regarding the naming. So whenever I talk about repositories, I mean generally a layer which manages persistence for the app - usually a set of interfaces or abstract classes, and their implementations.

Mocking / faking / spying ...

The first, the fastest, the cleanest and the easiest way to test a stateful code is to use a test double instead of real production repository instances. We can have as many such tests as we want. They will be probably fast, therefore, we can easily use the property based testing or parametrize such tests to cover a great number of cases. The point of these tests is to check the correctness the entity that depends on the persistence. But we are not really testing the persistence itself.

I put every technique that doesn't use the actual database under this category. We can use spying to check a method on the repository has been called under certain circumstances. Or maybe mock with preprogrammed expectations to check what's happening with the tested entity under given state of the storage. But the point is, we don't need the actual database and we can manage everything ourself in memory. Test codebase covered by such tests will be very likely easy to deal with so I expect we want mostly these kinds of tests.

For some databases, there are in-memory implementations we can use to simulate the actual TCP communication. Such libraries empower us with an ability to substitute the non-production implementation on lower level, thus a bigger portion of our application get tested. I personally used such technique only for testing a component depending on Redis and to be honest I didn't see the gain so I personally don't use this approach often. Anyway, such testing would still be in the mocking category.

Benefits

Disadvantages

Obviously, we won't be able to test production implementations of our repositories but it is not the intention so I'd cal it a specifics of the approach rather then its disadvantage.

2. Single database for whole test suit, each test in a transaction which is always rollbacked

This one and all the following categories are about testing against a real, running instance of the database. Evidently, we need a running instance for these tests which is the first "obstacle". Also, we need to make sure the migration script was triggered on the database.

Running all the tests on a single database instance but each one in its own transaction which is always rollbacked at the end feels like the first step of testing against a real database. The benefit should be obvious, we are connecting to the database and actually triggering SQL queries there. I'm using this technique to test each method / function of the repository implementation or combination of them.

Benefits

Disadvantages

3. Single database for whole test suit with cleanup

This one is about running tests on the same instance and database. It is probably the worst of all the options in terms of maintenance. We need to make sure tests won't interfere with each other by nature of the behaviour we are testing. The only suitable kind a test I can think of is something like Postgres's pub-sub.

The benefit of this approach is the speed, because we need to run the migration script only once.

Benefits

Disadvantages

4. A database per test

Creating a new database per test on a single DBMS instance is the way to go for testing bigger chunks of the application. We need to run migration script for each test, therefore it is not suitable for parametrised or property-based testing. But we get the desired isolation in here, therefore we can spawn the whole application and run UI tests or large component tests while being able to run it concurrently / in parallel with other tests.

I was recently dealing with these kind of tests. The main problem in my case is the performance because the migration script takes the same amount of time the whole test suit takes. At the end, it is not really a big problem because the tested behaviour is mostly IO-bound, thus I'm able to run tests concurrently. Still, I have some ideas how to improve the performance:

Benefits

Disadvantages

5. New database instance per test

To be honest, I have that one only for the sake of completeness and I don't have any practical experience with the approach.

In theory, running a new DBMS instance per test gives us the best isolation we could possibly get and with the containerisation nowadays it should be easily manageable in practice. Although, I haven't done any testing on that matter and I only assume, I strongly believe this option would be probably the slowest one. Personally, I haven't once encountered such a testing problem to be forced to spawn a whole new database instance per test unit.