How to build a data access layer

As ComplyAdvantage grows both in scale and complexity, our business demands more checks, balances and standards necessary to make things manageable and keep delivering. Common examples include feature flags, contract testing and other more advanced practices.

There is one technique that is, for reasons of expediency, often thought of quite late in the development cycle is a solid data access layer in your application or service.

Without a formally defined data access layer, data access is ad-hoc, or relies on rapid application development frameworks and direct use of various object-relational mappers (ORMs), or database queries embedded directly in application code.

Such a position is usually acceptable in small systems where outgrowing a single database is very unlikely, but for systems that need to scale, this can be a real problem.

As we shall see, they are limited in terms of flexibility, can be hard to test and hard to reason about in the long term.

Code examples are in typed Python, hopefully this should be easy enough to read for non-Python programmers.

A typical example

Firstly, as a counterpoint to the intended pattern, we will use the typical rapid application development staple of the Active Record pattern being called directly in application code as an example in this context.

class BlogComment:

    def __init__(self, db, post_id, username, comment):
        self._db = db
        self.post_id = post_id
        self.username = username
        self.comment = comment

    def save(self):
        username = self.username
        comment = self.comment,
        self._db.write("blog_comment", self.id, username, comment)

    @property
    def post(self):
        params = {"id": self.post_id}
        return self._db.query("blog_post", params)

    @classmethod
    def query(cls, params) -> List[BlogComment]:
        return self._db.query_many("blog_comment", params)

Note that real ORMs are far more sophisticated than this, but the interface is essentially the same. We include a few common Active Record patterns here, such as:

Object properties (such as id and username) that can be modified at any time.
A save method that modifies the database.
A post property that does a database query under the bonnet.
A query method where the function caller may specify ORM parameters.

This style limits the resulting application's data code's flexibility, testability and ease of future refactorings.

We will expand on the reasons why in the following sections.

This article is neutral on the usefulness (in general) of ORMs. It should be noted that use of ORMs can be a very good idea in some circumstances, when hidden behind a properly defined data access layer, and regarded as an implementation detail.

Unpredictable queries

The query class method exposes the params parameter, which allows for any kind of filter to be specified in the application code. Think of Java Hibernate HQL or Django keyword arguments filters, or whatever the equivalent is in your favourite ORM, if you have one.

This feature creates a leaky abstraction, where database specifics must be reasoned about in your application. You cannot know what conditions, joins (in the case of SQL databases), and other filters are being applied at query time.

For example, we could join across ten tables without knowing this in the data access code, as this is specified somewhere completely different to the model.

It makes it very difficult to modify the data access layer without having to search and touch all sorts of other places in the codebase, and violates the principle of separation of concerns.

Hard to mock, fake and test

We have to call the class methods of BlogComment in our application code to access the underlying database, and the post property secretly queries the database to provide a "seamless web of objects" experience to the developer.

Such an interface can make it difficult to create mocks and fakes for these model objects, which are possible to break when we change the application code. Even worse, we have somehow to anticipate the incoming queries to the query method which can introduce a lot of fragility in our unit tests; We can expect a lot of failures if we change the filters and queries in the application.

Is there a better way?

If we build our data access layer differently, and stick to a few rules, we can mitigate or eliminate many of these problems. These are as follows:

Use immutable model objects

All data should be modelled as basic, immutable “plain-old python objects” (POPOs), “plain-old Java objects” (POJOs or record classes in newer versions of Java), and other POxOs, with the following properties:

Properties must not be mutable or changeable by application code.
Models must not contain embedded or "secret" queries.
Models may contain other model objects either directly or in immutable collections.

The above conditions more-or-less completely rule out ActiveRecord-style objects.

Create a well defined data interface

All data must be accessible only via a well-factored data interface with the following properties:

It accepts only domain-specific immutable model types* or other immutable objects, either individually or in collections as function arguments.
It returns only domain-specific immutable data types* or other immutable objects, either individually or in collections as return types.
It can have more than one implementation, such as a SQL DB and fake implementation that can be used interchangeably from a correctness point of view.

* Some exceptions are possible, such as simple, domain-specific search criteria objects or data transfer objects in the case of complex atomic writes.

Dependency injection

All application code must be able to accept any concrete implementation of the data access interface, via formal or informal dependency injection.

Typically, we can either use a dependency injection framework, or we can simply pass the concrete implementation of the data access layer to the application code when the application is initialised, e.g:

def main():
    blog_post_provider = SQLBlogPostProvider()
    app = MyApplication(blog_post_provider)
    app.listen("0.0.0.0", 80)

This practice in general, has many benefits beyond data access layers, as it enables clean code, makes unit testing much easier, provides a nice framework for separating concerns and centralises code configuration in one place.

A good example

An example of immutable model objects and data interfaces without any unexpected behaviour is as follows:

class BlogComment(NamedTuple):
    id: UUID
    username: str
    comment: str

class BlogPost(NamedTuple):
    id: UUID
    title: str
    content: str
    username: str
    timestamp: datetime
    comments: List[BlogComment]

class BlogPostProvider(ABC):

    @abstractmethod
    def get_one(self, id: UUID) -> BlogPost:
        pass

    @abstractmethod
    def get_for_user(self, username: str) -> List[BlogPost]:
        pass
    
    @abstractmethod
    def submit_post(self, post: BlogPost):
        pass

    @abstractmethod
    def submit_comment(self, post_id: UUID, comment: BlogComment):
        pass

class SQLBlogPostProvider(BlogPostProvider):
    pass  # implementation omitted

class FakeBlogPostProvider(BlogPostProvider):
    pass  # implementation omitted

Note that the providers can use any technology in order to fulfil the interface contract, and can connect to any kind of SQL / No-SQL database using direct queries, ORMs or basically any other framework.

We can then inject the concrete provider implementations in place of the interface in application code, and use this to consume and update data in the application, with the following benefits:

Ease of unit testing

We can easily mock the provider, or create a feature-complete in-memory fake version (as hinted at in the code above) for use in unit tests. As this implementation can conform completely to the interface giving us a robust set of artefacts for testing that will not need updating if we decide to change filters and other code in the application.

Ease of integration testing

We can test our data access provider separately from the rest of the application during integration tests and be confident of correct function in production with the concrete provider in place.

This neatly bounds the blast-radius of integration tests, and gives us much better "bang for buck" for our combined unit and integration test suites.

Enhanced ability to refactor

We can at any point add properties to the model objects, create alternative domain models and smoothly migrate application and data access code from one set of models to another.

One scenario that occurs occasionally in ComplyAdvantage as we scale is that we wish to move to an entirely new data storage technology (such as moving from a SQL database to a different kind of system) as we grow. We can easily achieve this with a minimum of effort by replacing the provider implementation(s) with a new version based on the new data store.

Possibility of live data store migration

If we adhere to these principles, it is even possible to achieve full migration from one datastore to another with no downtime whatsoever by creating a proxy provider that reads and writes from old and new providers:

class MigrationProxyBlogPostProvider(BlogPostProvider):

    def __init__(
            self,
            old: BlogPostProvider,
            new: BlogPostProvider,
            read_from_new: bool,
    ):
        self._old = old
        self._new = new
        self._read_from_new = read_from_new

    def get_one(self, id: UUID) -> BlogPost:
        if read_from_new:
            return self._new.get_one(id)
        else:
            return self._old.get_one(id)

    def get_for_user(self, username: str) -> List[BlogPost]:
        if read_from_new:
            return self._new.get_for_user(username)
        else:
            return self._old.get_for_user(username)
    
    def submit_post(self, post: BlogPost):
        self._old.submit_post(post)
        self._new.submit_post(post)

    def submit_comment(self, post_id: UUID, comment: BlogComment):
        self._old.submit_comment(post_id, comment)
        self._new.submit_comment(post_id, comment)

We can use this proxy store to implement the interface in application code, the sequence of deployment / migration steps is as follows:

Write to old provider only (pre-migration).
Write to both old and new providers, read from old.
Backfill missing data from new datastore from old datastore until all data migrated.
Write to both old and new providers, read from new.
Write to new provider only (migration complete).

This allows a full migration of data from one datastore to another, assuming we can deploy services without downtime in general using Kubernetes or similar orchestration systems.

A real benefit in mission-critical systems!

Conclusion

As we have seen, having a good data access layer with limited mutability allows us to develop, test and reason about our code much more effectively, limiting the blast radius of code changes and offering a much greater degree of flexibility for future modification.

This is a very valuable property to have in a fast moving, rapidly scaling organisation such as ComplyAdvantage.