Modular Data Retrieval, an Architecture

How to architect your data stack for rapid, governable, iterative data set creation

11 min readJan 27, 2022

This is part 2 of a 2-part series on Modular Data Retrieval (MDR), and this one is a bit of a read. It’s probably helpful to have read the (much briefer) part 1, Modular Data Retrieval, an Introduction. If you’ve read that and you’re still interested — you want even more! — then I think this article will be worth it for you.

At the end of that piece, I laid out 5 criteria that should be met for data retrieval to be modular:

Data retrieval occurs via a middle layer between the user and the data.
Data components (e.g. feature groups) exhibit a consistent interface, which allows a data set-creation process to build data sets agnostic of the specific data components being used. This interface is declared in some way to the middle layer.
Data components have a stable definition between data sets. That is, data from a data component is consistent across data sets that it appears in.
Changes in a data component can and should be versioned.
Data components correspond to data that is relevant to the entire domain, while data filtering for specific use cases occurs during data set creation.

Also, part 1 established an example scenario that we will reference repeatedly to demonstrate the architecture.

In this scenario, you work with data about ICU patients to make predictive models that could increase awareness of a worsening patient condition.

Also, as in part 1, I’m using the term retrieval broadly to mean a process that brings data into working memory for use in some pipeline or application.

With this introduction in place, let’s walk through some of the practicalities of a modular data retrieval architecture. For now, we’ll skip to criteria #2 from the list above, since the requirements of criteria #1 are easier to see by looking at the data components themselves.

Data Component Engineering

The first step in data component engineering is feature creation in its classic sense. Ensure that you can obtain the data that you need, in the format that you require it in. This may involve some cleaning, some transforming, and then persisting. For my ICU predictive model example, I may want to clean and persist blood pressure measurements on a 15-minute interval with fields for timestamp, and systolic and diastolic pressures. As I persist, I’ll also want to include a patient ID field to help my middle-layer join the data from various components as it retrieves each one individually. Persistence can take any form that allows us to retrieve it efficiently, most likely in a queried sort of way.

Note that at this stage, you should persist features for the domain, and not for any specific use case.

Some data sets may make use of all of the records in my blood pressure data, but there may be other data sets that only use a small subset of these records — for example a model that is only concerned with making predictions within the first hour after a patient returns from surgery. We will get maximum re-use from our features if they are persisted at the domain level, and we can use our middle-layer to apply use case filters in retrieval. Importantly — and this is fundamental to MDR— we’re not persisting datasets, and we’re not even persisting the individual features from a single dataset. We’re persisting domain data in a format that can be readily used to construct multiple data sets. Note: This requires careful definition of what constitutes our domain. See the section below, “Finding the Balance” for discussion of the considerations we should make.

With persistence squared away, the next part is what makes this data I’ve persisted a data component within a modular data retrieval architecture: I need to persist metadata about this component. The metadata I persist should include something like:

A unique name
The version of the data under this name
Where the data is persisted
What universal identifier (or other means) can be used to relate this data to other data. (Patient ID)
A human-readable description of the data for documentation purposes
Any other attributes that a middle layer will require in order to provide the functionality you expect from it, such as a list of machine-readable descriptors describing the data returned when this component is used for data retrieval. Such a list can be used for automation and data validation.

For any given data component, all of this information comprises the metadata we want to record. For the system as a whole, when I require that this same metadata be provided for every data component, then I have defined an interface. If you’re currently using something like a feature store as a middle-layer, there may be a number of fields that are required in order to create a new “feature”. These fields are the data component interface. And the interface is what the middle layer (your feature store) uses to construct new data sets out of your components.

Middle-Layer Data Set Creation

With some data components in place, let’s now talk about how our middle layer creates data sets from components by combining them into a single collection of data, the data set. There are all sorts of middle-layers that do this: feature stores for ML, data virtualization platforms for BI, cloud data hubs for everything else imaginable, or even your own custom software or library that you’re building right now. What does it essentially mean to use a middle layer for data set creation?

For starters, it means not writing low-level code to retrieve all my individual data components and relate their specific information so that all the data joins into a single data set. Instead, I will use a library or an app, which, allows me to create a list of names and versions of my data components. That app will then use the metadata of my data components to perform the data joining into a data set.

The benefits of this are enormous. More than I can list here. But for starters: this ensures that I get exactly the data I expected, that the data has been engineered in a separate, code-reviewed process, that it is versioned data, and that someday soon I can iterate on this present data set by simply bumping the version number of this data component. It also decreases the amount of information that must be provided in order to create a data set, clearing the way for more automated data set creation.

The Middle-Layer Data Set Creation Process

In order for this whole process to work, my middle layer needs to be able to do a few things under the hood:

Access the metadata for a data component when provided the name and version of the component
Use the metadata of the various data components (particularly, the parts that tell where the data is persisted) to retrieve data for all components.
Use the universal identifier information from the metadata of the components to relate (often join) the data from the various components into a single data set.
Provide ways of performing basic filtering to limit which data is retrieved. (Ideally, provide optimizations for filtering across multiple components when possible.)

With the exception of #4, the middle-layer can rely on all necessary information being in the metadata as part of the data component interface we’ve designed.

Also, one thing about that interface that I’ve been vague about up until now is the where the data is persisted. The where could specify a table in a data warehouse. It could be a file location on a local computer. It could be a partitioned parquet in S3, it could be an API endpoint. It could be any information that your middle layer can use to retrieve data. One of the benefits of using a middle-layer is that it decouples us from a single data source. As long as all the data we retrieve can be related to each other, it can come from any source that our middle layer can handle.

Essential Middle-Layer Filtering Operations

Now #4 from above, performing basic filtering, needs a bit more attention. According to our full definition of modular data retrieval, “data components correspond to data that is relevant to the entire domain, while data filtering for specific use cases occurs during data set creation.”

Let’s flesh that out.

In our working example, our domain would be “ICU patients”. For our blood pressure data component, we persist blood pressure measurements for all ICU patients — the entire domain. But, we might have a specific use case that requires us to look at only blood pressure measurements within 1 hour of a patient arriving to the ICU from surgery. For this use case, we need to filter our data, and ideally, our middle-layer will allow us to evaluate a filter like this.

One of the most powerful filtering operations that a middle layer can perform is to filter the data retrieved for some data component by a filtering criteria applied to a different data component. Example: Imagine that we have a post_op_minutes data component. Our middle layer should allow us to retrieve the data component blood_pressures where post_op_minutes <= 60. Note: This requires some thoughtful data and middle-layer architecture. In the section below, “Finding the Balance”, we’ll discuss the considerations we should make.

And now that we’ve used a middle layer to create a data set, another benefit of Modular Data Retrieval becomes evident: data set creation can create a data set definition artifact that contains all the information required to produce the data set — most likely, your middle layer stores this somewhere in json or yaml format. This artifact is machine-readable and therefore easy to programmatically modify or update within your middle layer. Enjoy.

Finding the Balance

I’ve mentioned above that there are a few points at which careful planning should be done. Modular data retrieval depends on a combination of domain definition, data architecture, and middle-layer complexity. The burden of meeting all five requirements of modular data retrieval will be spread across these three aspects, so finding the right balance in assigning that burden is key to making modular data retrieval work in your stack for you.

For example, you could do almost no domain definition or data architecture at all and still achieve modular data retrieval — you’ll just require a complex middle layer that can perform [modular] data virtualization on top of your legacy database, or potentially [modularly] mesh together diverse data sources. But you’ll likely struggle with how diverse your domain is and you’ll find it difficult to make filtering efficient when retrieving data for any given use case.

Here are a few considerations to make when designing each of these layers.

Domain definition

Domain definition enables modular data retrieval by A) limiting the scope of the data and B) providing a perspective from which data architecture can occur.

Data components within a well-defined domain tend to relate to each other as being different aspects of a single entity. In our example domain, which is ICU patients, the data is unified by the patient as an entity. Each datum is an aspect of the patient at some point in time, such as age or blood pressure. A domain that is large or poorly defined for modular data retrieval will often contain several or even many different first-class entities. When this is the case, the way that different data components relate to each other becomes more complex, and that complexity typically needs to be handled either by the middle layer or by architectural convention.

Additionally, the domain definition offers a perspective for data architecture. Imagine that our domain was not ICU patients, but ICU people containing not just patient entities, but also nurse and physician entities. Data architected around a nurse entity might include blood pressures for all patients assigned to that nurse. The need to fully support data retrieval for both entities may require a more generic, relational data model. By not clarifying the domain’s perspective, we may be limiting how much of the modular-data-retrieval burden can be put onto data architecture, which will shift even more of it to the middle layer.

Data architecture

Data architecture can have a strong effect on how complex our middle layer must be in order to accommodate modular data retrieval. It may attempt to model data components explicitly, or it may require a middle-layer to provide those via virtualization. Architecture must provide a means for relating data components, typically via some sort of unique key. This relational aspect enables data components to be compiled into data sets and it enables data retrieval filtering to be efficiently applied across the retrieval of multiple data components. But beyond providing this basic functionality, there are a number of other choices that will enable (or hinder) modular data retrieval.

Depending on the use cases or the nature of the data, the architecture may be flattened into a table or stored as an object, and the use case may determine if either choice places more burden on the middle-layer. The architecture may be centralized or distributed. Our data architecture may even be layered, itself, utilizing various forms of operational logic beneath an interface that is presented to our middle-layer for data retrieval.

Middle-layer complexity

Any of the burden for achieving modular data retrieval which has not been assumed by a careful domain definition and data architecture will be pushed to our middle-layer. It’s important to note that even the most sophisticated middle-layer cannot always accomplish everything required to make modular data retrieval practical.

The amount of complexity in the middle-layer is precisely the amount of difficulty involved in taking a list of data components and filtering logic, and returning a complete data set. This generally boils down to three operations of varying complexity:

Translating the filtering logic that was declared by the user into something that can be used to filter data component retrieval as efficiently as possible.
Retrieving data for data components (ideally with filtering in place)
Combining the data of each component into a single set of data or records of some form. Note that this final combined form need not exist as a table or as a file or object in adjacent/linked memory or a single location in a filesystem. What matters is that the middle-layer provides data components that have been related into a single data set for use as such by the user. Many times that will be a table or a set of local records. But it doesn’t have to be.

Each of these bullet points can be made easier or more difficult by domain definition and data architecture.

Conclusion

The benefits of Modular Data Retrieval are manifold, or else no one would bother. Governance, Efficiency, Understandability, Reusability, Quality, Automate-ability. If your work requires frequent generation of data sets within a well-defined domain, then it’s worth it for you if you can take the time to carefully plan and implement it within your data stack. If so, I hope this article is some help to you.