Modular Data Retrieval (for Machine Learning, et al.), an Introduction
Grasp this key concept for feature stores and data virtualization platforms
There are a growing number of tools that simplify the process of data set creation. There are feature stores for ML, data virtualization platforms for BI, and cloud data hubs for everything imaginable. All of these tools put a necessary middle layer between the user who consumes data as data sets and the engineers who pipe, clean, and persist data.
But just adding a tool into the middle of a stack is rarely the way to get the most benefit out of it. Modular Data Retrieval (MDR) is a concept that enables rapid, governable, iterative data set creation via a middle-layer tool. If you understand MDR, you can maximize the benefit of your middle layer. (You could even build your own middle layer.)
This post is part 1 of a 2-part series on MDR, and the main purpose of this post is to introduce the broad concept. In part 2, Modern Data Retrieval, an Architecture, I’ll walk through some of the more lower-level architectural concepts that enable MDR.
What do I mean by “Modular Data Retrieval”?
Let me describe a data retrieval scenario and use that to demonstrate Modular Data Retrieval. And for the purpose of this article, I’m using the term retrieval broadly to mean a process that brings data (or a pointer to it) into working memory for use in some pipeline or application.
In this scenario, you work with data about ICU patients to make predictive models that could increase awareness of a worsening patient condition. (This is an entirely fictional and overly simplistic scenario, but it has more interesting stakes than most of us tend to experience in our day-to-day.)
After doing your work for a little while, you note that the data sets you construct for research and machine learning have a lot of things in common. For example, your data set for predicting a dangerous drop in serum potassium levels looks incredibly similar to your data set for predicting development of pulmonary embolism— not because the two conditions are necessarily physiologically related, but because much of the data you collect is relevant broadly across your domain. It includes things like:
- Patient demographic information (age, sex, etc.)
- Major co-morbid conditions
- Vital signs and other biometrics
- Physical assessments
- Clinical lab results
Each of these bullet points would be a feature group — a collection of features. The the features in a group are conceptually cohesive, and are often produced by the same source applications or devices.
Your data sets are very similar, but they also differ in key ways. You notice that almost all of your data sets include most of these feature groups. But, many data sets will exclude 1 or 2 of them. Also, almost all of your data sets include 1 or 2 more specialized features that aren’t as widely-used. For example, high-cardinality categorical features may be dimensionally reduced in different ways for different use cases — pharmaceuticals, for example. For your pulmonary embolism predictive model, you’re using pharmaceutical embeddings that were trained within the context of adverse blood clotting incidents. Your low-potassium model will use a different pharmaceutical embedding.
Making your data set modular means that the dataset-creation process is designed to treat features or feature groups as independent modules from which a dataset is built.
This ability to mix-and-match different feature groups is really only the beginning of what modular data retrieval has to offer. It’s the part of the architecture you see most clearly at first glance. Before we wrap this introduction up, let’s look at a more complete definition of MDR…
A Fuller Definition
In order for data retrieval to be modular, it must satisfy the following requirements:
- Data retrieval occurs via a middle layer between the user and the data.
- Data components (e.g. feature groups) exhibit a consistent interface, which allows a data set-creation process to build data sets agnostic of the specific data components being used. This interface is declared in some way to the middle layer.
- Data components have a stable definition between data sets. That is, data from a data component is consistent across data sets that it appears in.
- Changes in a data component can and should be versioned.
- Data components correspond to data that is relevant to the entire domain, while data filtering for use cases occurs during data set creation.
And in case that list seems a bit abstract, let me provide at least one “so what?” for each number above:
- Data sets are defined by just making some sort of list of the names (and versions) of the data components (i.e. features)— not even by code, necessarily, though a middle layer could look like a specialized code library or an app that reads yaml data set configs or the GUI for your feature store or something else.
- (A) You can create new data sets or iterate on old ones much faster… (B) The process of building a completely new data set from existing data components is even highly automate-able.
- Errors or conflicting definitions of data don’t creep into multiple data set retrievals via repeating code.
- Historical data sets can be re-created. “Improvements” can be empirically validated. Governance.
- The re-usability of data components across different use cases is maximized.
Next Steps: A Modular Data Retrieval Architecture
So what does an MDR architecture look like? Hop over to part 2, Modular Data Retrieval, an Architecture to go deeper.