There’s No ML Without Data

By Carm Taglienti, Chief Data Officer and Distinguished Engineer

Businesses generate large amounts of data every day. The more data your business generates, the more potential there is for confusion surrounding that data: What is the data telling you? Do you have the data you need? And most importantly, are you asking the right questions? 

If your business is looking to adopt Machine Learning Operations (MLOps), data is a good place to start. Why? Because ML depends on data and how that data fits into answering your business’s questions. For a machine to learn what you want it to learn, you need to be sure you are equipping your machine with the appropriate data.

In this blog, we will explore how to approach ML from a data-driven perspective.

Start with the “what” and the “why.”

Trying to get a machine to learn without data is like trying to get a student to learn without books (or, in today’s day and age, without Google).

Data makes the ML possible. So, when deciding how to sort through and manage your business’s data, you need to first identify what data is required to run your ML algorithms and generate Artificial Intelligence (AI) techniques to understand your business more effectively. 

To identify the right data for your ML algorithms, consider the “what” and the “why” behind your ML model: What is it that you expect the machine to learn, and why are you training the machine to be able to make decisions? The answers to these questions will assist you in classifying your data and creating your dataset. 

That said, it can be difficult to pinpoint the expected outcome of your ML model and why exactly you want to train a machine to be able to make decisions. If you are struggling to define the “what” and the “why” for your MLOps, it can be helpful to approach these questions from a business perspective. 

In other words, think about what business outcome you are looking to achieve and what kinds of questions your business is looking to answer. The answers to these questions will help to inform the “what” and the “why” for your ML processes and the data that is needed for your ML algorithms. 

One important consideration in this process is agility. ML pipelines that automate data collection and curation can enable highly productive feedback loops within the business to improve the learning and efficiency of the operational pipelines that drive the machine learning process.

Collecting the right data

Once you have identified the “what” and the “why” for your MLOps, the challenge is being able to obtain and collect the data that will position your machine to make the desired decisions. To overcome this challenge, look to identify that data where it might exist, and then create pipelines that will allow you to be able to collect it quickly and conveniently. This will help move the decision-making process forward. 

The general idea here is that ML is all about helping the machine understand the data patterns that might be associated with the types of business outcomes or questions that you (and the business) want to understand better.


The modern data estate: A foundation for success

When we work to identify what data a client will need to drive its ML toward a given business purpose, we do so in the context of what is called a modern data estate.

A modern data estate serves as the groundwork that preps the organization for a modern data platform. It consists of five elements: 

  • Security & governance
  • Smart, scalable storage
  • Flexible compute & language
  • Intelligent automation & enrichment
  • Democratic discoverability & sharing

Part of the MLOps process is defining the series of steps that needs to be automated in terms of data collection, and then what needs to happen to prepare that data for ML use. The modern data estate model allows us to do this more expediently, more effectively and more consistently.

This is because the modern data estate looks at data from the perspective of velocity and agility: the velocity at which that data is captured based on the lifetime value of the data (for example, tweets have to be consumed quickly as they have a short data value lifetime). Agility from the perspective of allowing for iteration and adaptation as you learn more about what data is required to address your AI/ML data needs. In a situation in which data is being captured in real time, the modern data estate allows you to consume that data in a real-time fashion. Or perhaps you want to dynamically augment the data or impute missing data to drive your MLOps more effectively — the modern data estate makes this possible and fast so that businesses can get to their desired outcomes sooner.

More resources for data and ML 

If your business has questions, ML could the solution to help you find your answers. But putting together a dataset and driving a successful ML operation is easier said than done. Here are a few additional resources to help guide you toward success.

Drive outcomes with Insight.
From defining use cases to implementing and maintaining solutions, Insight has the expertise you need to realize success with ML initiatives. See how we can help.