Key Concepts and Overview
Model enrichment adds additional external signals, features and samples alongside your first-party data to boost your AI models’ performance. This external data consists of data that isn’t already in your application and serves to provide additional context and information that is essential for making accurate predictions. Because ML models need to reflect real world context, it’s important that they include the appropriate context, up-to-date information and relevant signals to accurately model the real world tasks they intend to solve.
Arcus’ Model Enrichment offering leverages the Arcus Data Exchange to consume data that transparently improves your models’ performance. The Arcus Data Exchange is a two-sided auction that connects your ML models to high value external data.
You can get started by using the Arcus Model SDK, the Python client library that integrates directly with common ML frameworks and builds atop the exchange to provide model enrichment. Today, the Model SDK includes support for PyTorch. It takes less than 10 lines of code to integrate your model with Arcus. The data you consume from the exchange is automatically integrated into your ML workflows for training and inference, all while preserving the privacy of your original first-party data.
Let’s establish the steps to integrating your model with Arcus and connecting it to high value data sources that transparently improves its performance.
To enrich your models with Arcus, the first step is to create a model enrichment project in the Arcus platform. When you create a new project, you’ll be prompted to provide some basic information and requirements about your application. This information will be used to match your model to the most relevant data candidates. Once you’ve created a project and wrapped your model with the Model SDK, you can:
- Run a trial to discover and evaluate different data candidates provided by the exchange.
- Evaluate the performance of each data candidate. Based on the results from your trial and other information for each data candidate, you can select the best data candidate for your model.
- Once you’ve made a selection, you can use the data candidate in your training and serving workflows to power your AI application. Arcus ensures that the provided external data is up-to-date and available for use in your model.
Trialing is the first step for model enrichment. An Arcus trial consists of a few steps that help you discover and experimentally validate external data that improves your model performance while meeting your application requirements. This process requires no additional code of logic on your part beyond wrapping your model with the Arcus Model SDK.
- Discovery: When starting a trial, the Arcus Data Exchange first matches your model and your first-party data to multiple external data candidates, which are each composed of multiple underlying datasets.
- Each data candidate contains additional features and samples that are compatible with your existing first-party data and application while providing novel and relevant information that improves your model’s performance.
- The exchange determines which data candidates are most likely to improve your model performance by examining summary information about your task or use case and summary metrics and semantic types of the first-party data already present in your model. The exchange also takes into account the intrinsic data quality, compatibility and relevance of the external data to ensure that the matched candidates are most likely to improve your model’s performance.
- Throughout the discovery process, none of your first-party data is directly shared with Arcus. The Model SDK extracts metadata, summary metrics and semantic types to determine which external data is most relevant and compatible with your application. This process ensures that your data is kept private and secure in your infrastructure. You can also specify your own application requirements such as the freshness or liveness of the data you require.
- At the end of the discovery process, the exchange produces a list of data candidates that are most likely to improve your model’s performance and is provided back to the model to begin the enrichment process.
- Enrichment: After matching your model to a set of external data candidates, it’s important to empirically validate the performance of each candidate so you can determine which candidate is best for your application. To do this, the Arcus Model SDK runs multiple training and evaluation loops against each matched candidate to measure how each candidate affects your model’s performance.
- Arcus records key performance metrics for your model, such as the model accuracy and loss, using the enriched model with each data candidate and compares it to the baseline performance of your model with no enrichment.
- This enrichment process provides you with a set of metrics for each candidate that you can use to evaluate. For example, you may examine how your validation accuracy changes when your model is enriched with a given data candidate.
After this trialing process is complete, you can evaluate the relative performance of each data candidate and select the one that best suits your application needs. In addition to comparing the model performance metrics (e.g. accuracy or loss), Arcus also provides other relevant factors such as:
- Freshness: How frequently the candidate is updated. Datasets in the exchange are automatically updated on a periodic basis. However, depending on your application’s specific freshness requirements you may prefer different data candidates.
- Candidate Summary: Each data candidate is composed of various underlying data sources provided by the exchange. For example, a data candidate may be composed of demographic data, economic indicators and consumer behavior data. Arcus provides summary information for the data candidate so you can understand the datasets that comprise a given candidate.
- Cost: The cost of a given data candidate is determined dynamically by the auction process and by the underlying data sources that make up the candidate. You may prefer different data candidates based on their relative cost and performance results.
Finally, to consume one of the data candidates for use in your model, you can proceed to make a data selection and choose the data candidate to use. This selected data candidate is then used across your training and serving workflows.