Key Concepts and Overview
Model enrichment adds additional external signals, features and samples alongside your first-party data to boost your AI models’ performance. This external data consists of additional context and information that isn’t already in your application, but is essential for making accurate predictions. Because ML models need to reflect real world context, it’s important that they include the appropriate context, up-to-date information and relevant signals to accurately model the real world tasks they intend to solve.
Arcus’ Model Enrichment offering leverages the Arcus Data Platform to consume data that transparently improves your models’ performance. You can get started by using the Arcus Model SDK, the Python client library that integrates directly with common ML frameworks and builds atop the platform to provide model enrichment. Today, the Model SDK includes support for PyTorch. It takes less than 10 lines of code to integrate your model with Arcus. The data you consume from the platform is automatically integrated into your ML workflows for training and inference, all while preserving the privacy of your original first-party data.
Let’s establish the steps to integrating your model with Arcus and connecting it to high value data sources that transparently improves its performance.
To enrich your models with Arcus, the first step is to create a model enrichment project in the Arcus platform. When you create a new project, you’ll be prompted to provide some basic information and requirements about your application. This information will be used to match your model to the most relevant data candidates. Once you’ve created a project and wrapped your model with the Model SDK, you can:
- Start the trialing process, composed of Matching, Composition and Validation, allowing you to asses the value of the data and signals of different data candidates provided by the platform.
- Once you’ve made a selection, you can use the data candidate in your training and serving workflows to power your AI application. Arcus ensures that the provided external data is up-to-date and available for use in your model.
The Trialing process is how you assess the value of the external data and signals for your AI application. This process requires no additional code of logic on your part beyond wrapping your model with the Arcus Model SDK.
Matching: When starting a trial, the Arcus Data Platform first matches your model and your first-party data to multiple external datasets.
- Each dataset contains additional features and samples that are compatible with your existing first-party data and application while providing novel and relevant information that improves your model’s performance.
- The platform determines which datasets are most likely to improve your model’s performance by examining summary information about your task or use case and summary metrics and semantic types of the first-party data already present in your model. The platform also takes into account the intrinsic data quality, compatibility and relevance of the external data to ensure that the matched datasets are most likely to improve your model’s performance.
- Throughout the matching process, none of your first-party data is directly shared with Arcus. The Model SDK extracts metadata, summary metrics and semantic types to determine which external data is most relevant and compatible with your application. This process ensures that your data is kept private and secure in your infrastructure. You can also specify your own application requirements such as the freshness or liveness of the data.
Compostion: After matching your model to a set of external datasets that are compatible with your first-party data, Arcus automatically initiates the process of composing these datasets together. This step is aims to compose together a set of all-encompassing and relevant data candidates tailored to your specific task at hand.
- During the composition phase, Arcus ensures that any gaps or missing information in the composed data are addressed. In cases where certain data elements are absent, relevant synthetic data is introduced to complete the data candidate. This approach aims to produce comprehensive and relevant data candidates that encompass all necessary information, ensuring the optimal performance of your AI application.
Validation: After matching your model to a set of external data candidates, we then empirically validate the performance of each candidate so you can determine which candidate is best for your application. To do this, the Arcus Model SDK runs multiple training and evaluation loops against each matched candidate to measure how each candidate affects your model’s performance.
- Arcus records key performance metrics for your model, such as the model accuracy and loss, using the enriched model with each data candidate and compares it to the baseline performance of your model with no enrichment.
- This validation process provides you with a set of metrics for each candidate that you can use to evaluate the impact the data had for your specific AI application. For example, you may examine how your validation accuracy changes when your model is enriched with a given data candidate.
After this trialing process is complete, you can evaluate the relative performance of each data candidate and select the one that best suits your application needs. In addition to comparing the model performance metrics (e.g. accuracy or loss), Arcus also provides other relevant factors such as:
- Freshness: How frequently the candidate is updated. Datasets in the platform are automatically updated on a periodic basis. However, depending on your application’s specific freshness requirements you may prefer specific data candidates.
- Candidate Summary: Each data candidate is composed of various underlying data sources provided by the platform. For example, a data candidate may be composed of demographic data, economic indicators and consumer behavior data. Arcus provides summary information for the data candidate so you can understand the datasets that comprise a given candidate.
- Cost: The cost of a given data candidate is determined dynamically by the auction process and by the underlying data sources that make up the candidate. You may prefer different data candidates based on their relative cost and performance results.
Finally, to consume one of the data candidates for use in your model, you can proceed to make a data selection and choose the data candidate to use. This selected data candidate is then used across your training and serving workflows.