Arcus improves AI models’ and prompts’ performance by enriching them with high-value external data and signals provided by the Arcus Data Exchange. The performance of your AI applications depends on the quality of the data they use and how well the data represents the real world causations you are modeling.
Let’s establish a few key concepts that introduce how Arcus can help you improve your ML performance.
The Arcus Data Exchange is a dynamic auction that connects AI models and prompts to data distributed by Data Publishers. The exchange connects two parties, which interact with each other through the auction system:
- Data Consumers bring AI models and generative prompts to the Arcus Data Exchange, which automatically matches them to high-value data that improves their performance. These AI applications are seamlessly matched to additional signals, features and samples to transparently improve their performance without any additional effort on your part.
- Data Publishers publish and monetize datasets through the Arcus Data Exchange. These datasets are matched through the auction process to models and prompts who consume the data to improve their AI outcomes.
The Arcus Data Exchange uses powerful matching algorithms to determine the most valuable external data for a given AI application. This matching process results in a set of external data candidates, which are each composed of multiple underlying datasets, that the exchange has determined are most likely to improve the AI application’s performance. The exchange uses some core factors to generate candidates for your application to ensure the most high quality and relevant candidates are matched:
- Data Quality: The exchange determines a data quality index to measure the value of datasets on the exchange. This index uses the past performance of candidates and evaluates their inherent quality with metrics like data cleanliness, accuracy, and distribution.
- Data Compatibility: The exchange determines which candidates are compatible with your requirements and first-party data as well as which are most likely to improve the performance of your task. This uses factors such as:
- Data schema: The client libraries infer information about the semantic types of your first-party data to understand the semantic meanings of your features, such as whether they are locations, people, business names, etc. Semantic types may be expressed in multiple forms, which the exchange automatically matches and joins across compatible datasets and with your first-party data.
- Data freshness: A metric measured on how frequently a dataset is updated. For applications that rely on live signals and context, freshness is crucial to ensure the information fed to your models and prompts is accurate for your task.
- Data Relevance: The value of data depends on the application for which it’s used. The exchange ranks candidates by their relevance to your task, which depends on an understanding of your task and first-party data. The exchange builds this understanding using privacy-preserving algorithms so your data stays in your infrastructure. It also takes into account the completeness and distribution of your data to determine which additional samples to use. With this understanding of your first-party data, the exchange then uses data valuation algorithms to determine the relevance of each candidate to your application.
The Arcus Data Exchange scores the data quality, compatibility, and relevance of external data candidates on the exchange to rank the most valuable candidates for your application. Arcus then tightly integrates with your application to seamlessly enrich your existing ML workflows with this external data.
The Arcus Data Exchange improves your ML models’ and prompts’ performance by adding valuable external signals directly in your application through a process called enrichment. Let’s establish some terms around enrichment:
- First-party data: The data that you already have in your application. This is the data you use to train your ML models or in your prompts for generative AI applications.
- External data: Data that you don’t have, but is valuable to your application. This data is provided to your models and applications through the exchange.
- Enrichment: The process of adding the external data and signals to your models and prompts in order to improve your AI outcomes. This ensures your models and prompts reflect the real-world factors that influence the outcome of your applications.
The trialing process is how you assess the value of the external data and signals to your AI applications. During the trialing process, your models are trained and validated using the different data candidates provided by the exchange. This runs multiple training loops with different data candidates to measure key performance metrics, such as the model accuracy and loss, that represent how your model changes when enriched with different data candidates. Based on these empirical results, you can understand the real performance of your applications with the enriched data and before determining what data to consume in your application.