Arcus improves AI models’ and prompts’ performance by enriching them with high-value external data and signals. The performance of your AI applications depends on the quality of the data they use and how well the data represents the real world causations you are modeling.
Let’s establish a few key concepts that introduce how Arcus can help you improve your ML performance.
The Arcus Data Platform connects AI models and prompts to data they need. Lets establish the following concepts:
- Data Consumers bring AI models and generative prompts to the Arcus Data Platform, which automatically matches them to high-value data that improves their performance. These AI applications are seamlessly matched to additional signals, features and samples to transparently improve their performance without any additional effort on your part.
- Data Publishers publish and monetize datasets through the Arcus Data Platform. These datasets are matched through the discovery process to models and prompts who consume the data to improve their AI outcomes.
The Arcus Data Platform uses powerful matching algorithms to determine the most valuable external data for a given AI application. This matching process results in a set of data candidates, which are each composed of multiple underlying datasets, that the platform has determined are most likely to improve the AI application’s performance. The platform uses some core factors to generate candidates for your application to ensure the most high quality and relevant candidates are matched:
- Data Quality: The platform determines a data quality index to measure the value of datasets on the platform. This index uses the past performance of candidates and evaluates their inherent quality with metrics like data cleanliness, accuracy, and distribution.
- Data Compatibility: The platform determines which candidates are compatible with your requirements and first-party data as well as which are most likely to improve the performance of your task. This uses factors such as:
- Data schema: The client libraries infer information about the semantic types of your first-party data to understand the semantic meanings of your features, such as whether they are locations, people, business names, etc. Semantic types may be expressed in multiple forms, which the platform automatically matches and joins across compatible datasets and with your first-party data.
- Data freshness: A metric measured on how frequently a dataset is updated. For applications that rely on live signals and context, freshness is crucial to ensure the information fed to your models and prompts is accurate for your task.
- Data Relevance: The value of data depends on the application for which it’s used. The platform ranks candidates by their relevance to your task, which depends on an understanding of your task and first-party data. The platform builds this understanding using privacy-preserving algorithms so your data stays in your infrastructure. It also takes into account the completeness and distribution of your data to determine which additional samples to use. With this understanding of your first-party data, the platform then uses data valuation algorithms to determine the relevance of each candidate to your application.
The Arcus Data Platform scores the data quality, compatibility, and relevance of external data candidates on the platform to rank the most valuable candidates for your application. Arcus then tightly integrates with your application to seamlessly enrich your existing ML workflows with this external data.
The Arcus Data Platform improves your ML models’ and prompts’ performance by adding valuable external signals directly in your application through a process called enrichment. Let’s establish some terms around enrichment:
- First-party data: The data that you already have in your application. This is the data you use to train your ML models or in your prompts for generative AI applications.
- External data: Data that you don’t have, but is valuable to your application. This data is provided to your models and applications through the automatic data discovery process on the platform.
- Enrichment: The process of adding the external data and signals to your models and prompts in order to improve your AI outcomes. This ensures your models and prompts reflect the real-world factors that influence the outcome of your applications.
The trialing process is how you assess the value of the external data and signals to your AI applications. During the trialing process, your models are trained and validated using the different data candidates matched to your application via the Data Platform. This runs multiple training loops with different data candidates to measure key performance metrics, such as the model accuracy and loss, that represent how your model changes when enriched with different data candidates. Based on these empirical results, you can understand the real performance of your applications with the enriched data and before determining what data to consume in your application.