A core aspect of using Arcus Model Enrichment is running trials. The Arcus Data Platform matches your first-party data and model to various data candidates that are most likely to improve your model’s performance. These candidates are determined to maximize data quality, compatibility and relevance to your application.
After a trial has completed, you can validate the performance of each candidate by seeing how it performs empirically in your application. This way, you understand exactly the value of each data candidate for your given task. While trialing, Arcus runs training loops for your model enriched with each data candidate and measures the key model performance metrics that matter to your application, such as validation accuracy and loss. Based on these results, you evaluate the impact of each data candidate on the performance of your model and choose the best candidate for your application. The results of your trial are visible in the Arcus UI and through the Arcus Model SDK.
Let’s walk through how to run a trial with the Arcus Model SDK. Before we get started, you should create a model enrichment project on the Arcus platform (request early access here) and have your Project ID and API Key ready.
First, you configure your environment to connect the Model SDK to your Arcus Project. To do this, you wrap your existing model with an Arcus
Model object. This
Model object makes no changes to your original model, but configures it to consume external data which will be used during the trial.
Let’s look at an example using Pytorch. In a few lines of code, we’ll initialize an Arcus
Config object and use this to wrap our existing model with an Arcus
Config object takes in your Project ID and API Key, which you can find in the Arcus platform.
import arcus # Initialize your original Pytorch model my_model = MyModel() # Set the Config object arcus_config = arcus.model.shared.Config( api_key='MY_API_KEY', project_id='MY_PROJECT_ID', ) # Wrap the model with an Arcus Model arcus_model = arcus.model.torch.Model(my_model, arcus_config)
arcus_model object contains your existing model and provides the full underlying API, but with the added functionality of combining your first-party data with external data provided the platform.
Before running a trial, the
arcus_model object will not contain any external data and will behave exactly as the original
my_model object. This is because it has not yet been matched with any data candidates from the platform. At the start of our trial, the platform matches the model and first-party data to external data candidates, which are provided to the
arcus_model to train and validate the model. The trialing process will then run training and validation loops to evaluate the performance of each data candidate.
Now that you’ve configured the model, you are ready to run a trial with Arcus. Using Pytorch Lightning to train the model, the original model training code might look like the following:
my_lightning_module = MyLightningModule(my_model) my_trainer = pl.Trainer() trainer.fit( my_lightning_module, train_dataloader, val_dataloader )
This snippet uses a Pytorch Lightning
Trainer to train the model. This trainer runs over the first-party data that is contained in the
val_dataloader objects and trains the model using the
MyLightningModule class, which contains the model’s training and validation loops.
Now, using the wrapped
arcus_model, you can modify these three lines of code to run a trial with Arcus. During this trial, the platform will match and compose together high-value relevant data candidates that are compatible with your model and first-party data, which are then passed to the
arcus_model to train and validate the model. The metrics that are reported in the
MyLightningModule class will be posted to Arcus, which will be used to evaluate and select the best data candidate.
arcus_module = MyLightningModule(arcus_model) arcus_trainer = arcus.model.torch.Trainer() trial = arcus_trainer.trial( arcus_module, train_dataloader, val_dataloader )
arcus.model.torch.Trainer class implements all of the existing functionality of the Pytorch Lightning
Trainer class with the additional functionality needed to connect to the platform and run a trial. This makes it easy to modify the original code to enrich the model with external data. You run a trial by calling
Under the hood,
trial() contains a couple of steps:
- The model connects to the Arcus Data Platform to retrieve the matched data candidates.
- Matched data candidates are the result of several datasets each composed together to produce relevant, high-value data candidates for you to evaluate.
- For each matched data candidate:
- The Arcus
Trainermerges the external data candidate with your first-party data and then trains and validates the enriched model.
- During this process, the Arcus
Trainerreports validation metrics back to Arcus, which will be used to evaluate and select the best data candidate.
- For best results,
MyLightningModuleshould report at least one of validation loss or validation accuracy using
- The Arcus
Once you run a trial, you can exactly quantify the impact each data candidate has on the model’s performance, using empirical data. You can use the recorded metrics during the trial to evaluate and select a data candidate for use in your ML application’s training and serving workflows.