Skip to main content

What is an eval?

An eval is a way of measuring how well your agent is performing against a set of criteria. There are 2 types of evals:
  • Offline evals uses historical data to assess agent performance. It is helpful for preventing regressions when an agent is edited.
  • Online evals uses real time data to assess agent performance. It is helpful for monitoring real world performance and outputs of your agent.
Currently, Lindy only supports offline evals for your agents, meaning you can monitor historical tasks that you specify and if they pass. Tasks will not be scored in real time.

Key Terminology

In Lindy, an eval is a reference task combined with a set of scorers. Typically this is a task that has performed well so you can prevent regressions. Think of them as tests that you want to pass every time you deploy a new iteration of an agent. A scorer defines how the eval should be scored.

Creating an eval

To create an offline eval, click on the testube icon under any step
Creating an offline eval by clicking the test tube icon
This will transition you to eval creation mode where you can add scorers at the step level and the task level.
Eval creation mode with step and task level scorers
Create a scorer by clicking New Scorer. We only support LLM as a judge as a scoring mechanism, so to construct your scorer, simply tell the LLM how you want the step to be evaluated.
Creating a new scorer with LLM judge configuration
After setting up the scorer prompt, you’ll want to test the scorer to see the score this step would have received using your prompt. You can adjust your prompt or the model if necessary to ensure the score matches what you expect.

Monitoring and running your evals

To monitor and run your evals, go to the evals tab.
Evals tab showing summary of eval runs and changes
Here, can see a summary of all of your eval runs and the changes from the previous run. You can also see views for evals that have regressed (the score decreased) or have improved (score increased.
Eval runs consume credits from your account.
Note: running an evaluation is a safe simulation—it does not execute real actions. it will simply simulate how the run would behave with the current version of the agent.