Evaluation Playground

Interactive Model Testing

Select a benchmark question to test how your models query the Memanto memory layer. Evaluate the answer with a secondary LLM Judge against ground truth.

Context Question

Choose a ground-truth question from the benchmarks. This sets the Agent ID.

Dataset

Question

Inference LLM

The model generating the initial answer.

Provider

Model

API Key

Keys are not stored locally.

Judge LLM

Use Inference LLM

The model evaluating the accuracy of the answer.

Ready to evaluate

Review the context question and click Send to test the agent.