Skip to main content

Experiments

Foil’s experiment feature lets you run A/B tests on prompts, models, and configurations to find what works best.

What Can You Test?

  • Prompts - Compare different system prompts or instructions
  • Models - Test gpt-4o vs gpt-4o-mini vs claude
  • Parameters - Temperature, max tokens, etc.
  • Tools - Different tool configurations
  • Full workflows - Compare entire agent implementations

Creating an Experiment

  1. Go to Experiments in the dashboard
  2. Click Create Experiment
  3. Configure variants (name, traffic weight, configuration)
  4. Set metrics to track (signal names)
  5. Start the experiment

Using Variants in Your Code

const { Foil } = require('@getfoil/foil-js');
const OpenAI = require('openai');

const openai = new OpenAI();
const foil = new Foil({
  apiKey: process.env.FOIL_API_KEY,
  agentName: 'customer-support'
});

async function handleQuery(query, userId) {
  // Get experiment assignment
  const assignment = await foil.getExperimentVariant('prompt-test-v2', userId);

  return await foil.trace(async (ctx) => {
    const response = await openai.chat.completions.create({
      model: 'gpt-4o',
      messages: [
        { role: 'system', content: assignment.config.systemPrompt },
        { role: 'user', content: query }
      ]
    });

    return response.choices[0].message.content;
  }, {
    properties: {
      experimentId: assignment.experimentId,
      variant: assignment.variant
    }
  });
}

Experiment Configuration

FieldTypeDescription
namestringExperiment identifier
descriptionstringWhat you’re testing
variantsarrayList of variants to test
variants[].namestringVariant name
variants[].weightnumberTraffic percentage (0-100)
variants[].configobjectVariant-specific configuration
metricsarraySignal names to track
minimumSampleSizenumberRequired samples per variant
maximumDurationnumberAuto-stop after N days

Viewing Results

The experiment results page in the dashboard shows:
  • Variant performance comparison
  • Statistical significance (p-value)
  • Metric breakdowns
  • Traffic distribution

Best Practices

Isolate what you’re testing. If you change both prompt AND model, you won’t know which caused the difference.
Use user IDs or session IDs for assignment to ensure users get the same variant consistently.
Don’t end experiments early. Wait until you have enough samples and the p-value is meaningful (typically < 0.05).
A variant might improve one metric while hurting another. Track quality, user satisfaction, and business metrics.

Next Steps