AI onboarding email sequences for SaaS in one week

Most teams trying “AI onboarding email sequences for SaaS” are stuck on one thing: they do not have a code-complete, reviewable slice they can actually ship.

Contents

TL;DR: who this is for and what ships in a week
What success looks like: KPIs and decision owners
Repo quickstart: fork, run, and see a mock send
Event contract and schema gates
Prompt spec and versioning as code
Generation layer: modern API usage with retries and token accounting
Validator suite: layered checks with tests
ESP integrations and safe sending
Canary rollout and experimentation
Observability: metrics, dashboards, and logs
Deliverability warm up and DMARC / ESP guidance
Privacy, legal, and compliance checklist
Cost control and model economics
Operator runbooks and escalation paths
Tests, CI, and release flow for prompt changes
Appendix: artifacts, author, and trust signals
Next steps: choose your first slice and ship

This guide fixes that by walking you to a narrow, production-safe pipeline you can stand up in about a week: one onboarding email, one ESP, 1 percent canary, tested and observable.

TL;DR: who this is for and what ships in a week

Personas

Growth PM or lifecycle marketer who owns activation and trial conversion
Backend / infra engineer who owns events, APIs, and CI
Deliverability specialist or email ops who owns domain reputation
Privacy / legal reviewer who needs clear artifacts, not slides

Scope for week one

1 AI generated onboarding email template (for example, “Day 1: Welcome & first action”)
Single ESP integration (SendGrid or Amazon SES)
JSON event contract with tests
Generation service with retry, token accounting, and idempotent ESP send
Validator stack wired in: syntax, banned phrases, PII, semantic checks
1 percent canary rollout with automated decision rules
Prometheus metrics, Grafana dashboard exports, alert rules
Legal & deliverability artifacts: DPIA outline, DMARC example, footer templates
CI pipeline for prompt changes with fixtures

Timeline (hypothetical, assuming one engineer + one marketer)

Day 1–2 fork repo, hook up ESP sandbox, wire events and schema tests
Day 3–4 prompts, validators, metrics, first end to end run to mock ESP
Day 5–7 canary flag, dashboards, legal/deliverability review, limited live traffic

Everything below is written to support that thin slice. You can widen later.

What success looks like: KPIs and decision owners

You are not just “using a model.” You are changing a production communication channel. Treat it like a feature rollout.

Core KPIs

Activation rate for the targeted onboarding step (for example, accounts that reach “Aha” event in 7 days)
Complaint rate delta (ESP reported spam complaints vs baseline template)
Bounce rate delta (hard/soft bounce rate vs baseline)
Generation error rate (fraction of sends that fall back due to validation or model error)
Token cost per email (prompt tokens + completion tokens multiplied by price per token for your chosen model)

All metrics should be sliced by cohort = control | ai_canary.

Decision owners

Decision	Metric trigger	Primary owner	Consulted
Flip canary on/off	Complaint_rate_delta, bounce_rate_delta, generation_error_rate	Growth PM	Deliverability, infra
Revert prompt version	Activation drop or quality alerts	Growth PM	Infra
Pause all AI sends	PII incident or reputation risk	Deliverability / Security	Legal, PM
Approve new attributes for personalization	Data catalog / DPIA review	Privacy / Legal	PM, data

Sample SLA and rollback ownership

If complaint_rate_delta exceeds a configurable threshold (for example, two times baseline level) over a rolling window, deliverability can disable AI sends without waiting for PM approval.
If generation_error_rate exceeds a threshold (for example, 5 percent of attempts) in a short window, infra switches traffic to deterministic templates and opens an incident.
Prompt changes ship behind a feature flag; if activation does not improve within a predefined test window, PM reverts to the previous prompt version.

Repo quickstart: fork, run, and see a mock send

You want your growth PM to see a real generated email, with logs and metrics, within an hour. The structure below is designed for a public Git repo, a downloadable zip, and a Docker or Colab experience.

Suggested repo layout

ai-onboarding-pipeline/
  README.md
  docker/
    Dockerfile
    docker-compose.yml
  notebooks/
    power_check.ipynb
  src/
    config.py
    events/schema.py
    events/samples/
    prompts/
      onboarding_day1.yaml
    generation/
      client.py
      models.py
    validation/
      syntax.py
      banned_phrases.py
      pii.py
      semantic.py
    esp/
      sendgrid_client.py
      ses_client.py
      mock_esp.py
    rollout/
      cohorts.py
      canary_policy.py
    observability/
      metrics.py
      logging_config.py
  dashboards/
    grafana_onboarding.json
    prometheus_rules.yml
  legal_deliverability/
    dpia_outline.md
    dmarc_example.txt
    footers/
      base_footer_en.html
  .github/workflows/
    ci.yml
  tests/
    test_schema.py
    test_prompts.py
    test_validation.py
    test_esp.py
    test_rollout.py

Docker / docker compose quickstart

# docker/Dockerfile
FROM python:3.12-slim

WORKDIR /app
COPY . /app

RUN pip install --no-cache-dir -r requirements.txt

ENV PYTHONUNBUFFERED=1

CMD ["python", "-m", "src.esp.mock_esp"]

# docker/docker-compose.yml
version: "3.9"
services:
  pipeline:
    build:
      context: ..
      dockerfile: docker/Dockerfile
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - ESP_MODE=mock
    ports:
      - "8000:8000"

Colab / Jupyter demo flow

Structure your notebook to do this in order:

Load a sample event payload from events/samples/trial_signup.json
Run schema validation
Call the generation module to produce subject + body
Run validators and inspect failures
Send to mock ESP endpoint and show sample response
Display generated metrics from a short run

Event contract and schema gates

The event schema is the backbone. Most failures later in the pipeline start with messy or drifting events.

JSON Schema for onboarding event

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "OnboardingEmailContext",
  "type": "object",
  "required": ["user_id", "email", "plan", "signup_ts", "product_usage"],
  "properties": {
    "user_id": { "type": "string", "minLength": 1 },
    "email": { "type": "string", "format": "email" },
    "locale": { "type": "string", "pattern": "^[a-z]{2}(-[A-Z]{2})?$" },
    "plan": { "type": "string", "enum": ["free", "trial", "pro", "enterprise"] },
    "signup_ts": { "type": "string", "format": "date-time" },
    "product_usage": {
      "type": "object",
      "required": ["has_completed_tutorial", "events_last_24h"],
      "properties": {
        "has_completed_tutorial": { "type": "boolean" },
        "events_last_24h": { "type": "integer", "minimum": 0 }
      }
    },
    "consents": {
      "type": "object",
      "properties": {
        "email_marketing": { "type": "boolean" }
      }
    }
  },
  "additionalProperties": false
}

Python schema gate and test harness

# src/events/schema.py
from jsonschema import Draft202012Validator
import json
from pathlib import Path

SCHEMA_PATH = Path(__file__).with_name("onboarding_schema.json")
SCHEMA = json.loads(SCHEMA_PATH.read_text())
VALIDATOR = Draft202012Validator(SCHEMA)

def validate_event(payload: dict) -> None:
  errors = sorted(VALIDATOR.iter_errors(payload), key=lambda e: e.path)
  if errors:
    messages = [f"{'/'.join(map(str, e.path))}: {e.message}" for e in errors]
    raise ValueError(f"Event schema validation failed: {messages}")

# tests/test_schema.py
import json
from pathlib import Path
import pytest
from src.events.schema import validate_event

def load_sample(name: str) -> dict:
  return json.loads((Path(__file__).parents[1] / "src/events/samples" / name).read_text())

def test_valid_event_passes():
  payload = load_sample("trial_signup.json")
  validate_event(payload)

def test_missing_email_fails():
  payload = load_sample("trial_signup.json")
  payload.pop("email", None)
  with pytest.raises(ValueError):
    validate_event(payload)

CI job to block schema drift

In .github/workflows/ci.yml add a job that runs pytest tests/test_schema.py for any change under src/events/. Require this job for merge. Any incompatible change fails the pull request before it reaches production.

Prompt spec and versioning as code

Prompts are code. Treat them like code.

Prompt spec structure

# src/prompts/onboarding_day1.yaml
version: "1.1.0"
status: "canary"  # canary | stable | archived
owner: "growth@example.com"
model_hint: "gpt-4.1-mini"
locale: "en-US"

input_contract:
  schema_ref: "events/onboarding_schema.json"
  fixture_inputs:
    - "events/samples/trial_signup.json"

style_guidelines:
  tone: "concise, practical, friendly, no hype"
  banned_phrases:
    - "limited time offer"
    - "act now"
  required_elements:
    - "one clear CTA link"
    - "short summary of feature value"
    - "preheader text"

system_prompt: |
  You write onboarding emails for a SaaS product.
  Constraints:
  - Do not make guarantees about uptime or security beyond what is given.
  - Respect locale and plan.
  - Avoid urgency or false scarcity.

user_template: |
  Write a welcome email for the following user context (JSON):

  {{ event_json }}

  Return JSON with keys: subject, preheader, html_body.

expected_output_checks:
  subject:
    max_length: 80
  html_body:
    must_include:
      - "<a "
      - "Get started"

Prompt fixtures and CI

# tests/test_prompts.py
import json
from pathlib import Path
import yaml
from src.generation.client import generate_email

PROMPTS_DIR = Path("src/prompts")

def test_prompt_fixtures_generate_valid_shape(monkeypatch):
  # Use a cheap stub model in CI
  monkeypatch.setenv("MODEL_PROVIDER_MODE", "stub")

  for prompt_file in PROMPTS_DIR.glob("*.yaml"):
    spec = yaml.safe_load(prompt_file.read_text())
    for fixture in spec["input_contract"]["fixture_inputs"]:
      payload = json.loads((Path("src/events/samples") / fixture).read_text())
      result = generate_email(spec, payload)
      assert set(result.keys()) == {"subject", "preheader", "html_body"}
      assert len(result["subject"]) <= spec["expected_output_checks"]["subject"]["max_length"]

Pull requests that change a prompt must pass these fixture tests before merging.

Generation layer: modern API usage with retries and token accounting

This is how you interface with a model provider in 2026 without surprising costs or brittle behavior.

Model client abstraction

# src/generation/models.py
from dataclasses import dataclass
from typing import Dict, Any, Tuple
import os
import time
import logging

from openai import OpenAI  # official SDK

log = logging.getLogger(__name__)

@dataclass
class ModelResponse:
  content: str
  prompt_tokens: int
  completion_tokens: int
  model: str
  latency_ms: float

class ModelClient:
  def __init__(self, model: str):
    self.model = model
    self.client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

  def generate(self, system_prompt: str, user_prompt: str) -> ModelResponse:
    start = time.time()
    response = self.client.responses.create(
      model=self.model,
      input=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
      ],
      max_output_tokens=800,
      temperature=0.4
    )
    latency_ms = (time.time() - start) * 1000
    out = response.output[0].content[0].text
    usage = response.usage
    log.info(
      "model.generate",
      extra={
        "model": self.model,
        "prompt_tokens": usage.input_tokens,
        "completion_tokens": usage.output_tokens,
        "latency_ms": latency_ms,
      },
    )
    return ModelResponse(
      content=out,
      prompt_tokens=usage.input_tokens,
      completion_tokens=usage.output_tokens,
      model=self.model,
      latency_ms=latency_ms,
    )

Generation with retry and idempotency hook

# src/generation/client.py
import json
import logging
import uuid
from typing import Dict, Any
import yaml

from tenacity import retry, wait_exponential, stop_after_attempt

from .models import ModelClient
from src.validation.pipeline import validate_generated_email

log = logging.getLogger(__name__)

def load_prompt_spec(path: str) -> Dict[str, Any]:
  import pathlib
  p = pathlib.Path(path)
  return yaml.safe_load(p.read_text())

def make_idempotency_key(user_id: str, template_id: str) -> str:
  return f"{template_id}:{user_id}"

@retry(wait=wait_exponential(multiplier=0.5, min=1, max=8),
       stop=stop_after_attempt(3))
def generate_email(prompt_spec: Dict[str, Any], event: Dict[str, Any]) -> Dict[str, str]:
  system_prompt = prompt_spec["system_prompt"]
  user_template = prompt_spec["user_template"]
  user_prompt = user_template.replace("{{ event_json }}", json.dumps(event, sort_keys=True))

  client = ModelClient(prompt_spec["model_hint"])
  resp = client.generate(system_prompt, user_prompt)

  try:
    parsed = json.loads(resp.content)
  except json.JSONDecodeError as e:
    log.warning(
      "generation.invalid_json",
      extra={"error": str(e), "raw": resp.content[:300]},
    )
    raise

  email = {
    "subject": parsed.get("subject", "").strip(),
    "preheader": parsed.get("preheader", "").strip(),
    "html_body": parsed.get("html_body", ""),
  }

  validate_generated_email(email, event, resp)

  return email

The retry decorator handles transient model errors. Idempotency is handled at the ESP layer but based on a stable key from user id and template id.

Token accounting

Store measured prompt_tokens and completion_tokens as Prometheus histograms and per send logs. Cost per email is then:

cost_per_email = (avg_prompt_tokens + avg_completion_tokens) * price_per_token

Use hypothetical ranges while planning capacity. For instance, suppose you see 300 prompt tokens and 250 completion tokens, with cost per 1k tokens defined by your vendor. Multiply out for expected monthly email volume.

Validator suite: layered checks with tests

Validation is where you keep AI from hurting your reputation. Use a layered “validation pyramid” and emit metrics at each layer.

Syntactic validators

# src/validation/syntax.py
from typing import Dict

def check_lengths(email: Dict[str, str]) -> None:
  if len(email["subject"]) > 80:
    raise ValueError("Subject too long")
  if len(email["html_body"]) > 12000:
    raise ValueError("HTML body too long")

def check_basic_html(email: Dict[str, str]) -> None:
  body = email["html_body"]
  if "<script" in body.lower():
    raise ValueError("Script tags not allowed")

Banned phrase engine

# src/validation/banned_phrases.py
from typing import Dict, List

DEFAULT_BANNED = [
  "100% guaranteed",
  "act now",
  "risk-free",
]

def check_banned_phrases(email: Dict[str, str],
                         extra_banned: List[str] | None = None) -> None:
  phrases = set(DEFAULT_BANNED + (extra_banned or []))
  haystack = (email["subject"] + " " + email["html_body"]).lower()
  hits = [p for p in phrases if p.lower() in haystack]
  if hits:
    raise ValueError(f"Banned phrases detected: {hits}")

Keep this list small and opinionated. Let marketers tune it per template instead of global hardcoding.

PII detection: regex + NER

# src/validation/pii.py
import re
from typing import Dict

EMAIL_RE = re.compile(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}")
PHONE_RE = re.compile(r"\+?\d[\d\s\-]{7,}\d")

def detect_pii(text: str) -> dict:
  return {
    "emails": EMAIL_RE.findall(text),
    "phones": PHONE_RE.findall(text),
  }

def check_pii(email: Dict[str, str], user: Dict[str, str]) -> None:
  body = email["html_body"]
  pii = detect_pii(body)
  user_email = user.get("email")
  # Allow the user's own email in body if that is consistent with your template style
  found_emails = [e for e in pii["emails"] if e != user_email]
  if found_emails:
    raise ValueError("Unexpected email addresses in output")

For higher accuracy, add a small local NER model or hosted classifier and treat it as an extra signal. Regex covers many obvious incidents with low overhead. For logs, use a redact before log pattern so PII does not land in plain text.

Semantic checks

# src/validation/semantic.py
from typing import Dict
from dataclasses import dataclass

@dataclass
class SemanticResult:
  toxicity_score: float
  off_policy: bool

def semantic_guardrails(email: Dict[str, str]) -> SemanticResult:
  # Placeholder: call your safety classifier here.
  # In tests, stub this so CI is deterministic.
  return SemanticResult(toxicity_score=0.0, off_policy=False)

def check_semantic(email: Dict[str, str]) -> None:
  result = semantic_guardrails(email)
  if result.off_policy:
    raise ValueError("Semantic safety violation")

Validation pipeline and metrics

# src/validation/pipeline.py
from typing import Dict
import logging

from .syntax import check_lengths, check_basic_html
from .banned_phrases import check_banned_phrases
from .pii import check_pii
from .semantic import check_semantic
from src.observability.metrics import VALIDATION_COUNTER

log = logging.getLogger(__name__)

def validate_generated_email(email: Dict[str, str],
                             event: Dict[str, str],
                             model_resp) -> None:
  layers = [
    ("syntax", lambda: (check_lengths(email), check_basic_html(email))),
    ("banned_phrases", lambda: check_banned_phrases(email)),
    ("pii", lambda: check_pii(email, event)),
    ("semantic", lambda: check_semantic(email)),
  ]
  for name, fn in layers:
    try:
      fn()
      VALIDATION_COUNTER.labels(layer=name, status="pass").inc()
    except Exception as e:
      VALIDATION_COUNTER.labels(layer=name, status="fail").inc()
      log.warning("validation.failed", extra={"layer": name, "error": str(e)})
      raise

Validator tests

# tests/test_validation.py
import pytest
from src.validation.syntax import check_lengths
from src.validation.banned_phrases import check_banned_phrases

def test_subject_length_violation():
  email = {"subject": "x" * 200, "html_body": "<p>hi</p>"}
  with pytest.raises(ValueError):
    check_lengths(email)

def test_banned_phrase_detected():
  email = {"subject": "Act now", "html_body": "<p>hi</p>"}
  with pytest.raises(ValueError):
    check_banned_phrases(email)

Use fixtures for borderline cases and involve marketing to tune false positives over time.

ESP integrations and safe sending

Your model is only half the story. ESP idempotency and error handling prevent double sends and broken campaigns.

SendGrid integration with idempotency

# src/esp/sendgrid_client.py
import os
import logging
import requests
from typing import Dict

log = logging.getLogger(__name__)

SENDGRID_API_URL = "https://api.sendgrid.com/v3/mail/send"

class SendGridClient:
  def __init__(self):
    self.api_key = os.environ["SENDGRID_API_KEY"]

  def send_email(self, email: Dict[str, str],
                 to_email: str,
                 idempotency_key: str) -> Dict:
    headers = {
      "Authorization": f"Bearer {self.api_key}",
      "Content-Type": "application/json",
      "Idempotency-Key": idempotency_key,
    }
    payload = {
      "personalizations": [{"to": [{"email": to_email}]}],
      "from": {"email": os.environ.get("FROM_EMAIL")},
      "subject": email["subject"],
      "content": [{"type": "text/html", "value": email["html_body"]}],
    }
    resp = requests.post(SENDGRID_API_URL, json=payload, headers=headers, timeout=10)
    if resp.status_code not in (200, 202):
      log.error(
        "sendgrid.send_failed",
        extra={"status": resp.status_code, "body": resp.text[:300]},
      )
      raise RuntimeError("SendGrid send failed")
    log.info(
      "sendgrid.send_success",
      extra={"idempotency_key": idempotency_key, "status": resp.status_code},
    )
    return {"status": resp.status_code}

SES example

# src/esp/ses_client.py
import os
import logging
from typing import Dict
import boto3

log = logging.getLogger(__name__)

class SESClient:
  def __init__(self):
    self.client = boto3.client("ses", region_name=os.environ.get("AWS_REGION"))

  def send_email(self, email: Dict[str, str],
                 to_email: str,
                 idempotency_key: str) -> Dict:
    resp = self.client.send_email(
      Source=os.environ["FROM_EMAIL"],
      Destination={"ToAddresses": [to_email]},
      Message={
        "Subject": {"Data": email["subject"]},
        "Body": {"Html": {"Data": email["html_body"]}},
      },
      ConfigurationSetName=os.environ.get("SES_CONFIG_SET"),
    )
    message_id = resp.get("MessageId")
    log.info(
      "ses.send_success",
      extra={
        "idempotency_key": idempotency_key,
        "message_id": message_id,
      },
    )
    return {"message_id": message_id}

ESP tests

# tests/test_esp.py
from src.esp.sendgrid_client import SendGridClient

def test_sendgrid_handles_non_2xx(mocker):
  client = SendGridClient()
  mock_post = mocker.patch("src.esp.sendgrid_client.requests.post")
  mock_post.return_value.status_code = 500
  mock_post.return_value.text = "error"
  email = {"subject": "Hi", "html_body": "<p>hi</p>"}
  try:
    client.send_email(email, "user@example.com", "key123")
  except RuntimeError:
    assert True

Wire idempotency keys through your end to end span and store them alongside ESP message ids for later reconciliation.

Canary rollout and experimentation

You do not flip all onboarding traffic to AI in one go. You start with a small, stable canary.

Cohort assignment with stable hashing

# src/rollout/cohorts.py
import hashlib

def assign_cohort(user_id: str, experiment_name: str, canary_percent: float) -> str:
  key = f"{experiment_name}:{user_id}"
  h = hashlib.sha256(key.encode("utf-8")).hexdigest()
  bucket = int(h[:8], 16) / 0xFFFFFFFF
  return "ai_canary" if bucket < canary_percent else "control"

Automated canary decision rules

Implement a daily job that reads metrics and applies a simple rule set such as the following matrix, using hypothetical thresholds:

If complaint_rate_delta > threshold for 2 consecutive days, set canary_percent to 0 and revert prompt version.
If bounce_rate_delta > threshold, restrict to known engaged users or pause.
If activation_delta is positive and safety metrics are stable, gradually raise canary_percent.

Map each outcome to a runbook step so operators know which toggle to flip.

Power check notebook

In notebooks/power_check.ipynb, parameterize:

Baseline activation rate
Desired relative lift
Significance level (alpha)
Power (1 minus beta)

Use a standard two proportion test formula to estimate required sample sizes. The aim is not perfect statistics but to avoid tests with so little traffic that you draw false comfort from noise.

Observability: metrics, dashboards, and logs

Without observability, you will only discover problems when a big customer complains.

Prometheus instrumentation

# src/observability/metrics.py
from prometheus_client import Counter, Histogram

GEN_LATENCY = Histogram(
  "ai_onboarding_generation_latency_ms",
  "Model generation latency",
  ["model"],
  buckets=(50, 100, 200, 400, 800, 1600, 3200),
)

TOKENS = Histogram(
  "ai_onboarding_tokens",
  "Tokens per email",
  ["type", "model"],  # type = prompt | completion
  buckets=(50, 100, 200, 400, 800, 1600),
)

VALIDATION_COUNTER = Counter(
  "ai_onboarding_validation_events_total",
  "Validation events by layer and status",
  ["layer", "status"],
)

ESP_SENDS = Counter(
  "ai_onboarding_esp_sends_total",
  "ESP send outcomes",
  ["provider", "status"],  # status = success | failure
)

Hook these into the generation and ESP layers. Export metrics via an HTTP endpoint for Prometheus scraping.

Grafana dashboard exports

Include JSON definitions for panels such as:

Generation latency by model over time
Validation failures by layer (stacked bar)
ESP sends success vs failure rate
Complaint and bounce rates by cohort

Operators should be able to import a JSON file and see a working dashboard in minutes.

Prometheus alert rules

# dashboards/prometheus_rules.yml
groups:
  - name: ai-onboarding
    rules:
      - alert: HighGenerationErrors
        expr: rate(ai_onboarding_validation_events_total{status="fail"}[5m]) > 5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "AI onboarding validation failures increased"
      - alert: ESPFailures
        expr: rate(ai_onboarding_esp_sends_total{status="failure"}[5m]) > 1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "ESP send failures for AI onboarding"

Sample log schema

{
  "ts": "2026-02-01T10:15:20Z",
  "event": "onboarding_email_sent",
  "user_id": "u_123",
  "template_id": "onboarding_day1",
  "cohort": "ai_canary",
  "idempotency_key": "onboarding_day1:u_123",
  "model": "gpt-4.1-mini",
  "prompt_tokens": 320,
  "completion_tokens": 260,
  "validation_status": "pass",
  "esp_provider": "sendgrid",
  "esp_status": "success"
}

Apply redaction to any field that might include PII beyond the fields you consciously log.

Deliverability warm up and DMARC / ESP guidance

AI content does not give you a pass on deliverability basics. If anything, you should be more conservative.

IP / domain warm up

Start AI traffic on the same authenticated domain and IP pool as your existing onboarding if possible.
Keep initial canary volume small relative to your daily onboarding traffic so patterns do not trigger filters.
Use seed addresses across major ISPs (Gmail, Outlook, Yahoo) to monitor placement.

DMARC, SPF, DKIM, BIMI basics

Ensure SPF includes your ESP.
DKIM signing should be enabled on the sending domain.
DMARC should be configured with a policy aligned with your maturity (for example, start with a monitoring policy while you instrument reporting).

Include a DMARC example for your DNS administrator:

_dmarc.example.com. IN TXT "v=DMARC1; p=none; rua=mailto:dmarc-reports@example.com"

DMARC report parsing script

# legal_deliverability/dmarc_parser.py
import glob
import xml.etree.ElementTree as ET

def parse_aggregate_reports(path_pattern: str):
  for file in glob.glob(path_pattern):
    tree = ET.parse(file)
    root = tree.getroot()
    for record in root.findall("record"):
      source_ip = record.find("row/source_ip").text
      disposition = record.find("row/policy_evaluated/disposition").text
      yield {"source_ip": source_ip, "disposition": disposition}

Use this to spot unexpected sending sources and alignment issues.

Privacy, legal, and compliance checklist

Legal reviewers do not want a marketing deck. They want a list of decisions and artifacts.

Privacy checklist for personalization

Document each attribute used in the prompt (plan, locale, usage flags) and its source system.
Ensure consent for marketing email exists and is enforced at query time.
Define retention for raw model outputs and logs; avoid storing full content longer than needed.
Limit PII passed into prompts. Prefer segment tags over free form descriptive text that includes identifiers.

Footer templates

<!-- legal_deliverability/footers/base_footer_en.html -->
<table role="presentation" width="100%" cellpadding="0" cellspacing="0">
  <tr>
    <td align="center" style="font-size:12px;color:#888;padding:16px">
      You are receiving this email because you signed up for {{product_name}} with
      {{user_email}}.
      <br/>
      <a href="{{manage_preferences_url}}">Manage preferences</a> |
      <a href="{{unsubscribe_url}}">Unsubscribe</a>
      <br/>
      {{company_name}}, {{company_address}}
    </td>
  </tr>
</table>

DPIA outline

Description of processing: AI generation of onboarding emails using limited behavioral and account data.
Purpose: increase activation while respecting consent and privacy.
Data categories: identifiers (email, user id), product usage metrics, plan details.
Risks: PII leakage in content or logs, unintended profiling, cross border transfers.
Safeguards: validation pipeline, PII redaction, strict access controls on logs, model provider data handling review.
Residual risk and approval: signoff from data protection lead.

Cost control and model economics

AI onboarding is cheap on a per email basis at small scale, but it can surprise you when traffic grows.

Token accounting pattern

Record prompt and completion tokens per send as metrics and logs.
Compute rolling averages and store them as reference values.
Simulate monthly cost with simple formulas using your vendor price list and projected volume.

Personalization depth vs cost vs brittleness

Heavy personalization that uses many user attributes tends to increase prompt size and maintenance cost.
Segment based personalization (for example, by plan + activity cluster) can be generated once and cached per segment.
Use caching for high volume cohorts: you can generate a template per segment and insert only basic identifiers at send time.

A practical pattern is to generate copy at the cohort level and reserve direct per user model calls for key lifecycle moments or high value users.

Operator runbooks and escalation paths

Incidents will happen. The value is in how quickly you can detect and unwind them.

Runbook: high complaints spike

Alert fires from complaint_rate_delta.
Deliverability confirms in ESP dashboard.
Immediate step: set feature flag to route all traffic to control template.
Revert prompt version to last stable, keep AI off until a new experiment plan is written.
Review content with marketing and legal, check banned phrases and promises.

Runbook: model error spike

Alert fires from validation failures or model errors.
Infra checks model provider status and internal changes.
Immediate step: increase logging detail, route traffic to deterministic templates.
If problem is provider outage, keep AI off and schedule a postmortem to consider multi model fallback.

Runbook: suspected PII leak

Security receives report or sees validation failures due to PII detection.
Pause AI sends and freeze relevant logs.
Engage legal and data protection; follow your incident response playbook.
Audit prompts for data passed into the model and logs for stored content.
Document changes: stricter PII filters, log redaction updates, reduced attribute set.

Escalation matrix

Severity 1: PII incident or major deliverability hit. Security or deliverability leads the incident, with legal, PM, and infra.
Severity 2: Significant performance regression without user harm. Infra and PM lead with deliverability consulting.
Severity 3: Quality issues or minor anomalies. PM and marketing iterate on prompts and validators.

Tests, CI, and release flow for prompt changes

Prompt changes should feel like code changes: small, reviewable, and tested.

CI pipeline outline

# .github/workflows/ci.yml
name: CI

on:
  pull_request:
    branches: [ main ]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install -r requirements.txt
      - run: pytest tests -q

Include:

Schema validation tests
Prompt fixture tests with stub model
Validator tests
ESP mock tests

Prompt release checklist

Open PR with prompt spec changes and rationale in description.
Include example outputs for one or two representative events.
Ensure CI passes all prompt and validation tests.
Have growth PM and deliverability sign off in comments.
Tag prompt version as canary and ship behind canary flag.
Promote to stable once metrics look acceptable.

Appendix: artifacts, author, and trust signals

Exported artifacts checklist

Grafana dashboard JSON for onboarding metrics
Prometheus alert rule files
Sanitized sample logs and example ESP responses
DPIA outline document and footer HTML templates
Notebook for power checks and example parameter sets

Author credentials and change log

Signed by a senior operator who has shipped and supported email and messaging pipelines in production. Change log should live in CHANGELOG.md with entries like:

v1.0.0 initial repo, single template, SendGrid sandbox support
v1.1.0 added SES integration, semantic validator stub, power check notebook
v1.2.0 DMARC parser example, updated footers for legal feedback

Next steps: choose your first slice and ship

You have three practical options:

Choose a single onboarding template, single ESP, and 1 percent canary if you want a working pipeline in a week and can iterate later.
Choose multi template rollout only if you already have strong observability and deliverability processes and can extend them.
Defer AI entirely if you cannot support incident response, DMARC monitoring, or schema validation. In that case, invest in those basics first.

Pick the narrowest slice that still tests AI personalization where it matters for your SaaS. Wire it with contracts, validators, metrics, and a kill switch. Then you can scale with confidence rather than hope.