Mitigating Prompt Sensitivity: Manufacturing Robustness Through Diverse Preambles

Models behave differently based on how a question is phrased --- a "cynical senior dev" and a "curious student" get different answers to the same problem. Using NeMo Data Designer, we built a pipeline that generates hundreds of diverse prompt preambles with controlled variation across tone, strictness, verbosity, and answer format, then validates each one for compliance. These preambles feed into a YAML-driven training mixture pipeline that prepends diverse instructions to existing SFT data at scale. This approach is now used in Nemotron training mixtures to address the prompt-format brittleness observed in internal testing.

What Is Prompt Sensitivity?

A prompt to an LLM typically has three distinct components: the preamble (high-level instructions), the problem (the actual question or task), and the format instruction (how to structure the answer). Prompt sensitivity is the phenomenon where a model's accuracy changes significantly based on how the preamble and format instruction are phrased, even when the underlying problem is identical.

                ANATOMY OF A PROMPT + RESPONSE
                ================================

 ┌──────────────────────────────────────────────────────────────┐
 │  PREAMBLE  (variable — what we diversify)            [GREEN] │
 │                                                              │
 │  "Answer the following multiple choice question"             │
 ├──────────────────────────────────────────────────────────────┤
 │  PROBLEM  (fixed — from the source dataset)            [RED] │
 │                                                              │
 │  What is the capital of France?                              │
 │  A) London                                                   │
 │  B) Berlin                                                   │
 │  C) Paris                                                    │
 │  D) Madrid                                                   │
 ├──────────────────────────────────────────────────────────────┤
 │  FORMAT INSTRUCTION  (variable — what we diversify)  [GREEN] │
 │                                                              │
 │  "The last line of your response should be \boxed{LETTER}"   │
 └──────────────────────────────────────────────┬───────────────┘
                                                │
                                                ▼
 ┌──────────────────────────────────────────────────────────────┐
 │  ROLLOUT / OUTPUT  (model's response)                        │
 │                                                              │
 │  <think> ... reasoning ... </think>                          │
 │                                                              │
 │  \boxed{C}                                                   │
 └──────────────────────────────────────────────────────────────┘

The preamble and format instruction (green) are the parts we can vary freely without changing the problem. The problem itself (red) comes from the source dataset and stays fixed. When models are trained on data with only one preamble style and one format instruction, they become brittle --- they can solve the problem, but small wording changes cause them to misformat their response, triggering scoring failures.

When we evaluated early Nemotron checkpoints on internal STEM benchmarks with varied prompt phrasings, we observed accuracy swings of up to 15 percentage points depending solely on how the question was phrased:

"Select the best answer"           → 82% accuracy
"Choose the correct option"        → 78% accuracy
"Which of the following is true?"  → 74% accuracy

Same questions. Same model. Same knowledge. Different scores. This is a well-documented phenomenon across the industry --- models overfit to the prompt format seen during training.

The root cause is straightforward: the training data lacks prompt diversity. If every STEM MCQ in your SFT dataset starts with "Answer the following question and place your answer in \boxed{}", the model learns that specific format perfectly but becomes brittle to anything else.

The fix is equally simple in principle --- expose the model to the same problems with many different phrasings --- but doing this manually at the scale of thousands of training examples is impractical. We need preambles that span a wide diversity space:

Sentence types: imperative ("Select the answer"), interrogative ("Which option is correct?"), declarative ("The correct answer is to be placed in...")
Tones: formal, neutral, concise, encouraging, strict
Strictness levels: from "here's a question" to "you MUST follow this exact format"
Verbosity: one-liners vs. detailed multi-sentence instructions
Answer formats: \boxed{}, \boxed{LETTER}, Answer: A/B/C/D, ((X)), <final_answer>X</final_answer>, and dozens more

Covering the full combinatorial space of these dimensions manually is intractable --- and this is exactly the kind of structured diversity problem that synthetic data generation is designed to solve. Data Designer's sampler-driven approach lets us define the diversity dimensions declaratively, and the framework handles the combinatorics at scale, generating thousands of validated preamble variations that no human annotator could match.

Goal

Reduce LLM sensitivity to prompt phrasing by generating diverse, high-quality prompts for both SFT and RL training data using Data Designer.
Improve model robustness and generalization across different instruction styles, tones, and structures.
Specifically target variations in preambles and format instructions while keeping the core problem unchanged.

Pipeline Architecture: QA Preamble Generation

The pipeline below shows one specific instantiation for generating diverse preambles for QA/MCQ datasets, designed to improve the prompt sensitivity of the question prompt. The same architecture can be adapted for Math, Code, or any domain where prompt diversity is needed.

                              PROMPT SENSITIVITY PIPELINE
                              ==========================
                              (QA Preamble Generation)

             ┌─────────────────────────────────────────────────────────────────────────────────────┐
             │                              STEP 1: SEED EXAMPLES                                  │
             │                                                                                     │
             │   10 regex-paired format templates × 5 preamble anchors = 50 seed rows              │
             │   Loaded via DataFrameSeedSource with SamplingStrategy.SHUFFLE                      │
             └─────────────────────────────────────────┬───────────────────────────────────────────┘
                                                       │
                                                       ▼
             ┌─────────────────────────────────────────────────────────────────────────────────────┐
             │                           STEP 2: DIVERSITY SAMPLERS                                │
             │                                                                                     │
             │   sentence_type (3)    tone (5)         strictness_level (3)                        │
             │   imperative           formal            low                                        │
             │   interrogative        neutral           medium                                     │
             │   declarative          concise           high                                       │
             │                        encouraging                                                  │
             │                        strict            verbosity_level (3)                        │
             │                                          concise / standard / detailed              │
             │   domain_context (3)   preamble_format_order (8)                                    │
             │   general              P + F + {problem},  F + P + {problem},                       │
             │   STEM                 P + {problem} + F,  F + {problem} + P,                       │
             │   humanities           {problem} + P + F,  {problem} + F + P,                       │
             │                        PF + {problem},     {problem} + PF                           │
             │                                                                                     │
             │                                          Combinatorial space:                       │
             │                                          3 × 5 × 3 × 3 × 3 × 8 = 3,240              │
             └─────────────────────────────────────────┬───────────────────────────────────────────┘
                                                       │
                                                       ▼
             ┌─────────────────────────────────────────────────────────────────────────────────────┐
             │                          STEP 3: LLM GENERATION COLUMNS                             │
             │                                                                                     │
             │   Three LLMTextColumnConfigs conditioned on the samplers + seed rows:               │
             │     • preamble           — generic instruction (no format requirements)             │
             │     • format_instruction — paraphrased from seed; must match output_regex           │
             │     • user_prompt        — composed via preamble_format_order                       │
             └─────────────────────────────────────────┬───────────────────────────────────────────┘
                                                       │
                                                       ▼
             ┌─────────────────────────────────────────────────────────────────────────────────────┐
             │                             STEP 4: QUALITY JUDGES                                  │
             │                                                                                     │
             │   format_compliance (binary): does format_instruction enforce the format?           │
             │   regex_alignment   (binary): does format_instruction align with output_regex?      │
             │   order_coherence   (binary): does the assembled user_prompt read coherently?       │
             │   preamble_quality  (0-3):    does preamble match tone, verbosity, clarity?         │
             │                                                                                     │
             │   ~15-20% of generations fail at least one binary gate.                             │
             │   Filter binary gates to 1 and preamble_quality ≥ 2.                                │
             └─────────────────────────────────────────┬───────────────────────────────────────────┘
                                                       │
                                                       ▼
             ┌─────────────────────────────────────────────────────────────────────────────────────┐
             │                     STEP 5: INTEGRATION INTO TRAINING MIXTURES                      │
             │                                                                                     │
             │   YAML-driven mixture script that operates at production scale:                     │
             │   ├─ Load base SFT dataset (STEM MCQ, Math, Code)                                   │
             │   ├─ Sample user_prompts from the generated pool                                    │
             │   ├─ Substitute {problem} with each record's question                               │
             │   ├─ majority_percentage: 25% (canonical format) / 75% (diverse variations)         │
             │   └─ Pack sequences to 128k tokens for training                                     │
             └─────────────────────────────────────────────────────────────────────────────────────┘

The 6 samplers create a combinatorial diversity space of 3,240 unique combinations, multiplied by 50 seed rows from the format-template × preamble cross-product. Generating 1,000 records covers a broad slice of this space, ensuring the training data doesn't cluster around a few dominant styles.

Step 1: Curated Seed Examples

The seed is the cross-product of two small hand-written sets: 10 regex-paired format templates (covered in the Regex-Paired Format Templates section below) and 5 generic preamble anchors. Together they form 50 seed rows. Templates carry the format-instruction seed plus its paired output_regex; preambles act as style anchors so the LLM knows what a generic instruction line looks like.

import itertools
import pandas as pd
import data_designer.config as dd
from data_designer.interface import DataDesigner

# 10 regex-paired format templates (full list in the appendix)
MCQ_FORMAT_TEMPLATES = [
    {"format_key": "boxed_letter",
     "seed_format_instruction": "Choose exactly one letter. Place your answer in \\boxed{LETTER} format.",
     "output_regex": r"\\boxed\{([A-Za-z])\}"},
    # ... 9 more
]

# 5 generic preamble anchors (no format requirements)
SEED_PREAMBLES = [
    "Choose the correct answer from the options below.",
    "Read the question carefully and select the best option.",
    "Answer the following multiple choice question.",
    "Consider each option and pick the correct one.",
    "Solve the problem and select the right choice.",
]

# Cross-product: every format template × every preamble = 50 seed rows
seed_df = pd.DataFrame([
    {**fmt, "seed_preamble": preamble}
    for fmt, preamble in itertools.product(MCQ_FORMAT_TEMPLATES, SEED_PREAMBLES)
])

config = dd.DataDesignerConfigBuilder(model_configs=[
    # qwen3-235b-a22b is hosted on build.nvidia.com under the "nvidia" provider
    dd.ModelConfig(alias="gen", model="qwen/qwen3-235b-a22b", provider="nvidia"),
])
config.with_seed_dataset(
    dd.DataFrameSeedSource(df=seed_df),
    sampling_strategy=dd.SamplingStrategy.SHUFFLE,
)

With SamplingStrategy.SHUFFLE, each generated record sees a random seed row (a (seed_format_instruction, output_regex, seed_preamble) triple) alongside its sampled style parameters.

Step 2: Six Dimensions of Diversity

Each sampler controls one axis of variation:

config.add_column(dd.SamplerColumnConfig(
    name="sentence_type",
    sampler_type=dd.SamplerType.CATEGORY,
    params=dd.CategorySamplerParams(values=["imperative", "interrogative", "declarative"]),
))

config.add_column(dd.SamplerColumnConfig(
    name="tone",
    sampler_type=dd.SamplerType.CATEGORY,
    params=dd.CategorySamplerParams(values=["formal", "neutral", "concise", "encouraging", "strict"]),
))

config.add_column(dd.SamplerColumnConfig(
    name="strictness_level",
    sampler_type=dd.SamplerType.CATEGORY,
    params=dd.CategorySamplerParams(values=["low", "medium", "high"]),
))

config.add_column(dd.SamplerColumnConfig(
    name="verbosity_level",
    sampler_type=dd.SamplerType.CATEGORY,
    params=dd.CategorySamplerParams(values=["concise", "standard", "detailed"]),
))

config.add_column(dd.SamplerColumnConfig(
    name="domain_context",
    sampler_type=dd.SamplerType.CATEGORY,
    params=dd.CategorySamplerParams(values=["general", "STEM", "humanities"]),
))

config.add_column(dd.SamplerColumnConfig(
    name="preamble_format_order",
    sampler_type=dd.SamplerType.CATEGORY,
    params=dd.CategorySamplerParams(values=[
        "P + F + {problem}", "F + P + {problem}",
        "P + {problem} + F", "F + {problem} + P",
        "{problem} + P + F", "{problem} + F + P",
        "PF + {problem}", "{problem} + PF",
    ]),
))

The last sampler — preamble_format_order — controls how the preamble (P), format instruction (F), and {problem} placeholder get arranged in the final user prompt. It prevents the model from overfitting to a single positional layout.

This is the power of Data Designer's sampler approach: you define the diversity dimensions — including positional order — and the framework handles the combinatorics.

Step 3: LLM Generation Columns

Three LLM text columns turn the sampled style parameters and seed rows into the actual training-ready prompts. Each column is a separate generation call, with each downstream column able to reference the values produced upstream.

preamble — a generic instruction line with no format requirements. The model conditions only on style samplers (sentence type, tone, strictness, verbosity, domain), so the same preamble can be paired with any answer format downstream:

config.add_column(dd.LLMTextColumnConfig(
    name="preamble",
    model_alias="gen",
    prompt=(
        "You are writing a concise instruction line to accompany a "
        "{{ domain_context }} question.\n"
        "Keep it neutral and generic; do NOT include output formatting requirements.\n\n"
        "Constraints:\n"
        "- Sentence type: {{ sentence_type }}\n"
        "- Tone: {{ tone }}\n"
        "- Strictness: {{ strictness_level }}\n"
        "- Verbosity: {{ verbosity_level }}\n\n"
        "Seed (paraphrase; do not copy): \"{{ seed_preamble }}\"\n\n"
        "Return ONLY the instruction line text."
    ),
))

format_instruction — paraphrases the seed seed_format_instruction while staying compatible with its paired output_regex. The regex is the contract: the instruction can be reworded freely, but the resulting model output must still match it for RL reward extraction.

user_prompt — composes preamble + format_instruction + {problem} (literal placeholder) in the order chosen by the preamble_format_order sampler. This is the final string that will be prepended to each training record at mixture time.

The separation matters: keeping preamble and format_instruction as distinct columns means each diversity dimension is varied independently, and the judges in the next step can score each component on its own terms. See the appendix for the full format_instruction and user_prompt configs.

Step 4: Quality Judges

Four LLM judges score each generated row across format and quality dimensions. Three binary gates catch hard failures; one rubric judge scores tone and clarity.

format_compliance (binary): Does format_instruction actually enforce the required output pattern?

config.add_column(dd.LLMJudgeColumnConfig(
    name="format_compliance",
    model_alias="gen",
    prompt=(
        "Does this instruction enforce the format?\n\n"
        "Instruction: {{ format_instruction }}\n"
        "Regex: {{ output_regex }}"
    ),
    scores=[dd.Score(
        name="format_match",
        description="Format enforced?",
        options={1: "Yes", 0: "No"},
    )],
))

regex_alignment (binary): Does format_instruction align with the paired output_regex from the seed template? This catches drift where the generated instruction sounds plausible but won't be matched by the extraction regex at training/eval time.

order_coherence (binary): Does the assembled user_prompt read coherently in the sampled preamble_format_order? Some orderings (e.g. {problem} + PF) only work if the preamble and format instruction flow naturally after the question.

preamble_quality (0-3 rubric): Does the preamble match the requested tone, verbosity, and clarity?

config.add_column(dd.LLMJudgeColumnConfig(
    name="preamble_quality",
    model_alias="gen",
    prompt=(
        "Evaluate this instruction line.\n\n"
        "Instruction: {{ preamble }}\n"
        "Tone: {{ tone }}\n"
        "Verbosity: {{ verbosity_level }}"
    ),
    scores=[dd.Score(
        name="quality",
        description="Preamble quality",
        options={3: "Excellent", 2: "Good", 1: "Fair", 0: "Poor"},
    )],
))

Roughly 15-20% of generations fail at least one binary gate. The downstream filter keeps rows that pass all three binary judges and score ≥ 2 on preamble_quality.

Step 5: Integration into Training Mixtures

The previous four steps produce a pool of validated prompts. They're useful as a standalone artifact, but the actual goal is to use them as instruction variations on top of existing SFT data — without rebuilding that data from scratch.

This is what the mixture script does. It treats the DD-generated prompts as a diversity layer that gets applied at training-data assembly time:

Compose the training mixture from existing JSONL shards. Production SFT data already exists as JSONL files (open STEM, HLE, internal MCQ collections, etc.). Each shard contributes a configured percentage of the final mixture, sampled with a fixed seed for reproducibility.
Apply preamble variations on top. For each record, draw a Bernoulli flip against majority_percentage. On the majority side (25% in the example), prepend a single canonical preamble — the format the model is "officially" expected to handle. On the minority side (75%), sample a random preamble from the DD-generated pool. The model thus sees the canonical format often enough to lock it in, while being exposed to thousands of variations to keep it from overfitting to that one format.
Detect MCQ-like records with regex heuristics. Non-MCQ records (free-response, code, math without options) shouldn't get an MCQ preamble. The pipeline skips them by matching for \n(A), \n(B) patterns in the user turn or \boxed{} in the assistant turn.
Pack sequences for training. Concatenate records up to max_seq_length (128k here) with shuffle_before and shuffle_after so packed sequences don't memorize neighbor ordering.

Below is the YAML config that drives this:

mixture:
  target: 1000000
  seed: 13
  files:
    - path: /path/to/openstem.jsonl
      percent: 70.6
    - path: /path/to/openstem_15p.jsonl
      percent: 10.6
    - path: /path/to/hle.jsonl
      percent: 16.3
    - path: /path/to/hle_15p.jsonl
      percent: 2.5

preamble:
  augment: true
  majority_preamble: |-
    Answer the following multiple choice question. The last line of your
    response should be in the following format: 'Answer: \boxed{A/B/C/D}'
    (e.g. 'Answer: \boxed{A}').

    {problem}
  majority_percentage: 25.0
  variations:
    path: prompts/stem_prompts_1000.jsonl
    field: preamble_text

pack:
  enabled: true
  shuffle_before: true
  shuffle_after: true
  max_seq_length: 128000

The majority_percentage is the main knob to tune. Setting it too high (e.g., 90%) means the model rarely sees variations and prompt-format brittleness persists; setting it too low (e.g., 5%) starves the model of the canonical format it's expected to perform well on. In internal testing, 25% canonical / 75% varied struck the right balance for QA mixtures — the canonical format stays sharp while format-robustness improves.

The end result of this step is a 1M-record packed training mixture where each problem appears with one of 1,000+ different instruction phrasings — assembled from existing data with no manual preamble authoring required.

Regex-Paired Format Templates: Enabling Both SFT and RL

The key design decision that makes this pipeline work for both SFT and RL is pairing every format instruction with an extraction regex. Each template defines a human-readable format instruction (which gets paraphrased by Data Designer for diversity) and a regex pattern (which stays fixed for automated answer extraction).

- prompt: 'End your response with ''Correct Option: A/B/C/D/...''.'
  output_regex: 'Correct Option:\s*([A-Za-z])'

- prompt: 'Put the chosen letter inside double brackets: ((X)).'
  output_regex: '\(\(([A-Za-z])\)\)'

- prompt: 'Wrap your final answer letter in XML-style tags: <final_answer>X</final_answer>.'
  output_regex: '<final_answer>\s*([A-Za-z])\s*</final_answer>'

- prompt: 'Finish by enclosing the correct option letter in double asterisks (like **A**).'
  output_regex: '\*\*([A-Za-z])\*\*'

- prompt: 'Conclude by stating ''Correct Answer >> A/B/C/D/...''.'
  output_regex: 'Correct Answer >> ([A-Za-z])'

The appendix ships 10 distinct format templates spanning \boxed{}, brackets, parentheses, XML tags, markdown bold, arrows, and plain text — easy to extend by adding more (prompt, output_regex) pairs. This dual-use design means:

For SFT: The paraphrased format instructions add diversity to the training data. The model sees the same problem with many different answer format requirements, building robustness.
For RL: The paired regex lets the reward function extract the answer from model output regardless of which format was requested. The RL environment (e.g., NeMo-Gym) uses the regex to verify correctness without brittle string matching.

The preamble (generic instruction) and format instruction (answer format) are generated as separate LLM columns, then composed into a final user_prompt with the {problem} placeholder arranged in one of 8 placement orders (P + F + {problem}, {problem} + P + F, etc.). This separation lets you vary each dimension independently and prevents positional overfitting.

Key Takeaways

Samplers make diversity systematic. Six categorical samplers (3-8 values each) create a 3,240-combination space, multiplied by 50 seed rows from the format-template × preamble cross-product. No human annotator covers that surface area consistently.
Seed examples are style anchors, not templates. The LLM needs to see what a preamble is, but the samplers control what each preamble says. Without seeds, the LLM guesses at the format; without samplers, it converges to a narrow style.
Format compliance is a hard gate. A format_instruction that drifts from its paired output_regex will confuse the model during training and break extraction at eval time. Binary judges (format_compliance + regex_alignment) catch this — LLMs generate misaligned formats ~15-20% of the time.
The value is in the pipeline, not the individual records. Any single preamble is easy to write by hand. The value is generating 1,000+ diverse, validated preambles automatically and integrating them into million-record training mixtures with controlled majority/variation ratios.
Regex-paired templates unify SFT and RL. The same format templates serve double duty: paraphrased instructions add SFT diversity, while the paired regex enables RL reward parsing. One pipeline, both training paradigms.
Majority percentage controls the tradeoff. Setting majority_percentage: 25 means the model sees the canonical format 25% of the time and diverse variations 75% of the time. This ratio was tuned empirically --- too much diversity degrades canonical-format performance; too little doesn't build robustness.

Try For Yourself

Full source: prompt sensitivity pipeline

import itertools
import pandas as pd
import data_designer.config as dd
from data_designer.interface import DataDesigner

# --- Format templates with paired regexes ---
MCQ_FORMAT_TEMPLATES = [
    {"format_key": "boxed_letter",
     "seed_format_instruction": "Choose exactly one letter. Place your answer in \\boxed{LETTER} format.",
     "output_regex": r"\\boxed\{([A-Za-z])\}"},
    {"format_key": "correct_option",
     "seed_format_instruction": "End your response with 'Correct Option: A/B/C/D/...'.",
     "output_regex": r"Correct Option:\s*([A-Za-z])"},
    {"format_key": "double_brackets",
     "seed_format_instruction": "Put the chosen letter inside double brackets: ((X)).",
     "output_regex": r"\(\(([A-Za-z])\)\)"},
    {"format_key": "xml_tags",
     "seed_format_instruction": "Wrap your final answer letter in XML-style tags: <final_answer>X</final_answer>.",
     "output_regex": r"<final_answer>\s*([A-Za-z])\s*</final_answer>"},
    {"format_key": "double_asterisks",
     "seed_format_instruction": "Finish by enclosing the correct option letter in double asterisks (like **A**).",
     "output_regex": r"\*\*([A-Za-z])\*\*"},
    {"format_key": "square_brackets",
     "seed_format_instruction": "Give the letter choice at the end: [Answer: X] where X is the correct letter.",
     "output_regex": r"\[Answer:\s*([A-Za-z])\]"},
    {"format_key": "angle_brackets",
     "seed_format_instruction": "Provide the correct option enclosed in angle brackets: <A>.",
     "output_regex": r"<([A-Z])>"},
    {"format_key": "correct_answer_arrow",
     "seed_format_instruction": "Conclude by stating 'Correct Answer >> A/B/C/D/...'.",
     "output_regex": r"Correct Answer >> ([A-Za-z])"},
    {"format_key": "curly_braces",
     "seed_format_instruction": "End your answer by writing the correct option: 'Answer Choice: {X}'.",
     "output_regex": r"Answer Choice:\s*\{([A-Za-z])\}"},
    {"format_key": "selected_option",
     "seed_format_instruction": "Provide the selected option: Selected Option -> X.",
     "output_regex": r"Selected Option\s*->\s*([A-Za-z])"},
]

SEED_PREAMBLES = [
    "Choose the correct answer from the options below.",
    "Read the question carefully and select the best option.",
    "Answer the following multiple choice question.",
    "Consider each option and pick the correct one.",
    "Solve the problem and select the right choice.",
]

# Cross-product seed data
seed_df = pd.DataFrame([
    {**fmt, "seed_preamble": preamble}
    for fmt, preamble in itertools.product(
        pd.DataFrame(MCQ_FORMAT_TEMPLATES).to_dict(orient="records"),
        SEED_PREAMBLES,
    )
])

# --- Model + config ---
config = dd.DataDesignerConfigBuilder(model_configs=[
    dd.ModelConfig(alias="gen", model="qwen/qwen3-235b-a22b", provider="nvidia"),
])
config.with_seed_dataset(
    dd.DataFrameSeedSource(df=seed_df),
    sampling_strategy=dd.SamplingStrategy.SHUFFLE,
)

# --- Samplers ---
for name, values in [
    ("sentence_type", ["imperative", "interrogative", "declarative"]),
    ("tone", ["formal", "neutral", "concise", "encouraging", "strict"]),
    ("strictness_level", ["low", "medium", "high"]),
    ("verbosity_level", ["concise", "standard", "detailed"]),
    ("domain_context", ["general", "STEM", "humanities"]),
    ("preamble_format_order", [
        "P + F + {problem}", "F + P + {problem}",
        "P + {problem} + F", "F + {problem} + P",
        "{problem} + P + F", "{problem} + F + P",
        "PF + {problem}", "{problem} + PF",
    ]),
]:
    config.add_column(dd.SamplerColumnConfig(
        name=name,
        sampler_type=dd.SamplerType.CATEGORY,
        params=dd.CategorySamplerParams(values=values),
    ))

# --- LLM columns ---
config.add_column(dd.LLMTextColumnConfig(
    name="preamble",
    model_alias="gen",
    prompt=(
        "You are writing a concise instruction line to accompany a "
        "{{ domain_context }} question.\n"
        "Keep it neutral and generic; do NOT include output formatting requirements.\n\n"
        "Constraints:\n"
        "- Sentence type: {{ sentence_type }}\n"
        "- Tone: {{ tone }}\n"
        "- Strictness: {{ strictness_level }}\n"
        "- Verbosity: {{ verbosity_level }}\n\n"
        "Seed (paraphrase; do not copy): \"{{ seed_preamble }}\"\n\n"
        "Return ONLY the instruction line text."
    ),
))

config.add_column(dd.LLMTextColumnConfig(
    name="format_instruction",
    model_alias="gen",
    prompt=(
        "You are writing a concise format instruction for a multiple-choice question.\n"
        "The format must be compatible with this output pattern:\n\n"
        "- Output regex: {{ output_regex }}\n\n"
        "Constraints:\n"
        "- Tone: {{ tone }}\n"
        "- The answer must appear at the end of the response.\n\n"
        "Seed (paraphrase; keep meaning): \"{{ seed_format_instruction }}\"\n\n"
        "Return ONLY the format instruction."
    ),
))

config.add_column(dd.LLMTextColumnConfig(
    name="user_prompt",
    model_alias="gen",
    prompt=(
        "Concatenate these components in the specified order:\n\n"
        "- P (Preamble): {{ preamble }}\n"
        "- F (Format Instruction): {{ format_instruction }}\n"
        "- {problem}: Placeholder (keep as literal text)\n\n"
        "Order: {{ preamble_format_order }}\n\n"
        "Rules: arrange per order, merge if 'PF', newline before/after {problem}, "
        "no labels like 'P:' or 'F:'.\n\n"
        "Return ONLY the assembled prompt text."
    ),
))

# --- Judges ---
config.add_column(dd.LLMJudgeColumnConfig(
    name="format_compliance",
    model_alias="gen",
    prompt="Does this instruction enforce the format?\n\nInstruction: {{ format_instruction }}\nRegex: {{ output_regex }}",
    scores=[dd.Score(name="format_match", description="Format enforced?",
                     options={1: "Yes", 0: "No"})],
))

config.add_column(dd.LLMJudgeColumnConfig(
    name="regex_alignment",
    model_alias="gen",
    prompt="Does the instruction align with this regex?\n\nRegex: {{ output_regex }}\nInstruction: {{ format_instruction }}",
    scores=[dd.Score(name="aligned", description="Regex aligned?",
                     options={1: "Aligned", 0: "Not aligned"})],
))

config.add_column(dd.LLMJudgeColumnConfig(
    name="order_coherence",
    model_alias="gen",
    prompt="Is this user prompt coherent?\n\nUser prompt: {{ user_prompt }}",
    scores=[dd.Score(name="coherent", description="Ordering coherent?",
                     options={1: "Coherent", 0: "Incoherent"})],
))

config.add_column(dd.LLMJudgeColumnConfig(
    name="preamble_quality",
    model_alias="gen",
    prompt="Evaluate this instruction line.\n\nInstruction: {{ preamble }}\nTone: {{ tone }}\nVerbosity: {{ verbosity_level }}",
    scores=[dd.Score(name="quality", description="Preamble quality",
                     options={3: "Excellent", 2: "Good", 1: "Fair", 0: "Poor"})],
))

# --- Run ---
data_designer = DataDesigner()
results = data_designer.create(
    config_builder=config,
    num_records=1000,
    dataset_name="prompt-sensitivity-mcq",
)

Key Resources:

NeMo Data Designer on GitHub

Want to learn more about NeMo Data Designer? Check out our documentation and start building your own synthetic data pipelines today.