Common Types#

Shared data containers and type definitions used throughout TROPT: trigger placeholders, model I/O wrappers, target containers, and slice keys.

tropt.common.OPTIMIZED_TRIGGER_PLACEHOLDER: str = '{{OPTIMIZED_TRIGGER}}'#

str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.

tropt.common.DEFAULT_INIT_TRIGGER = '! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !'#

str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.

Data Containers#

class tropt.common.SliceKey(value)[source]#

Bases: str, Enum

Enum for standardized slice keys used in input embeddings.

APPENDED = 'appended'#: Optional tokens appended (=prefilled) at the end (e.g., target outputs for LMs)

INPUT_AFTER = 'input_after'#: Tokens after the trigger (including chat template, if exists)

INPUT_BEFORE = 'input_before'#: Tokens before the trigger (including chat template, if exists)

INPUT_LAST_TOKEN = 'input_last_token'#: The last token of the input sequence (e.g., typically the end of the chat template before generation starts)

TRIGGER = 'trigger'#: The optimized trigger tokens

class tropt.common.MessageTargets(**data)[source]#

Bases: BaseModel

Targets for a single selected message.

Parameters:

target_response_strs (str | None)
target_response_toks (Int[Tensor, 'target_seq_len'] | None)
target_response_logits (Float[Tensor, 'target_seq_len vocab_size'] | None)
target_vectors (Float[Tensor, 'd_model'] | None)
target_directions (Float[Tensor, 'd_model'] | None)
target_gradient (Float[Tensor, 'n_params'] | None)
target_class_idx (int | None)
true_class_idx (int | None)

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

target_class_idx: Optional[int]#: Target class index for this message. Used by targeted-misclassification losses on classifier outputs.

target_directions: Optional[Float[Tensor, 'd_model']]#: Target direction in activation space for this message. Used by steering losses (e.g., representation engineering).

target_gradient: Optional[Float[Tensor, 'n_params']]#: Target weight-gradient to align with, flattened over the trainable params. Precompute the target weight-gradient externally and pass it here.

target_response_logits: Optional[Float[Tensor, 'target_seq_len vocab_size']]#

Target logits from a reference (e.g., jailbroken) model, one per target position.

Used by distillation-style losses (e.g., FLRT, https://arxiv.org/abs/2407.17447).

target_response_strs: Optional[str]#: Raw text target response for this message.

target_response_toks: Optional[Int[Tensor, 'target_seq_len']]#: Tokenized target response for this message.

target_vectors: Optional[Float[Tensor, 'd_model']]#: Target embedding vector for this message.

true_class_idx: Optional[int]#: True (current) class index for this message; the class to steer away from. Used by untargeted-misclassification losses on classifier outputs.

class tropt.common.Targets(**data)[source]#

Bases: BaseModel

Targets for all templates. Each field has an initial n_templates dimension.

Typically only one or two of these fields need to be provided depending on the loss function being used. For example, a standard LM jailbreak only needs target_response_strs to provide the target outputs.

Parameters:

target_response_strs (Annotated[List[str], 'n_templates'] | None)
target_response_toks (Int[Tensor, 'n_templates target_seq_len'] | Annotated[List[Int[Tensor, 'target_seq_len']], 'n_templates'] | None)
target_response_logits (Annotated[List[Float[Tensor, 'target_seq_len vocab_size']], 'n_templates'] | None)
target_vectors (Float[Tensor, 'n_templates d_model'] | None)
target_directions (Float[Tensor, 'n_templates d_model'] | None)
target_gradient (Float[Tensor, 'n_templates n_params'] | None)
target_class_idx (Annotated[List[int], 'n_templates'] | None)
true_class_idx (Annotated[List[int], 'n_templates'] | None)

check_field_lengths()[source]#

Return type:: Targets

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

property n_templates: int#

select_indices(indices)[source]#

Return a new Targets with only the selected template indices.

Return type:: Targets
Parameters:: indices (List[int])

select_message(idx)[source]#

Return a MessageTargets instance for the selected template index.

Return type:: MessageTargets
Parameters:: idx (int)

target_class_idx: Optional[Annotated[List[int]]]#

Target class indices, one per template.

List of length n_templates. Used by: targeted-misclassification losses on classifier outputs.

target_directions: Optional[Float[Tensor, 'n_templates d_model']]#

Target directions in activation space, one per template.

Shape: (n_templates, d_model) Used by: Steering losses (e.g., refusal suppression). Note: if you need per-layer directions, store as (n_templates, n_layers, d_model) and update this annotation accordingly.

target_gradient: Optional[Float[Tensor, 'n_templates n_params']]#

Target weight-gradients, one per template (flattened over the trainable params).

Shape: (n_templates, n_params). Used by: gradient-matching losses. Precompute the per-template target weight-gradients externally and stack across templates.

target_response_logits: Optional[Annotated[List[Float[Tensor, 'target_seq_len vocab_size']]]]#

Target logits from a reference (e.g., jailbroken) model, one per target position.

Used by distillation-style losses (e.g., FLRT, https://arxiv.org/abs/2407.17447); one tensor per template.

List of length n_templates, each of (potentially different) shape (target_seq_len, vocab_size).

target_response_strs: Optional[Annotated[List[str]]]#

Raw text target outputs, one per template.

List is of length n_templates. - Used by: Language models for target matching. Will be tokenized

internally to produce target_response_toks if not provided directly.

If used with prefill-based losses, it will automatically run model computations with the prefilled target response
Note: any special tokens that must precede the actual response for the target model are the caller’s responsibility to include here. For example, thinking models (Qwen3, DeepSeek-R1, etc.) that were trained to begin every response with a <think>…</think> block typically need an empty block (e.g. "<think>\n\n</think>\n\n") prepended to the target string to suppress reasoning before the desired prefix.

target_response_toks: Union[Int[Tensor, 'n_templates target_seq_len'], Annotated[List[Int[Tensor, 'target_seq_len']]], None]#

Tokenized target outputs, one per template.

Shape: (n_templates, target_seq_len) OR List of length n_templates, each of (potentially different) shape (target_seq_len,) Used by: Language models for computing cross-entropy loss.

target_vectors: Optional[Float[Tensor, 'n_templates d_model']]#

Target embedding vectors, one per template.

Shape: (n_templates, d_model) Used by: Encoder models for similarity-based losses.

to_device(device)[source]#

Return type:: Targets
Parameters:: device (device | str)

true_class_idx: Optional[Annotated[List[int]]]#

True (current) class indices, one per template; the class to steer away from.

List of length n_templates. Used by: untargeted-misclassification losses on classifier outputs.

Model I/O#

class tropt.common.ModelInput(**data)[source]#

Bases: BaseModel

Standardized input container returned by InputsManager.get_triggered_inputs().

This renders a uniform interface for model outputs, that can then be used to compute different losses agnostic of the underlying model type/implementation.

The convention is that such object conveys the data of a single message, without mixing multiple messages.

Shape Notation:

bsz: batch size (typically n_candidates for a single message)
seq_len: total sequence length
trigger_seq_len: number of trigger tokens
d_model: embedding dimension

Examples

>>> # Token-level input
>>> token_input = ModelInput(
...     input_trigger_ids=torch.randint(0, 1000, (4, 20)),
...     input_embeds=torch.randn(4, 100, 768),
...     input_attention_mask=torch.ones(4, 100),
...     message_targets=MessageTargets(target_response_toks=target_ids)
... )

>>> # Text-level input
>>> text_input = ModelInput(
...     input_texts=["Text with trigger 1", "Text with trigger 2"],
...     message_targets=MessageTargets(target_response_strs="Response 1")
... )

Parameters:

input_texts (Annotated[List[str], 'bsz'] | None)
input_trigger_strs (Annotated[List[str], 'bsz'] | None)
input_ids (Int[Tensor, 'bsz seq_len'] | None)
input_trigger_ids (Int[Tensor, 'bsz trigger_seq_len'] | None)
input_embeds (Float[Tensor, 'bsz seq_len d_model'] | None)
input_attention_mask (Int[Tensor, 'bsz seq_len'] | None)
input_prefix_cache_kwargs (Dict[str, Any] | None)
input_slices (Dict[SliceKey, slice | None] | None)
message_targets (MessageTargets | None)

input_attention_mask: Optional[Int[Tensor, 'bsz seq_len']]#

Binary attention mask for the input sequence.: Passed to HuggingFace models to indicate valid token positions.

Shape: (batch_size, total_sequence_length).

input_embeds: Optional[Float[Tensor, 'bsz seq_len d_model']]#: Full input embeddings with trigger embeddings inserted, and potentially prefilled target tokens. Could be passed to model as inputs. Shape: (batch_size, total_sequence_length, embedding_dimension).

input_ids: Optional[Int[Tensor, 'bsz seq_len']]#: Token IDs of the full input sequence (prompt + trigger), plus optionally target tokens.

input_prefix_cache_kwargs: Optional[Dict[str, Any]]#: Keyword arguments for HuggingFace’s prefix caching (KV cache optimization).

input_slices: Optional[Dict[SliceKey, Optional[slice]]]#

Position slices marking different regions in the input sequence.

A dictionary mapping SliceKey to slice objects. Used to extract specific regions (trigger, input_before, input_after, appended) from model outputs like logits or hidden states.

Example

>>> input_slices = {
        SliceKey.TRIGGER: slice(10, 30),
        SliceKey.INPUT_BEFORE: slice(0, 10),
        SliceKey.INPUT_AFTER: slice(30, 50),
        SliceKey.APPENDED: slice(50, 60)
    }

Critical for loss functions that need to identify specific token positions in the output (e.g., target output region for cross-entropy loss).

input_texts: Optional[Annotated[List[str]]]#: List of complete text strings with triggers inserted, of length batch_size.

input_trigger_ids: Optional[Int[Tensor, 'bsz trigger_seq_len']]#

(batch_size, trigger_sequence_length).

Used by some losses to compute trigger-specific metrics (e.g., perplexity of trigger).

Type:: Token IDs of the trigger candidates. Shape

input_trigger_strs: Optional[Annotated[List[str]]]#: List of trigger strings used in the inputs, of length batch_size.

message_targets: Optional[MessageTargets]#

Target data required by loss functions.

A MessageTargets instance containing the target data for a single message. The specific fields used depend on the loss function (e.g., target_response_strs for text-based losses, target_directions for steering losses).

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

to_dict()[source]#

Convert to a dictionary, excluding None values.

Return type:: Dict[str, Any]

class tropt.common.ModelOutput(**data)[source]#

Bases: BaseModel

Standardized output container for all model types in TROPT.

This renders a uniform interface for model outputs, that can then be used to compute different losses agnostic of the underlying model type/implementation.

Shape Notation:

bsz: batch size (number of candidates or samples in a batch)
n_layers: number of model layers
n_heads: number of attention heads per layer
seq_len: total sequence length
response_len: length of generated response (variable per sample)
full_seq_len: full sequence length including prompt and generation
vocab_size: vocabulary size
d_model: model embedding dimension

Examples

>>> # Encoder model output
>>> encoder_output = ModelOutput(output_embeddings=torch.randn(4, 768))

>>> # Language model output with logits
>>> lm_output = ModelOutput(
...     full_logits=torch.randn(2, 50, 32000),
...     generated_response_strs=["Response 1", "Response 2"]
... )

>>> # Full output with hidden states and attentions
>>> full_output = ModelOutput(
...     full_logits=logits,
...     full_hidden_states=hidden_states,
...     full_attentions=attentions,
...     generated_response_strs=responses
... )

Parameters:

output_embeddings (Float[Tensor, 'bsz d_model'] | None)
full_logits (Float[Tensor, 'bsz seq_len vocab_size'] | None)
prefill_response_logits (Float[Tensor, 'bsz response_seq_len vocab_size'] | None)
full_hidden_states (Float[Tensor, 'bsz n_layers seq_len d_model'] | None)
full_attentions (Float[Tensor, 'bsz n_layers n_heads seq_len seq_len'] | None)
generated_response_ids (List[Int[Tensor, 'response_len']] | None)
generated_response_strs (List[str] | None)
generated_response_logits (List[Float[Tensor, 'response_len vocab_size']] | Float[Tensor, 'bsz response_len vocab_size'] | None)
response_first_token_logprobs (List[Dict[str, float]] | None)
output_class_logits (Float[Tensor, 'bsz n_classes'] | None)
full_ids (Int[Tensor, 'bsz full_seq_len'] | None)
full_strs (List[str] | None)

full_attentions: Optional[Float[Tensor, 'bsz n_layers n_heads seq_len seq_len']]#

Attention weights from all layers.

Note: Typically requires stacking tuple outputs from HuggingFace models:: torch.stack(outputs.attentions, dim=1)

full_hidden_states: Optional[Float[Tensor, 'bsz n_layers seq_len d_model']]#

Hidden states from all layers.

Note: Typically requires stacking tuple outputs from HuggingFace models:: torch.stack(outputs.hidden_states, dim=1)

full_ids: Optional[Int[Tensor, 'bsz full_seq_len']]#: Full template token IDs (prompt + generation; includes optional padding).

full_logits: Optional[Float[Tensor, 'bsz seq_len vocab_size']]#: Full sequence logits from language models; including both inputs and outputs (prefilled and generated).

full_strs: Optional[List[str]]#: Full template strings (prompt + generation).

generated_response_ids: Optional[List[Int[Tensor, 'response_len']]]#: Generated token IDs from language model generation. Response lengths may vary across samples.

generated_response_logits: Union[List[Float[Tensor, 'response_len vocab_size']], Float[Tensor, 'bsz response_len vocab_size'], None]#: Logits for generated tokens from language model generation. Notably, this differs from prefill_response_logits which take the logits w.r.t. a prefilled (mostly target) response. In particular, this excludes any prefilled tokens. Response lengths may vary across samples.

generated_response_strs: Optional[List[str]]#: Generated text strings from language model generation.

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

output_class_logits: Optional[Float[Tensor, 'bsz n_classes']]#: Classification logits from classifier models (pre-softmax). Shape: (batch_size, num_classes)

output_embeddings: Optional[Float[Tensor, 'bsz d_model']]#: Pooled output embeddings from encoder models. Shape: (batch_size, d_model)

prefill_response_logits: Optional[Float[Tensor, 'bsz response_seq_len vocab_size']]#: Logits corresponding to the prefilled response portion of the sequence.

response_first_token_logprobs: Optional[List[Dict[str, float]]]#

Sparse log-probabilities for the first generated token.

List of length bsz, where each element is a dict mapping token strings to their log-probability. For API models this is typically the top-k returned by the provider (e.g. top-20 from OpenAI); for HF models it can cover the full vocabulary.

to_dict()[source]#

Convert to a dictionary, excluding None values.

Return type:: Dict[str, Any]