Common Types#
Shared data containers and type definitions used throughout TROPT: trigger placeholders, model I/O wrappers, target containers, and slice keys.
- tropt.common.OPTIMIZED_TRIGGER_PLACEHOLDER: str = '{{OPTIMIZED_TRIGGER}}'#
str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.
- tropt.common.DEFAULT_INIT_TRIGGER = '! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !'#
str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.
Data Containers#
- class tropt.common.SliceKey(value)[source]#
Bases:
str,EnumEnum for standardized slice keys used in input embeddings.
- APPENDED = 'appended'#
Optional tokens appended (=prefilled) at the end (e.g., target outputs for LMs)
- INPUT_AFTER = 'input_after'#
Tokens after the trigger (including chat template, if exists)
- INPUT_BEFORE = 'input_before'#
Tokens before the trigger (including chat template, if exists)
- INPUT_LAST_TOKEN = 'input_last_token'#
The last token of the input sequence (e.g., typically the end of the chat template before generation starts)
- TRIGGER = 'trigger'#
The optimized trigger tokens
- class tropt.common.MessageTargets(**data)[source]#
Bases:
BaseModelTargets for a single selected message.
- Parameters:
target_response_strs (str | None)
target_response_toks (Int[Tensor, 'target_seq_len'] | None)
target_response_logits (Float[Tensor, 'target_seq_len vocab_size'] | None)
target_vectors (Float[Tensor, 'd_model'] | None)
target_directions (Float[Tensor, 'd_model'] | None)
target_gradient (Float[Tensor, 'n_params'] | None)
target_class_idx (int | None)
true_class_idx (int | None)
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
-
target_class_idx:
Optional[int]# Target class index for this message. Used by targeted-misclassification losses on classifier outputs.
-
target_directions:
Optional[Float[Tensor, 'd_model']]# Target direction in activation space for this message. Used by steering losses (e.g., representation engineering).
-
target_gradient:
Optional[Float[Tensor, 'n_params']]# Target weight-gradient to align with, flattened over the trainable params. Precompute the target weight-gradient externally and pass it here.
-
target_response_logits:
Optional[Float[Tensor, 'target_seq_len vocab_size']]# Target logits from a reference (e.g., jailbroken) model, one per target position.
Used by distillation-style losses (e.g., FLRT, https://arxiv.org/abs/2407.17447).
-
target_response_strs:
Optional[str]# Raw text target response for this message.
-
target_response_toks:
Optional[Int[Tensor, 'target_seq_len']]# Tokenized target response for this message.
-
target_vectors:
Optional[Float[Tensor, 'd_model']]# Target embedding vector for this message.
-
true_class_idx:
Optional[int]# True (current) class index for this message; the class to steer away from. Used by untargeted-misclassification losses on classifier outputs.
- class tropt.common.Targets(**data)[source]#
Bases:
BaseModelTargets for all templates. Each field has an initial n_templates dimension.
Typically only one or two of these fields need to be provided depending on the loss function being used. For example, a standard LM jailbreak only needs target_response_strs to provide the target outputs.
- Parameters:
target_response_strs (Annotated[List[str], 'n_templates'] | None)
target_response_toks (Int[Tensor, 'n_templates target_seq_len'] | Annotated[List[Int[Tensor, 'target_seq_len']], 'n_templates'] | None)
target_response_logits (Annotated[List[Float[Tensor, 'target_seq_len vocab_size']], 'n_templates'] | None)
target_vectors (Float[Tensor, 'n_templates d_model'] | None)
target_directions (Float[Tensor, 'n_templates d_model'] | None)
target_gradient (Float[Tensor, 'n_templates n_params'] | None)
target_class_idx (Annotated[List[int], 'n_templates'] | None)
true_class_idx (Annotated[List[int], 'n_templates'] | None)
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- property n_templates: int#
- select_indices(indices)[source]#
Return a new Targets with only the selected template indices.
- Return type:
- Parameters:
indices (List[int])
- select_message(idx)[source]#
Return a MessageTargets instance for the selected template index.
- Return type:
- Parameters:
idx (int)
-
target_class_idx:
Optional[Annotated[List[int]]]# Target class indices, one per template.
List of length n_templates. Used by: targeted-misclassification losses on classifier outputs.
-
target_directions:
Optional[Float[Tensor, 'n_templates d_model']]# Target directions in activation space, one per template.
Shape: (n_templates, d_model) Used by: Steering losses (e.g., refusal suppression). Note: if you need per-layer directions, store as (n_templates, n_layers, d_model) and update this annotation accordingly.
-
target_gradient:
Optional[Float[Tensor, 'n_templates n_params']]# Target weight-gradients, one per template (flattened over the trainable params).
Shape: (n_templates, n_params). Used by: gradient-matching losses. Precompute the per-template target weight-gradients externally and stack across templates.
-
target_response_logits:
Optional[Annotated[List[Float[Tensor, 'target_seq_len vocab_size']]]]# Target logits from a reference (e.g., jailbroken) model, one per target position.
Used by distillation-style losses (e.g., FLRT, https://arxiv.org/abs/2407.17447); one tensor per template.
List of length n_templates, each of (potentially different) shape (target_seq_len, vocab_size).
-
target_response_strs:
Optional[Annotated[List[str]]]# Raw text target outputs, one per template.
List is of length n_templates. - Used by: Language models for target matching. Will be tokenized
internally to produce target_response_toks if not provided directly.
If used with prefill-based losses, it will automatically run model computations with a the prefilled target response
Note: any special tokens that must precede the actual response for the target model are the caller’s responsibility to include here. For example, thinking models (Qwen3, DeepSeek-R1, etc.) that were trained to begin every response with a <think>…</think> block typically need an empty block (e.g.
"<think>\n\n</think>\n\n") prepended to the target string to suppress reasoning before the desired prefix.
-
target_response_toks:
Union[Int[Tensor, 'n_templates target_seq_len'],Annotated[List[Int[Tensor, 'target_seq_len']]],None]# Tokenized target outputs, one per template.
Shape: (n_templates, target_seq_len) OR List of length n_templates, each of (potentially different) shape (target_seq_len,) Used by: Language models for computing cross-entropy loss.
-
target_vectors:
Optional[Float[Tensor, 'n_templates d_model']]# Target embedding vectors, one per template.
Shape: (n_templates, d_model) Used by: Encoder models for similarity-based losses.
-
true_class_idx:
Optional[Annotated[List[int]]]# True (current) class indices, one per template; the class to steer away from.
List of length n_templates. Used by: untargeted-misclassification losses on classifier outputs.
Model I/O#
- class tropt.common.ModelInput(**data)[source]#
Bases:
BaseModelStandardized input container returned by InputsManager.get_triggered_inputs().
This renders a uniform interface for model outputs, that can then be used to compute different losses agnostic of the underlying model type/implementation.
The convention is that such object conveys the data of a single message, without mixing multiple messages.
- Shape Notation:
bsz: batch size (typically n_candidates for a single message)
seq_len: total sequence length
trigger_seq_len: number of trigger tokens
d_model: embedding dimension
Examples
>>> # Token-level input >>> token_input = ModelInput( ... input_trigger_ids=torch.randint(0, 1000, (4, 20)), ... input_embeds=torch.randn(4, 100, 768), ... input_attention_mask=torch.ones(4, 100), ... message_targets=MessageTargets(target_response_toks=target_ids) ... )
>>> # Text-level input >>> text_input = ModelInput( ... input_texts=["Text with trigger 1", "Text with trigger 2"], ... message_targets=MessageTargets(target_response_strs="Response 1") ... )
- Parameters:
input_texts (Annotated[List[str], 'bsz'] | None)
input_trigger_strs (Annotated[List[str], 'bsz'] | None)
input_ids (Int[Tensor, 'bsz seq_len'] | None)
input_trigger_ids (Int[Tensor, 'bsz trigger_seq_len'] | None)
input_embeds (Float[Tensor, 'bsz seq_len d_model'] | None)
input_attention_mask (Int[Tensor, 'bsz seq_len'] | None)
input_prefix_cache_kwargs (Dict[str, Any] | None)
input_slices (Dict[SliceKey, slice | None] | None)
message_targets (MessageTargets | None)
-
input_attention_mask:
Optional[Int[Tensor, 'bsz seq_len']]# - Binary attention mask for the input sequence.
Passed to HuggingFace models to indicate valid token positions.
Shape: (batch_size, total_sequence_length).
-
input_embeds:
Optional[Float[Tensor, 'bsz seq_len d_model']]# Full input embeddings with trigger embeddings inserted, and potentially prefilled target tokens. Could be passed to model as inputs. Shape: (batch_size, total_sequence_length, embedding_dimension).
-
input_ids:
Optional[Int[Tensor, 'bsz seq_len']]# Token IDs of the full input sequence (prompt + trigger), plus optionally target tokens.
-
input_prefix_cache_kwargs:
Optional[Dict[str,Any]]# Keyword arguments for HuggingFace’s prefix caching (KV cache optimization).
-
input_slices:
Optional[Dict[SliceKey,Optional[slice]]]# Position slices marking different regions in the input sequence.
List of length batch_size, where each element is a dictionary mapping SliceKey to slice objects. Used to extract specific regions (trigger, input_before, input_after, appended) from model outputs like logits or hidden states.
Example
>>> input_slices = { SliceKey.TRIGGER: slice(10, 30), SliceKey.INPUT_BEFORE: slice(0, 10), SliceKey.INPUT_AFTER: slice(30, 50), SliceKey.APPENDED: slice(50, 60) }
Critical for loss functions that need to identify specific token positions in the output (e.g., target output region for cross-entropy loss).
-
input_texts:
Optional[Annotated[List[str]]]# List of complete text strings with triggers inserted, of length batch_size.
-
input_trigger_ids:
Optional[Int[Tensor, 'bsz trigger_seq_len']]# (batch_size, trigger_sequence_length).
Used by some losses to compute trigger-specific metrics (e.g., perplexity of trigger).
- Type:
Token IDs of the trigger candidates. Shape
-
input_trigger_strs:
Optional[Annotated[List[str]]]# List of trigger strings used in the inputs, of length batch_size.
-
message_targets:
Optional[MessageTargets]# Target data required by loss functions.
A MessageTargets instance containing the target data for a single message. The specific fields used depend on the loss function (e.g., target_response_strs for text-based losses, target_directions for steering losses).
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class tropt.common.ModelOutput(**data)[source]#
Bases:
BaseModelStandardized output container for all model types in TROPT.
This renders a uniform interface for model outputs, that can then be used to compute different losses agnostic of the underlying model type/implementation.
- Shape Notation:
bsz: batch size (number of candidates or samples in a batch)
n_layers: number of model layers
n_heads: number of attention heads per layer
seq_len: total sequence length
response_len: length of generated response (variable per sample)
full_seq_len: full sequence length including prompt and generation
vocab_size: vocabulary size
d_model: model embedding dimension
Examples
>>> # Encoder model output >>> encoder_output = ModelOutput(output_embeddings=torch.randn(4, 768))
>>> # Language model output with logits >>> lm_output = ModelOutput( ... full_logits=torch.randn(2, 50, 32000), ... generated_response_strs=["Response 1", "Response 2"] ... )
>>> # Full output with hidden states and attentions >>> full_output = ModelOutput( ... full_logits=logits, ... full_hidden_states=hidden_states, ... full_attentions=attentions, ... generated_response_strs=responses ... )
- Parameters:
output_embeddings (Float[Tensor, 'bsz d_model'] | None)
full_logits (Float[Tensor, 'bsz seq_len vocab_size'] | None)
prefill_response_logits (Float[Tensor, 'bsz response_seq_len vocab_size'] | None)
full_hidden_states (Float[Tensor, 'bsz n_layers seq_len d_model'] | None)
full_attentions (Float[Tensor, 'bsz n_layers n_heads seq_len seq_len'] | None)
generated_response_ids (List[Int[Tensor, 'response_len']] | None)
generated_response_strs (List[str] | None)
generated_response_logits (List[Float[Tensor, 'response_len vocab_size']] | Float[Tensor, 'bsz response_len vocab_size'] | None)
response_first_token_logprobs (List[Dict[str, float]] | None)
output_class_logits (Float[Tensor, 'bsz n_classes'] | None)
full_ids (Int[Tensor, 'bsz full_seq_len'] | None)
full_strs (List[str] | None)
-
full_attentions:
Optional[Float[Tensor, 'bsz n_layers n_heads seq_len seq_len']]# Attention weights from all layers.
- Note: Typically requires stacking tuple outputs from HuggingFace models:
torch.stack(outputs.attentions, dim=1)
Hidden states from all layers.
- Note: Typically requires stacking tuple outputs from HuggingFace models:
torch.stack(outputs.hidden_states, dim=1)
-
full_ids:
Optional[Int[Tensor, 'bsz full_seq_len']]# Full template token IDs (prompt + generation; includes optional padding).
-
full_logits:
Optional[Float[Tensor, 'bsz seq_len vocab_size']]# Full sequence logits from language models; including both inputs and outputs (prefilled and generated).
-
full_strs:
Optional[List[str]]# Full template strings (prompt + generation).
-
generated_response_ids:
Optional[List[Int[Tensor, 'response_len']]]# Generated token IDs from language model generation. Response lengths may vary across samples.
-
generated_response_logits:
Union[List[Float[Tensor, 'response_len vocab_size']],Float[Tensor, 'bsz response_len vocab_size'],None]# Logits for generated tokens from language model generation. Notably, this differs from response_logits which take the logits w.r.t. a prefilled (mostly target) response. In particular, this excludes any prefilled tokens. Response lengths may vary across samples.
-
generated_response_strs:
Optional[List[str]]# Generated text strings from language model generation.
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
-
output_class_logits:
Optional[Float[Tensor, 'bsz n_classes']]# Classification logits from classifier models (pre-softmax). Shape: (batch_size, num_classes)
-
output_embeddings:
Optional[Float[Tensor, 'bsz d_model']]# Pooled output embeddings from encoder models. Shape: (batch_size, d_model)
-
prefill_response_logits:
Optional[Float[Tensor, 'bsz response_seq_len vocab_size']]# Logits corresponding to the prefilled response portion of the sequence.
-
response_first_token_logprobs:
Optional[List[Dict[str,float]]]# Sparse log-probabilities for the first generated token.
List of length bsz, where each element is a dict mapping token strings to their log-probability. For API models this is typically the top-k returned by the provider (e.g. top-20 from OpenAI); for HF models it can cover the full vocabulary.