Losses#

Loss Resolution#

The unified entry point for computing any loss — models call this instead of invoking loss functions directly. See Common Types for ModelInput / ModelOutput.

tropt.loss.resolution.resolve_and_compute_loss(model_output, model_input, loss_func)[source]#

Universal loss computation via automatic argument matching.

This function invokes loss_func with the model input (which includes the targets and slices), and output; it returns the computed loss tensor (size bsz).

Notes

Since loss functions in TROPT declare their required arguments following the naming

convention of ModelOutput and ModelInput fields, this function can automatically resolve which data to provide to the loss function by inspecting its __call__ signature.

In case insufficient arguments are available (e.g., because the model does not provide the required access),

a LossResolutionError is raised with details on what is missing.

Parameter Naming Convention:
Loss functions should name their parameters exactly as they appear in ModelOutput and ModelInput: - full_logits: From ModelOutput - output_embeddings: From ModelOutput - full_attentions: From ModelOutput - input_trigger_ids: From ModelInput - input_slices: From ModelInput - message_targets: From ModelInput - etc.

Parameters:

model_output (ModelOutput) – Standardized model output containing available data
model_input (ModelInput) – Standardized model input containing triggers, slices, message targets
loss_func (BaseLoss) – The loss function to compute (must have proper __call__ signature)

Return type:

Float[Tensor, 'bsz']

Returns:

Loss tensor of shape (bsz,) containing per-sample losses

Raises:

LossResolutionError – If required parameter is not found in model data
TypeError – If loss function signature is invalid

Examples

>>> from tropt.common import MessageTargets
>>> # Encoder model with SimilarityLoss(output_embeddings, target_vectors)
>>> output = ModelOutput(output_embeddings=torch.randn(4, 768))
>>> input_data = ModelInput(message_targets=MessageTargets(target_vectors=target_vecs))
>>> loss = resolve_and_compute_loss(output, input_data, SimilarityLoss())

>>> # Language model with PrefillCELoss(prefill_response_logits, message_targets)
>>> output = ModelOutput(prefill_response_logits=torch.randn(2, 50, 32000))
>>> input_data = ModelInput(
...     message_targets=MessageTargets(target_response_toks=target_ids)
... )
>>> loss = resolve_and_compute_loss(output, input_data, PrefillCELoss())

class tropt.loss.resolution.LossResolutionError[source]#

Bases: Exception

Raised when required data is missing for loss computation.

This exception indicates that a loss function requires specific model outputs or inputs that were not provided. The error message should clearly state what parameter is missing and where it should come from.

Examples

>>> raise LossResolutionError(
...     "Loss function requires parameter 'full_logits' but it was not "
...     "found in model_output or model_input."
... )

Loss Classes Interfaces#

class tropt.loss.BaseLoss[source]#

Bases: ABC

Base class for all loss functions.

contains_loss_type(loss_type)[source]#

Returns True if this loss is of the given type. Complicated losses (e.g., CombinedLoss) may override this method with different logic.

Return type:: bool
Parameters:: loss_type (type)

get_loss_log_dict()[source]#

Returns a loggable dict of the last computed loss value, keyed by loss class name. Useful for verbose loss logging in optimizers.

Return type:: dict

is_differentiable: ClassVar[bool] = True#: Whether this loss is back-propable. Set to False for losses that use external models, text generation, or other non-differentiable operations.

require_attentions: ClassVar[bool] = False#: Whether this loss requires the model to return attention weights.

require_first_token_logprobs: ClassVar[bool] = False#: Whether this loss requires first-token log-probabilities from generation.

require_generation: ClassVar[bool] = False#: Whether this loss requires autoregressive generation.

require_gradients: ClassVar[bool] = False#: Whether the loss-ranking path (otherwise run under torch.no_grad) must keep a live autograd graph for this loss. Set True by losses whose value is itself a gradient (e.g. gradient matching).

require_hidden_states: ClassVar[bool] = False#: Whether this loss requires the model to provide the forward pass’s hidden states.

require_target_prefill: ClassVar[bool] = False#: Whether this loss requires the model to prefill the target response tokens (appending them to the input, as a response prefix).

class tropt.loss.TriggerLogitBasedLoss[source]#

Bases: BaseLoss

Loss computed on full-sequence logits (full_logits) sliced to trigger positions. Useful for optimizing properties of the triggers directly.

class tropt.loss.AttentionBasedLoss[source]#

Bases: BaseLoss

Loss computed on model attention weights (full_attentions).

require_attentions: ClassVar[bool] = True#: Whether this loss requires the model to return attention weights.

class tropt.loss.EmbeddingBasedLoss[source]#

Bases: BaseLoss

Loss is computed based on model embeddings, compared to given target vectors.

Requires the target vectors (shape: (n_templates, d_model)) to be provided in the targets dict.

class tropt.loss.TextBasedLoss[source]#

Bases: BaseLoss

Marker base for losses that operate on text fields (e.g. input_texts, generated_response_strs).

is_differentiable: ClassVar[bool] = False#: Whether this loss is back-propable. Set to False for losses that use external models, text generation, or other non-differentiable operations.

class tropt.loss.SteeringActivationLoss(targeted_layers=slice(None, None, None), steer_away=False, slc_name=SliceKey.INPUT_LAST_TOKEN, do_cosine_sim=False, apply_square=False, apply_abs=False)[source]#

Bases: HiddenStateBasedLoss

Encourages hidden activations at specific layers/positions to align with a target direction. - Each message has a target direction vector (optionally its own unique one).

target_directions: (n_templates, d_model)

Note that the direction will be applied to the whole target positions and layers.

Default is steering towards a direction (maximizing alignment).
- Here, minimizing the loss maximizes alignment (dot product) with the target direction.
- Set steer_away=True to steer away (e.g., for refusal suppression).

References: - Was proposed as ‘refusal direction suppression’ combined with GCG:

https://aclanthology.org/2025.naacl-long.302/

Was proposed for adapting attacks (e.g., GCG) for evading probe-based classifiers.
https://arxiv.org/abs/2412.09565

Parameters:

targeted_layers (slice) – Which layers to apply steering on (default: all layers)
steer_away (bool) – Whether to minimize alignment instead of maximizing (default: False = steer towards)
slc_name (SliceKey) – Which token positions to apply steering on (default: “input_last_token”)
do_cosine_sim (bool) – Whether to use cosine similarity instead of dot product (default: False)
apply_square (bool) – Whether to square the similarity scores (default: False)
apply_abs (bool) – Whether to take the absolute value of the similarity scores (default: False).

apply_abs: bool = False#

apply_square: bool = False#

do_cosine_sim: bool = False#

slc_name: SliceKey = 'input_last_token'#

steer_away: bool = False#

targeted_layers: slice = slice(None, None, None)#

class tropt.loss.CombinedLoss(loss_funcs, weights=None)[source]#

Bases: BaseLoss

Combines multiple losses with given weights.

Parameters:

loss_funcs (List[BaseLoss])
weights (Optional[List[float]])

contains_loss_type(loss_type)[source]#

Check if the CombinedLoss contains a loss of the specified type.

Return type:: bool
Parameters:: loss_type (type)

get_loss_log_dict()[source]#

Returns a loggable dict of the last computed loss value (of all the component losses), keyed by loss class name. Useful for verbose loss logging in optimizers.

Return type:: dict

property is_differentiable: bool#

bool(x) -> bool