Why Helios?¶
Helios is built around two principles: explicitness and simplicity. It removes training boilerplate without hiding what is happening. Every step in the training loop is visible and overridable, which makes the code easy to follow and debug.
The goal of this page is to help you decide whether Helios is a good fit for your project. It covers:
A comparison between Helios and other popular frameworks, and
A description of its registry system.
Comparison with Other Frameworks¶
Compared to larger frameworks, Helios prioritises explicitness and simplicity over automation. Rather than inferring behaviour from annotations or hooks, Helios requires you to state your intent explicitly by either overriding functions or opting in to features in code that you own.
The table below contrasts Helios with PyTorch Lightning and Ignite across several areas for research and engineering:
Feature |
Helios |
Lightning |
Ignite |
|---|---|---|---|
Mixed precision |
Helper functions |
Automatic |
Manual |
Reproducible resume |
Built-in |
Partial |
Manual |
Distributed training |
|
|
Manual |
Boilerplate style |
Explicit |
Hook-based |
Event-based |
Training unit |
First-class ( |
Epoch-based |
Epoch-based |
Gradient accumulation |
Iteration-aware |
Epoch-based |
Manual |
Registry system |
Built-in |
None |
None |
Learning curve |
Low |
Medium |
High |
Debuggability |
High |
Medium |
High |
Mixed precision: Helios exposes
create_scaler(),
autocast(), and
clip_gradients() as explicit helper functions. You
opt in deliberately and interact with the scaler directly, so the behaviour is always
clear. Lightning applies mixed precision automatically based on a trainer flag while
Ignite leaves it entirely to the user.
Reproducible resume: When resuming from a checkpoint, Helios guarantees that three things are restored to the exact state they were in when the checkpoint was saved:
The training state (internal and user-defined),
the RNG state,
and the sequence of batches.
The last point is the key differentiator: Helios uses resumable samplers by default, so the dataloader picks up from exactly the same position in the dataset. Lightning saves core training state and RNG state but does not provide the batch sequence guarantee by default. Ignite provides no built-in resumption support.
Distributed training: Both Helios and Lightning support torchrun with automatic
device detection. Ignite requires manual process group setup.
Boilerplate style: Lightning relies on hook names and decorator-based injection; Ignite uses an event system. Helios uses explicit function overrides with clear call-site visibility.
Training unit: In Helios, the training unit is a first-class concept.
EPOCH and
ITERATION are distinct modes that govern the
entire training loop, including checkpoint frequency, stopping conditions, and gradient
accumulation behaviour. Both Lightning and Ignite are fundamentally epoch-based;
Lightning exposes max_steps as a secondary option, but iteration-based training in
either framework requires working around the default design.
Gradient accumulation: Helios handles gradient accumulation through the
TrainingState rather than a dedicated parameter. In epoch
mode, the trainer does not intervene: the iteration count is unchanged and the model
decides when to call backward() by inspecting
current_iteration. In iteration mode, the
trainer automatically scales the total iteration count by the accumulation factor, so
requesting \(N\) iterations with accumulation \(M\) truly means \(N\) backward
passes at the target batch size. The distinction between
current_iteration (effective, complete
iterations) and global_iteration (raw forward
passes) makes the accounting explicit. See Gradient Accumulation
for full details.
The Registry System¶
Helios provides typed global registries that map string names to types. Registries exist for all major components:
Any component can be registered with the @REGISTRY.register decorator and
instantiated by name using the corresponding factory function:
import helios.model as hlm
@hlm.MODEL_REGISTRY.register
class MyModel(hlm.Model):
...
model = hlm.create_model("MyModel", save_name="my_model")
This enables config-file-driven experiments where the model, dataset, optimiser, and scheduler are all selected by string name, with no changes to training code. Swapping any component is a one-line change in the configuration.
If your source tree spans multiple packages, use
update_all_registries() to scan and auto-register all
decorated classes at startup:
import helios.core as hlc
hlc.update_all_registries("my_package")
When to Choose Helios¶
Helios is a good fit if:
You need exact reproducibility. Pausing and resuming training must produce results identical to an uninterrupted run.
You run config-file-driven experiments and want to swap components by string name.
You want to be able to read and follow every line of the training loop.
Your training logic is non-standard (for example, a GAN, a multi-phase curriculum, or a custom distributed setup) and does not fit neatly into a higher-level abstraction.
When Lightning Might Be a Better Fit¶
Lightning may be more appropriate if:
You want a batteries-included experience and would prefer not to write any training boilerplate at all.
You are working within an ecosystem that already depends on Lightning, such as a team codebase or a set of third-party integrations built around it.