Quick Reference

Here is where you will find detailed explanations regarding the various modules that Helios offers. The topics are not sorted in any particular order, and are meant to be used as reference for developers.

Reproducibility

One of the largest features in Helios is the ability to maintain reproducibility even if training runs are stopped and restarted. The mechanisms through which reproducibility is ensured are split in several groups, which we will discuss in more detail below. At the end is a short summary explaining how to use Helios correctly to ensure reproducibility.

Warning

While every effort is done to ensure reproducibility, there are some limitations to what Helios can do. As Helios depends on PyTorch, it is bound to the same reproducibility limitations. For more information, see the reproducibility documentation from PyTorch.

Random Number Generation

To ensure that sequences of random numbers are maintained, Helios provides an automatic seeding system that is invoked as part of the start up process by the Trainer. The seed value used to initialise the random number generators can be assigned by setting the random_seed parameter of the trainer.

Note

If no value is assigned, the default seed is used. Helios has a default value of 6691 for seeding RNGs.

Random numbers may be required throughout the training process, so Helios will automatically seed the following generators:

  • PyTorch: through torch.manual_seed.

  • PyTorch CUDA: through torch.cuda.manual_seed_all. Note that this is only done if CUDA is available.

  • Python’s builtin random module through random.seed.

NumPY RNG

Starting with NumPY 1.16, numpy.random.rand is considered legacy and will receive no further updates. Their documentation states that newer code should rely instead on their new Generator class. In order to facilitate the seeding, saving and restoring of the NumPY generators, Helios provides a wrapper class called DefaultNumpyRNG. This class is automatically created by the trainer and initialised with the default seed (unless a different seed is specified). The generator can be accessed through get_default_numpy_rng() as seen below:

from helios.core import rng

generator = rng.get_default_numpy_rng().generator

# Use the generator as necessary. For example, we can retrieve a uniform random float
# between 0 and 1.
generator.uniform(0, 1)

Warning

Helios does not initialise the legacy random generator from NumPY at any point. You are strongly encouraged to use the provided NumPY generator instead.

The state of the RNGs that Helios seeds is automatically stored whenever checkpoints are saved, so model classes do not have to handle this themselves. Likewise, whenever checkpoints are loaded, the RNG state is automatically restored.

DataLoaders and Samplers

The next major block pertains to how the dataloaders and samplers are handled by Helios. This is split into two sets: the worker processes and the way datasets are sampled.

Worker Processes

When the dataloader for a dataset is created, Helios passes in a custom function to seed each worker process so the random sequences remain the same. The code is adapted from the official PyTorch documentation as shown here:

def _seed_worker(worker_id: int) -> None:
    worker_seed = torch.initial_seed() % 2**32
    rng.seed_rngs(worker_seed, skip_torch=True)

This ensures that all the RNGs that Helios supports are correctly seeded in the worker processes.

Note

This function is passed in internally as an argument to worker_init_fn of the PyTorch DataLoader class. At this time it is not possible to override this function, though it may be considered for a future release.

Samplers

A critical component of ensuring reproducibility is to have a way for the order in which batches are retrieved from the dataset stays the same even if a training run is stopped. PyTorch does not provide a built-in system to allow this, so Helios implements this through the ResumableSampler base class. The goal is to provide a way to do the following:

  1. The sampler must have a way of setting the starting iteration. For example, suppose that the sampler would’ve produced for a given epoch a sequence of \(N\) batches numbered \(0, 1, \ldots, N\). We need the sampler to provide way for us to set the starting batch to a given number \(n_i\) such that the sequence of batches continues from that starting point.

  2. The sampler must have a way of setting the current epoch. This is to allow the samplers to re-shuffle between epochs (if shuffling is used) and to guarantee that the resulting shuffled list is consistent.

Helios contains 3 samplers that provide this functionality. These are:

By default, the sampler is automatically selected using the following logic:

It is possible to override this by providing your own sampler, in which case you should set the sampler field of the DataLoaderParams.

Warning

The sampler must derive from either ResumabeSampler or ResumableDistributedSampler

Checkpoints

The final mechanism Helios has to ensure reproducibility is in the way checkpoints are saved. Specifically, the data that is stored in the checkpoints when they are created. By default, the trainer will write the following data:

  • The state of all supported RNGs.

  • The current TrainingState.

  • The state of the model (if any).

  • The paths to the log file and Tensorboard folder (if using).

If training is stopped and restarted, then Helios will look in the folder where checkpoints are stored and load the last checkpoint. This checkpoint is found by finding the file with the highest epoch and/or iteration number. Upon loading, the trainer will do the following:

  • Restore the state of all supported RNGs.

  • Load the saved training state.

  • Provide the loaded state to the model (if any).

  • Restore the file and Tensorboard loggers to continue writing to their original locations (if using).

Note

Any weights contained in the saved checkpoint are automatically mapped to the correct device when the checkpoint is loaded.

TL;DR

Below is a quick summary to ensure you use Helios’ reproducibility system correctly:

  1. Helios provides a default random seed, but you can override it by setting random_seed in the Trainer.

  2. If you need RNG, you can use the built-in random module from Python, torch.random and torch.cuda.random. If you need to use a NumPY RNG, use get_default_numpy_rng().

  3. Seeding of workers for dataloaders is automatically handled by Helios, so you don’t have to do extra work.

  4. Helios ships with custom samplers that ensure reproducibility in the event training stops. The choice of sampler is automatically handled, but you may override it by setting sampler.

  5. Checkpoints created by Helios automatically store the RNG state alongside training state. No more work is required on your part beyond saving the state of your model.

Stopping Training

In certain cases, it is desirable for training to halt under specific conditions. For example,

  • Either the validation metric or loss function have reached a specific threshold after which training isn’t necessary.

  • The loss function is returning invalid values.

  • The validation metric has not improved in the last \(N\) validation cycles.

Helios provides a way for training to halt if a condition is met. The behaviour is dependent on the choice of training unit, but in general, the following options are available:

  • If you wish to stop training after \(N\) validation cycles because the metric hasn’t improved, then you can use have_metrics_improved() in conjunction with the early_stop_cycles argument of the Trainer.

  • If you wish to stop training for any other reason, you can use should_training_stop().

We will now discuss each of these in more detail.

Stopping After \(N\) Validation Cycles

Helios will perform validation cycles based on the frequency assigned to valid_frequency in the Trainer. The value specifies:

  • The number of epochs between each cycle if the training unit is set to EPOCH or,

  • The number of iterations between each cycle if the training unit is set to ITERATION.

After the validation cycle is completed, the trainer will call have_metrics_improved(). If early_stop_cycles has been assigned when the trainer was created, then the following logic applies:

  • If the function returns true, then the early stop counter resets to 0 and training continues.

  • If the function returns false, then the early stop counter increases by one. If the counter is greater than or equal to the value given to early_stop_cycles, then training stops.

Note

If you wish to use the early stop system, you must assign early_stop_cycles.

Note

The call to have_metrics_improved() is performed after checking if a checkpoint should be saved. If your validation and checkpoint frequencies are the same, then you’re guaranteed that a checkpoint will be saved before the early stop check happens.

Stopping on a Condition

The function used to determine if training should stop for reasons that are not related to the early stop system is should_training_stop(). As there are various places in which it would be desirable for training to halt, Helios checks this function at the following times:

Note

The behaviour of the training batch is consistent regardless of the training unit.

Note

Remember: the choice of training unit affects the place where validation cycles are performed:

  • If training by epochs, then validation cycles occur at the end of every epoch.

  • If training by iterations, then validation cycles will occur after the training batch finishes on an iteration number that is a multiple of the validation frequency. In this case, the early stop checks would occur after the check to see if training should halt.

Gradient Accumulation

The Trainer provides native support for performing gradient accumulation while training. The behaviour is dependent on the choice of training unit, and the logic is the following:

  • If EPOCH is used, then gradient accumulation has no effect on the trainer. Specifically, the iteration count does not change, and neither do the total number of epochs.

  • If ITERATION is used, then accumulating by \(N_g\) steps with a total number of iterations \(N_i\) will result in \(N_g \cdot N_i\) total training iterations.

Training by Epoch

To better understand the behaviour of each unit type, lets look at an example. First, lets set the training unit to be epochs. Then, suppose that we want to train a network for 5 epochs and the batch size results in 10 iterations per epoch. We want to accumulate gradients for 2 iterations, effectively emulating a batch size that results in 5 iterations per epoch. In this case, the total number of iterations that the dataset loop has to run for remains unchanged. We’re still going to go through all 10 batches, but the difference is that we only want to compute backward passes on batches 2, 4, 6, 8, and 10. Since this is the responsibility of the model, the trainer doesn’t have to do any special handling, which results in the following data being stored in the TrainingState:

  • current_iteration and global_iteration will both have the same value, which will correspond to \(n_e \cdot n_i\) where \(n_e\) is the current epoch number and \(n_i\) is the batch number in the dataset.

  • global_epoch will contain the current epoch number.

Lets suppose that we want to perform the backward pass in the on_training_batch_end() function of the model. Then we would do something like this:

def on_training_batch_end(
    self, state: TrainingState, should_log: bool = False
) -> None:
    # Suppose that our loss tensor is stored in self._loss_items and the number of
    # accumulation steps is stored in self._accumulation_steps
    if state.current_iteration % self._accumulation_steps == 0:
        self._loss_items["loss"].backward()
        self._optimizer.step()
    ...

Note

In the example above, we could’ve just as easily used state.global_iteration as they both have the same value.

Training by Iteration

Now let’s see what happens when we switch to training by iterations. In this case, suppose we want to train a network for 10k iterations. We want to emulate a batch size that is twice our current size, so we want to accumulate by 2. If we were to run the training loop for 10k iterations performing backward passes every second iteration, we would’ve performed at total of 5k backward passes, which is half of what we want. Remember: we want to train for 10k iterations at double the batch size that we have. This means that, in order to get the same number of backward passes, we need to double the total iteration count to 20k. This way, we would get the 10k backward passes that we want.

In order to simplify things, the trainer will automatically handle this calculation for you, which results in the following data being stored in the TrainingState:

  • current_iteration is the real iteration count that accounts for gradient accumulation. In our example, this number would only increase every second iteration, and is it used to determine when training should stop.

  • global_iteration: is the total number of iterations that have been performed. In our example, this would be twice the value of the current iteration.

  • global_epoch is the current epoch number.

Like before, suppose that we want to perform the backward pass in the on_training_batch_end() function of the model. Then we would do something like this:

def on_training_batch_end(
    self, state: TrainingState, should_log: bool = False
) -> None:
    # Suppose that our loss tensor is stored in self._loss_items and the number of
    # accumulation steps is stored in self._accumulation_steps
    if state.global_iteration % self._accumulation_steps == 0:
        self._loss_items["loss"].backward()
        self._optimizer.step()
    ...

Warning

Unlike the epoch case, we cannot use state.current_iteration as that keeps track of the number of complete iterations we have done.

Checkpoint Saving

As mentioned in Reproducibility, Helios will automatically save checkpoints whenever both chkpt_frequency and chkpt_root are set in the Trainer. The data for checkpoints is stored in a dictionary that always contains the following keys:

  • training_state: contains the current TrainingState object.

  • model: contains the state of the model as returned by state_dict(). Note that by default this is an empty dictionary.

  • rng: contains the state of the supported RNGs.

  • version: contains the version of Helios used to generate the checkpoint.

The following keys may optionally appear in the dictionary:

  • log_path: appears only when file logging is enabled and contains the path to the log file.

  • run_path: appears only when Tensorboard logging is enabled and contains the path to the directory where the data is stored.

The name of the checkpoint is determined as follows:

<run-name>_<epoch>_<iteration>_<additional-metadata>.pth

Where:

  • <run-name> is the value assigned to run_name in the trainer.

  • <epoch> and <iteration> are the values stored in global_epoch and global_iteration, respectively.

The <additional-metadata> field is used to allow users to append additional information to the checkpoint name for easier identification later on. This data is retrieved from the append_metadata_to_chkpt_name() function from the model. For example, suppose we want to add the value of the accuracy metric we computed for validation. Then we would do something like this:

def append_metadata_to_chkpt_name(self, chkpt_name: str) -> str:
    # Suppose the accuracy is stored in self._val_scores
    accuracy = round(self._val_scores["accuracy"], 4)
    return "accuracy_{accuracy}"

This will append the string to the end of the checkpoint name. Say our run name is cifar10 and we’re saving on iteration 100 and epoch 3. Then the checkpoint name would be:

cifar10_epoch_3_iter_100_accuracy_0.89.pth

Note

You do not have to add the pth extension to the name when you append metadata. This will be automatically handled by the trainer.

Note

If distributed training is used, then only the process with global rank 0 will save checkpoints.

Migrating Checkpoints

The version key stored in the checkpoints generated by Helios acts as a fail-safe to prevent future changes from breaking previously generated checkpoints. Helios guarantees compatibility between checkpoints generated within the same major revision. In other words, checkpoints generated by version 1.0 will be compatible with version 1.1. Compatibility between major versions is not guaranteed. Should you wish to migrate your checkpoints to a newer version of Helios, you may do so by either manually calling migrate_checkpoints_to_current_version() or by using the script directly from the command line as follows:

python -m helios.chkpt_migrator <chkpt-root>

Logging

The Trainer has several sets of flags that control logging. These are:

  • enable_tensorboard which is paired with run_path,

  • enable_file_logging which is paired with log_path, and

  • enable_progress_bar.

The _path arguments determine the root directories where the corresponding logs will be saved.

Warning

If a flat is paired with a path, then you must provide the corresponding path if the flag is enabled. In other words, if you set enable_tensorboard, then you must also provide run_path.

Note

If the given path doesn’t exist, it will be created automatically.

The way the names for logs is determined as follows:

<run-name>_<current-date/time>

Where <run-name> is the value assigned to the run_name argument and <current-date/time> is the string representation of the current date and time with the format MonthDay_Hour-Minute-Second. This allows multiple training runs with the same names to save to different logs, which can be useful when tweaking hyper-parameters.

The enable_progress_bar flag determines whether a progress bar is shown on the screen while training is ongoing. The progress bar is only shown on the screen and does not appear in the file log (if enabled). The behaviour of the progress bar depends on the choice of training unit:

  • If epochs are used, then two progress bars are displayed: one that tracks the number of epochs and another that tracks the iterations within the current epoch.

  • If iterations are used, then a single progress bar is shown that tracks the number of iterations.

The progress bar is also shown during validation, in which case it tracks the number of iterations in the validation set.

CUDA

Helios provides several conveniences for handling training on GPUs through CUDA as well as distributed training. These are:

  • Automatic detection and selection of GPUs to train in,

  • Automatic mapping of checkpoints to the correct device,

  • Support for torchrun,

  • Ability to set certain CUDA flags.

The Trainer has two flags that can be used to control the behaviour when using CUDA. These are:

  • enable_deterministic: uses deterministic training.

  • enable_cudnn_benchmark: enables the use of CuDNN benchmarking for faster training.

Note

enable_deterministic can also be used when training on the CPU.

Note

CuDNN is enabled only during training. It is disabled automatically during validation to avoid non-deterministic issues.

Device Selection

When the trainer is created, there are a two arguments that can be used to determine which device(s) will be used for training: gpus and use_cpu. The logic for determining the device is the following:

  • If use_cpu is true, then the CPU will be used for training.

  • Otherwise, the choice of devices is determined by gpus. If no value is assigned, and CUDA is not available, then the CPU will be used.

  • If gpus is not assigned and CUDA is available, then Helios will automatically use all available GPUs in the system, potentially triggering distributed training if more than one is found.

  • If gpus is set, then the indices it contains determine the devices that will be used for training.

Note

If torchrun is used, then Helios will automatically detect the GPU assigned to the process as if it was assigned to gpus.

Note

If multiple GPUs are found, or if more than one index is provided to gpus, then Helios will automatically launch distributed training.

Warning

gpus must be set to a list of indices that represent the IDs of the GPU(s) to use.

Model Functions

The Model class provides several callbacks that can be used for training, validation, and testing. Below is a list of all available callbacks alongside with their use in the training loops.

Training Functions

The order in which the training functions are called roughly corresponds to the following code:

model.on_training_start()
model.train()
for epoch in epoch:
     model.on_training_epoch_start()

     for batch in dataloader:
         model.on_training_batch_start()
         model.train_step()
         model.on_training_batch_end()

     model.on_training_epoch_end()
model.on_training_end()

Validation Functions

The order in which the validation functions are called roughly corresponds to the following code:

model.eval()
model.on_validation_start()
     for batch in dataloader:
         model.on_validation_batch_start()
         model.valid_step()
         model.on_validation_batch_end()

model.on_validation_end()

Testing Functions

The order of the testing functions is identical to the one shown for validation:

model.eval()
model.on_testing_start()
     for batch in dataloader:
         model.on_testing_batch_start()
         model.test_step()
         model.on_testing_batch_end()

model.on_testing_end()

Exception Handling

By default, the main functions of Trainer (those being fit() and test()) will automatically catch any unhandled exceptions and re-raise them as RuntimeError. Depending on the situation, it may be desirable for certain exceptions to be passed through untouched. In order to accommodate this, the trainer has two sets of lists of exception types:

If an exception is raised and said exception is found in the training list (for fit()) or testing list (for test()), then the exception is passed through unchanged. Any other exceptions use the default behaviour.

For example, suppose we had a custom exception called MyException and we wanted that exception to be passed through when training because we’re going to handle it ourselves. We would then do the following:

import helios.trainer as hlt

trainer = hlt.Trainer(...)
trainer.train_esceptions.append(MyException)

try:
    trainer.fit(...)
except MyException as e:
    ...

The same logic applies for testing. This functionality is particularly useful when paired with plug-ins.

Synchronization

Helios provides some synchronization wrappers found in the distributed module:

The trainer also provides another way to synchronize values through the multi-processing queue. When using distributed training that isn’t through torchrun, Helios uses spawn to create the processes for each GPU. This triggers a copy of the arguments passed in to the handler, which in this case are the trainer, model, and datamodule. This presents a problem in the event that we need to return values back to the main process once training is complete. To facilitate this task, the trainer will create a queue that can be accessed through queue.

Note

If training isn’t distributed or if it was started through torchrun, then the queue is set to None.

The queue can then be used by either the Model, the DataModule, or any plug-in through their reference to the trainer.