Quick Reference¶
Here is where you will find detailed explanations regarding the various modules that Helios offers. The topics are not sorted in any particular order, and are meant to be used as reference for developers.
Reproducibility¶
One of the largest features in Helios is the ability to maintain reproducibility even if training runs are stopped and restarted. The mechanisms through which reproducibility is ensured are split in several groups, which we will discuss in more detail below. At the end is a short summary explaining how to use Helios correctly to ensure reproducibility.
Warning
While every effort is done to ensure reproducibility, there are some limitations to what Helios can do. As Helios depends on PyTorch, it is bound to the same reproducibility limitations. For more information, see the reproducibility documentation from PyTorch.
Random Number Generation¶
To ensure that sequences of random numbers are maintained, Helios provides an automatic
seeding system that is invoked as part of the start up process by the
Trainer
. The seed value used to initialise the random number
generators can be assigned by setting the random_seed
parameter of the trainer.
Note
If no value is assigned, the default seed is used. Helios has a default value of 6691 for seeding RNGs.
Random numbers may be required throughout the training process, so Helios will automatically seed the following generators:
PyTorch: through
torch.manual_seed
.PyTorch CUDA: through
torch.cuda.manual_seed_all
. Note that this is only done if CUDA is available.Python’s builtin random module through
random.seed
.
NumPY RNG¶
Starting with NumPY 1.16, numpy.random.rand
is considered legacy and will receive no
further updates. Their documentation states that newer code should rely instead on their
new Generator
class. In order to facilitate the seeding, saving and restoring of the
NumPY generators, Helios provides a wrapper class called
DefaultNumpyRNG
. This class is automatically created by the
trainer and initialised with the default seed (unless a different seed is specified).
The generator can be accessed through get_default_numpy_rng()
as
seen below:
from helios.core import rng
generator = rng.get_default_numpy_rng().generator
# Use the generator as necessary. For example, we can retrieve a uniform random float
# between 0 and 1.
generator.uniform(0, 1)
Warning
Helios does not initialise the legacy random generator from NumPY at any point. You are strongly encouraged to use the provided NumPY generator instead.
The state of the RNGs that Helios seeds is automatically stored whenever checkpoints are saved, so model classes do not have to handle this themselves. Likewise, whenever checkpoints are loaded, the RNG state is automatically restored.
DataLoaders and Samplers¶
The next major block pertains to how the dataloaders and samplers are handled by Helios. This is split into two sets: the worker processes and the way datasets are sampled.
Worker Processes¶
When the dataloader for a dataset is created, Helios passes in a custom function to seed each worker process so the random sequences remain the same. The code is adapted from the official PyTorch documentation as shown here:
def _seed_worker(worker_id: int) -> None:
worker_seed = torch.initial_seed() % 2**32
rng.seed_rngs(worker_seed, skip_torch=True)
This ensures that all the RNGs that Helios supports are correctly seeded in the worker processes.
Note
This function is passed in internally as an argument to worker_init_fn
of the
PyTorch DataLoader
class. At this time it is not possible to override this
function, though it may be considered for a future release.
Samplers¶
A critical component of ensuring reproducibility is to have a way for the order in which
batches are retrieved from the dataset stays the same even if a training run is stopped.
PyTorch does not provide a built-in system to allow this, so Helios implements this
through the ResumableSampler
base class. The goal is to
provide a way to do the following:
The sampler must have a way of setting the starting iteration. For example, suppose that the sampler would’ve produced for a given epoch a sequence of \(N\) batches numbered \(0, 1, \ldots, N\). We need the sampler to provide way for us to set the starting batch to a given number \(n_i\) such that the sequence of batches continues from that starting point.
The sampler must have a way of setting the current epoch. This is to allow the samplers to re-shuffle between epochs (if shuffling is used) and to guarantee that the resulting shuffled list is consistent.
Helios contains 3 samplers that provide this functionality. These are:
By default, the sampler is automatically selected using the following logic:
If training is distributed, use
ResumableDistributedSampler
If training is not distributed, then check if shuffling is required. If it is, use
ResumableRandomSampler
. Otherwise useResumableSequentialSampler
.
It is possible to override this by providing your own sampler, in which case you should
set the sampler
field of the
DataLoaderParams
.
Warning
The sampler must derive from either
ResumabeSampler
or
ResumableDistributedSampler
Checkpoints¶
The final mechanism Helios has to ensure reproducibility is in the way checkpoints are saved. Specifically, the data that is stored in the checkpoints when they are created. By default, the trainer will write the following data:
The state of all supported RNGs.
The current
TrainingState
.The state of the model (if any).
The paths to the log file and Tensorboard folder (if using).
If training is stopped and restarted, then Helios will look in the folder where checkpoints are stored and load the last checkpoint. This checkpoint is found by finding the file with the highest epoch and/or iteration number. Upon loading, the trainer will do the following:
Restore the state of all supported RNGs.
Load the saved training state.
Provide the loaded state to the model (if any).
Restore the file and Tensorboard loggers to continue writing to their original locations (if using).
Note
Any weights contained in the saved checkpoint are automatically mapped to the correct device when the checkpoint is loaded.
TL;DR¶
Below is a quick summary to ensure you use Helios’ reproducibility system correctly:
Helios provides a default random seed, but you can override it by setting
random_seed
in theTrainer
.If you need RNG, you can use the built-in
random
module from Python,torch.random
andtorch.cuda.random
. If you need to use a NumPY RNG, useget_default_numpy_rng()
.Seeding of workers for dataloaders is automatically handled by Helios, so you don’t have to do extra work.
Helios ships with custom samplers that ensure reproducibility in the event training stops. The choice of sampler is automatically handled, but you may override it by setting
sampler
.Checkpoints created by Helios automatically store the RNG state alongside training state. No more work is required on your part beyond saving the state of your model.
Stopping Training¶
In certain cases, it is desirable for training to halt under specific conditions. For example,
Either the validation metric or loss function have reached a specific threshold after which training isn’t necessary.
The loss function is returning invalid values.
The validation metric has not improved in the last \(N\) validation cycles.
Helios provides a way for training to halt if a condition is met. The behaviour is dependent on the choice of training unit, but in general, the following options are available:
If you wish to stop training after \(N\) validation cycles because the metric hasn’t improved, then you can use
have_metrics_improved()
in conjunction with theearly_stop_cycles
argument of theTrainer
.If you wish to stop training for any other reason, you can use
should_training_stop()
.
We will now discuss each of these in more detail.
Stopping After \(N\) Validation Cycles¶
Helios will perform validation cycles based on the frequency assigned to
valid_frequency
in the Trainer
. The value specifies:
The number of epochs between each cycle if the training unit is set to
EPOCH
or,The number of iterations between each cycle if the training unit is set to
ITERATION
.
After the validation cycle is completed, the trainer will call
have_metrics_improved()
. If early_stop_cycles
has
been assigned when the trainer was created, then the following logic applies:
If the function returns true, then the early stop counter resets to 0 and training continues.
If the function returns false, then the early stop counter increases by one. If the counter is greater than or equal to the value given to
early_stop_cycles
, then training stops.
Note
If you wish to use the early stop system, you must assign early_stop_cycles
.
Note
The call to have_metrics_improved()
is performed
after checking if a checkpoint should be saved. If your validation and checkpoint
frequencies are the same, then you’re guaranteed that a checkpoint will be saved
before the early stop check happens.
Stopping on a Condition¶
The function used to determine if training should stop for reasons that are not related to
the early stop system is should_training_stop()
. As
there are various places in which it would be desirable for training to halt, Helios
checks this function at the following times:
After a training batch is complete. Specifically, this check will be done after
on_training_batch_start()
,train_step()
, andon_training_batch_end()
have been called.After a validation cycle has been completed.
Note
The behaviour of the training batch is consistent regardless of the training unit.
Note
Remember: the choice of training unit affects the place where validation cycles are performed:
If training by epochs, then validation cycles occur at the end of every epoch.
If training by iterations, then validation cycles will occur after the training batch finishes on an iteration number that is a multiple of the validation frequency. In this case, the early stop checks would occur after the check to see if training should halt.
Gradient Accumulation¶
The Trainer
provides native support for performing gradient
accumulation while training. The behaviour is dependent on the choice of training unit,
and the logic is the following:
If
EPOCH
is used, then gradient accumulation has no effect on the trainer. Specifically, the iteration count does not change, and neither do the total number of epochs.If
ITERATION
is used, then accumulating by \(N_g\) steps with a total number of iterations \(N_i\) will result in \(N_g \cdot N_i\) total training iterations.
Training by Epoch¶
To better understand the behaviour of each unit type, lets look at an example. First, lets
set the training unit to be epochs. Then, suppose that we want to train a network for 5
epochs and the batch size results in 10 iterations per epoch. We want to accumulate
gradients for 2 iterations, effectively emulating a batch size that results in 5
iterations per epoch. In this case, the total number of iterations that the dataset loop
has to run for remains unchanged. We’re still going to go through all 10 batches, but the
difference is that we only want to compute backward passes on batches 2, 4, 6, 8, and 10.
Since this is the responsibility of the model, the trainer doesn’t have to do any special
handling, which results in the following data being stored in the
TrainingState
:
current_iteration
andglobal_iteration
will both have the same value, which will correspond to \(n_e \cdot n_i\) where \(n_e\) is the current epoch number and \(n_i\) is the batch number in the dataset.global_epoch
will contain the current epoch number.
Lets suppose that we want to perform the backward pass in the
on_training_batch_end()
function of the model. Then we
would do something like this:
def on_training_batch_end(
self, state: TrainingState, should_log: bool = False
) -> None:
# Suppose that our loss tensor is stored in self._loss_items and the number of
# accumulation steps is stored in self._accumulation_steps
if state.current_iteration % self._accumulation_steps == 0:
self._loss_items["loss"].backward()
self._optimizer.step()
...
Note
In the example above, we could’ve just as easily used state.global_iteration
as
they both have the same value.
Training by Iteration¶
Now let’s see what happens when we switch to training by iterations. In this case, suppose we want to train a network for 10k iterations. We want to emulate a batch size that is twice our current size, so we want to accumulate by 2. If we were to run the training loop for 10k iterations performing backward passes every second iteration, we would’ve performed at total of 5k backward passes, which is half of what we want. Remember: we want to train for 10k iterations at double the batch size that we have. This means that, in order to get the same number of backward passes, we need to double the total iteration count to 20k. This way, we would get the 10k backward passes that we want.
In order to simplify things, the trainer will automatically handle this calculation for
you, which results in the following data being stored in the
TrainingState
:
current_iteration
is the real iteration count that accounts for gradient accumulation. In our example, this number would only increase every second iteration, and is it used to determine when training should stop.global_iteration
: is the total number of iterations that have been performed. In our example, this would be twice the value of the current iteration.global_epoch
is the current epoch number.
Like before, suppose that we want to perform the backward pass in the
on_training_batch_end()
function of the model. Then we
would do something like this:
def on_training_batch_end(
self, state: TrainingState, should_log: bool = False
) -> None:
# Suppose that our loss tensor is stored in self._loss_items and the number of
# accumulation steps is stored in self._accumulation_steps
if state.global_iteration % self._accumulation_steps == 0:
self._loss_items["loss"].backward()
self._optimizer.step()
...
Warning
Unlike the epoch case, we cannot use state.current_iteration
as that keeps
track of the number of complete iterations we have done.
Checkpoint Saving¶
As mentioned in Reproducibility, Helios will automatically save checkpoints whenever both
chkpt_frequency
and chkpt_root
are set in the Trainer
.
The data for checkpoints is stored in a dictionary that always contains the following
keys:
training_state
: contains the currentTrainingState
object.model
: contains the state of the model as returned bystate_dict()
. Note that by default this is an empty dictionary.rng
: contains the state of the supported RNGs.version
: contains the version of Helios used to generate the checkpoint.
The following keys may optionally appear in the dictionary:
log_path
: appears only when file logging is enabled and contains the path to the log file.run_path
: appears only when Tensorboard logging is enabled and contains the path to the directory where the data is stored.
The name of the checkpoint is determined as follows:
<run-name>_<epoch>_<iteration>_<additional-metadata>.pth
Where:
<run-name>
is the value assigned torun_name
in the trainer.<epoch>
and<iteration>
are the values stored inglobal_epoch
andglobal_iteration
, respectively.
The <additional-metadata>
field is used to allow users to append additional
information to the checkpoint name for easier identification later on. This data is
retrieved from the append_metadata_to_chkpt_name()
function from the model. For example, suppose we want to add the value of the accuracy
metric we computed for validation. Then we would do something like this:
def append_metadata_to_chkpt_name(self, chkpt_name: str) -> str:
# Suppose the accuracy is stored in self._val_scores
accuracy = round(self._val_scores["accuracy"], 4)
return "accuracy_{accuracy}"
This will append the string to the end of the checkpoint name. Say our run name is
cifar10
and we’re saving on iteration 100 and epoch 3. Then the checkpoint name would
be:
cifar10_epoch_3_iter_100_accuracy_0.89.pth
Note
You do not have to add the pth
extension to the name when you append metadata. This
will be automatically handled by the trainer.
Note
If distributed training is used, then only the process with global rank 0 will save checkpoints.
Migrating Checkpoints¶
The version
key stored in the checkpoints generated by Helios acts as a fail-safe to
prevent future changes from breaking previously generated checkpoints. Helios guarantees
compatibility between checkpoints generated within the same major revision. In other
words, checkpoints generated by version 1.0 will be compatible with version 1.1.
Compatibility between major versions is not guaranteed. Should you wish to migrate
your checkpoints to a newer version of Helios, you may do so by either manually calling
migrate_checkpoints_to_current_version()
or by using the
script directly from the command line as follows:
python -m helios.chkpt_migrator <chkpt-root>
Logging¶
The Trainer
has several sets of flags that control logging.
These are:
enable_tensorboard
which is paired withrun_path
,enable_file_logging
which is paired withlog_path
, andenable_progress_bar
.
The _path
arguments determine the root directories where the corresponding logs will
be saved.
Warning
If a flat is paired with a path, then you must provide the corresponding path if
the flag is enabled. In other words, if you set enable_tensorboard
, then you must
also provide run_path
.
Note
If the given path doesn’t exist, it will be created automatically.
The way the names for logs is determined as follows:
<run-name>_<current-date/time>
Where <run-name>
is the value assigned to the run_name
argument and
<current-date/time>
is the string representation of the current date and time with the
format MonthDay_Hour-Minute-Second
. This allows multiple training runs with the same
names to save to different logs, which can be useful when tweaking hyper-parameters.
The enable_progress_bar
flag determines whether a progress bar is shown on the screen
while training is ongoing. The progress bar is only shown on the screen and does not
appear in the file log (if enabled). The behaviour of the progress bar depends on the
choice of training unit:
If epochs are used, then two progress bars are displayed: one that tracks the number of epochs and another that tracks the iterations within the current epoch.
If iterations are used, then a single progress bar is shown that tracks the number of iterations.
The progress bar is also shown during validation, in which case it tracks the number of iterations in the validation set.
CUDA¶
Helios provides several conveniences for handling training on GPUs through CUDA as well as distributed training. These are:
Automatic detection and selection of GPUs to train in,
Automatic mapping of checkpoints to the correct device,
Support for
torchrun
,Ability to set certain CUDA flags.
The Trainer
has two flags that can be used to control the
behaviour when using CUDA. These are:
enable_deterministic
: uses deterministic training.enable_cudnn_benchmark
: enables the use of CuDNN benchmarking for faster training.
Note
enable_deterministic
can also be used when training on the CPU.
Note
CuDNN is enabled only during training. It is disabled automatically during validation to avoid non-deterministic issues.
Device Selection¶
When the trainer is created, there are a two arguments that can be used to determine which
device(s) will be used for training: gpus
and use_cpu
. The logic for determining
the device is the following:
If
use_cpu
is true, then the CPU will be used for training.Otherwise, the choice of devices is determined by
gpus
. If no value is assigned, and CUDA is not available, then the CPU will be used.If
gpus
is not assigned and CUDA is available, then Helios will automatically use all available GPUs in the system, potentially triggering distributed training if more than one is found.If
gpus
is set, then the indices it contains determine the devices that will be used for training.
Note
If torchrun
is used, then Helios will automatically detect the GPU assigned to the
process as if it was assigned to gpus
.
Note
If multiple GPUs are found, or if more than one index is provided to gpus
, then
Helios will automatically launch distributed training.
Warning
gpus
must be set to a list of indices that represent the IDs of the GPU(s) to use.
Model Functions¶
The Model
class provides several callbacks that can be
used for training, validation, and testing. Below is a list of all available callbacks
alongside with their use in the training loops.
Training Functions¶
The order in which the training functions are called roughly corresponds to the following code:
model.on_training_start()
model.train()
for epoch in epoch:
model.on_training_epoch_start()
for batch in dataloader:
model.on_training_batch_start()
model.train_step()
model.on_training_batch_end()
model.on_training_epoch_end()
model.on_training_end()
Validation Functions¶
The order in which the validation functions are called roughly corresponds to the following code:
model.eval()
model.on_validation_start()
for batch in dataloader:
model.on_validation_batch_start()
model.valid_step()
model.on_validation_batch_end()
model.on_validation_end()
Testing Functions¶
The order of the testing functions is identical to the one shown for validation:
model.eval()
model.on_testing_start()
for batch in dataloader:
model.on_testing_batch_start()
model.test_step()
model.on_testing_batch_end()
model.on_testing_end()
Exception Handling¶
By default, the main functions of Trainer
(those being
fit()
and test()
) will
automatically catch any unhandled exceptions and re-raise them as
RuntimeError
. Depending on the situation, it may be desirable for certain
exceptions to be passed through untouched. In order to accommodate this, the trainer has
two sets of lists of exception types:
If an exception is raised and said exception is found in the training list (for
fit()
) or testing list (for
test()
), then the exception is passed through unchanged.
Any other exceptions use the default behaviour.
For example, suppose we had a custom exception called MyException
and we wanted that
exception to be passed through when training because we’re going to handle it ourselves.
We would then do the following:
import helios.trainer as hlt
trainer = hlt.Trainer(...)
trainer.train_esceptions.append(MyException)
try:
trainer.fit(...)
except MyException as e:
...
The same logic applies for testing. This functionality is particularly useful when paired with plug-ins.
Synchronization¶
Helios provides some synchronization wrappers found in the
distributed
module:
The trainer also provides another way to synchronize values through the multi-processing
queue. When using distributed training that isn’t through torchrun
, Helios uses
spawn
to create the processes for each GPU. This triggers a copy of the arguments
passed in to the handler, which in this case are the trainer, model, and datamodule. This
presents a problem in the event that we need to return values back to the main process
once training is complete. To facilitate this task, the trainer will create a queue that can
be accessed through queue
.
Note
If training isn’t distributed or if it was started through torchrun
, then the
queue
is set to None
.
The queue can then be used by either the Model
, the
DataModule
, or any plug-in through their reference to
the trainer.