Training a Classifier with Helios ################################# As a simple example, we're going to implement the `Training a Classifier `_ tutorial from PyTorch. For the sake of brevity, we will assume that the reader is familiar with the process of training networks using PyTorch and focus exclusively on the steps necessary to accomplish the same task using Helios. .. note:: The code for this tutorial is available `here `__. Project Structure ================= The first thing we're going to do is create a folder where our virtual environment and the code will live. .. code-block:: bash mkdir classifier cd classifier Next, let's create a virtual environment and install Helios. All necessary dependencies will be installed automatically. .. code-block:: bash python3 -m venv .venv # For Windows, replace with python . .venv/bin/activate # . .venv/Scripts/activate for Windows. pip install helios touch cifar.py With that done, let's begin by defining how our data will be managed. Managing Datasets ================= We start by setting up our data. In the tutorial, the dataset is downloaded through the `torchvision`, so we have to make sure we download it as well. In Helios, datasets are managed through the :py:class:`~helios.data.datamodule.DataModule`. Note that this shares some similarities with the corresponding class from PyTorch Lightning, so if you're familiar with that it will be easier to follow along. First, let's add our imports. .. code-block:: python import os import pathlib import typing import torch import torch.nn.functional as F import torchvision import torchvision.transforms.v2 as T from torch import nn import helios.core as hlc import helios.data as hld import helios.model as hlm import helios.optim as hlo import helios.trainer as hlt from helios.core import logging These will be all the imports we'll need for this tutorial. Next, let's create a new class for our data: .. code-block:: python class CIFARDataModule(hld.DataModule): def __init__(self, root: pathlib.Path) -> None: super().__init__() self._root = root / "data" The datamodule will take in as an argument the root where the datasets will be downloaded to. Next, let's add the code to download the data: .. code-block:: python def prepare_data(self) -> None: torchvision.datasets.CIFAR10(root=self._root, train=True, download=True) torchvision.datasets.CIFAR10(root=self._root, train=False, download=True) The :py:meth:`~helios.data.datamodule.DataModule.prepare_data` function will be called automatically by the Trainer before training starts. If we were training with multiple GPUs, this would be called *prior* to the creation of the distributed context. Now let's make the datasets themselves: .. code-block:: python def setup(self) -> None: transforms = T.Compose( [ hld.transforms.ToImageTensor(), T.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)), ] ) params = hld.DataLoaderParams() params.batch_size = 4 params.shuffle = True params.num_workers = 2 params.drop_last = True self._train_dataset = self._create_dataset( torchvision.datasets.CIFAR10( root=self._root, train=True, download=False, transform=transforms ), params, ) params.drop_last = False params.shuffle = False self._valid_dataset = self._create_dataset( torchvision.datasets.CIFAR10( root=self._root, train=False, download=False, transform=transforms ), params, ) There's a few things to note here: #. Helios ships with a transform that automatically converts images (or arrays of images) from their NumPY representation to tensors called :py:class:`~helios.data.transforms.ToImageTensor`. The class is ultimately equivalent to the following: .. code-block:: python import torchvision.transforms.v2 as T to_image_tensor = T.Compose( [T.ToImage(), T.ToDType(dtype=torch.float32, scale=scale), T.ToPureTensor()] ) #. The :py:class:`~helios.data.datamodule.DataLoaderParams` object wraps all of the settings used to create the dataloader and sampler pair. This is where you can set options like batch sizes, number of workers, whether the dataset should be shuffled, etc. #. The ``params`` object can be freely re-used without worrying about settings interfering with each other. As soon as ``_create_dataset`` is called, the ``params`` object is deep-copied to avoid conflicts. Making the Model ================ Network ------- With the datasets ready, we can now turn our attention to the network. The code will be identical to the one from PyTorch so we won't explain any details. .. code-block:: python class Net(nn.Module): def __init__(self): super().__init__() self.conv1 = nn.Conv2d(3, 6, 5) self.pool = nn.MaxPool2d(2, 2) self.conv2 = nn.Conv2d(6, 16, 5) self.fc1 = nn.Linear(16 * 5 * 5, 120) self.fc2 = nn.Linear(120, 84) self.fc3 = nn.Linear(84, 10) def forward(self, x: torch.Tensor) -> torch.Tensor: x = self.pool(F.relu(self.conv1(x))) x = self.pool(F.relu(self.conv2(x))) x = torch.flatten(x, 1) # flatten all dimensions except batch x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(x) return x With the network ready, we can implement the other main class from Helios: the model. The :py:class:`~helios.model.model.Model` class serves as the main holder for the training code itself. The functionality is provided through different callback functions that are used by the :py:class:`~helios.trainer.Trainer` at specific points in time. The first one is the :py:meth:`~helios.model.Model.setup` function which we'll use to initialize all the necessary members for training. In our case, we need: * The network itself, * The optimizer, and * The loss function. Following the tutorial, we'll use ``SGD`` for our optimizer and ``CrossEntropyLoss`` for our loss function. The could would be as follows: .. code-block:: python class ClassifierModel(hlm.Model): def __init__(self) -> None: super().__init__("classifier") def setup(self, fast_init: bool = False) -> None: self._net = Net().to(self.device) self._criterion = nn.CrossEntropyLoss().to(self.device) self._optimizer = hlo.create_optimizer( "SGD", self._net.parameters(), lr=0.001, momentum=0.9 ) A few comments: #. All classes that derive from :py:class:`~helios.model.model.Model` *must* provide a name to the base class. This is used to determine the name that will be given to the checkpoints when they are saved (more on this later). #. Upon training start, the :py:class:`~helios.trainer.Trainer` will automatically set the correct ``torch.device`` into the model. This means that any classes that need to be moved to the device can do so through the :py:attr:`~helios.model.model.Model.device` property. Registries ---------- One of the main features of Helios is the registry system that it ships with. The registries can be used to write *re-usable* training code for different networks. The idea is that a single model class can be written which can then create the necessary optimizers, loss functions, etc. based on settings which can be provided externally through a config file (for example). Helios ships with the following registries: * :py:data:`~helios.data.datamodule.DATASET_REGISTRY`, * :py:data:`~helios.data.samplers.SAMPLER_REGISTRY`, * :py:data:`~helios.data.transforms.TRANSFORM_REGISTRY`, * :py:data:`~helios.losses.utils.LOSS_REGISTRY`, * :py:data:`~helios.metrics.metrics.METRICS_REGISTRY`, * :py:data:`~helios.model.utils.MODEL_REGISTRY`, * :py:data:`~helios.nn.utils.NETWORK_REGISTRY`, * :py:data:`~helios.optim.utils.OPTIMIZER_REGISTRY`, * :py:data:`~helios.scheduler.utils.SCHEDULER_REGISTRY` Each registry comes with an associated ``create_`` function that will create the corresponding type from the registry. By default, the optimizer and scheduler registries ship with the classes that PyTorch offers for each type. In our example, we could create the optimizer directly as follows: .. code-block:: python from torch import optim self._optimizer = optim.SGD(self._net.parameters(), lr=0.001, momentum=0.9) Alternatively, we can create it by directly through the registry as follows: .. code-block:: python self._optimizer = hlo.create_optimizer( "SGD", self._net.parameters(), lr=0.001, momentum=0.9 ) Note that here we're manually specifying the arguments to the optimizer, but we could have just as easily stored the arguments in a dictionary (that were loaded from a file or passed in as a command-line argument) and then passed them in as follows: .. code-block:: python # These args are passed in externally. args = {"lr": 0.001, "momentum": 0.9} self._optimizer = hlo.create_optimizer("SGD", self._net.parameters(), **args) This would allow us to re-use the same model with different combinations of networks and optimizers, reducing code duplication and allowing the code to be standardised across combinations of settings. Checkpoints ----------- Now that the loss and optimizer have been created, we turn our attention to checkpoints. The :py:class:`~helios.trainer.Trainer` is designed to automatically save checkpoints at predetermined intervals. The checkpoints store all the necessary state to ensure training can be resumed. As part of the state stored, the model is able to add it's own state. In our case, we would like to save the state of the network, optimizer, and loss function. To do this, we need to override :py:meth:`~helios.model.model.Model.load_state_dict` and :py:meth:`~helios.model.model.Model.state_dict`. The code is: .. code-block:: python def load_state_dict( self, state_dict: dict[str, typing.Any], fast_init: bool = False ) -> None: self._net.load_state_dict(state_dict["net"]) self._criterion.load_state_dict(state_dict["criterion"]) self._optimizer.load_state_dict(state_dict["optimizer"]) def state_dict(self) -> dict[str, typing.Any]: return { "net": self._net.state_dict(), "criterion": self._criterion.state_dict(), "optimizer": self._optimizer.state_dict(), } Similarly to the device, the model *should not* remap any weights from the loaded checkpoint. Those will be automatically mapped by the :py:class:`~helios.trainer.Trainer` when the checkpoint is loaded. Training -------- We can now focus on the training code itself. It is recommended that you read through the documentation for the :py:class:`~helios.model.model.Model` so you are aware of all the callbacks available for training, which can be identified by the prefix ``on_training_...``. For our purposes, we're going to need the following: * We're going to trace the network and log it to tensorboard. * We need to perform the forward and backward passes. * We need to log the value of our loss function on each iteration. * When training is done, we also want to log the final validation score as well as the final value of the loss function. To start, let's add the code to switch our network into training mode: .. code-block:: python def train(self) -> None: self._net.train() Next, lets add the code to trace. Since we only need to do this once when training begins, we're going to use :py:meth:`~helios.model.model.Model.on_training_start`: .. code-block:: python def on_training_start(self) -> None: tb_logger = hlc.get_from_optional(logging.get_tensorboard_writer()) x = torch.randn((1, 3, 32, 32)).to(self.device) tb_logger.add_graph(self._net, x) The Tensorboard writer is automatically created by the :py:class:`~helios.trainer.Trainer` if requested to do so. As a result, :py:func:`~helios.core.logging.get_tensorboard_writer` can return ``None``. We could ensure that it's valid by doing: .. code-block:: python logger = logging.get_tensorboard_writer() if logger is not None: ... # Or alternatively: assert logger is not None This is especially necessary when using linters like Mypy. Since this gets repetitive very quickly, we can instead use :py:func:`~helios.core.utils.get_from_optional`, which ensures that the provided value is not ``None`` and returns it in a way that Mypy correctly identifies. Now to add the forward and backward passes. These are going to be kept in :py:meth:`~helios.model.model.Model.train_step`: .. code-block:: python def train_step(self, batch: typing.Any, state: hlt.TrainingState) -> None: inputs, labels = batch inputs = inputs.to(self.device) labels = labels.to(self.device) self._optimizer.zero_grad() outputs = self._net(inputs) loss = self._criterion(outputs, labels) loss.backward() self._optimizer.step() self._loss_items["loss"] = loss There's a few things to unpack here, so let's go one by one: #. The type of the ``batch`` parameter is determined by our dataset. In the case of the CIFAR10 dataset, the batch is a tuple of tensors containing the inputs and labels. Note that the base model class imposes no restrictions on what the batch is. #. Since the base model class makes no assumptions on the type of the batch, we need to move the components of the batch to the target device ourselves. This gives maximum flexibility since you can choose what (if anything) gets moved. Note that similarly to the creation of the network itself, we use the :py:attr:`~helios.model.model.Model.device` property. #. We're going to store the returned loss into the ``_loss_items`` dictionary. This allows the model to automatically gather the tensors for us if we were doing distributed training. Now let's look at the logging code: .. code-block:: python def on_training_batch_end( self, state: hlt.TrainingState, should_log: bool = False, ) -> None: super().on_training_batch_end(state, should_log) if should_log: root_logger = logging.get_root_logger() tb_logger = hlc.get_from_optional(logging.get_tensorboard_writer()) loss_val = self._loss_items["loss"] root_logger.info( f"[{state.global_epoch + 1}, {state.global_iteration:5d}] " f"loss: {loss_val:.3f}, " f"running loss: {loss_val / state.running_iter:.3f} " f"avg time: {state.average_iter_time:.2f}s" ) tb_logger.add_scalar("train/loss", loss_val, state.global_iteration) tb_logger.add_scalar( "train/running loss", loss_val / state.running_iter, state.global_iteration, ) Let's examine each part independently: #. The call to ``super().on_training_batch_end`` will automatically gather any tensors stored in the ``_loss_items`` dictionary if we're in distributed mode, so we don't have to manually do it ourselves. #. When the :py:class:`~helios.trainer.Trainer` is created, we can specify the interval at which logging should occur. Since :py:meth:`~helios.model.model.Model.on_training_batch_end` is called on at the end of *every* batch, the ``should_log`` flag is used to indicate when logging should happen. .. note:: In our example, we're performing both the forward and backward passes in :py:meth:`~helios.model.model.Model.train_step`. That being said, it is possible to split the forward and backward passes and have them occur in :py:meth:`~helios.model.model.Model.train_step` and :py:meth:`~helios.model.model.Model.on_training_batch_end` if it makes sense for your workflow. The rest of the code is pretty self-explanatory, with us just grabbing the Tensorboard logger just like before. Note that we also call :py:func:`~helios.core.logging.get_root_logger`, so let's discuss how Helios manages logging. Logging ------- By default, Helios provides two loggers: * :py:class:`~helios.core.logging.RootLogger`: logs to a file and to stdout. * :py:class:`~helios.core.logging.TensorboardWriter`: wraps the PyTorch Tensorboard writer class. .. note:: The :py:class:`~helios.core.logging.RootLogger` will *always* be created with stream output by default. This behaviour *cannot* be changed, as it is used to correctly forward error messages that may occur during training. The logging to a file can be toggled on/off based on the arguments provided to the :py:class:`~helios.trainer.Trainer` upon construction. The creation of these is handled by the :py:class:`~helios.trainer.Trainer`, and will be performed before training starts. If training is distributed, both loggers are designed to only log on the process whose rank is 0. In the event that training occurs over multiple nodes, then logging is performed on the process whose *global* rank is 0. The loggers can be obtained through :py:func:`~helios.core.logging.get_root_logger` and :py:func:`~helios.core.logging.get_tensorboard_writer`. .. warning:: Only the :py:class:`~helios.core.logging.RootLogger` is guaranteed to exist. In the event that the trainer is created with Tensorboard logging disabled, :py:func:`~helios.core.logging.get_tensorboard_writer` will return ``None``. Now that we have logged the training losses, let's add the code to log the final validation result as well as the final loss value. .. code-block:: python def on_training_end(self) -> None: total = self._val_scores["total"] correct = self._val_scores["correct"] accuracy = 100 * correct // total writer = hlc.get_from_optional(logging.get_tensorboard_writer()) writer.add_hparams( {"lr": 0.001, "momentum": 0.9, "epochs": 2}, {"hparam/accuracy": accuracy, "hparam/loss": self._loss_items["loss"].item()}, ) We will explain how validation works in the next section. The code itself is self-explanatory: we compute the final accuracy and then log it to the Tensorboard writer. Validation ---------- Similarly to the suite of callbacks used for training, the :py:class:`~helios.model.model.Model` class has a set of functions for both validation and testing. In our example, we want to perform validation, so let's first add a function to switch our network to evaluation mode: .. code-block:: python def eval(self) -> None: self._net.eval() The :py:class:`~helios.model.model.Model` contains a dictionary for validation scores similar to the one we used earlier for loss values. In our example, we need to keep track of the number of labels we have seen, and how many of those labels have been correct. To do this, we're going to assign these fields before validation starts: .. code-block:: python def on_validation_start(self, validation_cycle: int) -> None: super().on_validation_start(validation_cycle) self._val_scores["total"] = 0 self._val_scores["correct"] = 0 Calling :py:meth:`~helios.mode.model.Model.on_validation_start` on the base class automatically clears out the ``_val_scores`` dictionary to ensure we don't accidentally over-write or overlap values. After setting the fields we care about, let's perform the validation step: .. code-block:: python def valid_step(self, batch: typing.Any, step: int) -> None: images, labels = batch images = images.to(self.device) labels = labels.to(self.device) outputs = self._net(images) _, predicted = torch.max(outputs.data, 1) self._val_scores["total"] += labels.size(0) self._val_scores["correct"] += (predicted == labels).sum().item() The :py:meth:`~helios.model.model.Model.valid_step` function is analogous to :py:meth:`~helios.model.model.Model.train_step`. Like before, we receive the batch from our dataset and we are responsible for moving the data into the appropriate device using :py:attr:`~helios.model.model.Model.device`. The rest of the code is identical to the PyTorch tutorial, with the only difference that we assign the results to the fields we added before validation began. Finally, we need to compute the final accuracy score and log it: .. code-block:: python def on_validation_end(self, validation_cycle: int) -> None: root_logger = logging.get_root_logger() tb_logger = hlc.get_from_optional(logging.get_tensorboard_writer()) total = self._val_scores["total"] correct = self._val_scores["correct"] accuracy = 100 * correct // total root_logger.info(f"[Validation {validation_cycle}] accuracy: {accuracy}") tb_logger.add_scalar("val", accuracy, validation_cycle) Creating the Trainer ==================== Now that we have all of our training code ready, all that is left is to create the trainer and train our network. For the sake of simplicity, we're going to be performing this in the main block of our script. The trainer requires two things to train: #. The model we want to use. #. The datamodule with our datasets. Let's make those first: .. code-block:: python if __name__ == "__main__": datamodule = CIFARDataModule(pathlib.Path.cwd()) model = ClassifierModel() Now let's create the trainer itself: .. code-block:: python trainer = hlt.Trainer( run_name="cifar10", train_unit=hlt.TrainingUnit.EPOCH, total_steps=2, valid_frequency=1, chkpt_frequency=1, print_frequency=10, enable_tensorboard=True, enable_file_logging=True, enable_progress_bar=True, enable_deterministic=True, chkpt_root=pathlib.Path.cwd() / "chkpt", log_path=pathlib.Path.cwd() / "logs", run_path=pathlib.Path.cwd() / "runs", ) The :py:class:`~helios.trainer.Trainer` constructor takes a long list of arguments that provide control over various aspects of training. You're encouraged to read through the list of parameters for more details. Let's go over each of the arguments we set in our example, starting with the training unit. Training Units -------------- The :py:class:`~helios.trainer.Trainer` provides two ways of training networks based on the *training unit*. These are: #. :py:attr:`~helios.trainer.TrainingUnit.ITERATION`: used when the network needs to be trained for :math:`N` iterations. #. :py:attr:`~helios.trainer.TrainingUnit.EPOCH`: used when the network needs to be trained for :math:`N` epochs. The choice of training unit determines the behaviour of certain portions of the training loop, which we will discuss next. Training by Epoch ^^^^^^^^^^^^^^^^^ This is the most common case for training. In this mode, the training loop will run until the number of epochs specified by ``total_steps`` has been reached and it has the following behaviour: * ``valid_frequency`` and ``chkpt_frequency`` occur on epochs. For example, say that we want to train for 10 epochs and we want to perform validation every second epoch. This means that validation will occur on epochs 2, 4, 5, 8, and 10. Likewise, if we want to save checkpoints every second epoch, then checkpoints will be saved on epochs 2, 4, 5, 8, and 10. * Early stopping is performed on epochs. See :ref:`stopping-training`. * Gradient accumulation has no effect on the number of epochs. See :ref:`gradient-accumulation`. .. note:: ``print_frequency`` **always** refers to the number of iterations that logging should occur in. This is *independent* of the training unit. Training by Iteration ^^^^^^^^^^^^^^^^^^^^^ In this mode, the training loop will run until the number of iterations specified by ``total_steps`` has been reached *regardless* of how many epochs (complete or fractional) are performed. It has the following behaviour: * ``valid_frequency`` and ``chkpt_frequency`` occur on iterations. For example, say that we want to train for 10k iterations and we want to perform validation every 2k iterations. This means that validation will occur on iterations 2k, 4k, 6k, 8k, and 10k. Likewise, if we want to save checkpoints every 2k iterations, then checkpoints will be saved on iterations 2k, 4k, 6k, 8k, and 10k. * Early stopping is performed on iterations. See :ref:`stopping-training`. * Gradient accumulation multiplies the total number of iterations. See :ref:`gradient-accumulation`. Enabling Logging and Checkpoints -------------------------------- The next 3 arguments of the trainer cover the various kinds of logging that are available. As mentioned previously, the :py:class:`~helios.trainer.Trainer` will *always* create the :py:class:`~helios.core.logging.RootLogger` with output to stdout. That said, we can add logging to a file and to Tensorboard by setting the corresponding flags: * ``enable_tensorboard``: enables the Tensorboard writer. * ``enable_file_logging``: adds a file stream to the log. .. warning:: If either of ``enable_tensorboard`` or ``enable_file_logging`` is set, then you **must** also set ``run_path`` or ``log_path`` respectively. These should be set to a directory where the logs will be saved. Note that if the directory doesn't exist, it will be created automatically. The final logging flag determines whether a progress bar is displayed while training is ongoing. See :ref:`logging` for more details. Finally, since we want to save checkpoints, then we also assign the path that the checkpoints are saved to using ``chkpt_root``. .. warning:: If ``chkpt_frequency`` is not 0, then you **must** set ``chkpt_root`` to the directory where checkpoints are saved. Note that if the directory doesn't exist, it will be created automatically. See :ref:`checkpoint-saving`. We also set ``enable_deterministic`` to indicate to PyTorch that we want to use deterministic operations while training. This belongs to a set of flags that configure the environment when the trainer is created. See . Launching Training ================== The final step is to start training. With the trainer created, all that we have to do is this: .. code-block:: python trainer.fit(model, datamodule) And that's it! Helios will automatically configure the training environment and run the training loop for the specified number of epochs. Every epoch validation will be performed and a checkpoint will be saved. Helios provides more functionality than what is shown here, so you are encouraged to read through the quick reference guide for more details.