aegis_sim.checkpoint

Checkpoint for resumable simulations.

A Checkpoint captures the full simulation state so it can be restored later, allowing a simulation to be resumed from exactly where it left off.

How checkpointing works

Checkpoints are saved periodically during a simulation, controlled by the CHECKPOINT_RATE parameter (in steps). When CHECKPOINT_RATE is 0 (the default), no checkpoints are saved. The checkpoint is written to <output_dir>/checkpoint and overwritten each time, so only the latest state is kept on disk.

An initial checkpoint is written before the simulation loop starts, so there is always something to resume from even if the program crashes on the first step. Subsequent checkpoints are written at the end of each step where step % CHECKPOINT_RATE == 0, overwriting the initial one.

The save is atomic (write to a temp file, then rename) so a crash mid-write cannot corrupt the checkpoint.

Each checkpoint captures everything needed to reconstruct the simulation:

  • population: the full Population instance (genomes, ages, phenotypes, etc.)
  • eggs: unhatched offspring (Population instance, or None), so in-progress incubation is preserved when INCUBATION_PERIOD > 0
  • step: the simulation step at which the checkpoint was taken
  • rng_state: the numpy Generator (variables.rng) bit-generator state, so the resumed simulation produces the same random sequence it would have if it had never been interrupted
  • legacy_rng_state: the legacy np.random global state (used by some submodels like envdrift)
  • random_seed: the original seed, kept for bookkeeping
  • envdrift_map: the environmental drift XOR map (or None if envdrift is off)
  • predator_population_size: the current predator count, so predation dynamics continue correctly when PREDATION_RATE > 0
  • resource_capacity: the current available resource amount, so resource dynamics continue correctly
  • final_config: the fully resolved parameter dict, so the resumed run uses the exact same parameters without needing the original config file
  • custom_config_path: path to the original config file, used to locate the output directory on resume

CLI usage

-c is always required. -r, -o, and -p are mutually exclusive::

aegis sim -c config.yml              # fresh run
aegis sim -c config.yml -o           # fresh run, overwrite existing output
aegis sim -c config.yml -p pickle    # seed run (new sim from saved population)
aegis sim -c config.yml -r           # resume from checkpoint
aegis sim -c config.yml -r --extend 1500  # resume and extend to 1500 steps

When -r is used and no output directory exists yet, AEGIS falls back to a fresh run. If the output directory exists but contains no checkpoint, an error is raised.

Output truncation on resume

Between checkpoints, recorders append data to output files. If the sim crashes at step 783 but the last checkpoint was at step 700, the output files contain data for steps 700–783 that will be re-recorded on resume. To prevent duplicates, truncate_for_resume(checkpoint_step) is called during resume initialization. It:

  • Truncates per-step files (popsize, resources, eggs) to step - 1 lines
  • Truncates rate-based files (progress, spectra, genotypes, phenotypes, popgen) to the number of recordings that occurred before the checkpoint step, plus any header lines
  • Deletes TE files whose collection window started at or after the checkpoint
  • Handles envdriftmap separately (uses step % rate == 0 without the step-1 special case)

Snapshot files (feather format) are named by step number, so re-recording simply overwrites them. One-time files (config, summary) are also overwritten.

Extending a simulation

--extend N overrides STEPS_PER_SIMULATION to N after loading the checkpoint, allowing a completed simulation to continue beyond its original target. N must be greater than the checkpoint step.

Checkpointing vs seeding

Seeding (-p) starts a new simulation using a saved Population as the initial population. It resets the step counter to 1, creates fresh output, and does not restore RNG state or configuration.

Checkpointing (-r) resumes an existing simulation from the exact point it was saved. The step counter, RNG state, configuration, and output directory are all preserved.

  1"""Checkpoint for resumable simulations.
  2
  3A Checkpoint captures the full simulation state so it can be restored later,
  4allowing a simulation to be resumed from exactly where it left off.
  5
  6How checkpointing works
  7-----------------------
  8Checkpoints are saved periodically during a simulation, controlled by the
  9CHECKPOINT_RATE parameter (in steps). When CHECKPOINT_RATE is 0 (the default),
 10no checkpoints are saved. The checkpoint is written to ``<output_dir>/checkpoint``
 11and overwritten each time, so only the latest state is kept on disk.
 12
 13An initial checkpoint is written before the simulation loop starts, so there
 14is always something to resume from even if the program crashes on the first
 15step. Subsequent checkpoints are written at the end of each step where
 16``step % CHECKPOINT_RATE == 0``, overwriting the initial one.
 17
 18The save is atomic (write to a temp file, then rename) so a crash mid-write
 19cannot corrupt the checkpoint.
 20
 21Each checkpoint captures everything needed to reconstruct the simulation:
 22
 23- **population**: the full Population instance (genomes, ages, phenotypes, etc.)
 24- **eggs**: unhatched offspring (Population instance, or None), so in-progress
 25  incubation is preserved when INCUBATION_PERIOD > 0
 26- **step**: the simulation step at which the checkpoint was taken
 27- **rng_state**: the numpy Generator (``variables.rng``) bit-generator state,
 28  so the resumed simulation produces the same random sequence it would have
 29  if it had never been interrupted
 30- **legacy_rng_state**: the legacy ``np.random`` global state (used by some
 31  submodels like envdrift)
 32- **random_seed**: the original seed, kept for bookkeeping
 33- **envdrift_map**: the environmental drift XOR map (or None if envdrift is off)
 34- **predator_population_size**: the current predator count, so predation
 35  dynamics continue correctly when PREDATION_RATE > 0
 36- **resource_capacity**: the current available resource amount, so resource
 37  dynamics continue correctly
 38- **final_config**: the fully resolved parameter dict, so the resumed run uses
 39  the exact same parameters without needing the original config file
 40- **custom_config_path**: path to the original config file, used to locate the
 41  output directory on resume
 42
 43CLI usage
 44---------
 45``-c`` is always required. ``-r``, ``-o``, and ``-p`` are mutually exclusive::
 46
 47    aegis sim -c config.yml              # fresh run
 48    aegis sim -c config.yml -o           # fresh run, overwrite existing output
 49    aegis sim -c config.yml -p pickle    # seed run (new sim from saved population)
 50    aegis sim -c config.yml -r           # resume from checkpoint
 51    aegis sim -c config.yml -r --extend 1500  # resume and extend to 1500 steps
 52
 53When ``-r`` is used and no output directory exists yet, AEGIS falls back to a
 54fresh run. If the output directory exists but contains no checkpoint, an error
 55is raised.
 56
 57Output truncation on resume
 58----------------------------
 59Between checkpoints, recorders append data to output files. If the sim crashes
 60at step 783 but the last checkpoint was at step 700, the output files contain
 61data for steps 700–783 that will be re-recorded on resume. To prevent
 62duplicates, ``truncate_for_resume(checkpoint_step)`` is called during resume
 63initialization. It:
 64
 65- Truncates per-step files (popsize, resources, eggs) to ``step - 1`` lines
 66- Truncates rate-based files (progress, spectra, genotypes, phenotypes, popgen)
 67  to the number of recordings that occurred before the checkpoint step, plus
 68  any header lines
 69- Deletes TE files whose collection window started at or after the checkpoint
 70- Handles envdriftmap separately (uses ``step % rate == 0`` without the
 71  step-1 special case)
 72
 73Snapshot files (feather format) are named by step number, so re-recording
 74simply overwrites them. One-time files (config, summary) are also overwritten.
 75
 76Extending a simulation
 77-----------------------
 78``--extend N`` overrides ``STEPS_PER_SIMULATION`` to N after loading the
 79checkpoint, allowing a completed simulation to continue beyond its original
 80target. N must be greater than the checkpoint step.
 81
 82Checkpointing vs seeding
 83-------------------------
 84Seeding (``-p``) starts a *new* simulation using a saved Population as the
 85initial population. It resets the step counter to 1, creates fresh output,
 86and does not restore RNG state or configuration.
 87
 88Checkpointing (``-r``) *resumes* an existing simulation from the exact point
 89it was saved. The step counter, RNG state, configuration, and output directory
 90are all preserved.
 91"""
 92
 93import pickle
 94import logging
 95import pathlib
 96import tempfile
 97import numpy as np
 98
 99from aegis_sim.dataclasses.population import Population
100
101
102class Checkpoint:
103    """Immutable snapshot of simulation state at a given step.
104
105    Attributes:
106        population: The full Population instance at the time of capture.
107        eggs: Unhatched offspring (Population instance), or None.
108        step: Simulation step number when the checkpoint was taken.
109        rng_state: State dict of ``numpy.random.Generator.bit_generator``.
110        random_seed: The original random seed used to initialize the simulation.
111        legacy_rng_state: State tuple from ``numpy.random.get_state()``.
112        envdrift_map: Boolean ndarray (the XOR map), or None if envdrift is disabled.
113        predator_population_size: Current predator count (float), for predation dynamics.
114        resource_capacity: Current available resource amount (float).
115        final_config: Fully resolved parameter dict (default + species + config + overrides).
116        custom_config_path: Path to the original ``.yml`` config file.
117    """
118
119    def __init__(
120        self,
121        population: Population,
122        eggs,
123        step: int,
124        rng_state: dict,
125        random_seed: int,
126        legacy_rng_state: dict,
127        envdrift_map,
128        predator_population_size: float,
129        resource_capacity: float,
130        final_config: dict,
131        custom_config_path: pathlib.Path,
132    ):
133        self.population = population
134        self.eggs = eggs
135        self.step = step
136        self.rng_state = rng_state
137        self.random_seed = random_seed
138        self.legacy_rng_state = legacy_rng_state
139        self.envdrift_map = envdrift_map
140        self.predator_population_size = predator_population_size
141        self.resource_capacity = resource_capacity
142        self.final_config = final_config
143        self.custom_config_path = custom_config_path
144
145    @classmethod
146    def capture(cls, population, eggs, variables, submodels, parametermanager):
147        """Capture current simulation state into a Checkpoint."""
148        from aegis_sim.submodels.resources.resources import resources
149
150        return cls(
151            population=population,
152            eggs=eggs,
153            step=variables.steps,
154            rng_state=variables.rng.bit_generator.state,
155            random_seed=variables.random_seed,
156            legacy_rng_state=np.random.get_state(),
157            envdrift_map=submodels.architect.envdrift.map,
158            predator_population_size=submodels.predation.N,
159            resource_capacity=resources.capacity,
160            final_config=parametermanager.final_config,
161            custom_config_path=variables.custom_config_path,
162        )
163
164    def save(self, path: pathlib.Path):
165        """Serialize checkpoint to disk using atomic write with backup.
166
167        Keeps the previous checkpoint as ``<path>.bak`` so that a SIGKILL
168        during the rename window cannot leave the user with zero valid
169        checkpoints.  Only promotes the current checkpoint to backup if
170        it can be successfully unpickled — a corrupt file is deleted instead.
171        """
172        path.parent.mkdir(parents=True, exist_ok=True)
173        backup_path = path.with_suffix(".bak")
174        fd, tmp_path = tempfile.mkstemp(dir=path.parent, suffix=".tmp")
175        try:
176            with open(fd, "wb") as f:
177                pickle.dump(self, f)
178            # Only back up the current checkpoint if it's valid
179            if path.exists():
180                if self._is_valid_checkpoint(path):
181                    path.replace(backup_path)
182                else:
183                    logging.warning(f"Existing checkpoint at {path} is corrupt; discarding instead of backing up.")
184                    path.unlink()
185            pathlib.Path(tmp_path).replace(path)
186        except BaseException:
187            pathlib.Path(tmp_path).unlink(missing_ok=True)
188            raise
189        logging.debug(f"Checkpoint saved at step {self.step} to {path}")
190
191    @staticmethod
192    def _is_valid_checkpoint(path: pathlib.Path) -> bool:
193        """Return True if the file at *path* can be unpickled."""
194        try:
195            with open(path, "rb") as f:
196                pickle.load(f)
197            return True
198        except (pickle.UnpicklingError, EOFError, Exception):
199            return False
200
201    @classmethod
202    def load(cls, path: pathlib.Path) -> "Checkpoint":
203        """Deserialize checkpoint from disk, falling back to backup if corrupt."""
204        backup_path = path.with_suffix(".bak")
205        try:
206            return cls._load_single(path)
207        except (pickle.UnpicklingError, EOFError) as primary_err:
208            if backup_path.exists():
209                logging.warning(
210                    f"Primary checkpoint at {path} is corrupt ({primary_err}); "
211                    f"falling back to backup at {backup_path}."
212                )
213                return cls._load_single(backup_path)
214            raise
215
216    @classmethod
217    def _load_single(cls, path: pathlib.Path) -> "Checkpoint":
218        """Load and validate a single checkpoint file."""
219        with open(path, "rb") as f:
220            checkpoint = pickle.load(f)
221        if not isinstance(checkpoint, cls):
222            raise TypeError(f"Expected Checkpoint, got {type(checkpoint).__name__}")
223        logging.info(f"Checkpoint loaded from {path} (step {checkpoint.step})")
224        return checkpoint
225
226    @classmethod
227    def find_latest(cls, odir: pathlib.Path) -> pathlib.Path:
228        """Find the checkpoint file in an output directory.
229
230        Returns the primary checkpoint path if it exists. If only the backup
231        exists (e.g. after a SIGKILL during save), returns the backup path.
232
233        Args:
234            odir: The simulation output directory (e.g. ``temp/test_config``).
235
236        Returns:
237            Path to the checkpoint file.
238
239        Raises:
240            FileNotFoundError: If no checkpoint file is found.
241        """
242        checkpoint_path = odir / "checkpoint"
243        backup_path = checkpoint_path.with_suffix(".bak")
244        if checkpoint_path.exists():
245            return checkpoint_path
246        if backup_path.exists():
247            logging.warning(
248                f"No primary checkpoint in {odir}, using backup {backup_path}."
249            )
250            return backup_path
251        raise FileNotFoundError(f"No checkpoint file found in {odir}")
class Checkpoint:
103class Checkpoint:
104    """Immutable snapshot of simulation state at a given step.
105
106    Attributes:
107        population: The full Population instance at the time of capture.
108        eggs: Unhatched offspring (Population instance), or None.
109        step: Simulation step number when the checkpoint was taken.
110        rng_state: State dict of ``numpy.random.Generator.bit_generator``.
111        random_seed: The original random seed used to initialize the simulation.
112        legacy_rng_state: State tuple from ``numpy.random.get_state()``.
113        envdrift_map: Boolean ndarray (the XOR map), or None if envdrift is disabled.
114        predator_population_size: Current predator count (float), for predation dynamics.
115        resource_capacity: Current available resource amount (float).
116        final_config: Fully resolved parameter dict (default + species + config + overrides).
117        custom_config_path: Path to the original ``.yml`` config file.
118    """
119
120    def __init__(
121        self,
122        population: Population,
123        eggs,
124        step: int,
125        rng_state: dict,
126        random_seed: int,
127        legacy_rng_state: dict,
128        envdrift_map,
129        predator_population_size: float,
130        resource_capacity: float,
131        final_config: dict,
132        custom_config_path: pathlib.Path,
133    ):
134        self.population = population
135        self.eggs = eggs
136        self.step = step
137        self.rng_state = rng_state
138        self.random_seed = random_seed
139        self.legacy_rng_state = legacy_rng_state
140        self.envdrift_map = envdrift_map
141        self.predator_population_size = predator_population_size
142        self.resource_capacity = resource_capacity
143        self.final_config = final_config
144        self.custom_config_path = custom_config_path
145
146    @classmethod
147    def capture(cls, population, eggs, variables, submodels, parametermanager):
148        """Capture current simulation state into a Checkpoint."""
149        from aegis_sim.submodels.resources.resources import resources
150
151        return cls(
152            population=population,
153            eggs=eggs,
154            step=variables.steps,
155            rng_state=variables.rng.bit_generator.state,
156            random_seed=variables.random_seed,
157            legacy_rng_state=np.random.get_state(),
158            envdrift_map=submodels.architect.envdrift.map,
159            predator_population_size=submodels.predation.N,
160            resource_capacity=resources.capacity,
161            final_config=parametermanager.final_config,
162            custom_config_path=variables.custom_config_path,
163        )
164
165    def save(self, path: pathlib.Path):
166        """Serialize checkpoint to disk using atomic write with backup.
167
168        Keeps the previous checkpoint as ``<path>.bak`` so that a SIGKILL
169        during the rename window cannot leave the user with zero valid
170        checkpoints.  Only promotes the current checkpoint to backup if
171        it can be successfully unpickled — a corrupt file is deleted instead.
172        """
173        path.parent.mkdir(parents=True, exist_ok=True)
174        backup_path = path.with_suffix(".bak")
175        fd, tmp_path = tempfile.mkstemp(dir=path.parent, suffix=".tmp")
176        try:
177            with open(fd, "wb") as f:
178                pickle.dump(self, f)
179            # Only back up the current checkpoint if it's valid
180            if path.exists():
181                if self._is_valid_checkpoint(path):
182                    path.replace(backup_path)
183                else:
184                    logging.warning(f"Existing checkpoint at {path} is corrupt; discarding instead of backing up.")
185                    path.unlink()
186            pathlib.Path(tmp_path).replace(path)
187        except BaseException:
188            pathlib.Path(tmp_path).unlink(missing_ok=True)
189            raise
190        logging.debug(f"Checkpoint saved at step {self.step} to {path}")
191
192    @staticmethod
193    def _is_valid_checkpoint(path: pathlib.Path) -> bool:
194        """Return True if the file at *path* can be unpickled."""
195        try:
196            with open(path, "rb") as f:
197                pickle.load(f)
198            return True
199        except (pickle.UnpicklingError, EOFError, Exception):
200            return False
201
202    @classmethod
203    def load(cls, path: pathlib.Path) -> "Checkpoint":
204        """Deserialize checkpoint from disk, falling back to backup if corrupt."""
205        backup_path = path.with_suffix(".bak")
206        try:
207            return cls._load_single(path)
208        except (pickle.UnpicklingError, EOFError) as primary_err:
209            if backup_path.exists():
210                logging.warning(
211                    f"Primary checkpoint at {path} is corrupt ({primary_err}); "
212                    f"falling back to backup at {backup_path}."
213                )
214                return cls._load_single(backup_path)
215            raise
216
217    @classmethod
218    def _load_single(cls, path: pathlib.Path) -> "Checkpoint":
219        """Load and validate a single checkpoint file."""
220        with open(path, "rb") as f:
221            checkpoint = pickle.load(f)
222        if not isinstance(checkpoint, cls):
223            raise TypeError(f"Expected Checkpoint, got {type(checkpoint).__name__}")
224        logging.info(f"Checkpoint loaded from {path} (step {checkpoint.step})")
225        return checkpoint
226
227    @classmethod
228    def find_latest(cls, odir: pathlib.Path) -> pathlib.Path:
229        """Find the checkpoint file in an output directory.
230
231        Returns the primary checkpoint path if it exists. If only the backup
232        exists (e.g. after a SIGKILL during save), returns the backup path.
233
234        Args:
235            odir: The simulation output directory (e.g. ``temp/test_config``).
236
237        Returns:
238            Path to the checkpoint file.
239
240        Raises:
241            FileNotFoundError: If no checkpoint file is found.
242        """
243        checkpoint_path = odir / "checkpoint"
244        backup_path = checkpoint_path.with_suffix(".bak")
245        if checkpoint_path.exists():
246            return checkpoint_path
247        if backup_path.exists():
248            logging.warning(
249                f"No primary checkpoint in {odir}, using backup {backup_path}."
250            )
251            return backup_path
252        raise FileNotFoundError(f"No checkpoint file found in {odir}")

Immutable snapshot of simulation state at a given step.

Attributes: population: The full Population instance at the time of capture. eggs: Unhatched offspring (Population instance), or None. step: Simulation step number when the checkpoint was taken. rng_state: State dict of numpy.random.Generator.bit_generator. random_seed: The original random seed used to initialize the simulation. legacy_rng_state: State tuple from numpy.random.get_state(). envdrift_map: Boolean ndarray (the XOR map), or None if envdrift is disabled. predator_population_size: Current predator count (float), for predation dynamics. resource_capacity: Current available resource amount (float). final_config: Fully resolved parameter dict (default + species + config + overrides). custom_config_path: Path to the original .yml config file.

Checkpoint( population: aegis_sim.dataclasses.population.Population, eggs, step: int, rng_state: dict, random_seed: int, legacy_rng_state: dict, envdrift_map, predator_population_size: float, resource_capacity: float, final_config: dict, custom_config_path: pathlib.Path)
120    def __init__(
121        self,
122        population: Population,
123        eggs,
124        step: int,
125        rng_state: dict,
126        random_seed: int,
127        legacy_rng_state: dict,
128        envdrift_map,
129        predator_population_size: float,
130        resource_capacity: float,
131        final_config: dict,
132        custom_config_path: pathlib.Path,
133    ):
134        self.population = population
135        self.eggs = eggs
136        self.step = step
137        self.rng_state = rng_state
138        self.random_seed = random_seed
139        self.legacy_rng_state = legacy_rng_state
140        self.envdrift_map = envdrift_map
141        self.predator_population_size = predator_population_size
142        self.resource_capacity = resource_capacity
143        self.final_config = final_config
144        self.custom_config_path = custom_config_path
population
eggs
step
rng_state
random_seed
legacy_rng_state
envdrift_map
predator_population_size
resource_capacity
final_config
custom_config_path
@classmethod
def capture(cls, population, eggs, variables, submodels, parametermanager):
146    @classmethod
147    def capture(cls, population, eggs, variables, submodels, parametermanager):
148        """Capture current simulation state into a Checkpoint."""
149        from aegis_sim.submodels.resources.resources import resources
150
151        return cls(
152            population=population,
153            eggs=eggs,
154            step=variables.steps,
155            rng_state=variables.rng.bit_generator.state,
156            random_seed=variables.random_seed,
157            legacy_rng_state=np.random.get_state(),
158            envdrift_map=submodels.architect.envdrift.map,
159            predator_population_size=submodels.predation.N,
160            resource_capacity=resources.capacity,
161            final_config=parametermanager.final_config,
162            custom_config_path=variables.custom_config_path,
163        )

Capture current simulation state into a Checkpoint.

def save(self, path: pathlib.Path):
165    def save(self, path: pathlib.Path):
166        """Serialize checkpoint to disk using atomic write with backup.
167
168        Keeps the previous checkpoint as ``<path>.bak`` so that a SIGKILL
169        during the rename window cannot leave the user with zero valid
170        checkpoints.  Only promotes the current checkpoint to backup if
171        it can be successfully unpickled — a corrupt file is deleted instead.
172        """
173        path.parent.mkdir(parents=True, exist_ok=True)
174        backup_path = path.with_suffix(".bak")
175        fd, tmp_path = tempfile.mkstemp(dir=path.parent, suffix=".tmp")
176        try:
177            with open(fd, "wb") as f:
178                pickle.dump(self, f)
179            # Only back up the current checkpoint if it's valid
180            if path.exists():
181                if self._is_valid_checkpoint(path):
182                    path.replace(backup_path)
183                else:
184                    logging.warning(f"Existing checkpoint at {path} is corrupt; discarding instead of backing up.")
185                    path.unlink()
186            pathlib.Path(tmp_path).replace(path)
187        except BaseException:
188            pathlib.Path(tmp_path).unlink(missing_ok=True)
189            raise
190        logging.debug(f"Checkpoint saved at step {self.step} to {path}")

Serialize checkpoint to disk using atomic write with backup.

Keeps the previous checkpoint as <path>.bak so that a SIGKILL during the rename window cannot leave the user with zero valid checkpoints. Only promotes the current checkpoint to backup if it can be successfully unpickled — a corrupt file is deleted instead.

@classmethod
def load(cls, path: pathlib.Path) -> Checkpoint:
202    @classmethod
203    def load(cls, path: pathlib.Path) -> "Checkpoint":
204        """Deserialize checkpoint from disk, falling back to backup if corrupt."""
205        backup_path = path.with_suffix(".bak")
206        try:
207            return cls._load_single(path)
208        except (pickle.UnpicklingError, EOFError) as primary_err:
209            if backup_path.exists():
210                logging.warning(
211                    f"Primary checkpoint at {path} is corrupt ({primary_err}); "
212                    f"falling back to backup at {backup_path}."
213                )
214                return cls._load_single(backup_path)
215            raise

Deserialize checkpoint from disk, falling back to backup if corrupt.

@classmethod
def find_latest(cls, odir: pathlib.Path) -> pathlib.Path:
227    @classmethod
228    def find_latest(cls, odir: pathlib.Path) -> pathlib.Path:
229        """Find the checkpoint file in an output directory.
230
231        Returns the primary checkpoint path if it exists. If only the backup
232        exists (e.g. after a SIGKILL during save), returns the backup path.
233
234        Args:
235            odir: The simulation output directory (e.g. ``temp/test_config``).
236
237        Returns:
238            Path to the checkpoint file.
239
240        Raises:
241            FileNotFoundError: If no checkpoint file is found.
242        """
243        checkpoint_path = odir / "checkpoint"
244        backup_path = checkpoint_path.with_suffix(".bak")
245        if checkpoint_path.exists():
246            return checkpoint_path
247        if backup_path.exists():
248            logging.warning(
249                f"No primary checkpoint in {odir}, using backup {backup_path}."
250            )
251            return backup_path
252        raise FileNotFoundError(f"No checkpoint file found in {odir}")

Find the checkpoint file in an output directory.

Returns the primary checkpoint path if it exists. If only the backup exists (e.g. after a SIGKILL during save), returns the backup path.

Args: odir: The simulation output directory (e.g. temp/test_config).

Returns: Path to the checkpoint file.

Raises: FileNotFoundError: If no checkpoint file is found.