aegis_sim.checkpoint
Checkpoint for resumable simulations.
A Checkpoint captures the full simulation state so it can be restored later, allowing a simulation to be resumed from exactly where it left off.
How checkpointing works
Checkpoints are saved periodically during a simulation, controlled by the
CHECKPOINT_RATE parameter (in steps). When CHECKPOINT_RATE is 0 (the default),
no checkpoints are saved. The checkpoint is written to <output_dir>/checkpoint
and overwritten each time, so only the latest state is kept on disk.
An initial checkpoint is written before the simulation loop starts, so there
is always something to resume from even if the program crashes on the first
step. Subsequent checkpoints are written at the end of each step where
step % CHECKPOINT_RATE == 0, overwriting the initial one.
The save is atomic (write to a temp file, then rename) so a crash mid-write cannot corrupt the checkpoint.
Each checkpoint captures everything needed to reconstruct the simulation:
- population: the full Population instance (genomes, ages, phenotypes, etc.)
- eggs: unhatched offspring (Population instance, or None), so in-progress incubation is preserved when INCUBATION_PERIOD > 0
- step: the simulation step at which the checkpoint was taken
- rng_state: the numpy Generator (
variables.rng) bit-generator state, so the resumed simulation produces the same random sequence it would have if it had never been interrupted - legacy_rng_state: the legacy
np.randomglobal state (used by some submodels like envdrift) - random_seed: the original seed, kept for bookkeeping
- envdrift_map: the environmental drift XOR map (or None if envdrift is off)
- predator_population_size: the current predator count, so predation dynamics continue correctly when PREDATION_RATE > 0
- resource_capacity: the current available resource amount, so resource dynamics continue correctly
- final_config: the fully resolved parameter dict, so the resumed run uses the exact same parameters without needing the original config file
- custom_config_path: path to the original config file, used to locate the output directory on resume
CLI usage
-c is always required. -r, -o, and -p are mutually exclusive::
aegis sim -c config.yml # fresh run
aegis sim -c config.yml -o # fresh run, overwrite existing output
aegis sim -c config.yml -p pickle # seed run (new sim from saved population)
aegis sim -c config.yml -r # resume from checkpoint
aegis sim -c config.yml -r --extend 1500 # resume and extend to 1500 steps
When -r is used and no output directory exists yet, AEGIS falls back to a
fresh run. If the output directory exists but contains no checkpoint, an error
is raised.
Output truncation on resume
Between checkpoints, recorders append data to output files. If the sim crashes
at step 783 but the last checkpoint was at step 700, the output files contain
data for steps 700–783 that will be re-recorded on resume. To prevent
duplicates, truncate_for_resume(checkpoint_step) is called during resume
initialization. It:
- Truncates per-step files (popsize, resources, eggs) to
step - 1lines - Truncates rate-based files (progress, spectra, genotypes, phenotypes, popgen) to the number of recordings that occurred before the checkpoint step, plus any header lines
- Deletes TE files whose collection window started at or after the checkpoint
- Handles envdriftmap separately (uses
step % rate == 0without the step-1 special case)
Snapshot files (feather format) are named by step number, so re-recording simply overwrites them. One-time files (config, summary) are also overwritten.
Extending a simulation
--extend N overrides STEPS_PER_SIMULATION to N after loading the
checkpoint, allowing a completed simulation to continue beyond its original
target. N must be greater than the checkpoint step.
Checkpointing vs seeding
Seeding (-p) starts a new simulation using a saved Population as the
initial population. It resets the step counter to 1, creates fresh output,
and does not restore RNG state or configuration.
Checkpointing (-r) resumes an existing simulation from the exact point
it was saved. The step counter, RNG state, configuration, and output directory
are all preserved.
1"""Checkpoint for resumable simulations. 2 3A Checkpoint captures the full simulation state so it can be restored later, 4allowing a simulation to be resumed from exactly where it left off. 5 6How checkpointing works 7----------------------- 8Checkpoints are saved periodically during a simulation, controlled by the 9CHECKPOINT_RATE parameter (in steps). When CHECKPOINT_RATE is 0 (the default), 10no checkpoints are saved. The checkpoint is written to ``<output_dir>/checkpoint`` 11and overwritten each time, so only the latest state is kept on disk. 12 13An initial checkpoint is written before the simulation loop starts, so there 14is always something to resume from even if the program crashes on the first 15step. Subsequent checkpoints are written at the end of each step where 16``step % CHECKPOINT_RATE == 0``, overwriting the initial one. 17 18The save is atomic (write to a temp file, then rename) so a crash mid-write 19cannot corrupt the checkpoint. 20 21Each checkpoint captures everything needed to reconstruct the simulation: 22 23- **population**: the full Population instance (genomes, ages, phenotypes, etc.) 24- **eggs**: unhatched offspring (Population instance, or None), so in-progress 25 incubation is preserved when INCUBATION_PERIOD > 0 26- **step**: the simulation step at which the checkpoint was taken 27- **rng_state**: the numpy Generator (``variables.rng``) bit-generator state, 28 so the resumed simulation produces the same random sequence it would have 29 if it had never been interrupted 30- **legacy_rng_state**: the legacy ``np.random`` global state (used by some 31 submodels like envdrift) 32- **random_seed**: the original seed, kept for bookkeeping 33- **envdrift_map**: the environmental drift XOR map (or None if envdrift is off) 34- **predator_population_size**: the current predator count, so predation 35 dynamics continue correctly when PREDATION_RATE > 0 36- **resource_capacity**: the current available resource amount, so resource 37 dynamics continue correctly 38- **final_config**: the fully resolved parameter dict, so the resumed run uses 39 the exact same parameters without needing the original config file 40- **custom_config_path**: path to the original config file, used to locate the 41 output directory on resume 42 43CLI usage 44--------- 45``-c`` is always required. ``-r``, ``-o``, and ``-p`` are mutually exclusive:: 46 47 aegis sim -c config.yml # fresh run 48 aegis sim -c config.yml -o # fresh run, overwrite existing output 49 aegis sim -c config.yml -p pickle # seed run (new sim from saved population) 50 aegis sim -c config.yml -r # resume from checkpoint 51 aegis sim -c config.yml -r --extend 1500 # resume and extend to 1500 steps 52 53When ``-r`` is used and no output directory exists yet, AEGIS falls back to a 54fresh run. If the output directory exists but contains no checkpoint, an error 55is raised. 56 57Output truncation on resume 58---------------------------- 59Between checkpoints, recorders append data to output files. If the sim crashes 60at step 783 but the last checkpoint was at step 700, the output files contain 61data for steps 700–783 that will be re-recorded on resume. To prevent 62duplicates, ``truncate_for_resume(checkpoint_step)`` is called during resume 63initialization. It: 64 65- Truncates per-step files (popsize, resources, eggs) to ``step - 1`` lines 66- Truncates rate-based files (progress, spectra, genotypes, phenotypes, popgen) 67 to the number of recordings that occurred before the checkpoint step, plus 68 any header lines 69- Deletes TE files whose collection window started at or after the checkpoint 70- Handles envdriftmap separately (uses ``step % rate == 0`` without the 71 step-1 special case) 72 73Snapshot files (feather format) are named by step number, so re-recording 74simply overwrites them. One-time files (config, summary) are also overwritten. 75 76Extending a simulation 77----------------------- 78``--extend N`` overrides ``STEPS_PER_SIMULATION`` to N after loading the 79checkpoint, allowing a completed simulation to continue beyond its original 80target. N must be greater than the checkpoint step. 81 82Checkpointing vs seeding 83------------------------- 84Seeding (``-p``) starts a *new* simulation using a saved Population as the 85initial population. It resets the step counter to 1, creates fresh output, 86and does not restore RNG state or configuration. 87 88Checkpointing (``-r``) *resumes* an existing simulation from the exact point 89it was saved. The step counter, RNG state, configuration, and output directory 90are all preserved. 91""" 92 93import pickle 94import logging 95import pathlib 96import tempfile 97import numpy as np 98 99from aegis_sim.dataclasses.population import Population 100 101 102class Checkpoint: 103 """Immutable snapshot of simulation state at a given step. 104 105 Attributes: 106 population: The full Population instance at the time of capture. 107 eggs: Unhatched offspring (Population instance), or None. 108 step: Simulation step number when the checkpoint was taken. 109 rng_state: State dict of ``numpy.random.Generator.bit_generator``. 110 random_seed: The original random seed used to initialize the simulation. 111 legacy_rng_state: State tuple from ``numpy.random.get_state()``. 112 envdrift_map: Boolean ndarray (the XOR map), or None if envdrift is disabled. 113 predator_population_size: Current predator count (float), for predation dynamics. 114 resource_capacity: Current available resource amount (float). 115 final_config: Fully resolved parameter dict (default + species + config + overrides). 116 custom_config_path: Path to the original ``.yml`` config file. 117 """ 118 119 def __init__( 120 self, 121 population: Population, 122 eggs, 123 step: int, 124 rng_state: dict, 125 random_seed: int, 126 legacy_rng_state: dict, 127 envdrift_map, 128 predator_population_size: float, 129 resource_capacity: float, 130 final_config: dict, 131 custom_config_path: pathlib.Path, 132 ): 133 self.population = population 134 self.eggs = eggs 135 self.step = step 136 self.rng_state = rng_state 137 self.random_seed = random_seed 138 self.legacy_rng_state = legacy_rng_state 139 self.envdrift_map = envdrift_map 140 self.predator_population_size = predator_population_size 141 self.resource_capacity = resource_capacity 142 self.final_config = final_config 143 self.custom_config_path = custom_config_path 144 145 @classmethod 146 def capture(cls, population, eggs, variables, submodels, parametermanager): 147 """Capture current simulation state into a Checkpoint.""" 148 from aegis_sim.submodels.resources.resources import resources 149 150 return cls( 151 population=population, 152 eggs=eggs, 153 step=variables.steps, 154 rng_state=variables.rng.bit_generator.state, 155 random_seed=variables.random_seed, 156 legacy_rng_state=np.random.get_state(), 157 envdrift_map=submodels.architect.envdrift.map, 158 predator_population_size=submodels.predation.N, 159 resource_capacity=resources.capacity, 160 final_config=parametermanager.final_config, 161 custom_config_path=variables.custom_config_path, 162 ) 163 164 def save(self, path: pathlib.Path): 165 """Serialize checkpoint to disk using atomic write with backup. 166 167 Keeps the previous checkpoint as ``<path>.bak`` so that a SIGKILL 168 during the rename window cannot leave the user with zero valid 169 checkpoints. Only promotes the current checkpoint to backup if 170 it can be successfully unpickled — a corrupt file is deleted instead. 171 """ 172 path.parent.mkdir(parents=True, exist_ok=True) 173 backup_path = path.with_suffix(".bak") 174 fd, tmp_path = tempfile.mkstemp(dir=path.parent, suffix=".tmp") 175 try: 176 with open(fd, "wb") as f: 177 pickle.dump(self, f) 178 # Only back up the current checkpoint if it's valid 179 if path.exists(): 180 if self._is_valid_checkpoint(path): 181 path.replace(backup_path) 182 else: 183 logging.warning(f"Existing checkpoint at {path} is corrupt; discarding instead of backing up.") 184 path.unlink() 185 pathlib.Path(tmp_path).replace(path) 186 except BaseException: 187 pathlib.Path(tmp_path).unlink(missing_ok=True) 188 raise 189 logging.debug(f"Checkpoint saved at step {self.step} to {path}") 190 191 @staticmethod 192 def _is_valid_checkpoint(path: pathlib.Path) -> bool: 193 """Return True if the file at *path* can be unpickled.""" 194 try: 195 with open(path, "rb") as f: 196 pickle.load(f) 197 return True 198 except (pickle.UnpicklingError, EOFError, Exception): 199 return False 200 201 @classmethod 202 def load(cls, path: pathlib.Path) -> "Checkpoint": 203 """Deserialize checkpoint from disk, falling back to backup if corrupt.""" 204 backup_path = path.with_suffix(".bak") 205 try: 206 return cls._load_single(path) 207 except (pickle.UnpicklingError, EOFError) as primary_err: 208 if backup_path.exists(): 209 logging.warning( 210 f"Primary checkpoint at {path} is corrupt ({primary_err}); " 211 f"falling back to backup at {backup_path}." 212 ) 213 return cls._load_single(backup_path) 214 raise 215 216 @classmethod 217 def _load_single(cls, path: pathlib.Path) -> "Checkpoint": 218 """Load and validate a single checkpoint file.""" 219 with open(path, "rb") as f: 220 checkpoint = pickle.load(f) 221 if not isinstance(checkpoint, cls): 222 raise TypeError(f"Expected Checkpoint, got {type(checkpoint).__name__}") 223 logging.info(f"Checkpoint loaded from {path} (step {checkpoint.step})") 224 return checkpoint 225 226 @classmethod 227 def find_latest(cls, odir: pathlib.Path) -> pathlib.Path: 228 """Find the checkpoint file in an output directory. 229 230 Returns the primary checkpoint path if it exists. If only the backup 231 exists (e.g. after a SIGKILL during save), returns the backup path. 232 233 Args: 234 odir: The simulation output directory (e.g. ``temp/test_config``). 235 236 Returns: 237 Path to the checkpoint file. 238 239 Raises: 240 FileNotFoundError: If no checkpoint file is found. 241 """ 242 checkpoint_path = odir / "checkpoint" 243 backup_path = checkpoint_path.with_suffix(".bak") 244 if checkpoint_path.exists(): 245 return checkpoint_path 246 if backup_path.exists(): 247 logging.warning( 248 f"No primary checkpoint in {odir}, using backup {backup_path}." 249 ) 250 return backup_path 251 raise FileNotFoundError(f"No checkpoint file found in {odir}")
103class Checkpoint: 104 """Immutable snapshot of simulation state at a given step. 105 106 Attributes: 107 population: The full Population instance at the time of capture. 108 eggs: Unhatched offspring (Population instance), or None. 109 step: Simulation step number when the checkpoint was taken. 110 rng_state: State dict of ``numpy.random.Generator.bit_generator``. 111 random_seed: The original random seed used to initialize the simulation. 112 legacy_rng_state: State tuple from ``numpy.random.get_state()``. 113 envdrift_map: Boolean ndarray (the XOR map), or None if envdrift is disabled. 114 predator_population_size: Current predator count (float), for predation dynamics. 115 resource_capacity: Current available resource amount (float). 116 final_config: Fully resolved parameter dict (default + species + config + overrides). 117 custom_config_path: Path to the original ``.yml`` config file. 118 """ 119 120 def __init__( 121 self, 122 population: Population, 123 eggs, 124 step: int, 125 rng_state: dict, 126 random_seed: int, 127 legacy_rng_state: dict, 128 envdrift_map, 129 predator_population_size: float, 130 resource_capacity: float, 131 final_config: dict, 132 custom_config_path: pathlib.Path, 133 ): 134 self.population = population 135 self.eggs = eggs 136 self.step = step 137 self.rng_state = rng_state 138 self.random_seed = random_seed 139 self.legacy_rng_state = legacy_rng_state 140 self.envdrift_map = envdrift_map 141 self.predator_population_size = predator_population_size 142 self.resource_capacity = resource_capacity 143 self.final_config = final_config 144 self.custom_config_path = custom_config_path 145 146 @classmethod 147 def capture(cls, population, eggs, variables, submodels, parametermanager): 148 """Capture current simulation state into a Checkpoint.""" 149 from aegis_sim.submodels.resources.resources import resources 150 151 return cls( 152 population=population, 153 eggs=eggs, 154 step=variables.steps, 155 rng_state=variables.rng.bit_generator.state, 156 random_seed=variables.random_seed, 157 legacy_rng_state=np.random.get_state(), 158 envdrift_map=submodels.architect.envdrift.map, 159 predator_population_size=submodels.predation.N, 160 resource_capacity=resources.capacity, 161 final_config=parametermanager.final_config, 162 custom_config_path=variables.custom_config_path, 163 ) 164 165 def save(self, path: pathlib.Path): 166 """Serialize checkpoint to disk using atomic write with backup. 167 168 Keeps the previous checkpoint as ``<path>.bak`` so that a SIGKILL 169 during the rename window cannot leave the user with zero valid 170 checkpoints. Only promotes the current checkpoint to backup if 171 it can be successfully unpickled — a corrupt file is deleted instead. 172 """ 173 path.parent.mkdir(parents=True, exist_ok=True) 174 backup_path = path.with_suffix(".bak") 175 fd, tmp_path = tempfile.mkstemp(dir=path.parent, suffix=".tmp") 176 try: 177 with open(fd, "wb") as f: 178 pickle.dump(self, f) 179 # Only back up the current checkpoint if it's valid 180 if path.exists(): 181 if self._is_valid_checkpoint(path): 182 path.replace(backup_path) 183 else: 184 logging.warning(f"Existing checkpoint at {path} is corrupt; discarding instead of backing up.") 185 path.unlink() 186 pathlib.Path(tmp_path).replace(path) 187 except BaseException: 188 pathlib.Path(tmp_path).unlink(missing_ok=True) 189 raise 190 logging.debug(f"Checkpoint saved at step {self.step} to {path}") 191 192 @staticmethod 193 def _is_valid_checkpoint(path: pathlib.Path) -> bool: 194 """Return True if the file at *path* can be unpickled.""" 195 try: 196 with open(path, "rb") as f: 197 pickle.load(f) 198 return True 199 except (pickle.UnpicklingError, EOFError, Exception): 200 return False 201 202 @classmethod 203 def load(cls, path: pathlib.Path) -> "Checkpoint": 204 """Deserialize checkpoint from disk, falling back to backup if corrupt.""" 205 backup_path = path.with_suffix(".bak") 206 try: 207 return cls._load_single(path) 208 except (pickle.UnpicklingError, EOFError) as primary_err: 209 if backup_path.exists(): 210 logging.warning( 211 f"Primary checkpoint at {path} is corrupt ({primary_err}); " 212 f"falling back to backup at {backup_path}." 213 ) 214 return cls._load_single(backup_path) 215 raise 216 217 @classmethod 218 def _load_single(cls, path: pathlib.Path) -> "Checkpoint": 219 """Load and validate a single checkpoint file.""" 220 with open(path, "rb") as f: 221 checkpoint = pickle.load(f) 222 if not isinstance(checkpoint, cls): 223 raise TypeError(f"Expected Checkpoint, got {type(checkpoint).__name__}") 224 logging.info(f"Checkpoint loaded from {path} (step {checkpoint.step})") 225 return checkpoint 226 227 @classmethod 228 def find_latest(cls, odir: pathlib.Path) -> pathlib.Path: 229 """Find the checkpoint file in an output directory. 230 231 Returns the primary checkpoint path if it exists. If only the backup 232 exists (e.g. after a SIGKILL during save), returns the backup path. 233 234 Args: 235 odir: The simulation output directory (e.g. ``temp/test_config``). 236 237 Returns: 238 Path to the checkpoint file. 239 240 Raises: 241 FileNotFoundError: If no checkpoint file is found. 242 """ 243 checkpoint_path = odir / "checkpoint" 244 backup_path = checkpoint_path.with_suffix(".bak") 245 if checkpoint_path.exists(): 246 return checkpoint_path 247 if backup_path.exists(): 248 logging.warning( 249 f"No primary checkpoint in {odir}, using backup {backup_path}." 250 ) 251 return backup_path 252 raise FileNotFoundError(f"No checkpoint file found in {odir}")
Immutable snapshot of simulation state at a given step.
Attributes:
population: The full Population instance at the time of capture.
eggs: Unhatched offspring (Population instance), or None.
step: Simulation step number when the checkpoint was taken.
rng_state: State dict of numpy.random.Generator.bit_generator.
random_seed: The original random seed used to initialize the simulation.
legacy_rng_state: State tuple from numpy.random.get_state().
envdrift_map: Boolean ndarray (the XOR map), or None if envdrift is disabled.
predator_population_size: Current predator count (float), for predation dynamics.
resource_capacity: Current available resource amount (float).
final_config: Fully resolved parameter dict (default + species + config + overrides).
custom_config_path: Path to the original .yml config file.
120 def __init__( 121 self, 122 population: Population, 123 eggs, 124 step: int, 125 rng_state: dict, 126 random_seed: int, 127 legacy_rng_state: dict, 128 envdrift_map, 129 predator_population_size: float, 130 resource_capacity: float, 131 final_config: dict, 132 custom_config_path: pathlib.Path, 133 ): 134 self.population = population 135 self.eggs = eggs 136 self.step = step 137 self.rng_state = rng_state 138 self.random_seed = random_seed 139 self.legacy_rng_state = legacy_rng_state 140 self.envdrift_map = envdrift_map 141 self.predator_population_size = predator_population_size 142 self.resource_capacity = resource_capacity 143 self.final_config = final_config 144 self.custom_config_path = custom_config_path
146 @classmethod 147 def capture(cls, population, eggs, variables, submodels, parametermanager): 148 """Capture current simulation state into a Checkpoint.""" 149 from aegis_sim.submodels.resources.resources import resources 150 151 return cls( 152 population=population, 153 eggs=eggs, 154 step=variables.steps, 155 rng_state=variables.rng.bit_generator.state, 156 random_seed=variables.random_seed, 157 legacy_rng_state=np.random.get_state(), 158 envdrift_map=submodels.architect.envdrift.map, 159 predator_population_size=submodels.predation.N, 160 resource_capacity=resources.capacity, 161 final_config=parametermanager.final_config, 162 custom_config_path=variables.custom_config_path, 163 )
Capture current simulation state into a Checkpoint.
165 def save(self, path: pathlib.Path): 166 """Serialize checkpoint to disk using atomic write with backup. 167 168 Keeps the previous checkpoint as ``<path>.bak`` so that a SIGKILL 169 during the rename window cannot leave the user with zero valid 170 checkpoints. Only promotes the current checkpoint to backup if 171 it can be successfully unpickled — a corrupt file is deleted instead. 172 """ 173 path.parent.mkdir(parents=True, exist_ok=True) 174 backup_path = path.with_suffix(".bak") 175 fd, tmp_path = tempfile.mkstemp(dir=path.parent, suffix=".tmp") 176 try: 177 with open(fd, "wb") as f: 178 pickle.dump(self, f) 179 # Only back up the current checkpoint if it's valid 180 if path.exists(): 181 if self._is_valid_checkpoint(path): 182 path.replace(backup_path) 183 else: 184 logging.warning(f"Existing checkpoint at {path} is corrupt; discarding instead of backing up.") 185 path.unlink() 186 pathlib.Path(tmp_path).replace(path) 187 except BaseException: 188 pathlib.Path(tmp_path).unlink(missing_ok=True) 189 raise 190 logging.debug(f"Checkpoint saved at step {self.step} to {path}")
Serialize checkpoint to disk using atomic write with backup.
Keeps the previous checkpoint as <path>.bak so that a SIGKILL
during the rename window cannot leave the user with zero valid
checkpoints. Only promotes the current checkpoint to backup if
it can be successfully unpickled — a corrupt file is deleted instead.
202 @classmethod 203 def load(cls, path: pathlib.Path) -> "Checkpoint": 204 """Deserialize checkpoint from disk, falling back to backup if corrupt.""" 205 backup_path = path.with_suffix(".bak") 206 try: 207 return cls._load_single(path) 208 except (pickle.UnpicklingError, EOFError) as primary_err: 209 if backup_path.exists(): 210 logging.warning( 211 f"Primary checkpoint at {path} is corrupt ({primary_err}); " 212 f"falling back to backup at {backup_path}." 213 ) 214 return cls._load_single(backup_path) 215 raise
Deserialize checkpoint from disk, falling back to backup if corrupt.
227 @classmethod 228 def find_latest(cls, odir: pathlib.Path) -> pathlib.Path: 229 """Find the checkpoint file in an output directory. 230 231 Returns the primary checkpoint path if it exists. If only the backup 232 exists (e.g. after a SIGKILL during save), returns the backup path. 233 234 Args: 235 odir: The simulation output directory (e.g. ``temp/test_config``). 236 237 Returns: 238 Path to the checkpoint file. 239 240 Raises: 241 FileNotFoundError: If no checkpoint file is found. 242 """ 243 checkpoint_path = odir / "checkpoint" 244 backup_path = checkpoint_path.with_suffix(".bak") 245 if checkpoint_path.exists(): 246 return checkpoint_path 247 if backup_path.exists(): 248 logging.warning( 249 f"No primary checkpoint in {odir}, using backup {backup_path}." 250 ) 251 return backup_path 252 raise FileNotFoundError(f"No checkpoint file found in {odir}")
Find the checkpoint file in an output directory.
Returns the primary checkpoint path if it exists. If only the backup exists (e.g. after a SIGKILL during save), returns the backup path.
Args:
odir: The simulation output directory (e.g. temp/test_config).
Returns: Path to the checkpoint file.
Raises: FileNotFoundError: If no checkpoint file is found.