Enabling cgroups v2

Enabling cgroups v2

The JAliEn job-pilot can use cgroups v2 to better box-in each job, preventing misbehaving jobs from overusing resources and interrupting other payloads. Support for this feature will depend on OS/distribution and LRMS, but generally require:

  • EL9 and
  • HTCondor 23.1+ or
  • Slurm 22.05+*

Slurm requires a workaround which involves having whole-node scheduling, and enabling lingering* on the WNs. This can be done by running touch /var/lib/systemd/linger/$USER, where $USER is to be replaced with the user associated with ALICE, e.g. aliprod.

NOTE: EL9 will delegate resource controllers for memory and pids by default, but not for cpu, cpuset and io. In order for JAliEn to access these, the following must be added to the file /etc/systemd/system/user@.service.d/delegate.conf:

[Service]
Delegate=cpu cpuset io memory pids

Followed by a reboot.