Pauli documentation and maintenance info

Package installation

When configuring packages, ensure that the compilers are system compilers or that there’s an explicit dependence in the resulting module file. Start with

module list
module unload X

and then load the modules that the build will depend on. For some packages we have prebuilt compilation / installation scripts. Have a look at

/root/builds/build-all/script_gcc/

but most are built by hand and installed into directories under

/opt/ohpc/pub/

After installation, a module file needs to be created. Our module files are in

/opt/ohpc/pub/modulefiles/

Security updates

Run the security updates as follows:

yum updateinfo list security
yum update --security

Backup information

We use bacula for backups. The configuration file is in

/etc/bacula/bacula-dir.conf

The tape pools are defined in

/etc/bacula/pools.conf

Some additional info is here: https://green-phys.org/pauli/ bconsole starts the backup tape console information.

The most useful commands are:

 list media
 messages
 status all
 status schedule
 mount / umount
 label barcodes pool=TapesWeakly
 enable schedule=WeeklyCycleInc

Recent command history is stored in .bconsole_history.

Node reboot and IPMI

Pauli has an IPMI network for the node. That’s the 192.168.1.1XX network. Node 19, for instance, is at 192.168.1.119. IPMI on these nodes can only be reached from the master. IMPI does require a password, which is stored in /root/sec_data/ along with other secure data.

Checking kernel and bootloader

Sometimes, after security updates, the bootloader misbehaves and sets itself to a nondescript kernel. Use grubby to handle installing the bootloader and loading the right kernel

  ls -l /boot/vmlinuz-*
  grubby --default-kernel
  grubby --default-index
  grubby --set-default=/boot/vmlinuz-5.4.234-1.el8.elrepo.x86_64

Unsticking a node stuck in drain state

If a node is stuck in drain state, but has otherwise recovered, it can be ‘unstuck’ with scontrol:

scontrol update NodeName=pauli18 State=DOWN Reason="undraining"
scontrol update NodeName=pauli18 State=RESUME

Changing the partition wall time

If someone needs more time on a partition, update partition properties (like walltime) as

scontrol update partition=ludicrous MaxTime=4-00:00:00

iommu and the GPU node

It seems that the GPU direct memory access needed for direct file access from the GPU is incompatible with some of the iommu settings. The solution is to disable iommu in the bios of the machine. The master has iommu disabled, but the GPU node has it enabled by default.

updating and packaging the node image

Our node images are managed by warewulf. We are currently running Rocky Linux release 8.7 (Green Obsidian), and the original images are located in /opt/ohpc/admin/images/. The warewulf version we run is ancient (unclear which one) but some configuration is in /etc/warewulf/.