”Show me your tools” session: Slurm kø-system tools Ole Holm Nielsen Senior HPC Officer, DTU Fysik Ole.H.Nielsen@fysik.dtu.dk Nyttige tools til den daglige overvågning og administration af et Slurm kø-system.
Kort om Slurm Slurm er et køsystem til Linux clusters. Slurm Workload Manager: https://slurm.schedmd.com/overview.html Slurm køsystemet benyttes verden over af bl.a.: Universiteter, national-laboratorier, og over 60% af TOP-500 supercomputer sites. Slurm benyttes på danske universiteters HPC-anlæg: SDU, AU, KU, DTU (Niflheim) Alle universiteter i Sverige og Norge. Liste over Slurm tools findes på download siden: https://slurm.schedmd.com/download.html 20. september 2018
Min Wiki-side om installation og konfiguration https://wiki.fysik.dtu.dk/niflheim/SLURM 20. september 2018
Mine Slurm tools Ressourcer på Github: https://github.com/OleHolmNielsen/Slurm_tools: pestat Prints a Slurm cluster nodes status with 1 line per node and job info. slurmreportmonth Generate monthly accounting statistics from Slurm using the sreport command. slurmacct Generate accounting statistics from Slurm as an alternative to the sreport command showuserjobs Print the current node status and batch jobs status broken down into userids. slurmibtopology Infiniband topology tool for Slurm. Slurm triggers scripts. Scripts for managing nodes. Scripts for managing jobs. Scripts for managing Slurm accounts and users. 20. september 2018
pestat Prints a Slurm cluster nodes status Anvendes af sysadmin såvel som brugere Usage: pestat [-p partition(s)] [-u username] [-g groupname] [-q qoslist] [-s statelist] [-n/-w hostlist] [-j joblist] [-G] [-N] [-f | -F | -m free_mem | -M free_mem ] [-1|-2] [-d] [-E] [-C|-c] [-V] [-h] where: -p partition: Select only partion <partition> -u username: Print only user <username> -g groupname: Print only users in UNIX group <groupname> -q qoslist: Print only QOS in the qoslist <qoslist> -R reservationlist: Print only node reservations <reservationlist> -s statelist: Print only nodes with state in <statelist> -n/-w hostlist: Print only nodes in hostlist -j joblist: Print only nodes in job <joblist> -G: Print GRES (Generic Resources) in addition to JobId -N: Print JobName in addition to JobId -f: Print only nodes that are flagged by * (unexpected load etc.) -F: Like -f, but only nodes flagged in RED are printed. -m free_mem: Print only nodes with free memory LESS than free_mem MB -M free_mem: Print only nodes with free memory GREATER than free_mem MB (under-utilized) -d: Omit nodes with states: down drained -1: Default: Only 1 line per node (unique nodes in multiple partitions are printed once only) -2: 2..N lines per node which participates in multiple partitions -E: Job EndTime is printed after each jobid/user 20. september 2018
”pestat –F” for ”flagged” noder 20. september 2018
pestat –u <brugernavn> 20. september 2018
pestat: Jobname og End Time 20. september 2018
pestat: Jobs som ikke udnytter RAM fuldt ud 20. september 2018
showuserjobs: Oversigt over bruger-jobs Anvendes af sysadmin såvel som brugere 20. september 2018
showuserjobs -G 20. september 2018
showuserjobs –p <partition> 20. september 2018
Tools til at overvåge batch jobs psjob: Do a ps (process status) on a job's node-list, but exclude system processes: psjob <jobid> showjob: Show status of Slurm job(s). sbadjob: Print a warning about bad jobs hanging indefinitely in the queue. warn_maxjobs: Issue warnings about the number of Slurm jobs approaching MaxJobCount schedjobs: Stop or start job scheduling in ALL Slurm partitions. 20. september 2018
psjob eksempel 20. september 2018
Sundheds-checks vedr. job scheduling Cron-job på admin-noden som checker jobs en gang i timen: Nødbremsen: Stop jobs fra at blive startet Fx ved strøm- og køle-problemer. 20. september 2018