Præsentation er lastning. Vent venligst

Præsentation er lastning. Vent venligst

”Show me your tools” session: Slurm kø-system tools

Lignende præsentationer


Præsentationer af emnet: "”Show me your tools” session: Slurm kø-system tools"— Præsentationens transcript:

1 ”Show me your tools” session: Slurm kø-system tools
Ole Holm Nielsen Senior HPC Officer, DTU Fysik Nyttige tools til den daglige overvågning og administration af et Slurm kø-system.

2 Kort om Slurm Slurm er et køsystem til Linux clusters.
Slurm Workload Manager: Slurm køsystemet benyttes verden over af bl.a.: Universiteter, national-laboratorier, og over 60% af TOP-500 supercomputer sites. Slurm benyttes på danske universiteters HPC-anlæg: SDU, AU, KU, DTU (Niflheim) Alle universiteter i Sverige og Norge. Liste over Slurm tools findes på download siden: 20. september 2018

3 Min Wiki-side om installation og konfiguration
20. september 2018

4 Mine Slurm tools Ressourcer på Github: pestat Prints a Slurm cluster nodes status with 1 line per node and job info. slurmreportmonth Generate monthly accounting statistics from Slurm using the sreport command. slurmacct Generate accounting statistics from Slurm as an alternative to the sreport command showuserjobs Print the current node status and batch jobs status broken down into userids. slurmibtopology Infiniband topology tool for Slurm. Slurm triggers scripts. Scripts for managing nodes. Scripts for managing jobs. Scripts for managing Slurm accounts and users. 20. september 2018

5 pestat Prints a Slurm cluster nodes status Anvendes af sysadmin såvel som brugere
Usage: pestat [-p partition(s)] [-u username] [-g groupname] [-q qoslist] [-s statelist] [-n/-w hostlist] [-j joblist] [-G] [-N] [-f | -F | -m free_mem | -M free_mem ] [-1|-2] [-d] [-E] [-C|-c] [-V] [-h] where: -p partition: Select only partion <partition> -u username: Print only user <username> -g groupname: Print only users in UNIX group <groupname> -q qoslist: Print only QOS in the qoslist <qoslist> -R reservationlist: Print only node reservations <reservationlist> -s statelist: Print only nodes with state in <statelist> -n/-w hostlist: Print only nodes in hostlist -j joblist: Print only nodes in job <joblist> -G: Print GRES (Generic Resources) in addition to JobId -N: Print JobName in addition to JobId -f: Print only nodes that are flagged by * (unexpected load etc.) -F: Like -f, but only nodes flagged in RED are printed. -m free_mem: Print only nodes with free memory LESS than free_mem MB -M free_mem: Print only nodes with free memory GREATER than free_mem MB (under-utilized) -d: Omit nodes with states: down drained -1: Default: Only 1 line per node (unique nodes in multiple partitions are printed once only) -2: 2..N lines per node which participates in multiple partitions -E: Job EndTime is printed after each jobid/user 20. september 2018

6 ”pestat –F” for ”flagged” noder
20. september 2018

7 pestat –u <brugernavn>
20. september 2018

8 pestat: Jobname og End Time
20. september 2018

9 pestat: Jobs som ikke udnytter RAM fuldt ud
20. september 2018

10 showuserjobs: Oversigt over bruger-jobs Anvendes af sysadmin såvel som brugere
20. september 2018

11 showuserjobs -G 20. september 2018

12 showuserjobs –p <partition>
20. september 2018

13 Tools til at overvåge batch jobs
psjob: Do a ps (process status) on a job's node-list, but exclude system processes: psjob <jobid> showjob: Show status of Slurm job(s). sbadjob: Print a warning about bad jobs hanging indefinitely in the queue. warn_maxjobs: Issue warnings about the number of Slurm jobs approaching MaxJobCount schedjobs: Stop or start job scheduling in ALL Slurm partitions. 20. september 2018

14 psjob eksempel 20. september 2018

15 Sundheds-checks vedr. job scheduling
Cron-job på admin-noden som checker jobs en gang i timen: Nødbremsen: Stop jobs fra at blive startet Fx ved strøm- og køle-problemer. 20. september 2018


Download ppt "”Show me your tools” session: Slurm kø-system tools"

Lignende præsentationer


Annoncer fra Google