安装规划
SLURM(Simple Linux Utility for Resource Management)是一个开源、高性能、可扩展的集群管理和作业调度系统,被广泛应用于大型计算集群和超级计算机中。它能够有效地管理集群中的计算资源(如CPU、内存、GPU等),并根据用户的需求对作业进行调度,从而提高集群的使用率。
- master控制节点:
- node计算节点:
- 172.16.45.2(920)
- 172.16.45.4(920)
此次以centos 8 等rpm体系的Linux发行版为例。
创建账号
- #! 删除数据库
- yum remove mariadb-server mariadb-devel -y
- #! 删除Slurm及Munge
- yum remove slurm munge munge-libs munge-devel -y
- #! 删除用户
- userdel -r slurm
- userdel -r munge
- #! 创建用户
- export MUNGEUSER=1051
- groupadd -g $MUNGEUSER munge
- useradd -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge -s /sbin/nologin munge
- export SLURMUSER=1052
- groupadd -g $SLURMUSER slurm
- useradd -m -c "SLURM workload manager" -d /var/lib/slurm -u $SLURMUSER -g slurm -s /bin/bash slurm
- #! ssh免密码登录 # 控制节点执行 https://builtin.com/articles/ssh-without-password
- ssh-keygen
- #! 拷贝密钥到计算节点
- ssh-copy-id 172.16.45.2
- ssh-copy-id 172.16.45.4
复制代码 Munge
Munge 是一个用于创建和验证用户凭证的身份验证服务,主要应用于大规模的高性能计算(HPC)集群中。它被计划为高度可扩展,能够在复杂的集群环境中提供安全可靠的身份验证。
https://github.com/dun/munge
Munge的作用
Munge 允许进程在具有相同普通用户(UID)和组(GID)的主机组中,对另一个本地或远程的进程进行身份验证。这些主机组构成了一个共享暗码密钥的安全域。
Munge 通过定义安全域来管理不同主机之间的信任关系。在同一个安全域内的主机可以相互信任,而不同安全域之间的主机则需要进行额外的身份验证。
Munge 可以简化HPC集群中的身份管理。通过使用Munge,管理员可以避免在每个节点上配置复杂的SSH密钥或Kerberos配置。
Munge的工作原理
Munge 通过生成和验证证书来实现身份验证。当一个进程需要访问另一个进程时,它会向Munge服务器请求一个证书。Munge服务器会验证请求者的身份,然后生成一个证书。这个证书会包含请求者的UID、GID以及其他一些信息。被访问的进程会验证这个证书,以确认请求者的身份。
Munge的上风
- 高性能: Munge 被计划为能够处理大量身份验证请求。
- 可扩展性: Munge 可以很轻易地扩展到大型集群。
- 安全性: Munge 提供了多种安全机制,可以防止未授权访问。
- 易于使用: Munge 的配置相对简单,易于管理。
安装
- #! 所有节点
- yum install epel-release -y
- yum install munge munge-libs munge-devel -y
复制代码 管理节点生成secret key- yum install rng-tools -y
- rngd -r /dev/urandom
- /usr/sbin/create-munge-key -r
- dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
- chown munge: /etc/munge/munge.key
- chmod 400 /etc/munge/munge.key
- scp /etc/munge/munge.key root@172.16.45.2:/etc/munge
- scp /etc/munge/munge.key root@172.16.45.4:/etc/munge
复制代码- #! 所有节点
- chown -R munge: /etc/munge/ /var/log/munge/
- chmod 0700 /etc/munge/ /var/log/munge/
- systemctl enable munge
- systemctl start munge
复制代码- #! 在主节点测试
- # munge -n
- MUNGE:AwQFAAD9xUgg77lK2Ts72xayqCe4IETD9sp4ZEJD8ZTCbDekcojBef1fveBK8YweUi/7ImJMUdw3rO+gl3P02K5cHJAJX0Xq74rhW+1EgZgJZcIxHy4Z3qmsPWk4rVzhJfKGgUQ=:
- # munge -n | munge
- MUNGE:AwQFAACLbOsTGZWeENLUthY0WyyVWQ1HVEBbGIWEAobpAaLI2T1oMbHKjMO6zOvCTIKZcEPB/0CBhYxbpekFQwK7jeN7RMIxuZ+9dZFUF6jLEh0gbiLIpvgL1z3kGGwZNR+FMR6D/b1pUFPL4Mt9QQd4zjAIOvVnWCoXyE3XTfI64ZIbGJCZypMRj6nD7G2zgEVQ+v23vSPb81mnfC7ne1FaLIdNu9Iy8ZsESaxXJDrVoKFf/3Nax+Iw/LvauIbjF/Ps/Ok6aDcIAoPbOFWfbO7L2rovQzHt/3ABwwzH4yOGDdj9aWyqcyuqegDp/d8l6iJ7TIg=:
- # munge -n | ssh 172.16.45.2 unmunge
- Authorized users only. All activities may be monitored and reported.
- STATUS: Success (0)
- ENCODE_HOST: ??? (172.16.45.29)
- ENCODE_TIME: 2024-12-10 16:16:55 +0800 (1733818615)
- DECODE_TIME: 2024-12-10 16:16:52 +0800 (1733818612)
- TTL: 300
- CIPHER: aes128 (4)
- MAC: sha256 (5)
- ZIP: none (0)
- UID: root (0)
- GID: root (0)
- LENGTH: 0
复制代码 安装Slurm
- #! 所有节点
- yum install openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel libibmad libibumad perl-ExtUtils-MakeMaker perl-devel gcc mariadb-devel pam-devel rpm-build -y
- wget https://download.schedmd.com/slurm/slurm-24.05.4.tar.bz2
- rpmbuild -ta slurm-24.05.4.tar.bz2
- cd /root/rpmbuild/RPMS/aarch64/
- yum --nogpgcheck localinstall * -y
复制代码- #! 所有节点
- yum install openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel libibmad libibumad perl-ExtUtils-MakeMaker perl-devel gcc mariadb-devel pam-devel rpm-build -y
- wget https://download.schedmd.com/slurm/slurm-24.05.4.tar.bz2
- rpmbuild -ta slurm-24.05.4.tar.bz2
- cd /root/rpmbuild/RPMS/aarch64/
- yum --nogpgcheck localinstall * -ymkdir -p /var/log/slurm/chown slurm: /var/log/slurm/# vi /etc/slurm/slurm.conf## Example slurm.conf file. Please run configurator.html# (in doc/html) to build a configuration file customized# for your environment.### slurm.conf file generated by configurator.html.# Put this file on all nodes of your cluster.# See the slurm.conf man page for more information.#ClusterName=clusterSlurmctldHost=Donau(172.16.45.29)#SlurmctldHost=##DisableRootJobs=NO#EnforcePartLimits=NO#Epilog=#EpilogSlurmctld=#FirstJobId=1#MaxJobId=67043328#GresTypes=#GroupUpdateForce=0#GroupUpdateTime=600#JobFileAppend=0#JobRequeue=1#JobSubmitPlugins=lua#KillOnBadExit=0#LaunchType=launch/slurm#Licenses=foo*4,bar#MailProg=/bin/mail#MaxJobCount=10000#MaxStepCount=40000#MaxTasksPerNode=512MpiDefault=none#MpiParams=ports=#-##PluginDir=#PlugStackConfig=#PrivateData=jobsProctrackType=proctrack/cgroup#Prolog=#PrologFlags=#PrologSlurmctld=#PropagatePrioProcess=0#PropagateResourceLimits=#PropagateResourceLimitsExcept=#RebootProgram=ReturnToService=1SlurmctldPidFile=/var/run/slurmctld.pidSlurmctldPort=6817SlurmdPidFile=/var/run/slurmd.pidSlurmdPort=6818SlurmdSpoolDir=/var/spool/slurmdSlurmUser=slurm#SlurmdUser=root#SrunEpilog=#SrunProlog=StateSaveLocation=/var/spool/slurmctldSwitchType=switch/none#TaskEpilog=TaskPlugin=task/affinity#TaskProlog=#TopologyPlugin=topology/tree#TmpFS=/tmp#TrackWCKey=no#TreeWidth=#UnkillableStepProgram=#UsePAM=0### TIMERS#BatchStartTimeout=10#CompleteWait=0#EpilogMsgTime=2000#GetEnvTimeout=2#HealthCheckInterval=0#HealthCheckProgram=InactiveLimit=0KillWait=30#MessageTimeout=10#ResvOverRun=0MinJobAge=300#OverTimeLimit=0SlurmctldTimeout=120SlurmdTimeout=300#UnkillableStepTimeout=60#VSizeFactor=0Waittime=0### SCHEDULING#DefMemPerCPU=0#MaxMemPerCPU=0#SchedulerTimeSlice=30SchedulerType=sched/backfillSelectType=select/cons_tres### JOB PRIORITY#PriorityFlags=#PriorityType=priority/multifactor#PriorityDecayHalfLife=#PriorityCalcPeriod=#PriorityFavorSmall=#PriorityMaxAge=#PriorityUsageResetPeriod=#PriorityWeightAge=#PriorityWeightFairshare=#PriorityWeightJobSize=#PriorityWeightPartition=#PriorityWeightQOS=### LOGGING AND ACCOUNTING#AccountingStorageEnforce=0#AccountingStorageHost=#AccountingStoragePass=#AccountingStoragePort=AccountingStorageType=accounting_storage/none#AccountingStorageUser=#AccountingStoreFlags=#JobCompHost=#JobCompLoc=#JobCompPass=#JobCompPort=JobCompType=jobcomp/none#JobCompUser=#JobContainerType=JobAcctGatherFrequency=30JobAcctGatherType=jobacct_gather/noneSlurmctldDebug=infoSlurmctldLogFile=/var/log/slurm/slurmctld.logSlurmdDebug=infoSlurmdLogFile=/var/log/slurm/slurmd.log#SlurmSchedLogFile=#SlurmSchedLogLevel=#DebugFlags=### POWER SAVE SUPPORT FOR IDLE NODES (optional)#SuspendProgram=#ResumeProgram=#SuspendTimeout=#ResumeTimeout=#ResumeRate=#SuspendExcNodes=#SuspendExcParts=#SuspendRate=#SuspendTime=### COMPUTE NODESNodeName=rabbitmq-node1 NodeAddr=172.16.45.2 CPUs=128 State=UNKNOWNNodeName=gczxagenta2 NodeAddr=172.16.45.4 CPUs=128 State=UNKNOWNPartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
复制代码 控制节点- mkdir /var/spool/slurmctld
- chown slurm: /var/spool/slurmctld
- chmod 755 /var/spool/slurmctld
- touch /var/log/slurm/slurmctld.log
- chown slurm: /var/log/slurm/slurmctld.log
- touch /var/log/slurm/slurm_jobacct.log /var/log/slurm/slurm_jobcomp.log
- chown slurm: /var/log/slurm/slurm_jobacct.log /var/log/slurm/slurm_jobcomp.log
复制代码 计算节点- mkdir /var/spool/slurm
- chown slurm: /var/spool/slurm
- chmod 755 /var/spool/slurm
- touch /var/log/slurm/slurmd.log
- chown slurm: /var/log/slurm/slurmd.log
复制代码 所有节点测试配置:- # slurmd -C # 确认没有报错
- NodeName=rabbitmq-node1 CPUs=128 Boards=1 SocketsPerBoard=128 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=514413
- UpTime=12-07:19:32
- # yum install ntp -y
- # chkconfig ntpd on
- # ntpdate pool.ntp.org
- # systemctl start ntpd
复制代码 计算节点- systemctl enable slurmd.service
- systemctl start slurmd.service
- systemctl status slurmd.service
- # 此时管理节点没有启动,报错是正常的。
复制代码 参考资料
主节点安装MariaDB
- yum install mariadb-server mariadb-devel -y
- systemctl enable mariadb
- systemctl start mariadb
- systemctl status mariadb
- mysql
- MariaDB[(none)]> GRANT ALL ON slurm_acct_db.* TO 'slurm'@'localhost' IDENTIFIED BY '1234' with grant option;
- MariaDB[(none)]> SHOW VARIABLES LIKE 'have_innodb';
- MariaDB[(none)]> FLUSH PRIVILEGES;
- MariaDB[(none)]> CREATE DATABASE slurm_acct_db;
- MariaDB[(none)]> quit;
- # vi /etc/my.cnf.d/innodb.cnf
- [mysqld]
- innodb_buffer_pool_size=1024M
- innodb_log_file_size=64M
- innodb_lock_wait_timeout=900
- # systemctl stop mariadb
- mv /var/lib/mysql/ib_logfile? /tmp/
- systemctl start mariadb
- # vim /etc/slurm/slurmdbd.conf
- #
- # Example slurmdbd.conf file.
- #
- # See the slurmdbd.conf man page for more information.
- #
- # Archive info
- #ArchiveJobs=yes
- #ArchiveDir="/tmp"
- #ArchiveSteps=yes
- #ArchiveScript=
- #JobPurge=12
- #StepPurge=1
- #
- # Authentication info
- AuthType=auth/munge
- #AuthInfo=/var/run/munge/munge.socket.2
- #
- # slurmDBD info
- DbdAddr=localhost
- DbdHost=localhost
- #DbdPort=7031
- SlurmUser=slurm
- #MessageTimeout=300
- DebugLevel=verbose
- #DefaultQOS=normal,standby
- LogFile=/var/log/slurm/slurmdbd.log
- PidFile=/var/run/slurmdbd.pid
- #PluginDir=/usr/lib/slurm
- #PrivateData=accounts,users,usage,jobs
- #TrackWCKey=yes
- #
- # Database info
- StorageType=accounting_storage/mysql
- #StorageHost=localhost
- #StoragePort=1234
- DbdPort=6819
- StoragePass=1234
- StorageLoc=slurm_acct_db
- # chown slurm: /etc/slurm/slurmdbd.conf
- chmod 600 /etc/slurm/slurmdbd.conf
- touch /var/log/slurmdbd.log
- chown slurm: /var/log/slurmdbd.log
- systemctl enable slurmdbd
- systemctl start slurmdbd
- systemctl status slurmdbd
- systemctl enable slurmctld.service
- systemctl start slurmctld.service
- systemctl status slurmctld.service
复制代码 验证
- # sinfo
- PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
- debug* up infinite 1 idle gczxagenta2,rabbitmq-node1
- # srun -N2 -l /bin/hostname
- 0: gczxagenta2
- 1: rabbitmq-node1
复制代码 巨多的坑
- fatal error: EXTERN.h :实行 yum -y install perl-devel一般可以解决
- 管理节点和计算节点不要部署在同一台
- munged: Error: Logfile is insecure: group-writable permissions set on "/var/log"
有时启动会对日志文件的权限有要求,比如:755
- error: auth_p_get_host: Lookup failed for 172.16.45.34
建议在hosts文件添加IP和主机名的映射,比如:- 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
- ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
- 172.16.45.29 Donau
- 172.16.45.18 rabbitmq-node2
- 172.16.45.2 rabbitmq-node1
- 172.16.45.34 Donau2
- 172.16.45.4 gczxagenta2
复制代码
- error: Configured MailProg is invalid: 这个错误无需处理
- _read_slurm_cgroup_conf: No cgroup.conf file (/etc/slurm/cgroup.conf), using defaults 这个错误无需处理
- srun: error: Task launch for StepId=12.0 failed on node : Invalid node
检查node的ip等是否有重复
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。 |