Linux安装Slurm集群

打印 上一主题 下一主题

主题 730|帖子 730|积分 2190

安装规划

SLURM(Simple Linux Utility for Resource Management)是一个开源、高性能、可扩展的集群管理和作业调度系统,被广泛应用于大型计算集群和超级计算机中。它能够有效地管理集群中的计算资源(如CPU、内存、GPU等),并根据用户的需求对作业进行调度,从而提高集群的使用率。

  • master控制节点:

    • 172.16.45.29(920)

  • node计算节点:

    • 172.16.45.2(920)
    • 172.16.45.4(920)


此次以centos 8 等rpm体系的Linux发行版为例。
创建账号
  1. #! 删除数据库
  2. yum remove mariadb-server mariadb-devel -y
  3. #! 删除Slurm及Munge
  4. yum remove slurm munge munge-libs munge-devel -y
  5. #! 删除用户
  6. userdel -r slurm
  7. userdel -r munge
  8. #! 创建用户
  9. export MUNGEUSER=1051
  10. groupadd -g $MUNGEUSER munge
  11. useradd  -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge  -s /sbin/nologin munge
  12. export SLURMUSER=1052
  13. groupadd -g $SLURMUSER slurm
  14. useradd  -m -c "SLURM workload manager" -d /var/lib/slurm -u $SLURMUSER -g slurm  -s /bin/bash slurm
  15. #! ssh免密码登录 # 控制节点执行 https://builtin.com/articles/ssh-without-password
  16. ssh-keygen
  17. #! 拷贝密钥到计算节点
  18. ssh-copy-id 172.16.45.2
  19. ssh-copy-id 172.16.45.4
复制代码
Munge

Munge 是一个用于创建和验证用户凭证的身份验证服务,主要应用于大规模的高性能计算(HPC)集群中。它被计划为高度可扩展,能够在复杂的集群环境中提供安全可靠的身份验证。
https://github.com/dun/munge
Munge的作用


  • 身份验证
Munge 允许进程在具有相同普通用户(UID)和组(GID)的主机组中,对另一个本地或远程的进程进行身份验证。这些主机组构成了一个共享暗码密钥的安全域。

  • 安全域
Munge 通过定义安全域来管理不同主机之间的信任关系。在同一个安全域内的主机可以相互信任,而不同安全域之间的主机则需要进行额外的身份验证。

  • 简化身份管理
Munge 可以简化HPC集群中的身份管理。通过使用Munge,管理员可以避免在每个节点上配置复杂的SSH密钥或Kerberos配置。
Munge的工作原理

Munge 通过生成和验证证书来实现身份验证。当一个进程需要访问另一个进程时,它会向Munge服务器请求一个证书。Munge服务器会验证请求者的身份,然后生成一个证书。这个证书会包含请求者的UID、GID以及其他一些信息。被访问的进程会验证这个证书,以确认请求者的身份。
Munge的上风


  • 高性能: Munge 被计划为能够处理大量身份验证请求。
  • 可扩展性: Munge 可以很轻易地扩展到大型集群。
  • 安全性: Munge 提供了多种安全机制,可以防止未授权访问。
  • 易于使用: Munge 的配置相对简单,易于管理。
安装
  1. #! 所有节点
  2. yum install epel-release -y
  3. yum install munge munge-libs munge-devel -y
复制代码
管理节点生成secret key
  1. yum install rng-tools -y
  2. rngd -r /dev/urandom
  3. /usr/sbin/create-munge-key -r
  4. dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
  5. chown munge: /etc/munge/munge.key
  6. chmod 400 /etc/munge/munge.key
  7. scp /etc/munge/munge.key root@172.16.45.2:/etc/munge
  8. scp /etc/munge/munge.key root@172.16.45.4:/etc/munge
复制代码
  1. #! 所有节点
  2. chown -R munge: /etc/munge/ /var/log/munge/
  3. chmod 0700 /etc/munge/ /var/log/munge/
  4. systemctl enable munge
  5. systemctl start munge
复制代码
  1. #! 在主节点测试
  2. # munge -n
  3. MUNGE:AwQFAAD9xUgg77lK2Ts72xayqCe4IETD9sp4ZEJD8ZTCbDekcojBef1fveBK8YweUi/7ImJMUdw3rO+gl3P02K5cHJAJX0Xq74rhW+1EgZgJZcIxHy4Z3qmsPWk4rVzhJfKGgUQ=:
  4. # munge -n | munge
  5. MUNGE:AwQFAACLbOsTGZWeENLUthY0WyyVWQ1HVEBbGIWEAobpAaLI2T1oMbHKjMO6zOvCTIKZcEPB/0CBhYxbpekFQwK7jeN7RMIxuZ+9dZFUF6jLEh0gbiLIpvgL1z3kGGwZNR+FMR6D/b1pUFPL4Mt9QQd4zjAIOvVnWCoXyE3XTfI64ZIbGJCZypMRj6nD7G2zgEVQ+v23vSPb81mnfC7ne1FaLIdNu9Iy8ZsESaxXJDrVoKFf/3Nax+Iw/LvauIbjF/Ps/Ok6aDcIAoPbOFWfbO7L2rovQzHt/3ABwwzH4yOGDdj9aWyqcyuqegDp/d8l6iJ7TIg=:
  6. # munge -n | ssh 172.16.45.2  unmunge
  7. Authorized users only. All activities may be monitored and reported.
  8. STATUS:          Success (0)
  9. ENCODE_HOST:     ??? (172.16.45.29)
  10. ENCODE_TIME:     2024-12-10 16:16:55 +0800 (1733818615)
  11. DECODE_TIME:     2024-12-10 16:16:52 +0800 (1733818612)
  12. TTL:             300
  13. CIPHER:          aes128 (4)
  14. MAC:             sha256 (5)
  15. ZIP:             none (0)
  16. UID:             root (0)
  17. GID:             root (0)
  18. LENGTH:          0
复制代码
安装Slurm
  1. #! 所有节点
  2. yum install openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel libibmad libibumad perl-ExtUtils-MakeMaker perl-devel  gcc mariadb-devel  pam-devel rpm-build -y
  3. wget https://download.schedmd.com/slurm/slurm-24.05.4.tar.bz2
  4. rpmbuild -ta slurm-24.05.4.tar.bz2
  5. cd /root/rpmbuild/RPMS/aarch64/
  6. yum --nogpgcheck localinstall * -y
复制代码
  1. #! 所有节点
  2. yum install openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel libibmad libibumad perl-ExtUtils-MakeMaker perl-devel  gcc mariadb-devel  pam-devel rpm-build -y
  3. wget https://download.schedmd.com/slurm/slurm-24.05.4.tar.bz2
  4. rpmbuild -ta slurm-24.05.4.tar.bz2
  5. cd /root/rpmbuild/RPMS/aarch64/
  6. yum --nogpgcheck localinstall * -ymkdir -p /var/log/slurm/chown slurm: /var/log/slurm/# vi /etc/slurm/slurm.conf## Example slurm.conf file. Please run configurator.html# (in doc/html) to build a configuration file customized# for your environment.### slurm.conf file generated by configurator.html.# Put this file on all nodes of your cluster.# See the slurm.conf man page for more information.#ClusterName=clusterSlurmctldHost=Donau(172.16.45.29)#SlurmctldHost=##DisableRootJobs=NO#EnforcePartLimits=NO#Epilog=#EpilogSlurmctld=#FirstJobId=1#MaxJobId=67043328#GresTypes=#GroupUpdateForce=0#GroupUpdateTime=600#JobFileAppend=0#JobRequeue=1#JobSubmitPlugins=lua#KillOnBadExit=0#LaunchType=launch/slurm#Licenses=foo*4,bar#MailProg=/bin/mail#MaxJobCount=10000#MaxStepCount=40000#MaxTasksPerNode=512MpiDefault=none#MpiParams=ports=#-##PluginDir=#PlugStackConfig=#PrivateData=jobsProctrackType=proctrack/cgroup#Prolog=#PrologFlags=#PrologSlurmctld=#PropagatePrioProcess=0#PropagateResourceLimits=#PropagateResourceLimitsExcept=#RebootProgram=ReturnToService=1SlurmctldPidFile=/var/run/slurmctld.pidSlurmctldPort=6817SlurmdPidFile=/var/run/slurmd.pidSlurmdPort=6818SlurmdSpoolDir=/var/spool/slurmdSlurmUser=slurm#SlurmdUser=root#SrunEpilog=#SrunProlog=StateSaveLocation=/var/spool/slurmctldSwitchType=switch/none#TaskEpilog=TaskPlugin=task/affinity#TaskProlog=#TopologyPlugin=topology/tree#TmpFS=/tmp#TrackWCKey=no#TreeWidth=#UnkillableStepProgram=#UsePAM=0### TIMERS#BatchStartTimeout=10#CompleteWait=0#EpilogMsgTime=2000#GetEnvTimeout=2#HealthCheckInterval=0#HealthCheckProgram=InactiveLimit=0KillWait=30#MessageTimeout=10#ResvOverRun=0MinJobAge=300#OverTimeLimit=0SlurmctldTimeout=120SlurmdTimeout=300#UnkillableStepTimeout=60#VSizeFactor=0Waittime=0### SCHEDULING#DefMemPerCPU=0#MaxMemPerCPU=0#SchedulerTimeSlice=30SchedulerType=sched/backfillSelectType=select/cons_tres### JOB PRIORITY#PriorityFlags=#PriorityType=priority/multifactor#PriorityDecayHalfLife=#PriorityCalcPeriod=#PriorityFavorSmall=#PriorityMaxAge=#PriorityUsageResetPeriod=#PriorityWeightAge=#PriorityWeightFairshare=#PriorityWeightJobSize=#PriorityWeightPartition=#PriorityWeightQOS=### LOGGING AND ACCOUNTING#AccountingStorageEnforce=0#AccountingStorageHost=#AccountingStoragePass=#AccountingStoragePort=AccountingStorageType=accounting_storage/none#AccountingStorageUser=#AccountingStoreFlags=#JobCompHost=#JobCompLoc=#JobCompPass=#JobCompPort=JobCompType=jobcomp/none#JobCompUser=#JobContainerType=JobAcctGatherFrequency=30JobAcctGatherType=jobacct_gather/noneSlurmctldDebug=infoSlurmctldLogFile=/var/log/slurm/slurmctld.logSlurmdDebug=infoSlurmdLogFile=/var/log/slurm/slurmd.log#SlurmSchedLogFile=#SlurmSchedLogLevel=#DebugFlags=### POWER SAVE SUPPORT FOR IDLE NODES (optional)#SuspendProgram=#ResumeProgram=#SuspendTimeout=#ResumeTimeout=#ResumeRate=#SuspendExcNodes=#SuspendExcParts=#SuspendRate=#SuspendTime=### COMPUTE NODESNodeName=rabbitmq-node1 NodeAddr=172.16.45.2 CPUs=128 State=UNKNOWNNodeName=gczxagenta2 NodeAddr=172.16.45.4 CPUs=128 State=UNKNOWNPartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
复制代码
控制节点
  1. mkdir /var/spool/slurmctld
  2. chown slurm: /var/spool/slurmctld
  3. chmod 755 /var/spool/slurmctld
  4. touch /var/log/slurm/slurmctld.log
  5. chown slurm: /var/log/slurm/slurmctld.log
  6. touch /var/log/slurm/slurm_jobacct.log /var/log/slurm/slurm_jobcomp.log
  7. chown slurm: /var/log/slurm/slurm_jobacct.log /var/log/slurm/slurm_jobcomp.log
复制代码
计算节点
  1. mkdir /var/spool/slurm
  2. chown slurm: /var/spool/slurm
  3. chmod 755 /var/spool/slurm
  4. touch /var/log/slurm/slurmd.log
  5. chown slurm: /var/log/slurm/slurmd.log
复制代码
所有节点测试配置:
  1. # slurmd -C # 确认没有报错
  2. NodeName=rabbitmq-node1 CPUs=128 Boards=1 SocketsPerBoard=128 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=514413
  3. UpTime=12-07:19:32
  4. # yum install ntp -y
  5. # chkconfig ntpd on
  6. # ntpdate pool.ntp.org
  7. # systemctl start ntpd
复制代码
计算节点
  1. systemctl enable slurmd.service
  2. systemctl start slurmd.service
  3. systemctl status slurmd.service
  4. # 此时管理节点没有启动,报错是正常的。
复制代码
参考资料

主节点安装MariaDB
  1. yum install mariadb-server mariadb-devel -y
  2. systemctl enable mariadb
  3. systemctl start mariadb
  4. systemctl status mariadb
  5. mysql
  6. MariaDB[(none)]> GRANT ALL ON slurm_acct_db.* TO 'slurm'@'localhost' IDENTIFIED BY '1234' with grant option;
  7. MariaDB[(none)]> SHOW VARIABLES LIKE 'have_innodb';
  8. MariaDB[(none)]> FLUSH PRIVILEGES;
  9. MariaDB[(none)]> CREATE DATABASE slurm_acct_db;
  10. MariaDB[(none)]> quit;
  11. # vi  /etc/my.cnf.d/innodb.cnf
  12. [mysqld]
  13. innodb_buffer_pool_size=1024M
  14. innodb_log_file_size=64M
  15. innodb_lock_wait_timeout=900
  16. # systemctl stop mariadb
  17. mv /var/lib/mysql/ib_logfile? /tmp/
  18. systemctl start mariadb
  19. # vim /etc/slurm/slurmdbd.conf
  20. #
  21. # Example slurmdbd.conf file.
  22. #
  23. # See the slurmdbd.conf man page for more information.
  24. #
  25. # Archive info
  26. #ArchiveJobs=yes
  27. #ArchiveDir="/tmp"
  28. #ArchiveSteps=yes
  29. #ArchiveScript=
  30. #JobPurge=12
  31. #StepPurge=1
  32. #
  33. # Authentication info
  34. AuthType=auth/munge
  35. #AuthInfo=/var/run/munge/munge.socket.2
  36. #
  37. # slurmDBD info
  38. DbdAddr=localhost
  39. DbdHost=localhost
  40. #DbdPort=7031
  41. SlurmUser=slurm
  42. #MessageTimeout=300
  43. DebugLevel=verbose
  44. #DefaultQOS=normal,standby
  45. LogFile=/var/log/slurm/slurmdbd.log
  46. PidFile=/var/run/slurmdbd.pid
  47. #PluginDir=/usr/lib/slurm
  48. #PrivateData=accounts,users,usage,jobs
  49. #TrackWCKey=yes
  50. #
  51. # Database info
  52. StorageType=accounting_storage/mysql
  53. #StorageHost=localhost
  54. #StoragePort=1234
  55. DbdPort=6819
  56. StoragePass=1234
  57. StorageLoc=slurm_acct_db
  58. # chown slurm: /etc/slurm/slurmdbd.conf
  59. chmod 600 /etc/slurm/slurmdbd.conf
  60. touch /var/log/slurmdbd.log
  61. chown slurm: /var/log/slurmdbd.log
  62. systemctl enable slurmdbd
  63. systemctl start slurmdbd
  64. systemctl status slurmdbd
  65. systemctl enable slurmctld.service
  66. systemctl start slurmctld.service
  67. systemctl status slurmctld.service
复制代码
验证
  1. # sinfo
  2. PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
  3. debug*       up   infinite      1   idle gczxagenta2,rabbitmq-node1
  4. # srun -N2 -l /bin/hostname
  5. 0: gczxagenta2
  6. 1: rabbitmq-node1
复制代码
巨多的坑


  • fatal error: EXTERN.h :实行 yum -y install perl-devel一般可以解决
  • 管理节点和计算节点不要部署在同一台
  • munged: Error: Logfile is insecure: group-writable permissions set on "/var/log"
    有时启动会对日志文件的权限有要求,比如:755
  • error: auth_p_get_host: Lookup failed for 172.16.45.34
建议在hosts文件添加IP和主机名的映射,比如:
  1. 127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
  2. ::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
  3. 172.16.45.29 Donau
  4. 172.16.45.18 rabbitmq-node2
  5. 172.16.45.2 rabbitmq-node1
  6. 172.16.45.34 Donau2
  7. 172.16.45.4 gczxagenta2
复制代码

  • error: Configured MailProg is invalid: 这个错误无需处理
  • _read_slurm_cgroup_conf: No cgroup.conf file (/etc/slurm/cgroup.conf), using defaults 这个错误无需处理
  • srun: error: Task launch for StepId=12.0 failed on node : Invalid node
检查node的ip等是否有重复

免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。

本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有账号?立即注册

x
回复

使用道具 举报

0 个回复

倒序浏览

写过一篇

金牌会员
这个人很懒什么都没写!

标签云

快速回复 返回顶部 返回列表