diff --git a/README.md b/README.md index 515e732e16d1d6aac9e5c983e85f729f24dd3506..2afdd0d8ea2ee292d62597829bae8c0c82250399 100644 --- a/README.md +++ b/README.md @@ -32,9 +32,15 @@ | openEuler | 20.03 LTS | x86_64 |最小化安装| | Kylin | v10 sp1 | x86_64 |最小化安装| -根目录的磁盘空间利用率高于85%会触发Kubelet的镜像垃圾回收机制,将导致服务不可用。请确保根目录有足够的磁盘空间,建议大于500GB +注意: + +1. 根目录的磁盘空间利用率高于85%会触发Kubelet的镜像垃圾回收机制,将导致服务不可用。请确保根目录有足够的磁盘空间,建议大于500GB + +2. 建议参照上述备注要求安装操作系统,如最小化安装,否则可能有软件包冲突,导致服务不可用 -建议参照上述备注要求安装操作系统,如最小化安装,否则可能有软件包冲突,导致服务不可用 +3. 本工具只支持在同一种操作系统内部署 + +4. harbor默认安装在本机localhost的/data目录,建议额外配置一块磁盘挂载到/data目录,以避免占用根目录的磁盘空间。 ### 角色说明 @@ -62,7 +68,6 @@ useradd -g HwHiAiUser -u 1000 -d /home/HwHiAiUser -m HwHiAiUser -s /bin/bash 3. 存储(NFS、CephFS、OceanStore) 4. 容器镜像仓(harbor) -harbor默认安装在本机localhost的/data目录,建议额外配置一块磁盘挂载到/data目录,以避免占用根目录的磁盘空间。 ## 下载本工具 @@ -314,7 +319,7 @@ HIAI_GROUP_ID: 1000 5. 部署完成后,Mindx DL平台的URL访问地址即为"https://\:\"。 -### 步骤4:检查集群状态 +### 步骤4:检查集群连通性状态 如果inventory_file内配置了非localhost的远程ip,根据ansible官方建议,请用户自行使用SSH密钥的方式连接到远程机器,可参考[[connection_details; Ansible Documentation](https://docs.ansible.com/ansible/latest/user_guide/connection_details.html#setting-up-ssh-keys)] @@ -330,7 +335,7 @@ localhost | SUCCESS => { "changed": false, "ping": "pong" } -worker1_ipaddres | SUCCESS => { +192.0.3.100 | SUCCESS => { "ansible_facts": { "discovered_interpreter_python": "/usr/bin/python3" }, @@ -341,7 +346,7 @@ worker1_ipaddres | SUCCESS => { 当所有节点都能ping通,则表示inventory_file文件中所有节点连通性正常。否则,请检查节点的ssh连接和inventory_file文件配置是否正确 -各个节点应保持时间同步,不然可能会出现不可预知异常,时间同步服务应当由网络管理员提供支持。无法获取网络管理员提供支持的时间同步服务时,本工具也提供了可选的时间同步服务,依次执行01、99这2个子任务即可。 +各个节点应保持时间同步,不然可能会出现不可预知异常,时间同步服务应当由网络管理员提供支持。无法获取网络管理员提供支持的时间同步服务时,本工具也提供了可选的时间同步服务,依次执行01、99这2个子任务即可 ```bash root@master:~/ascend-hccl-controller# ansible-playbook -i inventory_file playbooks/01.resource.yaml playbooks/99.chrony.yaml @@ -494,13 +499,13 @@ mindx-dl redis-deploy-85dbb68c56-cfxhq 1/1 Running 1 root@master:~/ascend-hccl-controller# ansible worker -i inventory_file -m shell -a "docker pull :/mindx/npu-exporter:" ``` - apigw中的apigw-business部署在某个worker节点(k8s集群中没有worker节点时部署在master节点),apigw-business的镜像也是apigw。apigw-business由k8s调度到某个worker节点,apigw镜像会从harbor仓自动拉取,故不建议执行上面的命令手动拉取apigw镜像。k8s自动拉取的apigw镜像不会强制更新,如需更新同tag名的apigw镜像,请先删除环境上已存在的旧apigw镜像。 + apigw中的apigw-business部署在某个worker节点,apigw-business的镜像也是apigw。apigw-business由k8s调度到某个worker节点,apigw镜像会从harbor仓自动拉取,故不建议执行上面的命令手动拉取apigw镜像。k8s自动拉取的apigw镜像不会强制更新,如需更新同tag名的apigw镜像,请先删除环境上已存在的旧apigw镜像。 注: 1. MindX DL平台组件安装时依赖harbor。安装过程会制作镜像并上传到harbor中 -2. 只支持安装MindX DL平台组件,当前包括14个平台组件(apigw、cluster-manager、data-manager、dataset-manager、image-manager、model-manager、inference-manager、train-manager、user-manager、alarm-manager、hccl-controller、volcano、npu-exporter、device-plugin)。其中npu-exporter、device-plugin部署在所有worker节点,apigw中的apigw-business部署在某个worker节点(k8s集群中没有worker节点时部署在master节点),apigw中的apigw及其他组件都部署在master节点 +2. 只支持安装MindX DL平台组件,当前包括14个平台组件(apigw、cluster-manager、data-manager、dataset-manager、image-manager、model-manager、inference-manager、train-manager、user-manager、alarm-manager、hccl-controller、volcano、npu-exporter、device-plugin)。其中npu-exporter、device-plugin部署在所有worker节点,apigw中的apigw-business部署在某个worker节点,apigw中的apigw及其他组件都部署在master节点 3. npu-exporter、device-plugin组件包内的部分版本由于安全整改,可能没有Dockerfile和yaml文件,需要获取到对应版本的文件并重新打包,获取地址:[链接](https://gitee.com/ascend/mindxdl-deploy/tags)。NPU驱动和固件、MindX DL平台组件、Toolbox的版本需要配套使用,请参阅官方文档获取配套的软件包 @@ -602,23 +607,23 @@ playbooks/ 如果需要重新部署DL平台,手动清除k8s系统及DL平台残留的mysql数据库目录后,只需分别依次执行06-16这些子任务(这些子任务都跟k8s相关)即可,不必执行01-05、17这些子任务 -4. (可选)各个节点应保持时间同步,不然可能会出现不可预知异常,时间同步服务应当由网络管理员提供支持。无法获取网络管理员提供支持的时间同步服务时,本工具也提供了可选的时间同步服务,执行01、99这2个子任务即可。 +4. (可选)各个节点应保持时间同步,不然可能会出现不可预知异常,时间同步服务应当由网络管理员提供支持。无法获取网络管理员提供支持的时间同步服务时,本工具也提供了可选的时间同步服务,执行01、99这2个子任务即可 ```bash root@master:~/ascend-hccl-controller# ansible-playbook -i inventory_file playbooks/01.resource.yaml playbooks/99.chrony.yaml ``` -## 节点扩缩容 -目标:用此脚本搭建平台后,针对已有集群实现节点扩缩容 +## worker节点扩缩容 + +如果用户已完整执行过以上安装步骤,本工具支持在现有k8s集群上添加、删除worker节点。 -playbooks中97.worker_join.yaml任务可将worker节点加入已有集群 98.node_banish任务则可将对应节点踢出已有集群 +查阅“步骤2:配置集群信息”的inventory_file文件和“步骤3:配置安装信息”的group_vars/all.yaml文件,确保这2个配置文件同上一次使用本工具时的配置完全一致 -### 加入节点操作: +### 添加worker节点操作: -1.在inventory_file中填写原有集群[harbor]、[master]字段内容(若集群由此脚本安装保持原有配置不变即可) +1. 配置待新增的worker节点信息 -[harbor]为目标集群harbor对应ip, [master]为目标集群master信息 +在inventory_file文件中,建议将[master_backup]和[worker]组注释掉;增加[worker_join]组,写入需要添加的worker节点信息。本工具只支持添加跟现有OS相同的worker节点 -2.在inventory_file [worker_join]写入对应节点信息 ```ini [harbor] localhost ansible_connection=local @@ -627,37 +632,106 @@ localhost ansible_connection=local localhost ansible_connection=local [master] -localhost ansible_connection=local set_hostname="master" kube_interface="enp125s0f1" apiserver_advertise_address="195.0.3.99" +localhost ansible_connection=local set_hostname="master" kube_interface="enp125s0f0" -[master_backup] -192.0.3.100 set_hostname="master-backup-1" kube_interface="enp125s0f1" apiserver_advertise_address="195.0.3.100" +# [master_backup] +# 192.0.3.100 set_hostname="master-backup-1" kube_interface="enp125s0f0" +# 192.0.3.101 set_hostname="master-backup-2" kube_interface="enp125s0f0" -[worker] -192.0.2.50 set_hostname="worker-1" +# [worker] +# 192.0.2.50 set_hostname="worker-1" +# 192.0.2.51 set_hostname="worker-2" +# 192.0.2.52 set_hostname="worker-3" + +# 上一次使用本工具时,已经部署了一个k8s集群(3 master + 3 worker) +# 建议将[master_backup]和[worker]组注释掉,避免在这些节点执行重复的耗时操作 [worker_join] -192.0.2.52 set_hostname="worker-3" +192.0.2.53 set_hostname="worker-4" +192.0.2.54 set_hostname="worker-5" + +# 增加[worker_join]组,写入需要添加的worker节点信息。此处即会添加2个worker节点到现有的k8s集群 +``` + +2. 检查待新增的worker节点连通性状态 + +如果inventory_file内配置了非localhost的远程ip,根据ansible官方建议,请用户自行使用SSH密钥的方式连接到远程机器,可参考[[connection_details; Ansible Documentation](https://docs.ansible.com/ansible/latest/user_guide/connection_details.html#setting-up-ssh-keys)] + +在工具目录中执行: + +```bash +root@master:~/ascend-hccl-controller# ansible -i inventory_file worker_join -m ping + +192.0.2.53 | SUCCESS => { + "ansible_facts": { + "discovered_interpreter_python": "/usr/bin/python3" + }, + "changed": false, + "ping": "pong" +} +192.0.2.54 | SUCCESS => { + "ansible_facts": { + "discovered_interpreter_python": "/usr/bin/python3" + }, + "changed": false, + "ping": "pong" +} ``` -3.参照步骤3 填写group_vars目录中的all.yaml文件 -若集群由此脚本安装保持原有配置不变即可 +当新增加的[worker_join]组节点都能ping通,则表示其连通性正常。否则,请检查新增加的[worker_join]组节点的ssh连接和inventory_file文件配置是否正确 -若配置丢失请根据harbor信息正确填写harbor相关字段 并保持MYSQL_PASSWORD、REDIS_PASSWORD、APIGW_LOADBALANCER_IP字段非空(任意内容) +各个节点应保持时间同步,不然可能会出现不可预知异常,时间同步服务应当由网络管理员提供支持。无法获取网络管理员提供支持的时间同步服务时,本工具也提供了可选的时间同步服务,依次执行01、99这2个子任务即可 -4.执行以下命令即可将节点加入集群 ```bash -root@master:~/ascend-hccl-controller# ansible-playbook -i inventory_file playbooks/97.node_join.yaml +root@master:~/ascend-hccl-controller# ansible-playbook -i inventory_file playbooks/01.resource.yaml playbooks/99.chrony.yaml ``` -若inventory_file已有[worker]节点,则会为[worker]中第一个节点添加apigw-business调度label -若无worker节点,则会为[worker_join]中第一个节点添加apigw-business调度label +3. 执行添加worker节点命令 + +```bash +root@master:~/ascend-hccl-controller# ansible-playbook -i inventory_file playbooks/97.worker_join.yaml +``` +4. 存储配置 +- 4.1 使用NFS方案时,需要找到nfs_server节点的/etc/exports文件,并在该文件的行末尾追加新增worker节点ip和权限 +比如,现有的k8s集群(3 master + 3 worker),默认配置可能为如下 +```bash +# nfs_server节点的/etc/exports文件 +/data/atlas_dls 192.0.3.99(rw,sync,no_root_squash) 192.0.3.100(rw,sync,no_root_squash) 192.0.3.101(rw,sync,no_root_squash) 192.0.2.50(rw,sync,no_root_squash) 192.0.2.51(rw,sync,no_root_squash) 192.0.2.52(rw,sync,no_root_squash) +``` -### 放逐节点操作: +追加新增的2个worker节点ip和权限后,配置为如下 +```bash +# nfs_server节点的/etc/exports文件 +/data/atlas_dls 192.0.3.99(rw,sync,no_root_squash) 192.0.3.100(rw,sync,no_root_squash) 192.0.3.101(rw,sync,no_root_squash) 192.0.2.50(rw,sync,no_root_squash) 192.0.2.51(rw,sync,no_root_squash) 192.0.2.52(rw,sync,no_root_squash) 192.0.2.53(rw,sync,no_root_squash) 192.0.2.54(rw,sync,no_root_squash) +``` + +nfs_server节点的/etc/exports文件修改完成后,重启nfs-server服务 +```bash +systemctl restart nfs-server +``` + +- 4.2 使用OceanStore方案时,需要在新增worker节点上手动挂载oceanstore存储 + +事前准备:已安装好oceanstore的dpc客户端(由oceanstore存储完成) + +```bash +mkdir /dl # 创建oceanstore的挂载目录 +chown 9000:9000 /dl # 使用hostpath方式,需要将挂载目录属主设置为9000 +mount -t dpc /dl # ,由oceanstore存储提供,不可与“/dl”同名 +# 使用autofs设置dpc开机自动挂载,以达到高可用 # 具体操作由oceanstore存储提供 +``` + +- 4.3 使用CephFS方案时,无需额外操作 + +### 删除worker节点操作: + +1. 配置待删除的worker节点信息 + +在inventory_file文件中,之前的配置不变;增加[worker_delete]组,写入需要删除的worker节点信息 -1.在inventory_file[master]字段填写入主节点信息 [node_banish]字段写入对应节点信息 ```ini [harbor] localhost ansible_connection=local @@ -666,34 +740,58 @@ localhost ansible_connection=local localhost ansible_connection=local [master] -localhost ansible_connection=local set_hostname="master" kube_interface="enp125s0f1" apiserver_advertise_address="195.0.3.99" +localhost ansible_connection=local set_hostname="master" kube_interface="enp125s0f0" [master_backup] -192.0.3.100 set_hostname="master-backup-1" kube_interface="enp125s0f1" apiserver_advertise_address="195.0.3.100" +192.0.3.100 set_hostname="master-backup-1" kube_interface="enp125s0f0" +192.0.3.101 set_hostname="master-backup-2" kube_interface="enp125s0f0" [worker] 192.0.2.50 set_hostname="worker-1" 192.0.2.51 set_hostname="worker-2" - -[worker_join] 192.0.2.52 set_hostname="worker-3" -[node_banish] +# 上一次使用本工具时,已经部署了一个k8s集群(3 master + 3 worker) +# 之前的配置不用变,不用注释掉[master_backup]和[worker]组 + +[worker_delete] 192.0.2.51 set_hostname="worker-2" -``` +192.0.2.52 set_hostname="worker-3" -2.执行以下命令即可将节点踢出集群 -```bash -root@master:~/ascend-hccl-controller# ansible-playbook -i inventory_file playbooks/98.node_banish.yaml +# 增加[worker_delete]组,写入需要删除的worker节点信息。此处即会把2个worker节点从现有的k8s集群删除 ``` -**!!!注意!!!** -放逐节点为高危操作:请务必根据集群状态以及实际需求使用此功能 +2. 检查待删除的worker节点连通性状态 -1.若集群主master节点被放逐出集群,将导致平台、集群整体崩溃 +如果inventory_file内配置了非localhost的远程ip,根据ansible官方建议,请用户自行使用SSH密钥的方式连接到远程机器,可参考[[connection_details; Ansible Documentation](https://docs.ansible.com/ansible/latest/user_guide/connection_details.html#setting-up-ssh-keys)] -2.部分worker、master集群被放逐出集群可能导致平台部分业务不可用 +在工具目录中执行: + +```bash +root@master:~/ascend-hccl-controller# ansible -i inventory_file worker_delete -m ping + +192.0.2.51 | SUCCESS => { + "ansible_facts": { + "discovered_interpreter_python": "/usr/bin/python3" + }, + "changed": false, + "ping": "pong" +} +192.0.2.52 | SUCCESS => { + "ansible_facts": { + "discovered_interpreter_python": "/usr/bin/python3" + }, + "changed": false, + "ping": "pong" +} +``` + +当新增加的[worker_delete]组节点都能ping通,则表示其连通性正常。否则,请检查新增加的[worker_delete]组节点的ssh连接和inventory_file文件配置是否正确 +3. 执行删除worker节点命令 +```bash +root@master:~/ascend-hccl-controller# ansible-playbook -i inventory_file playbooks/98.worker_delete.yaml +``` # FAQ diff --git a/inventory_file b/inventory_file index 1c46c0c530534a324b47f327adbba279abfd4297..face0d063294da559c966276ed7bd44360d34c02 100644 --- a/inventory_file +++ b/inventory_file @@ -10,7 +10,3 @@ localhost ansible_connection=local [master_backup] [worker] - -[worker_join] - -[node_banish] diff --git a/playbooks/01.resource.yaml b/playbooks/01.resource.yaml index d1f99e9069df4a2b1dd015d4fbb902e91d741fb1..52a263ba9b215e9062e25637ae1ed2b159a523ed 100644 --- a/playbooks/01.resource.yaml +++ b/playbooks/01.resource.yaml @@ -5,6 +5,7 @@ - worker - harbor - master_backup + - worker_join roles: - role: mindx.resource vars: diff --git a/playbooks/97.worker_join.yaml b/playbooks/97.worker_join.yaml index dae096a7958d652a7034cbc3d6addc3677872956..f98d73cdd7f3aaed681207fc1c77917309b50dfb 100644 --- a/playbooks/97.worker_join.yaml +++ b/playbooks/97.worker_join.yaml @@ -1,9 +1,8 @@ --- -# node join k8s cluster +# worker join k8s cluster # distribute resources -- hosts: - - worker_join +- hosts: worker_join roles: - role: mindx.resource vars: @@ -17,48 +16,29 @@ - openEuler_20.03_x86_64 - kylin_V10_x86_64 -# install docker -- hosts: - - worker_join - roles: - - role: mindx.docker +# harbor gather_facts +- hosts: harbor -# docker login harbor -- hosts: - - master - - worker_join +# install softwares and login harbor +- hosts: worker_join + gather_facts: False roles: + - role: mindx.docker + - role: mindx.k8s.install + - role: mindx.nfs.client + when: STORAGE_TYPE == "NFS" - role: mindx.harbor.login # set basic config for mindxdl -- hosts: - - worker_join - tasks: - - include_tasks: roles/mindx.basic/tasks/common.yml - - hosts: worker_join gather_facts: False tasks: + - include_tasks: roles/mindx.basic/tasks/common.yml - include_tasks: roles/mindx.basic/tasks/worker.yml -- hosts: - - harbor - - worker_join - tasks: - - name: set HARBOR_IP - include_tasks: task_set_harbor_ip.yaml - -# install k8s -- hosts: - - worker_join - gather_facts: False - roles: - - role: mindx.k8s.install - # worker join k8s - hosts: worker_join gather_facts: False roles: - role: mindx.k8s.worker - role: mindx.k8s.autolabel - diff --git a/playbooks/98.node_banish.yaml b/playbooks/98.node_banish.yaml deleted file mode 100644 index 5c76090be4afe73aad4ac89cbd94541b76d601c2..0000000000000000000000000000000000000000 --- a/playbooks/98.node_banish.yaml +++ /dev/null @@ -1,14 +0,0 @@ ---- -# k8s cluster banish nodes - -# nodes uninstall k8s -- hosts: - - node_banish - roles: - - role: mindx.k8s.uninstall - -# k8s cluster delete node -- hosts: - - node_banish - roles: - - role: mindx.k8s.node_delete \ No newline at end of file diff --git a/playbooks/98.worker_delete.yaml b/playbooks/98.worker_delete.yaml new file mode 100644 index 0000000000000000000000000000000000000000..714fd7c7558251be1e91b917268d41708dba23ea --- /dev/null +++ b/playbooks/98.worker_delete.yaml @@ -0,0 +1,5 @@ +--- +# worker deleted from k8s cluster +- hosts: worker_delete + roles: + - role: mindx.k8s.worker_delete diff --git a/playbooks/99.chrony.yaml b/playbooks/99.chrony.yaml index 232e5d29860f89e71e8acfa90f59a7131d29f055..00f8755010d83285d450b378db7199ada0f5a8fd 100644 --- a/playbooks/99.chrony.yaml +++ b/playbooks/99.chrony.yaml @@ -5,5 +5,6 @@ - master - worker - master_backup + - worker_join roles: - role: mindx.chrony diff --git a/playbooks/roles/mindx.dl.install/files/dlinstall b/playbooks/roles/mindx.dl.install/files/dlinstall index 4136c55419bf0f4f81e82319bf4fb477e66a34c6..626037b3e36646aec1346cb7e4561646b23cd16c 100644 --- a/playbooks/roles/mindx.dl.install/files/dlinstall +++ b/playbooks/roles/mindx.dl.install/files/dlinstall @@ -254,7 +254,8 @@ function do_install() ansible worker -i ../inventory_file -m shell -a "docker pull ${image}" fi if [[ "${image}" =~ "apigw" ]]; then - local apigw_business_ip=$(cat ../inventory_file | grep $(kubectl get node -l apigw-selector=apigw-business-worker-node | awk 'END{print $1}') | awk 'END{print $1}') + local apigw_business_hostname=$(kubectl get pod -n mindx-dl -o wide -l app=apigw-business | awk 'END{print $7}') + local apigw_business_ip=$(grep "${apigw_business_hostname}" ../inventory_file | awk 'END{print $1}') [[ -n "${apigw_business_ip}" ]] && ansible ${apigw_business_ip} -i ../inventory_file -m shell -a "docker pull ${image}" fi done diff --git a/playbooks/roles/mindx.harbor.push/defaults/main.yml b/playbooks/roles/mindx.harbor.push/defaults/main.yml index e7301b7861b1e7da5cc0ae4a2382c4c1d758d5cb..c1e7395c9e8bba3860f07a1cff31f9bd275aa855 100644 --- a/playbooks/roles/mindx.harbor.push/defaults/main.yml +++ b/playbooks/roles/mindx.harbor.push/defaults/main.yml @@ -6,4 +6,4 @@ prefabricated: "{{ 1 if images_dir == 'mindx-inner-images' else 2 }}" harbor_path: "{{ manifest_name if images_dir == 'mindx-inner-images' else orig_name.split(':')[0] }}" image_arch: "{{ 'noarch' if images_dir == 'mindx-inner-images' else arch }}" image_prefix: "{{ image.split('/')[-1].split('_')[0] if images_dir == 'mindx-pre-images' else '' }}" -image_usage: "{{ image_prefix if (images_dir == 'mindx-pre-images' and image_prefix in ('inference', 'common')) else '' }}" +image_usage: "{{ image_prefix if (images_dir == 'mindx-pre-images' and image_prefix in ('inference', 'common', 'training', 'development')) else '' }}" diff --git a/playbooks/roles/mindx.harbor.push/tasks/push.yml b/playbooks/roles/mindx.harbor.push/tasks/push.yml index e4484176ecd53564fb0af370184abd3b3fb4d43e..d4b3d448245dc2bd6365647f6f33377e52cd45d3 100644 --- a/playbooks/roles/mindx.harbor.push/tasks/push.yml +++ b/playbooks/roles/mindx.harbor.push/tasks/push.yml @@ -42,6 +42,23 @@ register: image_size when: "'mindx' in images_dir" +- name: retry until mysql is ok + shell: | + kubectl exec {{ mysql_pod.stdout }} -n {{ k8s_namespace }} -- \ + mysql -u image_user -p{{ MYSQL_PASSWORD }} -e \ + "USE dl_platform; \ + SHOW TABLES;" + register: mysql_status + until: mysql_status.stdout.find("image_configs") != -1 + retries: 20 + delay: 6 + environment: + http_proxy: "" + https_proxy: "" + HTTP_PROXY: "" + HTTPS_PROXY: "" + when: "'mindx' in images_dir" + - name: insert image info to mysql shell: | kubectl exec {{ mysql_pod.stdout }} -n {{ k8s_namespace }} -- \ diff --git a/playbooks/roles/mindx.k8s.autolabel/tasks/main.yml b/playbooks/roles/mindx.k8s.autolabel/tasks/main.yml index 485c734aed970af83da7dc6e3afa29f5260d8d5a..e4c2f5dfaecffe78f6265cd0595f3b4254554347 100644 --- a/playbooks/roles/mindx.k8s.autolabel/tasks/main.yml +++ b/playbooks/roles/mindx.k8s.autolabel/tasks/main.yml @@ -62,43 +62,3 @@ HTTP_PROXY: "" HTTPS_PROXY: "" when: "'Device d801' in processing_accelerator.stdout" - -- name: label apigw-selector master - shell: kubectl label --overwrite node {{ansible_hostname}} apigw-selector=apigw-business-worker-node - delegate_to: "{{ groups['master'][0] }}" - delegate_facts: true - run_once: true - environment: - http_proxy: "" - https_proxy: "" - HTTP_PROXY: "" - HTTPS_PROXY: "" - when: - - "'worker' not in groups or groups['worker'] | length == 0" - - "'worker_join' not in groups or groups['worker_join'] | length == 0" - -- name: label apigw-selector worker - shell: kubectl label --overwrite node {{ hostvars[groups['worker'][0]]['set_hostname'] }} apigw-selector=apigw-business-worker-node - delegate_to: "{{ groups['master'][0] }}" - delegate_facts: true - run_once: true - environment: - http_proxy: "" - https_proxy: "" - HTTP_PROXY: "" - HTTPS_PROXY: "" - when: "'worker' in groups and groups['worker'] | length != 0" - -- name: label apigw-selector worker_join - shell: kubectl label --overwrite node {{ hostvars[groups['worker_join'][0]]['set_hostname'] }} apigw-selector=apigw-business-worker-node - delegate_to: "{{ groups['master'][0] }}" - delegate_facts: true - run_once: true - environment: - http_proxy: "" - https_proxy: "" - HTTP_PROXY: "" - HTTPS_PROXY: "" - when: - - "'worker' not in groups or groups['worker'] | length == 0" - - "'worker_join' in groups and groups['worker_join'] | length == 0" \ No newline at end of file diff --git a/playbooks/roles/mindx.k8s.master/tasks/main.yml b/playbooks/roles/mindx.k8s.master/tasks/main.yml index 4bda7e42401b2e915b11c51688876659c41b7daa..8b8f88d18bbb6879b90bce2a51579b3b2866ff6b 100644 --- a/playbooks/roles/mindx.k8s.master/tasks/main.yml +++ b/playbooks/roles/mindx.k8s.master/tasks/main.yml @@ -9,7 +9,7 @@ when: "'master' not in groups or groups['master'] | length != 1 or groups['master'][0] != 'localhost'" - name: check k8s - shell: "kubectl cluster-info | grep 'is running at' | wc -l" + shell: "kubectl cluster-info 2>&1 | grep 'is running at' | wc -l" environment: http_proxy: "" https_proxy: "" diff --git a/playbooks/roles/mindx.k8s.node_delete/tasks/main.yml b/playbooks/roles/mindx.k8s.node_delete/tasks/main.yml deleted file mode 100644 index e7d63f36ed53ada90dbc13302797232f10215e61..0000000000000000000000000000000000000000 --- a/playbooks/roles/mindx.k8s.node_delete/tasks/main.yml +++ /dev/null @@ -1,12 +0,0 @@ -- name: message - debug: - msg: "*************************start delete node***************************" - -- name: kubectl delete node - shell: kubectl delete node {{set_hostname}} - delegate_to: "{{ groups['master'][0] }}" - delegate_facts: true - failed_when: false - when: - - inventory_hostname not in groups['master'] - diff --git a/playbooks/roles/mindx.k8s.uninstall/meta/main.yml b/playbooks/roles/mindx.k8s.uninstall/meta/main.yml deleted file mode 100644 index 602e3a16c3a0316f13e8dbe8b61cb8a5c0fa5323..0000000000000000000000000000000000000000 --- a/playbooks/roles/mindx.k8s.uninstall/meta/main.yml +++ /dev/null @@ -1,9 +0,0 @@ -galaxy_info: - role_name: k8s.uninstall - author: ascend - description: developer - company: none - license: Apache-2.0 - min_ansible_version: 2.1 - galaxy_tags: - - 'ascend' diff --git a/playbooks/roles/mindx.k8s.uninstall/tasks/main.yml b/playbooks/roles/mindx.k8s.uninstall/tasks/main.yml deleted file mode 100644 index 2328b03aed31a82c67d063614ebb9376eb6db549..0000000000000000000000000000000000000000 --- a/playbooks/roles/mindx.k8s.uninstall/tasks/main.yml +++ /dev/null @@ -1,21 +0,0 @@ -- name: message - debug: - msg: "*************************start node uninstall k8s***************************" - -- name: check kubelet service - shell: systemctl is-active kubelet | grep '^active$' | wc -l - register: kubelet_status - -- name: message - debug: - msg: "kubelet is inactive and may already be uninstalled" - when: kubelet_status.stdout == "0" - -- name: uninstall k8s for node - shell: kubeadm reset -f; iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X; systemctl restart docker - failed_when: false - when: - - ansible_connection != "local" - - inventory_hostname not in groups['master'] - - diff --git a/playbooks/roles/mindx.k8s.worker/tasks/main.yml b/playbooks/roles/mindx.k8s.worker/tasks/main.yml index 7a5650947268e09bd7c5f2aba6cd54e3face0f2a..cb250a0e5a9f1b6f55651d2e8478a0a101c22924 100644 --- a/playbooks/roles/mindx.k8s.worker/tasks/main.yml +++ b/playbooks/roles/mindx.k8s.worker/tasks/main.yml @@ -6,8 +6,7 @@ include_tasks: worker_join.yml when: - ansible_connection != "local" - - (inventory_hostname in groups['worker'] or inventory_hostname in groups['worker_join']) - - inventory_hostname not in groups['master_backup'] + - "'master_backup' not in groups or ('master_backup' in groups and inventory_hostname not in groups['master_backup'])" - name: label worker shell: | diff --git a/playbooks/roles/mindx.k8s.worker/tasks/worker_join.yml b/playbooks/roles/mindx.k8s.worker/tasks/worker_join.yml index 6875cec0506a001437c430135084c462097f68a1..023d9798345d95aa13d4fcb25265f5894553fc65 100644 --- a/playbooks/roles/mindx.k8s.worker/tasks/worker_join.yml +++ b/playbooks/roles/mindx.k8s.worker/tasks/worker_join.yml @@ -1,7 +1,3 @@ -- name: message - debug: - msg: "******************************start join k8s on worker******************************" - - name: check k8s shell: "kubectl cluster-info 2>&1 | grep 'is running at' | wc -l" environment: diff --git a/playbooks/roles/mindx.k8s.worker/tests/test.yml b/playbooks/roles/mindx.k8s.worker/tests/test.yml index 91562e6cd9e2e0b5f2f3def9b778b8689470cc1f..dececc27e0d6048846a5fafaea8e9b5ce9106dc5 100644 --- a/playbooks/roles/mindx.k8s.worker/tests/test.yml +++ b/playbooks/roles/mindx.k8s.worker/tests/test.yml @@ -2,4 +2,4 @@ - hosts: localhost remote_user: root roles: - - mindx.k8sworker + - mindx.k8s.worker diff --git a/playbooks/roles/mindx.k8s.node_delete/meta/main.yml b/playbooks/roles/mindx.k8s.worker_delete/meta/main.yml similarity index 80% rename from playbooks/roles/mindx.k8s.node_delete/meta/main.yml rename to playbooks/roles/mindx.k8s.worker_delete/meta/main.yml index fe98e54e513a9d70f141e0748c9373ea60f1fdeb..12a54bc4609e917220047b8698f4853a0ea31fc3 100644 --- a/playbooks/roles/mindx.k8s.node_delete/meta/main.yml +++ b/playbooks/roles/mindx.k8s.worker_delete/meta/main.yml @@ -1,5 +1,5 @@ galaxy_info: - role_name: mindx.k8s.node_delete + role_name: mindx.k8s.worker_delete author: mindx description: developer license: Apache-2.0 diff --git a/playbooks/roles/mindx.k8s.worker_delete/tasks/main.yml b/playbooks/roles/mindx.k8s.worker_delete/tasks/main.yml new file mode 100644 index 0000000000000000000000000000000000000000..ca0b2b9bf6c8b69c52faceb5cc324b9f7bc33a9e --- /dev/null +++ b/playbooks/roles/mindx.k8s.worker_delete/tasks/main.yml @@ -0,0 +1,14 @@ +- name: message + debug: + msg: "*************************start delete k8s on worker***************************" + +- name: worker deleted from k8s + include_tasks: worker_delete.yml + when: + - ansible_connection != "local" + - inventory_hostname not in groups['master_backup'] + +- name: message + debug: + msg: "{{ inventory_hostname }} is also a master node, skipping" + when: ansible_connection == "local" or inventory_hostname in groups['master_backup'] diff --git a/playbooks/roles/mindx.k8s.worker_delete/tasks/worker_delete.yml b/playbooks/roles/mindx.k8s.worker_delete/tasks/worker_delete.yml new file mode 100644 index 0000000000000000000000000000000000000000..afc39ffcdfb3540ff216a17100085b3794432755 --- /dev/null +++ b/playbooks/roles/mindx.k8s.worker_delete/tasks/worker_delete.yml @@ -0,0 +1,32 @@ +- name: message + debug: + msg: "*************************start delete k8s on worker***************************" + +- name: check k8s + shell: "kubectl cluster-info 2>&1 | grep 'is running at' | wc -l" + environment: + http_proxy: "" + https_proxy: "" + HTTP_PROXY: "" + HTTPS_PROXY: "" + register: cluster_info + +- name: message + debug: + msg: "k8s is not running" + when: cluster_info.stdout == "0" + +- name: kubeadm reset + shell: kubeadm reset -f; iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X; systemctl restart docker + failed_when: false + +- name: kubectl delete node + shell: kubectl delete node {{ ansible_hostname }} + delegate_to: "{{ groups['master'][0] }}" + delegate_facts: true + failed_when: false + environment: + http_proxy: "" + https_proxy: "" + HTTP_PROXY: "" + HTTPS_PROXY: "" diff --git a/tools/create_storage_dir.sh b/tools/create_storage_dir.sh index b8518947004d21c22b97fa53020c24060e2a335c..6e8f205c50fa1bc00dbc8a5cef4b9e6fe0250313 100644 --- a/tools/create_storage_dir.sh +++ b/tools/create_storage_dir.sh @@ -16,8 +16,8 @@ set -o errexit # 2. oceanstore: -# 由于使用hostpath方式使用oceanstore,需要在所有k8s节点上执行,并建议直接按如下操作创建oceanstore的挂载目录 -# 事前准备:已安装好oceanstore的dpc客户端(由oceanstore存储完成);创建oceanstore的挂载目录,并手动挂载oceanstore存储集群到该目录。 +# 由于使用hostpath方式使用oceanstore,需要在所有k8s节点上手动挂载oceanstore存储 +# 事前准备:已安装好oceanstore的dpc客户端(由oceanstore存储完成) # mkdir /dl # 创建oceanstore的挂载目录 # chown 9000:9000 /dl # 使用hostpath方式,需要将挂载目录属主设置为9000