搭建Kubeflow——基于K8s的机器学习平台

Screen Shot 2021-07-26 at 7.12.28 AM.png

简介

Kubeflow是在k8s平台之上针对机器学习的开发、训练、优化、部署、管理的工具集合,内部集成的方式融合机器学习中的很多领域的开源项目,比如Jupyter、tfserving、Katib、Fairing、Argo等。可以针对机器学习的不同阶段:数据预处理、模型训练、模型预测、服务管理等进行管理。

一、基础环境准备

k8s版本:v1.20.5

docker版本:v19.03.15

kfctl版本:v1.2.0-0-gbc038f9

kustomize版本:v4.1.3

我也不确定到底能否在1.20.5的k8s版本上完全兼容kubeflow 1.2.0版本。现在只是测试。

版本兼容性可参考:https://www.kubeflow.org/docs/distributions/kfctl/overview#minimum-system-requirements

1、安装kfctl

kfctl 是用于部署和管理 Kubeflow 的控制平面。 主要的部署模式是使用 kfctl 作为 CLI,为不同的 Kubernetes 风格配置 KFDef 配置来部署和管理 Kubeflow。

wget https://github.com/kubeflow/kfctl/releases/download/v1.2.0/kfctl_v1.2.0-0-gbc038f9_linux.tar.gz
tar -xvf kfctl_v1.2.0-0-gbc038f9_linux.tar.gz
chmod 755 kfctl
cp kfctl /usr/bin
kfctl version

2、安装kustomize

Kustomize 是一种配置管理解决方案,它利用分层来保留应用程序和组件的基本设置,方法是覆盖声明性 yaml 工件(称为补?。?,这些工件有选择地覆盖默认设置而不实际更改原始文件。

下载地址:https://github.com/kubernetes-sigs/kustomize/releases

wget https://github.com/kubernetes-sigs/kustomize/releases/download/kustomize%2Fv4.1.3/kustomize_v4.1.3_linux_amd64.tar.gz
tar -xzvf kustomize_v4.1.3_linux_amd64.tar.gz
chmod 755 kustomize
mv kustomize /use/bin/
kustomize version

三、基于公网的部署

如果你的服务器能够访问外网。就可直接执行安装部署。

本次测试部署使用的阿里云美国西部1(硅谷)的机器。

1、创建kubeflow的工作目录

mkdir /apps/kubeflow
cd /apps/kubeflow

2、配置storageclass

# cat storageclass.yaml 
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: alicloud-nas
mountOptions:
- nolock,tcp,noresvport
- vers=3
parameters:
  volumeAs: subpath
  server: "*********.us-west-1.nas.aliyuncs.com:/nasroot1/"  #这里使用的是阿里的NAS存储
  archiveOnDelete: "false"
provisioner: nasplugin.csi.alibabacloud.com
reclaimPolicy: Retain

3、设置为默认的storageclass

# kubectl get sc
NAME                       PROVISIONER                       RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
alicloud-nas               nasplugin.csi.alibabacloud.com    Retain          Immediate              false                  24h

# 为false时为关闭默认
# kubectl patch storageclass alicloud-nas -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

# kubectl get sc
NAME                       PROVISIONER                       RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
alicloud-nas (default)     nasplugin.csi.alibabacloud.com    Retain          Immediate              false                  24h

4、安装部署

wget https://raw.githubusercontent.com/kubeflow/manifests/v1.2-branch/kfdef/kfctl_k8s_istio.v1.2.0.yaml
kfctl apply -V -f kfctl_k8s_istio.v1.2.0.yaml

等所有pod都创建成功后检查各个pod

保证以下所有的pod都是Running状态。

# kubectl get pods -n cert-manager
NAME                                       READY   STATUS    RESTARTS   AGE
cert-manager-7c75b559c4-c2hhj              1/1     Running   0          23h
cert-manager-cainjector-7f964fd7b5-mxbjl   1/1     Running   0          23h
cert-manager-webhook-566dd99d6-6vvzv       1/1     Running   2          23h

# kubectl get pods -n istio-system
NAME                                                         READY   STATUS      RESTARTS   AGE
cluster-local-gateway-5898bc5c74-822c9                       1/1     Running     0          23h
cluster-local-gateway-5898bc5c74-b5tmr                       1/1     Running     0          23h
cluster-local-gateway-5898bc5c74-fpswf                       1/1     Running     0          23h
istio-citadel-6dffd79d7-4scx7                                1/1     Running     0          23h
istio-galley-77cb9b44dc-6l4lm                                1/1     Running     0          23h
istio-ingressgateway-7bb77f89b8-psqcm                        1/1     Running     0          23h
istio-nodeagent-5qsmg                                        1/1     Running     0          23h
istio-nodeagent-ccc8j                                        1/1     Running     0          23h
istio-nodeagent-gqrsl                                        1/1     Running     0          23h
istio-pilot-67d94fc954-vl2sx                                 2/2     Running     0          23h
istio-policy-546596d4b4-6ct59                                2/2     Running     1          23h
istio-security-post-install-release-1.3-latest-daily-qbrf6   0/1     Completed   0          23h
istio-sidecar-injector-796b6454d9-lv8dg                      1/1     Running     0          23h
istio-telemetry-58f9cd4bf5-8cjj5                             2/2     Running     1          23h
prometheus-7c6d764c48-s29kn                                  1/1     Running     0          23h

# kubectl get pods -n knative-serving
NAME                                READY   STATUS    RESTARTS   AGE
activator-6c87fcbbb6-f4cs2          1/1     Running   0          23h
autoscaler-847b9f89dc-5jvml         1/1     Running   0          23h
controller-55f67c9ddb-67vvc         1/1     Running   0          23h
istio-webhook-db664df87-jn72n       1/1     Running   0          23h
networking-istio-76f8cc7796-9jr2j   1/1     Running   0          23h
webhook-6bff77594b-2r2gx            1/1     Running   0          23h

# kubectl get pods -n kubeflow
NAME                                                     READY   STATUS    RESTARTS   AGE
admission-webhook-bootstrap-stateful-set-0               1/1     Running   4          23h
admission-webhook-deployment-5cd7dc96f5-fw7d4            1/1     Running   2          23h
application-controller-stateful-set-0                    1/1     Running   0          23h
argo-ui-65df8c7c84-qwtc8                                 1/1     Running   0          23h
cache-deployer-deployment-5f4979f45-2xqbf                2/2     Running   2          23h
cache-server-7859fd67f5-hplhm                            2/2     Running   0          23h
centraldashboard-67767584dc-j9ffz                        1/1     Running   0          23h
jupyter-web-app-deployment-8486d5ffff-hmbz4              1/1     Running   0          23h
katib-controller-7fcc95676b-rn98v                        1/1     Running   1          23h
katib-db-manager-85db457c64-jx97j                        1/1     Running   0          23h
katib-mysql-6c7f7fb869-bt87c                             1/1     Running   0          23h
katib-ui-65dc4cf6f5-nhmsg                                1/1     Running   0          23h
kfserving-controller-manager-0                           2/2     Running   0          23h
kubeflow-pipelines-profile-controller-797fb44db9-rqzmg   1/1     Running   0          23h
metacontroller-0                                         1/1     Running   0          23h
metadata-db-6dd978c5b-zzntn                              1/1     Running   0          23h
metadata-envoy-deployment-67bd5954c-zvpf4                1/1     Running   0          23h
metadata-grpc-deployment-577c67c96f-zjt7w                1/1     Running   3          23h
metadata-writer-756dbdd478-dm4j4                         2/2     Running   0          23h
minio-54d995c97b-4rm2d                                   1/1     Running   0          23h
ml-pipeline-7c56db5db9-fprrw                             2/2     Running   1          23h
ml-pipeline-persistenceagent-d984c9585-vrd4g             2/2     Running   0          23h
ml-pipeline-scheduledworkflow-5ccf4c9fcc-9qkrq           2/2     Running   0          23h
ml-pipeline-ui-7ddcd74489-95dvl                          2/2     Running   0          23h
ml-pipeline-viewer-crd-56c68f6c85-tgxc2                  2/2     Running   1          23h
ml-pipeline-visualizationserver-5b9bd8f6bf-4zvwt         2/2     Running   0          23h
mpi-operator-d5bfb8489-gkp5w                             1/1     Running   0          23h
mxnet-operator-7576d697d6-qx7rg                          1/1     Running   0          23h
mysql-74f8f99bc8-f42zn                                   2/2     Running   0          23h
notebook-controller-deployment-5bb6bdbd6d-rclvr          1/1     Running   0          23h
profiles-deployment-56bc5d7dcb-2nqxj                     2/2     Running   0          23h
pytorch-operator-847c8d55d8-z89wh                        1/1     Running   0          23h
seldon-controller-manager-6bf8b45656-b7p7g               1/1     Running   0          23h
spark-operatorsparkoperator-fdfbfd99-9k46b               1/1     Running   0          23h
spartakus-volunteer-558f8bfd47-hskwf                     1/1     Running   0          23h
tf-job-operator-58477797f8-wzdcr                         1/1     Running   0          23h
workflow-controller-64fd7cffc5-zs6wx                     1/1     Running   0          23h

5、访问kubeflow ui

kubectl get svc/istio-ingressgateway -n istio-system
NAME                   TYPE       CLUSTER-IP     EXTERNAL-IP   PORT(S)                                                                                                                                      AGE
istio-ingressgateway   NodePort   12.80.127.69   <none>        15020:32661/TCP,80:31380/TCP,443:31390/TCP,31400:31400/TCP,15029:30345/TCP,15030:32221/TCP,15031:31392/TCP,15032:31191/TCP,15443:32136/TCP   5h14m

代理到本地测试

kubectl port-forward svc/istio-ingressgateway 80 -n istio-system

然后本地访问localhost即可。

四、离线部署kubeflow

如果你的机器是不能够访问外网的话,那可没上面的运行那么顺利。

比如这样。ImagePullBackOff.....

Untitled.png

那你得要一个个pod查看是用到那个image。然后再在可以访问外网的机器上下载下来。

大概的步骤就是:

先准备所需镜像—>将镜像拉取推送到内网镜像仓库—>然后修改manifests-1.2.0项目的镜像地址—>将manifests-1.2.0项目打包成v1.2.0.tar.gz—>启动项目

1、准备所需镜像

quay.io/jetstack/cert-manager-cainjector:v0.11.0
quay.io/jetstack/cert-manager-webhook:v0.11.0
gcr.io/istio-release/citadel:release-1.3-latest-daily
gcr.io/istio-release/proxyv2:release-1.3-latest-daily
gcr.io/istio-release/node-agent-k8s:release-1.3-latest-daily
gcr.io/istio-release/pilot:release-1.3-latest-daily
gcr.io/istio-release/mixer:release-1.3-latest-daily
gcr.io/istio-release/kubectl:release-1.3-latest-daily
quay.io/jetstack/cert-manager-controller:v0.11.0
gcr.io/istio-release/galley:release-1.3-latest-daily
gcr.io/istio-release/sidecar_injector:release-1.3-latest-daily
gcr.io/istio-release/proxy_init:release-1.3-latest-daily
gcr.io/kubebuilder/kube-rbac-proxy:v0.4.0
gcr.io/kfserving/kfserving-controller:v0.4.1
python:3.7
metacontroller/metacontroller:v0.3.0
gcr.io/ml-pipeline/envoy:metadata-grpc
gcr.io/istio-release/proxy_init:release-1.3-latest-daily
gcr.io/kubebuilder/kube-rbac-proxy:v0.4.0
gcr.io/kfserving/kfserving-controller:v0.4.1
python:3.7
metacontroller/metacontroller:v0.3.0
gcr.io/ml-pipeline/envoy:metadata-grpc
gcr.io/tfx-oss-public/ml_metadata_store_server:v0.21.1
gcr.io/ml-pipeline/persistenceagent:1.0.4
gcr.io/ml-pipeline/scheduledworkflow:1.0.4
gcr.io/ml-pipeline/frontend:1.0.4
mpioperator/mpi-operator:latest
kubeflow/mxnet-operator:v1.0.0-20200625
gcr.io/ml-pipeline/metadata-writer:1.0.4
gcr.io/ml-pipeline/visualization-server:1.0.4
mxnet-operator-679f456768-rcnfr
gcr.io/kubeflow-images-public/notebook-controller:vmaster-g6eb007d0
gcr.io/kubeflow-images-public/pytorch-operator:vmaster-g518f9c76
docker.io/seldonio/seldon-core-operator:1.4.0
gcr.io/kubeflow-images-public/tf_operator:vmaster-gda226016
gcr.io/kubeflow-images-public/admission-webhook:v20190520-v0-139-gcee39dbc-dirty-0d8f4c
gcr.io/ml-pipeline/cache-server:1.0.4
mysql:8.0.3
gcr.io/ml-pipeline/minio:RELEASE.2019-08-14T20-37-41Z-license-compliance
gcr.io/ml-pipeline/mysql:5.6
gcr.io/kubeflow-images-public/metadata:v0.1.11
gcr.io/kubeflow-images-public/profile-controller:vmaster-ga49f658f
gcr.io/kubeflow-images-public/kfam:vmaster-g9f3bfd00
gcr.io/google_containers/spartakus-amd64:v1.1.0
argoproj/workflow-controller:v2.3.0
gcr.io/kubeflow-images-public/tensorflow-1.15.2-notebook-cpu:1.0.0
gcr.io/kubeflow-images-public/tensorflow-1.15.2-notebook-gpu:1.0.0
gcr.io/kubeflow-images-public/tensorflow-2.1.0-notebook-cpu:1.0.0
gcr.io/kubeflow-images-public/tensorflow-2.1.0-notebook-gpu:1.0.0

---以下的镜像拉取下来是没有tag的,需要自己打下tag.建议拉取的时候单独手工拉取
gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:ffa3d72ee6c2eeb2357999248191a643405288061b7080381f22875cb703e929
gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:f89fd23889c3e0ca3d8e42c9b189dc2f93aa5b3a91c64e8aab75e952a210eeb3
gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:b86ac8ecc6b2688a0e0b9cb68298220a752125d0a048b8edf2cf42403224393c
gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:e6b142c0f82e0e0b8cb670c11eb4eef6ded827f98761bbf4bea7bdb777b80092
gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:75c7918ca887622e7242ec1965f87036db1dc462464810b72735a8e64111
gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:75c7918ca887622e7242ec1965f87036db1dc462464810b72735a8e64111f6
gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:75c7918ca887622e7242ec1965f87036db1dc462464810b72735a8e64111f6f7
gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:7e6df0fda229a13219bbc90ff72a10434a0c64cd7fe13dc534b914247d1087f4

看到这么多镜像慌了吧?别急??隙ú换崛媚阋桓龈鍪止とハ略?。

编写一个shell脚本来帮助我们完成这一系列的重复操作,建议是拉取镜像的服务器上面是比较干净的哈。最好把镜像都清空后执行。因为下面保存镜像那一部分是根据docker imges过滤的。也就是所如果没有清空将会把原来本地的镜像也一同save过去。。

建议在磁盘有100G以上的机器上执行。。

2、创建pull_images.sh

# vim pull_images.sh 
#!/bin/bash
G=`tput setaf 2`
C=`tput setaf 6`
Y=`tput setaf 3`
Q=`tput sgr0`

echo -e "${C}\n\n镜像下载脚本:${Q}"
echo -e "${C}pull_images.sh将读取images.txt中的镜像,拉取并保存到images.tar.gz中\n\n${Q}"

# 清理本地已有镜像
# echo "${C}start: 清理镜像${Q}"
# for rm_image in $(cat images.txt)
# do 
#  docker rmi $aliNexus$rm_image
# done
# echo -e "${C}end: 清理完成\n\n${Q}"

# 创建文件夹
mkdir images

# pull
echo "${C}start: 开始拉取镜像...${Q}"
for pull_image in $(cat images.txt)
do    
  echo "${Y}    开始拉取$pull_image...${Q}"
  fileName=${pull_image//:/_}
  docker pull $pull_image
done
echo "${C}end: 镜像拉取完成...${Q}"

# save镜像
IMAGES_LIST=($(docker images | sed '1d' | awk '{print $1}'))
IMAGES_NM_LIST=($(docker images | sed  '1d' | awk '{print $1"-"$2}'| awk -F/ '{print $NF}'))
IMAGES_NUM=${#IMAGES_LIST[*]}
echo "镜像列表....."
docker images
# docker images | sed '1d' | awk '{print $1}'
for((i=0;i<$IMAGES_NUM;i++))
do
  echo "正在save ${IMAGES_LIST[$i]} image..."
  docker save "${IMAGES_LIST[$i]}" -o ./images/"${IMAGES_NM_LIST[$i]}".tar.gz
done
ls images
echo -e "${C}end: 保存完成\n\n${Q}"

# 打包镜像
#tag_date=$(date "+%Y%m%d%H%M")
echo "${C}start: 打包镜像:images.tar.gz${Q}"
tar -czvf images.tar.gz images
echo -e "${C}end: 打包完成\n\n${Q}"

# 上传镜像包到OSS,如果没有oss的可以自行更换自己内网可以访问到的其他仓库
# echo "${C}start: 将镜像包images.tar.gz上传到OSS${Q}"
# ossutil64 cp images.tar.gz oss://aicloud-deploy/kubeflow-images/
# echo -e "${C}end: 镜像包上传完成\n\n${Q}"
# 清理镜像
read -p "${C}是否清理本地镜像(Y/N,默认N)?:${Q}" is_clean
 if [ -z "${is_clean}" ];then
   is_clean="N"
 fi
 if [ "${is_clean}" == "Y" ];then
   rm -rf images/*
   rm -rf images.tar.gz
   for clean_image in $(cat images.txt)
   do    
     docker rmi $clean_image
   done
   echo -e "${C}清理结束~\n\n${Q}"
 fi

echo -e "${C}执行结束~\n\n${Q}"

3、编辑需要下载的镜像列表文件images.txt

# vim images.txt
quay.io/jetstack/cert-manager-cainjector:v0.11.0
quay.io/jetstack/cert-manager-webhook:v0.11.0
gcr.io/istio-release/citadel:release-1.3-latest-daily
gcr.io/istio-release/proxyv2:release-1.3-latest-daily
gcr.io/istio-release/node-agent-k8s:release-1.3-latest-daily
gcr.io/istio-release/pilot:release-1.3-latest-daily
gcr.io/istio-release/mixer:release-1.3-latest-daily
gcr.io/istio-release/kubectl:release-1.3-latest-daily
quay.io/jetstack/cert-manager-controller:v0.11.0
gcr.io/istio-release/galley:release-1.3-latest-daily
gcr.io/istio-release/sidecar_injector:release-1.3-latest-daily
gcr.io/istio-release/proxy_init:release-1.3-latest-daily
gcr.io/kubebuilder/kube-rbac-proxy:v0.4.0
gcr.io/kfserving/kfserving-controller:v0.4.1
python:3.7
metacontroller/metacontroller:v0.3.0
gcr.io/ml-pipeline/envoy:metadata-grpc
gcr.io/istio-release/proxy_init:release-1.3-latest-daily
gcr.io/kubebuilder/kube-rbac-proxy:v0.4.0
gcr.io/kfserving/kfserving-controller:v0.4.1
python:3.7
metacontroller/metacontroller:v0.3.0
gcr.io/ml-pipeline/envoy:metadata-grpc
gcr.io/tfx-oss-public/ml_metadata_store_server:v0.21.1
gcr.io/ml-pipeline/persistenceagent:1.0.4
gcr.io/ml-pipeline/scheduledworkflow:1.0.4
gcr.io/ml-pipeline/frontend:1.0.4
mpioperator/mpi-operator:latest
kubeflow/mxnet-operator:v1.0.0-20200625
gcr.io/ml-pipeline/metadata-writer:1.0.4
gcr.io/ml-pipeline/visualization-server:1.0.4
mxnet-operator-679f456768-rcnfr
gcr.io/kubeflow-images-public/notebook-controller:vmaster-g6eb007d0
gcr.io/kubeflow-images-public/pytorch-operator:vmaster-g518f9c76
docker.io/seldonio/seldon-core-operator:1.4.0
gcr.io/kubeflow-images-public/tf_operator:vmaster-gda226016
gcr.io/kubeflow-images-public/admission-webhook:v20190520-v0-139-gcee39dbc-dirty-0d8f4c
gcr.io/ml-pipeline/cache-server:1.0.4
mysql:8.0.3
gcr.io/ml-pipeline/minio:RELEASE.2019-08-14T20-37-41Z-license-compliance
gcr.io/ml-pipeline/mysql:5.6
gcr.io/kubeflow-images-public/metadata:v0.1.11
gcr.io/kubeflow-images-public/profile-controller:vmaster-ga49f658f
gcr.io/kubeflow-images-public/kfam:vmaster-g9f3bfd00
gcr.io/google_containers/spartakus-amd64:v1.1.0
argoproj/workflow-controller:v2.3.0
gcr.io/kubeflow-images-public/tensorflow-1.15.2-notebook-cpu:1.0.0
gcr.io/kubeflow-images-public/tensorflow-1.15.2-notebook-gpu:1.0.0
gcr.io/kubeflow-images-public/tensorflow-2.1.0-notebook-cpu:1.0.0
gcr.io/kubeflow-images-public/tensorflow-2.1.0-notebook-gpu:1.0.0

4、执行脚本

sh pull_images.sh

以下镜像建议单独手工拉取。并打tag, 也可以修改上面的脚本把其他的删掉,只留pull部分

docker pull gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:ffa3d72ee6c2eeb2357999248191a643405288061b7080381f22875cb703e929
docker pull gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:f89fd23889c3e0ca3d8e42c9b189dc2f93aa5b3a91c64e8aab75e952a210eeb3
docker pull gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:b86ac8ecc6b2688a0e0b9cb68298220a752125d0a048b8edf2cf42403224393c
docker pull gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:e6b142c0f82e0e0b8cb670c11eb4eef6ded827f98761bbf4bea7bdb777b80092
docker pull gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:75c7918ca887622e7242ec1965f87036db1dc462464810b72735a8e64111
docker pull gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:75c7918ca887622e7242ec1965f87036db1dc462464810b72735a8e64111f6
docker pull gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:75c7918ca887622e7242ec1965f87036db1dc462464810b72735a8e64111f6f7
docker pull gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:7e6df0fda229a13219bbc90ff72a10434a0c64cd7fe13dc534b914247d1087f4

5、打tag

docker tag 镜像ID 内网镜像仓库
docker tag 3208baba46fc aicloud-harbor.com/library/serving/cmd/activator:v1.2.0
docker tag 4578f31842ab aicloud-harbor.com/library/serving/cmd/autoscaler:v1.2.0
docker tag d1b481df9ac3 aicloud-harbor.com/library/serving/cmd/webhook:v1.2.0
docker tag 9f8e41e19efb aicloud-harbor.com/library/serving/cmd/controller:v1.2.0
docker tag 6749b4c87ac8 aicloud-harbor.com/library/net-istio/cmd/webhook:v1.2.0
docker tag ba7fa40d9f88 aicloud-harbor.com/library/net-istio/cmd/controller:v1.2.0

6、save镜像

docker save 镜像 -o 包名
docker save aicloud-harbor.com/library/serving/cmd/activator:v1.2.0 -o activator-v1.2.0.tar.gz
docker save aicloud-harbor.com/library/serving/cmd/autoscaler:v1.2.0 -o autoscaler-v1.2.0.tar.gz
....

然后把镜像包上传到内网自建的镜像仓库,如harbor. 我这里是将包上传到harbor。当然 你也可以直接上传到部署的服务器,当然你得要每个node节点都上传,不然pod重启切换节点将又会拉不到镜像...

如果你刚刚拉取镜像的服务器无法和内网的镜像仓库连通,那还需要将镜像包下载到本地再上传到内网harbor服务器。当然如果你们的镜像仓库是在阿里的就更加方便。

推送镜像的脚本,注意还需要编辑一个images.txt文件。。

7、编辑push_images.sh

vim push_images.sh
#!/bin/bash
G=`tput setaf 2`
C=`tput setaf 6`
Y=`tput setaf 3`
Q=`tput sgr0`

echo -e "${C}\n\n镜像上传脚本:${Q}"
echo -e "${C}push_images.sh将读取images.txt中的镜像名称,将images.tar.gz中的镜像推送到内网镜像仓库\n\n${Q}"

# 获取内网镜像仓库地址
read -p "${C}内网镜像仓库地址(默认aicloud-harbor.com/library):${Q}" nexusAddr
if [ -z "${nexusAddr}" ];then
  nexusAddr="aicloud-harbor.com/library"
fi
if [[ ${nexusAddr} =~ /$ ]];
  then echo  
  else nexusAddr="${nexusAddr}/"
fi
tar -xzf images.tar.gz
cd images
# tag
echo "${C}start: 加载镜像${Q}"
for image_name in $(ls ./)
do    
  echo -e "${Y}    开始load $image_name...${Q}"
  docker load < ${image_name}
done
echo -e "${C}end: 加载完成...\n\n${Q}"

#push镜像
echo "${C}start: 开始push镜像到harbor...${Q}"
IMAGES_LIST=($(docker images | sed '1d' | awk '{print $1":"$2}'))
for push_image in $(docker images | sed '1d' | awk '{print $1":"$2}')
do    
  echo -e "${Y}    开始推送$push_image...${Q}"
  docker tag $push_image $nexusAddr/$push_image
  docker push $nexusAddr/$push_image
  echo "镜像:$nexusAddr/$push_image 推送完成..."
done

echo -e "${C}end: 全部镜像推送完成\n\n${Q}"

8、修改kubeflow项目文件里面的镜像地址

先将整个项目拉取到本地

项目地址:https://github.com/kubeflow/manifests/releases

将v1.2.0的包下载到本地。需要改里面的镜像。因为大部分镜像都是国外的源。

https://github.com/kubeflow/manifests/archive/v1.2.0.tar.gz

我这里使用的是idea打开。方便全局替换、查找和编辑。

先把压缩包解压,会得到manifests-1.2.0这个文件。然后在idea上面打开这个项目。

Untitled 1.png

使用快捷键Alt+Shift+R

或者

Untitled 2.png

然后把原来的镜像地址替换成自己打tag的镜像

上面所下载的镜像都需要替换。这个工作量有点大哈。。

Untitled 3.png

9、项目打包

将项目打包压缩上传到一个部署服务器可以wget到的仓库。我这里用的是nexus。

为了和源文件的打包方式一样,我先在本地Windows上打成zip包,然后在上传到linux服务器解压再打包成tar包。

rz manifests-1.2.0.zip   #上传命令
mkdir manifests-1.2.0
mv manifests-1.2.0.zip manifests-1.2.0/
cd manifests-1.2.0/
unzip manifests-1.2.0.zip
rm -rf manifests-1.2.0.zip
cd ..
tar -czvf v1.2.0.tar.gz manifests-1.2.0/
curl -u test:*********** --upload-file ./v1.2.0.tar.gz http://nexus.example.com/repository/public-ftp/kubernetes/package/kubeflow/manifests-1.2.0/

10、创建kubeflow的工作目录

mkdir /apps/kubeflow
cd /apps/kubeflow

11、创建一个StorageClass.

# cat StorageClass.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nfs-client
provisioner: nfs-client-provisioner  # 自己内网搭建的一个NAS controller名称,也可以是阿里的NAS,但必须能够访问得到
reclaimPolicy: Retain
parameters:
  archiveOnDelete: "true"

并将它改为默认的 StorageClass

# kubectl get sc
NAME                     PROVISIONER                                   RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
nfs-client               nfs-client-provisioner                        Retain          Immediate           false                  21h

# 为false的时候为关闭
# kubectl patch storageclass nfs-client -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
# kubectl get sc
NAME                     PROVISIONER                                   RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
nfs-client (default)     nfs-client-provisioner                        Retain          Immediate           false                  21h

12、编辑配置文件

vim kfctl_k8s_istio.v1.2.0.yaml
apiVersion: kfdef.apps.kubeflow.org/v1
kind: KfDef
metadata:
  namespace: kubeflow
spec:
  applications:
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: namespaces/base
    name: namespaces
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: application/v3
    name: application
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: stacks/kubernetes/application/istio-1-3-1-stack
    name: istio-stack
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: stacks/kubernetes/application/cluster-local-gateway-1-3-1
    name: cluster-local-gateway
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: istio/istio/base
    name: istio
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: stacks/kubernetes/application/cert-manager-crds
    name: cert-manager-crds
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: stacks/kubernetes/application/cert-manager-kube-system-resources
    name: cert-manager-kube-system-resources
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: stacks/kubernetes/application/cert-manager
    name: cert-manager
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: stacks/kubernetes/application/add-anonymous-user-filter
    name: add-anonymous-user-filter
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: metacontroller/base
    name: metacontroller
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: admission-webhook/bootstrap/overlays/application
    name: bootstrap
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: stacks/kubernetes/application/spark-operator
    name: spark-operator
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: stacks/kubernetes
    name: kubeflow-apps
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: knative/installs/generic
    name: knative
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: kfserving/installs/generic
    name: kfserving
  # Spartakus is a separate applications so that kfctl can remove it
  # to disable usage reporting
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: stacks/kubernetes/application/spartakus
    name: spartakus
  repos:
  - name: manifests
# 注意这里需要修改成我们已经替换好镜像路径的项目地址
   # uri: https://github.com/kubeflow/manifests/archive/v1.2.0.tar.gz
    uri: http://aicloud-nexus.midea.com/repository/public-ftp/kubernetes/package/kubeflow/manifests-1.2.0/v1.2.0.tar.gz
  version: v1.2-branch

登录到部署服务器上下载刚刚打包好的包

wget http://aicloud-nexus.midea.com/repository/public-ftp/kubernetes/package/kubeflow/manifests-1.2.0/v1.2.0.tar.gz
tar -xzvf v1.2.0.tar.gz
cp kfctl_k8s_istio.v1.2.0.yaml ./manifests-1.2.0
cd manifests-1.2.0

13、部署

kfctl apply -V -f kfctl_k8s_istio.v1.2.0.yaml

检查所有命名空间

# kubectl get pods -n cert-manager
# kubectl get pods -n istio-system
# kubectl get pods -n knative-serving
# kubectl get pods -n kubeflow

14、使用浏览器访问

访问kubeflow ui

kubectl get svc/istio-ingressgateway -n istio-system
NAME                   TYPE       CLUSTER-IP     EXTERNAL-IP   PORT(S)                                                                                                                                      AGE
istio-ingressgateway   NodePort   12.80.127.69   <none>        15020:32661/TCP,80:31380/TCP,443:31390/TCP,31400:31400/TCP,15029:30345/TCP,15030:32221/TCP,15031:31392/TCP,15032:31191/TCP,15443:32136/TCP   5h14m

因为使用的NodePort的类型。所以我们就可以直接在浏览器上面访问

node节点IP+31380端口

http://10.18.3.228:31380

Untitled 4.png
Untitled 5.png

15、测试

创建一个Notebook Servers

Untitled 6.png

然后去查看pod,将会多出这三个pod,如果都为running状态则表示正常。

# kubectl get pods -A
NAMESPACE            NAME                                                         READY   STATUS             RESTARTS   AGE
anonymous            ml-pipeline-ui-artifact-ccf49557c-s5jk9                      2/2     Running            0          4m48s
anonymous            ml-pipeline-visualizationserver-866f48bf7b-pfr4l             2/2     Running            0          4m48s
anonymous            test-0                                                       2/2     Running            0          2m13s

四、删除kubeflow

kfctl delete -V -f kfctl_k8s_istio.v1.2.0.yaml

五、问题

1、启动时一直卡在cert-manager这里

application.app.k8s.io/cert-manager configured
WARN[0161] Encountered error applying application cert-manager:  (kubeflow.error): Code 500 with message: Apply.Run : error when creating "/tmp/kout044650944": Internal error occurred: failed calling webhook "webhook.cert-manager.io": the server is currently unable to handle the request  filename="kustomize/kustomize.go:284"
WARN[0161] Will retry in 26 seconds.                     filename="kustomize/kustomize.go:285"

解决

先查看pod

# kubectl get pods -n cert-manager
NAME                                       READY   STATUS             RESTARTS   AGE
cert-manager-7c75b559c4-xmsp6              1/1     Running            0          3m46s
cert-manager-cainjector-7f964fd7b5-fnsg7   1/1     Running            0          3m46s
cert-manager-webhook-566dd99d6-fnchp       0/1     ImagePullBackOff   0          3m46s

# kubectl describe pod cert-manager-webhook-566dd99d6-fnchp -n cert-manager
Events:
  Type     Reason       Age                    From               Message
  ----     ------       ----                   ----               -------
  Normal   Scheduled    4m57s                  default-scheduler  Successfully assigned cert-manager/cert-manager-webhook-566dd99d6-fnchp to node6
  Warning  FailedMount  4m26s (x7 over 4m58s)  kubelet            MountVolume.SetUp failed for volume "certs" : secret "cert-manager-webhook-tls" not found
  Warning  Failed       3m53s                  kubelet            Failed to pull image "quay.io/jetstack/cert-manager-webhook:v0.11.0": rpc error: code = Unknown desc = Error response from daemon: Get https://quay.io/v2/: dial tcp 54.197.99.84:443: connect: connection refused
  Warning  Failed       3m37s                  kubelet            Failed to pull image "quay.io/jetstack/cert-manager-webhook:v0.11.0": rpc error: code = Unknown desc = Error response from daemon: Get https://quay.io/v2/: dial tcp 54.156.10.58:443: connect: connection refused
  Normal   Pulling      3m9s (x3 over 3m53s)   kubelet            Pulling image "quay.io/jetstack/cert-manager-webhook:v0.11.0"
  Warning  Failed       3m9s (x3 over 3m53s)   kubelet            Error: ErrImagePull
  Warning  Failed       3m9s                   kubelet            Failed to pull image "quay.io/jetstack/cert-manager-webhook:v0.11.0": rpc error: code = Unknown desc = Error response from daemon: Get https://quay.io/v2/: dial tcp 52.4.104.248:443: connect: connection refused
  Normal   BackOff      2m58s (x4 over 3m52s)  kubelet            Back-off pulling image "quay.io/jetstack/cert-manager-webhook:v0.11.0"
  Warning  Failed       2m46s (x5 over 3m52s)  kubelet            Error: ImagePullBackOff

可以看到是因为镜像无法拉取到问题导致。

如果镜像地址没问题的话删除一下这个pod

kubectl delete pod cert-manager-webhook-566dd99d6-fnchp -n cert-manager

问题1、删除kubeflow,pvc都显示Terminating

# kubectl get pvc -n kubeflow
NAME             STATUS        VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
metadata-mysql   Terminating   pvc-4fe5c5f2-a187-4200-95c3-33de0c01f781   10Gi       RWO            nfs-client     23h
minio-pvc        Terminating   pvc-cd2dc964-a448-4c68-b0bb-5bc2183e5203   20Gi       RWO            nfs-client     23h
mysql-pv-claim   Terminating   pvc-514407db-00bd-4767-8043-a31b1a70e47f   20Gi       RWO            nfs-client     23h

解决

# kubectl patch pvc metadata-mysql -p '{"metadata":{"finalizers":null}}' -n kubeflow
persistentvolumeclaim/metadata-mysql patched

删除kubeflow完后pv并不会自动删除

# kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM                      STORAGECLASS   REASON   AGE
jenkins-home                               60Gi       RWO            Retain           Bound      infrastructure/jenkins     jenkins-home            9d
pvc-0860e679-dd0b-48fc-8326-8a4c993410e6   20Gi       RWO            Retain           Released   kubeflow/minio-pvc         nfs-client              16m
pvc-13e06aac-f688-4d89-a467-93e5c6d6ecf6   20Gi       RWO            Retain           Released   kubeflow/mysql-pv-claim    nfs-client              16m
pvc-3e495907-53c4-468e-9aad-426c2f3e0851   10Gi       RWO            Retain           Released   kubeflow/katib-mysql       nfs-client              16m
pvc-3f59b851-0429-4e75-929b-33c05f8af66f   20Gi       RWO            Retain           Released   kubeflow/mysql-pv-claim    nfs-client              7h42m
pvc-5da0ac9b-c1c4-4aa1-b9ff-128174fe152c   10Gi       RWO            Retain           Released   kubeflow/metadata-mysql    nfs-client              7h42m
pvc-749f2098-8ba2-469c-8d78-f5889e24a9d4   5Gi        RWO            Retain           Released   anonymous/workspace-test   nfs-client              7h35m
pvc-94e61c9f-0b9c-4589-9e33-efb885c84233   20Gi       RWO            Retain           Released   kubeflow/minio-pvc         nfs-client              7h42m
pvc-a291c901-f2be-4994-b0d4-d83341879c3b   10Gi       RWO            Retain           Released   kubeflow/metadata-mysql    nfs-client              16m
pvc-a657f4c5-abce-47b4-8474-4ee4e60826b9   10Gi       RWO            Retain           Released   kubeflow/katib-mysql       nfs-client              7h42m

需要自己手工删除

# kubectl delete pv pvc-0860e679-dd0b-48fc-8326-8a4c993410e6

如果出现katib-db、katib-mysql、metadata-grpc-deployment等这几个pod出现pending或者初始化错误的话。大概几率就是那个持久卷没有挂载上。可以describe查看具体报错原因。

检查pv和pvc有没有挂载

# kubectl get pvc -A
# kubectl get pv
?著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 214,100评论 6 493
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,308评论 3 388
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事?!?“怎么了?”我有些...
    开封第一讲书人阅读 159,718评论 0 349
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,275评论 1 287
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,376评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,454评论 1 292
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,464评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,248评论 0 269
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,686评论 1 306
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,974评论 2 328
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,150评论 1 342
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,817评论 4 337
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,484评论 3 322
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,140评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,374评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,012评论 2 365
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,041评论 2 351

推荐阅读更多精彩内容