Install PrimeHub Enterprise on MicroK8S Single Node (Ubuntu)

This document will guide you to install MicroK8s on a single node and PrimeHub Enterprise with a easy script.

Provision a cluster

MicroK8s supports multi-platform, we demonstrate it in the following spec:

Ubuntu 18.04 LTS
Kubernetes 1.19 version
IP address: EXTERNAL-IP
Networking: allow port 80 for HTTP

Requirement

Git

Please follow the os-specific command to install git command

cURL

cURL is a command-line tool that allows us to do HTTP requests from shell. To install cURL, please follow the os-specific method. For example.

Ubuntu

sudo apt update
sudo apt install curl

RHEL/CentOS

yum install curl

Clone PrimeHub Repository

We provide a install script which makes the installation much easier to create a MicroK8s-single-node Kubernetes.

Please make sure the machine has installed cURL.

Download the script primehub-install

git clone https://github.com/InfuseAI/primehub.git

Install PrimeHub required binaries

./primehub/install/primehub-install required-bin

This will install the required commands onto ~/bin. You should append the ~/bin to your PATH variables, or use the following command to append and read from the .bashrc

echo "export PATH=$HOME/bin:$PATH" >> ~/.bashrc
source ~/.bashrc

Install MicroK8s Single Node

Run the create singlenode command:

./primehub/install//primehub-install create singlenode --k8s-version 1.19

After the first execution, you will see the message. Because it adds the user to microk8s group and needs to relogin:

[Require Action] Please relogin this session and run create singlenode again

After relogin, run the same command again to finish the single-node provision:

./primehub-install create singlenode --k8s-version 1.19

During the installation, you might run into troubles or need to modify the default settings, please check the TroubleShooting section.

Quick Verification

Access nginx-ingress with your EXTERNAL-IP:

curl http://${EXTERNAL-IP}

The output will be 404 because no Ingress resources are defined yet:

default backend - 404

Configurations

Configure GPU (optional)

Download and install Nvidia GPU drivers from official website
Enable GPU feature

Please be aware that if MicroK8s v1.21, you need to modify the default_runtime_name and DO NOT enable GPU feature by microk8s enable gpu

if MicroK8s is prior to v1.20 (<= 1.20), enable gpu feature by default
```
microk8s.enable gpu
```

if MicroK8s is 1.21, please follow the steps

Modify the file /var/snap/microk8s/current/args/containerd-template.toml and manually change the default_runtime_name to to nvidia-container-runtime

# default_runtime_name is the default runtime name to use.
- default_runtime_name = "${RUNTIME}"
+ default_runtime_name = "nvidia-container-runtime"

Restart the microk8s
```
microk8s stop
microk8s start
```

Install Nvidia Device Plugin by helm

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm install -n kube-system nvidia-device-plugin nvdp/nvidia-device-plugin

Verify GPU

kubectl describe node | grep 'nvidia.com/gpu'

Verify within MicroK8s cluster

deviceplugin_pod=$(kubectl -n kube-system get pod | grep nvidia-device-plugin | awk '{print $1}')

kubectl -n kube-system exec -t ${deviceplugin_pod} nvidia-smi

Ref: https://github.com/NVIDIA/gpu-operator/issues/163#issuecomment-794445253

Configure snap (optional)

snap will update packages automatically. If you plan to use it in a production-ready environment, you could

Set the update process run on a special time window, or delay it before a date
Disable it by setting a wrong proxy

Please see the manual from snapcraft.

Using Self-hosted DNS (Optional)

If your domain name is not hosted by public DNS server, using the self-hosted DNS server instead.

Validate domain name for PrimeHub. regexr.com

# The domain name must match 
[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*

Configure K8S CoreDNS

kubectl edit cm -n kube-system coredns

Please modify the following line and fill your own DNS server

# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: v1
data:
  Corefile: |
    .:53 {
        errors
        health
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . <Fill your own DNS server>
        cache 30
        loop
        reload
        loadbalance
    }
...

After changing the config map of coredns, please use the following command to restart coredns and apply the new configuration.

kubectl rollout -n kube-system restart deploy coredns

Reference

https://kubernetes.io/docs/tasks/administer-cluster/dns-custom-nameservers/

Install PrimeHub

Check available stable versions

./primehub/install/primehub-install

Install the latest stable version by default

./primehub-install create primehub

Or install the specific version such as v3.7.0 as below

./primehub-install create primehub --primehub-version <version>

Please enter the domain name of PrimeHub

The password of KC_PASSWORD and PH_PASSWORD will be auto generated if input empty value.

Enable Model Deployment (Optional)

To manually enable the Model Deployment feature, please modify the file ~/.primehub/config/microk8s/helm_override/primehub.yaml and add following contents at the end of the file.

modelDeployment:
  enabled: true

After the modification of the primehub config file, run the command with a specified version to apply the change.

./primehub-install upgrade primehub

Monitor the PrimeHub installation

Once running the PrimeHub installation, meanwhile, just open another terminal session to run the command to monitor the installation.

watch 'kubectl get pod -n hub'

Once the primehub-bootstrap is running, use the the command to watch the log of primehub bootstrap

kubectl logs -n hub $(kubectl get pod -n hub | grep primehub-bootstrap | cut -d' ' -f1) -f

Apply license

By default, a trial license is applied. See trial license limitations. Please contact InfuseAI for the license inquiry for a valid commercial license

Run the command to show the default license.

$ ./primehub-install license
[PrimeHub License]
  status:
    expired: unexpired
    expired_at: "2038-01-19T03:14:00Z"
    licensed_to: Default
    max_group: 0
    started_at: "2020-01-01T00:00:00Z"

Once you have a commercial license file from InfuseAI. Save the license_crd.yml file under the same folder where primehub-install script is and run the the command to apply license.

./primehub-install apply-license

New to PrimeHub

Initially, PrimeHub has a built-in user phadmin, a built-in group phusers, several instance types/image which are set Global. phadmin can launch a notebook quickly by using these resources.

Now PrimeHub CE is ready, see Launch Notebook to launch your very first JupyterNotebook on PrimeHub. Also see User Guide to have the fundamental knowledge of PrimeHub.

Troubleshooting

Generate a log file for diagnosis.

./primehub-install diagnose

You may run into troubles during the installation, we list some of them, hopefully, you find resolutions here.

Using valid hostname and domain

Validate the hostname of the node with the following regular expression. regexr.com
```
[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*
```

Symptoms

When hostname is invalid, the installation might suspend at the microk8s status phase, because the cluster is not running:

ubuntu@foo_bar:~$ ./primehub-install create singlenode --insecure-registry foo-bar:5000 --k8s-version 1.19
[Search] Folder primehub-v2.6.2
[Not Found] Folder primehub-v2.6.2
[Search] tarball primehub-v2.6.2.tar.gz
[Not Found] tarball primehub-v2.6.2.tar.gz
[Search] primehub helm chart with version: v2.6.2
[Not Found] primehub v2.6.2 in infuseai helm chart
[Skip] Don't need PrimeHub release package when target != primehub
[check] microk8s status

Resolution

You would get microk8s is not running from the status result:

ubuntu@foo_bar:~$ microk8s.status
microk8s is not running. Use microk8s.inspect for a deeper inspection.

If you run the inspect command , it shows everything running:

ubuntu@foo_bar:~$ microk8s.inspect
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-flanneld is running
  Service snap.microk8s.daemon-containerd is running
  Service snap.microk8s.daemon-apiserver is running
  Service snap.microk8s.daemon-apiserver-kicker is running
  Service snap.microk8s.daemon-proxy is running
  Service snap.microk8s.daemon-kubelet is running
  Service snap.microk8s.daemon-scheduler is running
  Service snap.microk8s.daemon-controller-manager is running
  Service snap.microk8s.daemon-etcd is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy openSSL information to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster

Building the report tarball
  Report tarball is at /var/snap/microk8s/1489/inspection-report-20200713_093426.tar.gz

You could find the root cause in the kubelet's inspection-report logs:

ubuntu@foo_bar:~/inspection-report/snap.microk8s.daemon-kubelet$ cat systemctl.log
● snap.microk8s.daemon-kubelet.service - Service for snap application microk8s.daemon-kubelet
   Loaded: loaded (/etc/systemd/system/snap.microk8s.daemon-kubelet.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2020-07-13 09:34:10 UTC; 14s ago
 Main PID: 10180 (kubelet)
    Tasks: 12 (limit: 2329)
   CGroup: /system.slice/snap.microk8s.daemon-kubelet.service
           └─10180 /snap/microk8s/1489/kubelet --kubeconfig=/var/snap/microk8s/1489/credentials/kubelet.config --cert-dir=/var/snap/microk8s/1489/certs --client-ca-file=/var/snap/microk8s/1489/certs/ca.crt --anonymous-auth=false --network-plugin=cni --root-dir=/var/snap/microk8s/common/var/lib/kubelet --fail-swap-on=false --cni-conf-dir=/var/snap/microk8s/1489/args/cni-network/ --cni-bin-dir=/snap/microk8s/1489/opt/cni/bin/ --feature-gates=DevicePlugins=true --eviction-hard=memory.available<100Mi,nodefs.available<1Gi,imagefs.available<1Gi --container-runtime=remote --container-runtime-endpoint=/var/snap/microk8s/common/run/containerd.sock --node-labels=microk8s.io/cluster=true

Jul 13 09:34:24 foo_bar microk8s.daemon-kubelet[10180]: E0713 09:34:24.139863   10180 kubelet.go:2263] node "foo_bar" not found
Jul 13 09:34:24 foo_bar microk8s.daemon-kubelet[10180]: E0713 09:34:24.240093   10180 kubelet.go:2263] node "foo_bar" not found
Jul 13 09:34:24 foo_bar microk8s.daemon-kubelet[10180]: E0713 09:34:24.340308   10180 kubelet.go:2263] node "foo_bar" not found
Jul 13 09:34:24 foo_bar microk8s.daemon-kubelet[10180]: E0713 09:34:24.440529   10180 kubelet.go:2263] node "foo_bar" not found
Jul 13 09:34:24 foo_bar microk8s.daemon-kubelet[10180]: E0713 09:34:24.540736   10180 kubelet.go:2263] node "foo_bar" not found
Jul 13 09:34:24 foo_bar microk8s.daemon-kubelet[10180]: E0713 09:34:24.640915   10180 kubelet.go:2263] node "foo_bar" not found
Jul 13 09:34:24 foo_bar microk8s.daemon-kubelet[10180]: I0713 09:34:24.713484   10180 kubelet_node_status.go:294] Setting node annotation to enable volume controller attach/detach
Jul 13 09:34:24 foo_bar microk8s.daemon-kubelet[10180]: I0713 09:34:24.714871   10180 kubelet_node_status.go:70] Attempting to register node foo_bar
Jul 13 09:34:24 foo_bar microk8s.daemon-kubelet[10180]: E0713 09:34:24.721575   10180 kubelet_node_status.go:92] Unable to register node "foo_bar" with API server: Node "foo_bar" is invalid: metadata.name: Invalid value: "foo_bar": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')
Jul 13 09:34:24 foo_bar microk8s.daemon-kubelet[10180]: E0713 09:34:24.741101   10180 kubelet.go:2263] node "foo_bar" not found

Jul 13 09:34:24 foo_bar microk8s.daemon-kubelet[10180]: E0713 09:34:24.721575 10180 kubelet_node_status.go:92] Unable to register node "foo_bar" with API server: Node "foo_bar" is invalid: metadata.name: Invalid value: "foo_bar": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is 'a-z0-9?(.a-z0-9?)*')

Please fix the hostname, domain name and reinstall microk8s. You could destroy it with this command:

$ ./primehub-install destroy singlenode

Duplicated image registry settings causing containerd dead

The install script supports --insecure-registry to create a node with extra docker registry settings.

It is possible that we execute installation command multiple times, in this case , it would have set up duplicated registries in the containerd's configuration file.

./primehub-install create singlenode --insecure-registry foo-bar:5000 --k8s-version 1.19

Symptom

You couldn't run a new pod, it was pending after scheduled to a node without any reason:

ubuntu@foo-bar:~$ kubectl describe pod foobar-6bfcbb6974-c2gt7
Name:           foobar-6bfcbb6974-c2gt7
Namespace:      default
Priority:       0
Node:           foo-bar/
Labels:         pod-template-hash=6bfcbb6974
                run=foobar
Annotations:    <none>
Status:         Pending
IP:
IPs:            <none>
Controlled By:  ReplicaSet/foobar-6bfcbb6974
Containers:
  foobar:
    Image:        ubuntu
    Port:         <none>
    Host Port:    <none>
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-pm69p (ro)
Conditions:
  Type           Status
  PodScheduled   True
Volumes:
  default-token-pm69p:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-pm69p
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  5m48s  default-scheduler  Successfully assigned default/foobar-6bfcbb6974-c2gt7 to foo-bar

The inspect command showed the containered service not running:

ubuntu@foo-bar:~$ microk8s.inspect
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-flanneld is running
 FAIL:  Service snap.microk8s.daemon-containerd is not running
For more details look at: sudo journalctl -u snap.microk8s.daemon-containerd
  Service snap.microk8s.daemon-apiserver is running
  Service snap.microk8s.daemon-apiserver-kicker is running
  Service snap.microk8s.daemon-proxy is running
  Service snap.microk8s.daemon-kubelet is running
  Service snap.microk8s.daemon-scheduler is running
  Service snap.microk8s.daemon-controller-manager is running
  Service snap.microk8s.daemon-etcd is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy openSSL information to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster

Building the report tarball
  Report tarball is at /var/snap/microk8s/1489/inspection-report-20200713_095624.tar.gz

Resolution

check containerd-template.toml if there is any duplicated registry settings:

ubuntu@foo-bar:~$ cat /var/snap/microk8s/current/args/containerd-template.toml  | grep -A10  plugins.cri.registry
    [plugins.cri.registry]
      [plugins.cri.registry.mirrors]
        [plugins.cri.registry.mirrors."foo-bar:5000"]
          endpoint = ["http://foo-bar:5000"]
        [plugins.cri.registry.mirrors."foo-bar:5000"]
          endpoint = ["http://foo-bar:5000"]
        [plugins.cri.registry.mirrors."docker.io"]
          endpoint = ["https://registry-1.docker.io"]
        [plugins.cri.registry.mirrors."localhost:32000"]
          endpoint = ["http://localhost:32000"]
  [plugins.diff-service]
    default = ["walking"]
  [plugins.linux]
    shim = "containerd-shim"
    runtime = "${RUNTIME}"
    runtime_root = ""
    no_shim = false
    shim_debug = true
  [plugins.scheduler]

Remove duplicated settings and restart MicroK8s, the pod could be started:

ubuntu@foo-bar:~$ kubectl get pod
NAME                      READY   STATUS              RESTARTS   AGE
foobar-6bfcbb6974-c2gt7   0/1     ContainerCreating   0          10m

Ensure CNI IP Range not overlaid with your Egress network

MicroK8s uses iptables as kube-proxy implementation, the overlay IP ranges might cause unexpected behavior with networking. For example, a client in a pod might not access same IP range external endpoint (banned by iptables).

ubuntu@foo-bar:~$ ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc fq_codel state UP group default qlen 1000
    link/ether 06:62:0d:d6:82:d4 brd ff:ff:ff:ff:ff:ff
    inet 172.31.39.226/20 brd 172.31.47.255 scope global dynamic eth0
       valid_lft 2651sec preferred_lft 2651sec
    inet6 fe80::462:dff:fed6:82d4/64 scope link
       valid_lft forever preferred_lft forever
3: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8951 qdisc noqueue state UNKNOWN group default
    link/ether ba:0f:7c:6d:8d:56 brd ff:ff:ff:ff:ff:ff
    inet 10.1.46.0/32 scope global flannel.1
       valid_lft forever preferred_lft forever
    inet6 fe80::b80f:7cff:fe6d:8d56/64 scope link
       valid_lft forever preferred_lft forever
4: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8951 qdisc noqueue state UP group default qlen 1000
    link/ether ee:98:e2:60:4e:18 brd ff:ff:ff:ff:ff:ff
    inet 10.1.46.1/24 scope global cni0
       valid_lft forever preferred_lft forever
    inet6 fe80::ec98:e2ff:fe60:4e18/64 scope link
       valid_lft forever preferred_lft forever

Microk8s uses flannel as the default CNI, it would create two interface flannel.1 and cni0. In the example eth0 IP range(172.31.0.0) is not overlaid with CNI's IP range (10.1.0.0).

Resolution

If the IP Range is overlaid with each other, please fix it by update CNI's IP range configuration /var/snap/microk8s/current/args/flannel-network-mgr-config:

{"Network": "10.1.0.0/16", "Backend": {"Type": "vxlan"}}

We might change it to 10.3.0.0/16 and restart microk8s

{"Network": "10.3.0.0/16", "Backend": {"Type": "vxlan"}}

We have to delete cni0 to make it re-create in the new configuration (flannel.1 would be updated automatically):

sudo ip link delete cni0

You might find all pods in the new IP ranges:

ubuntu@foo-bar:~$ k get pod -A -o wide
NAMESPACE        NAME                                             READY   STATUS             RESTARTS   AGE   IP              NODE      NOMINATED NODE   READINESS GATES
default          foobar-6bfcbb6974-c2gt7                          1/1     Running            1          29m   10.3.49.116     foo-bar   <none>           <none>
ingress-nginx    nginx-ingress-controller-676d5ccd4c-gmm5f        1/1     Running            3          32m   172.31.39.226   foo-bar   <none>           <none>
ingress-nginx    nginx-ingress-default-backend-5b967cf596-crhdp   1/1     Running            2          32m   10.3.49.117     foo-bar   <none>           <none>
kube-system      coredns-9b8997588-wm2n7                          1/1     Running            4          34m   10.3.49.113     foo-bar   <none>           <none>
kube-system      hostpath-provisioner-7b9cb5cdb4-d9hlm            1/1     Running            3          34m   10.3.49.115     foo-bar   <none>           <none>
kube-system      tiller-deploy-969865475-h6gtb                    1/1     Running            2          32m   10.3.49.114     foo-bar   <none>           <none>
metacontroller   metacontroller-0                                 1/1     Running            2          31m   10.3.49.112     foo-bar   <none>           <none>

DNS configuration might reset to the default values when enable/disable microk8s addons

If you update the coredns ConfigMap, please keep a backup to restore it after every microk8s addons enabling or disabling.

Symptom

Applications are not able to resolve some domain names, they are registered in your internal DNS.

ubuntu@foo-bar:~$ kubectl -n kube-system get cm coredns -o yaml

Here is the default configuration, we could customize the core-dns by editing it:

apiVersion: v1
data:
  Corefile: |
    .:53 {
        errors
        health
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . 8.8.8.8 8.8.4.4
        cache 30
        loop
        reload
        loadbalance
    }
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"Corefile":".:53 {\n    errors\n    health\n    ready\n    kubernetes cluster.local in-addr.arpa ip6.arpa {\n      pods insecure\n      fallthrough in-addr.arpa ip6.arpa\n    }\n    prometheus :9153\n    forward . 8.8.8.8 8.8.4.4\n    cache 30\n    loop\n    reload\n    loadbalance\n}\n"},"kind":"ConfigMap","metadata":{"annotations":{},"labels":{"addonmanager.kubernetes.io/mode":"EnsureExists","k8s-app":"kube-dns"},"name":"coredns","namespace":"kube-system"}}
  creationTimestamp: "2020-07-13T09:44:53Z"
  labels:
    addonmanager.kubernetes.io/mode: EnsureExists
    k8s-app: kube-dns
  name: coredns
  namespace: kube-system
  resourceVersion: "7079"
  selfLink: /api/v1/namespaces/kube-system/configmaps/coredns
  uid: 7b0c948b-083a-49a0-a18e-49b173d78c5a

Try editing forward like this:

forward . 8.8.8.8

Enable some addons either:

microk8s.enable istio
microk8s.enable gpu

The coredns ConfigMap will be reset to the default values.

Resolution

There is no a good way to tackle it.

Please remember to backup your settings and apply it after every addons enabling or disable.