Pod Scheduling

Pod scheduling is the the process of assigning nodes to pods in 
kubernetes cluster. Kubernetes scheduler service is responsible
for finding nodes to run pods. We have few types of Pod
Scheduling.
Types of Pod Scheduling

1. Node Name based scheduling
2. Node Label and Node Selector based scheduling
3. Taint and toleration based scheduling

Node Name based scheduling

This is the pod scheduling techniques by which you can define 
on which node you want your pods to run. Let's take below
example. Let's create a yaml file using dry run.
# kubectl create deploy varelite1 --image=nginx --dry-run -o yaml > varelite1.yaml

# cat varelite1.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: varelite1
  name: varelite1
spec:
  replicas: 1
  selector:
    matchLabels:
      app: varelite1
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: varelite1
    spec:
      containers:
      - image: nginx
        name: nginx
        resources: {}
status: {}

# kubectl explain Deployment.spec.template.spec | grep -i nodeName
  nodeName      <string>
    NodeName is a request to schedule this pod onto a specific node. If it is
    FQDN in the hostname field of the kernel (the nodename field of struct

# kubectl explain Deployment.spec.template.spec.nodeName
GROUP:      apps
KIND:       Deployment
VERSION:    v1

FIELD: nodeName <string>

DESCRIPTION:
    NodeName is a request to schedule this pod onto a specific node. If it is
    non-empty, the scheduler simply schedules this pod onto that node, assuming
    that it fits resource requirements.
Now we can go ahead and mention the worker name *nodeName* at 
above place. Once you defined the worker node there then pod
will always run on that node only. If you scale it up then all
new nodes will run only on that worker node.

And you can only define one worker node. Please see below yaml
and deploy it to check things. We don't use this in Production.
The important thing to note in this deployment is that
Scheduler is not used during this deployment as we have
directly defined the nodeName there.
#  cat varelite1.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: varelite1
  name: varelite1
spec:
  replicas: 1
  selector:
    matchLabels:
      app: varelite1
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: varelite1
    spec:
      nodeName: worker1
      containers:
      - image: nginx
        name: nginx
        resources: {}
status: {}


# kubectl get pods -o wide
NAME                         READY   STATUS    RESTARTS   AGE   IP               NODE      NOMINATED NODE   READINESS GATES
varelite1-54c9b6946f-x4nll   1/1     Running   0          31m   172.16.235.154   worker1   <none>           <none>
You can see that it started on running worker1. Let's scale it 
up and see where it lands. I changed replicas=2.
# vim varelite1.yaml
# kubectl apply -f varelite1.yaml
deployment.apps/varelite1 configured

# kubectl get pods -o wide
NAME                         READY   STATUS              RESTARTS   AGE   IP               NODE      NOMINATED NODE   READINESS GATES
varelite1-54c9b6946f-rhtnf   0/1     ContainerCreating   0          4s    <none>           worker1   <none>           <none>
varelite1-54c9b6946f-x4nll   1/1     Running             0          35m   172.16.235.154   worker1   <none>           <none>

Node Label and Node Selector based scheduling

You can label the nodes and then using nodeSelector, you can 
define the pods scheduling. Below are the commands related to
schedule.
# kubectl label node worker1 role=db
node/worker1 labeled

# kubectl label node worker2 role=db
node/worker2 labeled

# kubectl label node worker2 region=usa
node/worker2 labeled

# kubectl label node worker1 region=usa
node/worker1 labeled

# kubectl label node worker1 zone=delhi
node/worker1 labeled

# kubectl label node worker2 zone=london
node/worker2 labeled

# kubectl get node --show-labels worker1
NAME      STATUS   ROLES    AGE   VERSION   LABELS
worker1   Ready    <none>   45d   v1.27.9   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=worker1,kubernetes.io/os=linux,region=usa,role=db,zone=delhi

# kubectl get node --show-labels worker2
NAME      STATUS   ROLES    AGE   VERSION   LABELS
worker2   Ready    <none>   45d   v1.27.9   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=worker2,kubernetes.io/os=linux,region=usa,role=db,zone=london

# kubectl get node --show-labels worker3
NAME      STATUS   ROLES    AGE   VERSION   LABELS
worker3   Ready    <none>   45d   v1.27.9   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=worker3,kubernetes.io/os=linux

# kubectl get node -l role=db
NAME      STATUS   ROLES    AGE   VERSION
worker1   Ready    <none>   45d   v1.27.9
worker2   Ready    <none>   45d   v1.27.9

# kubectl get node -l region=usa
NAME      STATUS   ROLES    AGE   VERSION
worker1   Ready    <none>   45d   v1.27.9
worker2   Ready    <none>   45d   v1.27.9

# kubectl get node -l zone=delhi
NAME      STATUS   ROLES    AGE   VERSION
worker1   Ready    <none>   45d   v1.27.9

# kubectl explain Deployment.spec.template.spec.nodeSelector
GROUP:      apps
KIND:       Deployment
VERSION:    v1

FIELD: nodeSelector <map[string]string>

DESCRIPTION:
    NodeSelector is a selector which must be true for the pod to fit on a node.
    Selector which must match a node's labels for the pod to be scheduled on
    that node. More info:
    https://kubernetes.io/docs/concepts/configuration/assign-pod-node/


# vim varelite2.yaml

# cat varelite2.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: varelite2
  name: varelite2
spec:
  replicas: 2
  selector:
    matchLabels:
      app: varelite2
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: varelite2
    spec:
      nodeSelector:
          region: usa
      containers:
      - image: nginx
        name: nginx
        resources: {}
status: {}
Now I have defined the nodeSelector as *region=usa*. Let's 
where our pods will be created.
# kubectl create -f varelite2.yaml
deployment.apps/varelite2 created

# kubectl get pods -o wide
NAME                         READY   STATUS              RESTARTS   AGE   IP       NODE      NOMINATED NODE   READINESS GATES
varelite2-7cf7487bf9-sb5jg   0/1     ContainerCreating   0          4s    <none>   worker2   <none>           <none>
varelite2-7cf7487bf9-scczx   0/1     ContainerCreating   0          4s    <none>   worker1   <none>           <none>
So it is creating your pods on worker1 and worker2 as 
*region=usa* has been defined on these two woker nodes.
Let's scale up replicas to 4 and see where it lands. Sure,
It will be only on worker1 and worker2.
# vim varelite2.yaml

# kubectl apply -f varelite2.yaml
Warning: resource deployments/varelite2 is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by kubectl apply. kubectl apply should only be used on resources created declaratively by either kubectl create --save-config or kubectl apply. The missing annotation will be patched automatically.
deployment.apps/varelite2 configured

# kubectl get pods -o wide
NAME                         READY   STATUS    RESTARTS   AGE    IP               NODE      NOMINATED NODE   READINESS GATES
varelite2-7cf7487bf9-5zqg8   1/1     Running   0          6s     172.16.235.157   worker1   <none>           <none>
varelite2-7cf7487bf9-sb5jg   1/1     Running   0          113s   172.16.189.90    worker2   <none>           <none>
varelite2-7cf7487bf9-scczx   1/1     Running   0          113s   172.16.235.156   worker1   <none>           <none>
varelite2-7cf7487bf9-w2nl4   1/1     Running   0          6s     172.16.189.91    worker2   <none>           <none>

# kubectl scale --replicas=5 deploy varelite2
deployment.apps/varelite2 scaled

# kubectl get pods -o wide
NAME                         READY   STATUS              RESTARTS   AGE     IP               NODE      NOMINATED NODE   READINESS GATES
varelite2-7cf7487bf9-5qqg2   0/1     ContainerCreating   0          2s      <none>           worker1   <none>           <none>
varelite2-7cf7487bf9-5zqg8   1/1     Running             0          92s     172.16.235.157   worker1   <none>           <none>
varelite2-7cf7487bf9-sb5jg   1/1     Running             0          3m19s   172.16.189.90    worker2   <none>           <none>
varelite2-7cf7487bf9-scczx   1/1     Running             0          3m19s   172.16.235.156   worker1   <none>           <none>
varelite2-7cf7487bf9-w2nl4   1/1     Running             0          92s     172.16.189.91    worker2   <none>           <none>
Let's say you want to add extra worker node to the regions=usa. 
You can lable it and if any new pods will be created then it
may up on that worker node as well. But running pods will be on
same where it was running now.

Note: nodeSelector can be done at deployment level as well as
namespace level. You can also assign labels to namespace then
you don't need to define the same at deployment level.
Before doing this you need to enable that feature using api
configuration file. Go and edit below files and add
*PodNodeSelector* at *--enable-admission-plugins=*.
# vim /etc/kubernetes/manifests/kube-apiserver.yaml

# cat /etc/kubernetes/manifests/kube-apiserver.yaml | grep -i 'enable-admission-plugins'
    - --enable-admission-plugins=NodeRestriction,PodNodeSelector
Now, you can use *edit* command to make changes in namespace. 
There you can find labels.
# kubectl edit ns default

# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: v1
kind: Namespace
metadata:
  creationTimestamp: "2023-12-31T11:16:02Z"
  labels:
    kubernetes.io/metadata.name: default
  name: default
  resourceVersion: "41"
  uid: f049cee0-fdf8-404b-aa51-1d44aa5ddb7d
spec:
  finalizers:
  - kubernetes
status:
  phase: Active

You need to add below lines at labels.

annotations:
   scheduler.alpha.kubernetes.io/node-selector: zone=delhi

Taint and toleration based scheduling

The application which is not looking for shared tendency. It 
means the application pods do not want any other pod run on the
node on which this application is running. Here comes the pod
scheduling *Taint and Tolerations*. It is giving dedicate
tenancy.

Taint means it will enable nodes to prevent certain pods ton
run on tainted node while Tolerations is a kubernetes property
which allow a pod to be scheduled on a node with matching
taint.

If you see no pods run on master when you start any
applications. This is because master node has been tainted.
# kubectl describe node kb-master | grep -i taint
Taints:             node-role.kubernetes.io/control-plane:NoSchedule
Let's taint a node. 
# kubectl taint node worker1 money:NoSchedule
node/worker1 tainted

# kubectl describe node worker1 | grep -i taint
Taints:             money:NoSchedule
Now create a deployment and see if it is going to worker1 or 
not.
# kubectl create deploy varelite3 --image=nginx --replicas=4
deployment.apps/varelite3 created

# kubectl get pods -o wide
NAME                        READY   STATUS              RESTARTS   AGE   IP              NODE      NOMINATED NODE   READINESS GATES
varelite3-f8d5ffc5b-4jpd9   1/1     Running             0          9s    172.16.189.93   worker2   <none>           <none>
varelite3-f8d5ffc5b-76498   0/1     ContainerCreating   0          9s    <none>          worker2   <none>           <none>
varelite3-f8d5ffc5b-9t6m6   0/1     ContainerCreating   0          9s    <none>          worker3   <none>           <none>
varelite3-f8d5ffc5b-qx4x7   0/1     ContainerCreating   0          9s    <none>          worker3   <none>           <none>
So no pod is running on worker1 as It is tainted and it is not 
available for all application you want to run. Now we will go
and define tolerations so that it can run on tainted node
*worker1*.
# kubectl create deploy varelite4 --image=nginx --dry-run -o yaml > varelite4.yaml

# cat varelite4.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: varelite4
  name: varelite4
spec:
  replicas: 1
  selector:
    matchLabels:
      app: varelite4
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: varelite4
    spec:
      tolerations:
        - key: money
          effect: NoSchedule
          operator: Exists
      containers:
      - image: nginx
        name: nginx
        resources: {}
status: {}

# kubectl create -f varelite4.yaml
deployment.apps/varelite4 created

# kubectl get pods -o wide
NAME                         READY   STATUS    RESTARTS   AGE   IP               NODE      NOMINATED NODE   READINESS GATES
varelite4-58988784db-r8nbc   1/1     Running   0          7s    172.16.235.159   worker1   <none>           <none>

Now try to scale it up and see what happens next. You will see that scaled pods may be running on other nodes as well.

# kubectl scale deploy varelite4 --replicas=3
deployment.apps/varelite4 scaled

# kubectl get pods -owide
NAME                         READY   STATUS    RESTARTS   AGE   IP               NODE      NOMINATED NODE   READINESS GATES
varelite4-58988784db-6vz2g   1/1     Running   0          8s    172.16.182.27    worker3   <none>           <none>
varelite4-58988784db-992r9   1/1     Running   0          8s    172.16.189.94    worker2   <none>           <none>
varelite4-58988784db-r8nbc   1/1     Running   0          92s   172.16.235.159   worker1   <none>           <none>
So it means that the node you tainted can only accept what you 
define in tolerations. Schedulers can scale up pods to other
nodes as well. And if you want everything run only on worker1
then you must use *Node label and Node Selector* pod scheduling
with Taint and tolerations.
# kubectl label node worker1 type=money
node/worker1 labeled

# kubectl edit deploy varelite4
deployment.apps/varelite4 edited
I have edited the deployment and added a label which has been 
assigned to worker1.

spec:
nodeSelector:
type: money
# kubectl get pods -owide
NAME                        READY   STATUS    RESTARTS   AGE   IP               NODE      NOMINATED NODE   READINESS GATES
varelite4-84fdc988f-4h8b2   1/1     Running   0          75s   172.16.235.164   worker1   <none>           <none>
varelite4-84fdc988f-j22g6   1/1     Running   0          82s   172.16.235.162   worker1   <none>           <none>
varelite4-84fdc988f-j8zwp   1/1     Running   0          78s   172.16.235.163   worker1   <none>           <none>
So you can see that each pod is running on worker1.
# kubectl describe node | grep -i taint
Taints:             node-role.kubernetes.io/control-plane:NoSchedule
Taints:             money:NoSchedule
Taints:             <none>
Taints:             <none>
You can untaint node using below command.
# kubectl taint node worker1 money:NoSchedule-
node/worker1 untainted
Note: When you taint any node on which pods are already running 
then there will be no impact on running pods. This is the case
of NoSchedule.

Let's taint a node using *NoExecute* type of effect. Then it
will move the pods which was running before tainting the node.
# kubectl get pods -o wide
NAME                         READY   STATUS    RESTARTS   AGE   IP              NODE      NOMINATED NODE   READINESS GATES
varelite5-658c5ccd8b-ggrdb   1/1     Running   0          12s   172.16.182.29   worker3   <none>           <none>

# kubectl taint node worker3 money:NoExecute
node/worker3 tainted

# kubectl get pods -o wide
NAME                         READY   STATUS              RESTARTS   AGE   IP       NODE      NOMINATED NODE   READINESS GATES
varelite5-658c5ccd8b-wv8lf   0/1     ContainerCreating   0          1s    <none>   worker1   <none>           <none>

Leave a Reply

Your email address will not be published. Required fields are marked *