Table of Contents
TogglePod scheduling is the the process of assigning nodes to pods in
kubernetes cluster. Kubernetes scheduler service is responsible
for finding nodes to run pods. We have few types of Pod
Scheduling.
Types of Pod Scheduling
1. Node Name based scheduling
2. Node Label and Node Selector based scheduling
3. Taint and toleration based scheduling
This is the pod scheduling techniques by which you can define
on which node you want your pods to run. Let's take below
example. Let's create a yaml file using dry run.
# kubectl create deploy varelite1 --image=nginx --dry-run -o yaml > varelite1.yaml
# cat varelite1.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
creationTimestamp: null
labels:
app: varelite1
name: varelite1
spec:
replicas: 1
selector:
matchLabels:
app: varelite1
strategy: {}
template:
metadata:
creationTimestamp: null
labels:
app: varelite1
spec:
containers:
- image: nginx
name: nginx
resources: {}
status: {}
# kubectl explain Deployment.spec.template.spec | grep -i nodeName
nodeName <string>
NodeName is a request to schedule this pod onto a specific node. If it is
FQDN in the hostname field of the kernel (the nodename field of struct
# kubectl explain Deployment.spec.template.spec.nodeName
GROUP: apps
KIND: Deployment
VERSION: v1
FIELD: nodeName <string>
DESCRIPTION:
NodeName is a request to schedule this pod onto a specific node. If it is
non-empty, the scheduler simply schedules this pod onto that node, assuming
that it fits resource requirements.
Now we can go ahead and mention the worker name *nodeName* at
above place. Once you defined the worker node there then pod
will always run on that node only. If you scale it up then all
new nodes will run only on that worker node.
And you can only define one worker node. Please see below yaml
and deploy it to check things. We don't use this in Production.
The important thing to note in this deployment is that
Scheduler is not used during this deployment as we have
directly defined the nodeName there.
# cat varelite1.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
creationTimestamp: null
labels:
app: varelite1
name: varelite1
spec:
replicas: 1
selector:
matchLabels:
app: varelite1
strategy: {}
template:
metadata:
creationTimestamp: null
labels:
app: varelite1
spec:
nodeName: worker1
containers:
- image: nginx
name: nginx
resources: {}
status: {}
# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
varelite1-54c9b6946f-x4nll 1/1 Running 0 31m 172.16.235.154 worker1 <none> <none>
You can see that it started on running worker1. Let's scale it
up and see where it lands. I changed replicas=2.
# vim varelite1.yaml
# kubectl apply -f varelite1.yaml
deployment.apps/varelite1 configured
# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
varelite1-54c9b6946f-rhtnf 0/1 ContainerCreating 0 4s <none> worker1 <none> <none>
varelite1-54c9b6946f-x4nll 1/1 Running 0 35m 172.16.235.154 worker1 <none> <none>
You can label the nodes and then using nodeSelector, you can
define the pods scheduling. Below are the commands related to
schedule.
# kubectl label node worker1 role=db
node/worker1 labeled
# kubectl label node worker2 role=db
node/worker2 labeled
# kubectl label node worker2 region=usa
node/worker2 labeled
# kubectl label node worker1 region=usa
node/worker1 labeled
# kubectl label node worker1 zone=delhi
node/worker1 labeled
# kubectl label node worker2 zone=london
node/worker2 labeled
# kubectl get node --show-labels worker1
NAME STATUS ROLES AGE VERSION LABELS
worker1 Ready <none> 45d v1.27.9 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=worker1,kubernetes.io/os=linux,region=usa,role=db,zone=delhi
# kubectl get node --show-labels worker2
NAME STATUS ROLES AGE VERSION LABELS
worker2 Ready <none> 45d v1.27.9 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=worker2,kubernetes.io/os=linux,region=usa,role=db,zone=london
# kubectl get node --show-labels worker3
NAME STATUS ROLES AGE VERSION LABELS
worker3 Ready <none> 45d v1.27.9 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=worker3,kubernetes.io/os=linux
# kubectl get node -l role=db
NAME STATUS ROLES AGE VERSION
worker1 Ready <none> 45d v1.27.9
worker2 Ready <none> 45d v1.27.9
# kubectl get node -l region=usa
NAME STATUS ROLES AGE VERSION
worker1 Ready <none> 45d v1.27.9
worker2 Ready <none> 45d v1.27.9
# kubectl get node -l zone=delhi
NAME STATUS ROLES AGE VERSION
worker1 Ready <none> 45d v1.27.9
# kubectl explain Deployment.spec.template.spec.nodeSelector
GROUP: apps
KIND: Deployment
VERSION: v1
FIELD: nodeSelector <map[string]string>
DESCRIPTION:
NodeSelector is a selector which must be true for the pod to fit on a node.
Selector which must match a node's labels for the pod to be scheduled on
that node. More info:
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
# vim varelite2.yaml
# cat varelite2.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
creationTimestamp: null
labels:
app: varelite2
name: varelite2
spec:
replicas: 2
selector:
matchLabels:
app: varelite2
strategy: {}
template:
metadata:
creationTimestamp: null
labels:
app: varelite2
spec:
nodeSelector:
region: usa
containers:
- image: nginx
name: nginx
resources: {}
status: {}
Now I have defined the nodeSelector as *region=usa*. Let's
where our pods will be created.
# kubectl create -f varelite2.yaml
deployment.apps/varelite2 created
# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
varelite2-7cf7487bf9-sb5jg 0/1 ContainerCreating 0 4s <none> worker2 <none> <none>
varelite2-7cf7487bf9-scczx 0/1 ContainerCreating 0 4s <none> worker1 <none> <none>
So it is creating your pods on worker1 and worker2 as
*region=usa* has been defined on these two woker nodes.
Let's scale up replicas to 4 and see where it lands. Sure,
It will be only on worker1 and worker2.
# vim varelite2.yaml
# kubectl apply -f varelite2.yaml
Warning: resource deployments/varelite2 is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by kubectl apply. kubectl apply should only be used on resources created declaratively by either kubectl create --save-config or kubectl apply. The missing annotation will be patched automatically.
deployment.apps/varelite2 configured
# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
varelite2-7cf7487bf9-5zqg8 1/1 Running 0 6s 172.16.235.157 worker1 <none> <none>
varelite2-7cf7487bf9-sb5jg 1/1 Running 0 113s 172.16.189.90 worker2 <none> <none>
varelite2-7cf7487bf9-scczx 1/1 Running 0 113s 172.16.235.156 worker1 <none> <none>
varelite2-7cf7487bf9-w2nl4 1/1 Running 0 6s 172.16.189.91 worker2 <none> <none>
# kubectl scale --replicas=5 deploy varelite2
deployment.apps/varelite2 scaled
# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
varelite2-7cf7487bf9-5qqg2 0/1 ContainerCreating 0 2s <none> worker1 <none> <none>
varelite2-7cf7487bf9-5zqg8 1/1 Running 0 92s 172.16.235.157 worker1 <none> <none>
varelite2-7cf7487bf9-sb5jg 1/1 Running 0 3m19s 172.16.189.90 worker2 <none> <none>
varelite2-7cf7487bf9-scczx 1/1 Running 0 3m19s 172.16.235.156 worker1 <none> <none>
varelite2-7cf7487bf9-w2nl4 1/1 Running 0 92s 172.16.189.91 worker2 <none> <none>
Let's say you want to add extra worker node to the regions=usa.
You can lable it and if any new pods will be created then it
may up on that worker node as well. But running pods will be on
same where it was running now.
Note: nodeSelector can be done at deployment level as well as
namespace level. You can also assign labels to namespace then
you don't need to define the same at deployment level.
Before doing this you need to enable that feature using api
configuration file. Go and edit below files and add
*PodNodeSelector* at *--enable-admission-plugins=*.
# vim /etc/kubernetes/manifests/kube-apiserver.yaml
# cat /etc/kubernetes/manifests/kube-apiserver.yaml | grep -i 'enable-admission-plugins'
- --enable-admission-plugins=NodeRestriction,PodNodeSelector
Now, you can use *edit* command to make changes in namespace.
There you can find labels.
# kubectl edit ns default
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: v1
kind: Namespace
metadata:
creationTimestamp: "2023-12-31T11:16:02Z"
labels:
kubernetes.io/metadata.name: default
name: default
resourceVersion: "41"
uid: f049cee0-fdf8-404b-aa51-1d44aa5ddb7d
spec:
finalizers:
- kubernetes
status:
phase: Active
You need to add below lines at labels.
annotations:
scheduler.alpha.kubernetes.io/node-selector: zone=delhi
The application which is not looking for shared tendency. It
means the application pods do not want any other pod run on the
node on which this application is running. Here comes the pod
scheduling *Taint and Tolerations*. It is giving dedicate
tenancy.
Taint means it will enable nodes to prevent certain pods ton
run on tainted node while Tolerations is a kubernetes property
which allow a pod to be scheduled on a node with matching
taint.
If you see no pods run on master when you start any
applications. This is because master node has been tainted.
# kubectl describe node kb-master | grep -i taint
Taints: node-role.kubernetes.io/control-plane:NoSchedule
Let's taint a node.
# kubectl taint node worker1 money:NoSchedule
node/worker1 tainted
# kubectl describe node worker1 | grep -i taint
Taints: money:NoSchedule
Now create a deployment and see if it is going to worker1 or
not.
# kubectl create deploy varelite3 --image=nginx --replicas=4
deployment.apps/varelite3 created
# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
varelite3-f8d5ffc5b-4jpd9 1/1 Running 0 9s 172.16.189.93 worker2 <none> <none>
varelite3-f8d5ffc5b-76498 0/1 ContainerCreating 0 9s <none> worker2 <none> <none>
varelite3-f8d5ffc5b-9t6m6 0/1 ContainerCreating 0 9s <none> worker3 <none> <none>
varelite3-f8d5ffc5b-qx4x7 0/1 ContainerCreating 0 9s <none> worker3 <none> <none>
So no pod is running on worker1 as It is tainted and it is not
available for all application you want to run. Now we will go
and define tolerations so that it can run on tainted node
*worker1*.
# kubectl create deploy varelite4 --image=nginx --dry-run -o yaml > varelite4.yaml
# cat varelite4.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
creationTimestamp: null
labels:
app: varelite4
name: varelite4
spec:
replicas: 1
selector:
matchLabels:
app: varelite4
strategy: {}
template:
metadata:
creationTimestamp: null
labels:
app: varelite4
spec:
tolerations:
- key: money
effect: NoSchedule
operator: Exists
containers:
- image: nginx
name: nginx
resources: {}
status: {}
# kubectl create -f varelite4.yaml
deployment.apps/varelite4 created
# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
varelite4-58988784db-r8nbc 1/1 Running 0 7s 172.16.235.159 worker1 <none> <none>
Now try to scale it up and see what happens next. You will see that scaled pods may be running on other nodes as well.
# kubectl scale deploy varelite4 --replicas=3
deployment.apps/varelite4 scaled
# kubectl get pods -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
varelite4-58988784db-6vz2g 1/1 Running 0 8s 172.16.182.27 worker3 <none> <none>
varelite4-58988784db-992r9 1/1 Running 0 8s 172.16.189.94 worker2 <none> <none>
varelite4-58988784db-r8nbc 1/1 Running 0 92s 172.16.235.159 worker1 <none> <none>
So it means that the node you tainted can only accept what you
define in tolerations. Schedulers can scale up pods to other
nodes as well. And if you want everything run only on worker1
then you must use *Node label and Node Selector* pod scheduling
with Taint and tolerations.
# kubectl label node worker1 type=money
node/worker1 labeled
# kubectl edit deploy varelite4
deployment.apps/varelite4 edited
I have edited the deployment and added a label which has been
assigned to worker1.
spec:
nodeSelector:
type: money
# kubectl get pods -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
varelite4-84fdc988f-4h8b2 1/1 Running 0 75s 172.16.235.164 worker1 <none> <none>
varelite4-84fdc988f-j22g6 1/1 Running 0 82s 172.16.235.162 worker1 <none> <none>
varelite4-84fdc988f-j8zwp 1/1 Running 0 78s 172.16.235.163 worker1 <none> <none>
So you can see that each pod is running on worker1.
# kubectl describe node | grep -i taint
Taints: node-role.kubernetes.io/control-plane:NoSchedule
Taints: money:NoSchedule
Taints: <none>
Taints: <none>
You can untaint node using below command.
# kubectl taint node worker1 money:NoSchedule-
node/worker1 untainted
Note: When you taint any node on which pods are already running
then there will be no impact on running pods. This is the case
of NoSchedule.
Let's taint a node using *NoExecute* type of effect. Then it
will move the pods which was running before tainting the node.
# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
varelite5-658c5ccd8b-ggrdb 1/1 Running 0 12s 172.16.182.29 worker3 <none> <none>
# kubectl taint node worker3 money:NoExecute
node/worker3 tainted
# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
varelite5-658c5ccd8b-wv8lf 0/1 ContainerCreating 0 1s <none> worker1 <none> <none>