Merge "Add specification for dynamic pod binding"
This commit is contained in:
commit
11639b88d8
236
specs/dynamic-pod-binding.rst
Normal file
236
specs/dynamic-pod-binding.rst
Normal file
@ -0,0 +1,236 @@
|
||||
=================================
|
||||
Dynamic Pod Binding in Tricircle
|
||||
=================================
|
||||
|
||||
Background
|
||||
===========
|
||||
|
||||
Most public cloud infrastructure is built with Availability Zones (AZs).
|
||||
Each AZ is consisted of one or more discrete data centers, each with high
|
||||
bandwidth and low latency network connection, separate power and facilities.
|
||||
These AZs offer cloud tenants the ability to operate production
|
||||
applications and databases deployed into multiple AZs are more highly
|
||||
available, fault tolerant and scalable than a single data center.
|
||||
|
||||
In production clouds, each AZ is built by modularized OpenStack, and each
|
||||
OpenStack is one pod. Moreover, one AZ can include multiple pods. Among the
|
||||
pods, they are classified into different categories. For example, servers
|
||||
in one pod are only for general purposes, and the other pods may be built
|
||||
for heavy load CAD modeling with GPU. So pods in one AZ could be divided
|
||||
into different groups. Different pod groups for different purposes, and
|
||||
the VM's cost and performance are also different.
|
||||
|
||||
The concept "pod" is created for the Tricircle to facilitate managing
|
||||
OpenStack instances among AZs, which therefore is transparent to cloud
|
||||
tenants. The Tricircle maintains and manages a pod binding table which
|
||||
records the mapping relationship between a cloud tenant and pods. When the
|
||||
cloud tenant creates a VM or a volume, the Tricircle tries to assign a pod
|
||||
based on the pod binding table.
|
||||
|
||||
Motivation
|
||||
===========
|
||||
|
||||
In resource allocation scenario, when a tenant creates a VM in one pod and a
|
||||
new volume in a another pod respectively. If the tenant attempt to attach the
|
||||
volume to the VM, the operation will fail. In other words, the volume should
|
||||
be in the same pod where the VM is, otherwise the volume and VM would not be
|
||||
able to finish the attachment. Hence, the Tricircle needs to ensure the pod
|
||||
binding so as to guarantee that VM and volume are created in one pod.
|
||||
|
||||
In capacity expansion scenario, when resources in one pod are exhausted,
|
||||
then a new pod with the same type should be added into the AZ. Therefore,
|
||||
new resources of this type should be provisioned in the new added pod, which
|
||||
requires dynamical change of pod binding. The pod binding could be done
|
||||
dynamically by the Tricircle, or by admin through admin api for maintenance
|
||||
purpose. For example, for maintenance(upgrade, repairement) window, all
|
||||
new provision requests should be forwarded to the running one, but not
|
||||
the one under maintenance.
|
||||
|
||||
Solution: dynamic pod binding
|
||||
==============================
|
||||
|
||||
It's quite headache for capacity expansion inside one pod, you have to
|
||||
estimate, calculate, monitor, simulate, test, and do online grey expansion
|
||||
for controller nodes and network nodes whenever you add new machines to the
|
||||
pod. It's quite big challenge as more and more resources added to one pod,
|
||||
and at last you will reach limitation of one OpenStack. If this pod's
|
||||
resources exhausted or reach the limit for new resources provisioning, the
|
||||
Tricircle needs to bind tenant to a new pod instead of expanding the current
|
||||
pod unlimitedly. The Tricircle needs to select a proper pod and stay binding
|
||||
for a duration, in this duration VM and volume will be created for one tenant
|
||||
in the same pod.
|
||||
|
||||
For example, suppose we have two groups of pods, and each group has 3 pods,
|
||||
i.e.,
|
||||
|
||||
GroupA(Pod1, Pod2, Pod3) for general purpose VM,
|
||||
|
||||
GroupB(Pod4, Pod5, Pod6) for CAD modeling.
|
||||
|
||||
Tenant1 is bound to Pod1, Pod4 during the first phase for several months.
|
||||
In the first phase, we can just add weight in Pod, for example, Pod1, weight 1,
|
||||
Pod2, weight2, this could be done by adding one new field in pod table, or no
|
||||
field at all, just link them by the order created in the Tricircle. In this
|
||||
case, we use the pod creation time as the weight.
|
||||
|
||||
If the tenant wants to allocate VM/volume for general VM, Pod1 should be
|
||||
selected. It can be implemented with flavor or volume type metadata. For
|
||||
general VM/Volume, there is no special tag in flavor or volume type metadata.
|
||||
|
||||
If the tenant wants to allocate VM/volume for CAD modeling VM, Pod4 should be
|
||||
selected. For CAD modeling VM/Volume, a special tag "resource: CAD Modeling"
|
||||
in flavor or volume type metadata determines the binding.
|
||||
|
||||
When it is detected that there is no more resources in Pod1, Pod4. Based on
|
||||
the resource_affinity_tag, the Tricircle queries the pod table for available
|
||||
pods which provision a specific type of resources. The field resource_affinity
|
||||
is a key-value pair. The pods will be selected when there are matched
|
||||
key-value in flavor extra-spec or volume extra-spec. A tenant will be bound
|
||||
to one pod in one group of pods with same resource_affinity_tag. In this case,
|
||||
the Tricircle obtains Pod2 and Pod3 for general purpose, as well as Pod5 an
|
||||
Pod6 for CAD purpose. The Tricircle needs to change the binding, for example,
|
||||
tenant1 needs to be bound to Pod2, Pod5.
|
||||
|
||||
Implementation
|
||||
===============
|
||||
|
||||
Measurement
|
||||
-------------
|
||||
|
||||
To get the information of resource utilization of pods, the Tricircle needs to
|
||||
conduct some measurements on pods. The statistic task should be done in
|
||||
bottom pod.
|
||||
|
||||
For resources usages, current cells provide interface to retrieve usage for
|
||||
cells [1]. OpenStack provides details of capacity of a cell, including disk
|
||||
and ram via api of showing cell capacities [1].
|
||||
|
||||
If OpenStack is not running with cells mode, we can ask Nova to provide
|
||||
an interface to show the usage detail in AZ. Moreover, an API for usage
|
||||
query at host level is provided for admins [3], through which we can obtain
|
||||
details of a host, including cpu, memory, disk, and so on.
|
||||
|
||||
Cinder also provides interface to retrieve the backend pool usage,
|
||||
including updated time, total capacity, free capacity and so on [2].
|
||||
|
||||
The Tricircle needs to have one task to collect the usage in the bottom on
|
||||
daily base, to evaluate whether the threshold is reached or not. A threshold
|
||||
or headroom could be configured for each pod, but not to reach 100% exhaustion
|
||||
of resources.
|
||||
|
||||
On top there should be no heavy process. So getting the sum info from the
|
||||
bottom can be done in the Tricircle. After collecting the details, the
|
||||
Tricircle can judge whether a pod reaches its limit.
|
||||
|
||||
Tricircle
|
||||
----------
|
||||
|
||||
The Tricircle needs a framework to support different binding policy (filter).
|
||||
|
||||
Each pod is one OpenStack instance, including controller nodes and compute
|
||||
nodes. E.g.,
|
||||
|
||||
::
|
||||
|
||||
+-> controller(s) - pod1 <--> compute nodes <---+
|
||||
|
|
||||
The tricircle +-> controller(s) - pod2 <--> compute nodes <---+ resource migration, if necessary
|
||||
(resource controller) .... |
|
||||
+-> controller(s) - pod{N} <--> compute nodes <-+
|
||||
|
||||
|
||||
The Tricircle selects a pod to decide where the requests should be forwarded
|
||||
to which controller. Then the controllers in the selected pod will do its own
|
||||
scheduling.
|
||||
|
||||
One simplest binding filter is as follows. Line up all available pods in a
|
||||
list and always select the first one. When all the resources in the first pod
|
||||
has been allocated, remove it from the list. This is quite like how production
|
||||
cloud is built: at first, only a few pods are in the list, and then add more
|
||||
and more pods if there is not enough resources in current cloud. For example,
|
||||
|
||||
List1 for general pool: Pod1 <- Pod2 <- Pod3
|
||||
List2 for CAD modeling pool: Pod4 <- Pod5 <- Pod6
|
||||
|
||||
If Pod1's resource exhausted, Pod1 is removed from List1. The List1 is changed
|
||||
to: Pod2 <- Pod3.
|
||||
If Pod4's resource exhausted, Pod4 is removed from List2. The List2 is changed
|
||||
to: Pod5 <- Pod6
|
||||
|
||||
If the tenant wants to allocate resources for general VM, the Tricircle
|
||||
selects Pod2. If the tenant wants to allocate resources for CAD modeling VM,
|
||||
the Tricircle selects Pod5.
|
||||
|
||||
Filtering
|
||||
-------------
|
||||
|
||||
For the strategy of selecting pods, we need a series of filters. Before
|
||||
implementing dynamic pod binding, the binding criteria are hard coded to
|
||||
select the first pod in the AZ. Hence, we need to design a series of filter
|
||||
algorithms. Firstly, we plan to design an ALLPodsFilter which does no
|
||||
filtering and passes all the available pods. Secondly, we plan to design an
|
||||
AvailabilityZoneFilter which passes the pods matching the specified available
|
||||
zone. Thirdly, we plan to design a ResourceAffiniyFilter which passes the pods
|
||||
matching the specified resource type. Based on the resource_affinity_tag,
|
||||
the Tricircle can be aware of which type of resource the tenant wants to
|
||||
provision. In the future, we can add more filters, which requires adding more
|
||||
information in the pod table.
|
||||
|
||||
Weighting
|
||||
-------------
|
||||
|
||||
After filtering all the pods, the Tricircle obtains the available pods for a
|
||||
tenant. The Tricircle needs to select the most suitable pod for the tenant.
|
||||
Hence, we need to define a weight function to calculate the corresponding
|
||||
weight of each pod. Based on the weights, the Tricircle selects the pod which
|
||||
has the maximum weight value. When calculating the weight of a pod, we need
|
||||
to design a series of weigher. We first take the pod creation time into
|
||||
consideration when designing the weight function. The second one is the idle
|
||||
capacity, to select a pod which has the most idle capacity. Other metrics
|
||||
will be added in the future, e.g., cost.
|
||||
|
||||
Data Model Impact
|
||||
==================
|
||||
|
||||
Firstly, we need to add a column “resource_affinity_tag” to the pod table,
|
||||
which is used to store the key-value pair, to match flavor extra-spec and
|
||||
volume extra-spec.
|
||||
|
||||
Secondly, in the pod binding table, we need to add fields of start binding
|
||||
time and end binding time, so the history of the binding relationship could
|
||||
be stored.
|
||||
|
||||
Thirdly, we need a table to store the usage of each pod for Cinder/Nova.
|
||||
We plan to use JSON object to store the usage information. Hence, even if
|
||||
the usage structure is changed, we don't need to update the table. And if
|
||||
the usage value is null, that means the usage has not been initialized yet.
|
||||
As just mentioned above, the usage could be refreshed in daily basis. If it's
|
||||
not initialized yet, it means there is still lots of resources available,
|
||||
which could be scheduled just like this pod has not reach usage threshold.
|
||||
|
||||
Dependencies
|
||||
=============
|
||||
|
||||
None
|
||||
|
||||
|
||||
Testing
|
||||
========
|
||||
|
||||
None
|
||||
|
||||
|
||||
Documentation Impact
|
||||
=====================
|
||||
|
||||
None
|
||||
|
||||
|
||||
Reference
|
||||
==========
|
||||
|
||||
[1] http://developer.openstack.org/api-ref-compute-v2.1.html#showCellCapacities
|
||||
|
||||
[2] http://developer.openstack.org/api-ref-blockstorage-v2.html#os-vol-pool-v2
|
||||
|
||||
[3] http://developer.openstack.org/api-ref-compute-v2.1.html#showinfo
|
Loading…
Reference in New Issue
Block a user