Add specification for dynamic pod binding
1. What is the problem? In production clouds, each availability zone (AZ) is built by modularized OpenStack instances. Each OpenStack instance acts as a pod. One AZ consists of multiple pods. Among the pods within an AZ, they are classified into different categories for different proposes, for instance, general propose, CAD modeling and so on. Each tenant is bound to one pod, where it creates various types of resources. However such a binding relationship should be dynamic instead of static. For instance when some resources in the pod are exhausted, tenant needs to be bound to a new pod in same AZ. 2. What is the solution to the problem? To deal with the above problem, the Tricircle dynamically bind tenants to pod which has available resources. We call this feature dynamic pod binding 3. What the features need to be implemented to the Tricircle to realize the solution? To realize dynamic pod binding, the following features need to be implemented in the Tricircle. 1) To collect the usage in pod daily to evaluate whether the threshold is reached or not. 2) To filter and weigh all the available pods for cloud tenants to bind a tenant to a proper pod. 3) To manage and maintain all the active and historical binding relationship. This spec explains how Tricircle binds pods to tenants dynamically in detail. Blueprint: https://blueprints.launchpad.net/tricircle/+spec/dynamic-pod-binding Change-Id: Ib429a59d3d216e578f9c451d84c1fe9a333cf050
This commit is contained in:
parent
da6c08f93e
commit
a1602f7e5e
236
specs/dynamic-pod-binding.rst
Normal file
236
specs/dynamic-pod-binding.rst
Normal file
@ -0,0 +1,236 @@
|
||||
=================================
|
||||
Dynamic Pod Binding in Tricircle
|
||||
=================================
|
||||
|
||||
Background
|
||||
===========
|
||||
|
||||
Most public cloud infrastructure is built with Availability Zones (AZs).
|
||||
Each AZ is consisted of one or more discrete data centers, each with high
|
||||
bandwidth and low latency network connection, separate power and facilities.
|
||||
These AZs offer cloud tenants the ability to operate production
|
||||
applications and databases deployed into multiple AZs are more highly
|
||||
available, fault tolerant and scalable than a single data center.
|
||||
|
||||
In production clouds, each AZ is built by modularized OpenStack, and each
|
||||
OpenStack is one pod. Moreover, one AZ can include multiple pods. Among the
|
||||
pods, they are classified into different categories. For example, servers
|
||||
in one pod are only for general purposes, and the other pods may be built
|
||||
for heavy load CAD modeling with GPU. So pods in one AZ could be divided
|
||||
into different groups. Different pod groups for different purposes, and
|
||||
the VM's cost and performance are also different.
|
||||
|
||||
The concept "pod" is created for the Tricircle to facilitate managing
|
||||
OpenStack instances among AZs, which therefore is transparent to cloud
|
||||
tenants. The Tricircle maintains and manages a pod binding table which
|
||||
records the mapping relationship between a cloud tenant and pods. When the
|
||||
cloud tenant creates a VM or a volume, the Tricircle tries to assign a pod
|
||||
based on the pod binding table.
|
||||
|
||||
Motivation
|
||||
===========
|
||||
|
||||
In resource allocation scenario, when a tenant creates a VM in one pod and a
|
||||
new volume in a another pod respectively. If the tenant attempt to attach the
|
||||
volume to the VM, the operation will fail. In other words, the volume should
|
||||
be in the same pod where the VM is, otherwise the volume and VM would not be
|
||||
able to finish the attachment. Hence, the Tricircle needs to ensure the pod
|
||||
binding so as to guarantee that VM and volume are created in one pod.
|
||||
|
||||
In capacity expansion scenario, when resources in one pod are exhausted,
|
||||
then a new pod with the same type should be added into the AZ. Therefore,
|
||||
new resources of this type should be provisioned in the new added pod, which
|
||||
requires dynamical change of pod binding. The pod binding could be done
|
||||
dynamically by the Tricircle, or by admin through admin api for maintenance
|
||||
purpose. For example, for maintenance(upgrade, repairement) window, all
|
||||
new provision requests should be forwarded to the running one, but not
|
||||
the one under maintenance.
|
||||
|
||||
Solution: dynamic pod binding
|
||||
==============================
|
||||
|
||||
It's quite headache for capacity expansion inside one pod, you have to
|
||||
estimate, calculate, monitor, simulate, test, and do online grey expansion
|
||||
for controller nodes and network nodes whenever you add new machines to the
|
||||
pod. It's quite big challenge as more and more resources added to one pod,
|
||||
and at last you will reach limitation of one OpenStack. If this pod's
|
||||
resources exhausted or reach the limit for new resources provisioning, the
|
||||
Tricircle needs to bind tenant to a new pod instead of expanding the current
|
||||
pod unlimitedly. The Tricircle needs to select a proper pod and stay binding
|
||||
for a duration, in this duration VM and volume will be created for one tenant
|
||||
in the same pod.
|
||||
|
||||
For example, suppose we have two groups of pods, and each group has 3 pods,
|
||||
i.e.,
|
||||
|
||||
GroupA(Pod1, Pod2, Pod3) for general purpose VM,
|
||||
|
||||
GroupB(Pod4, Pod5, Pod6) for CAD modeling.
|
||||
|
||||
Tenant1 is bound to Pod1, Pod4 during the first phase for several months.
|
||||
In the first phase, we can just add weight in Pod, for example, Pod1, weight 1,
|
||||
Pod2, weight2, this could be done by adding one new field in pod table, or no
|
||||
field at all, just link them by the order created in the Tricircle. In this
|
||||
case, we use the pod creation time as the weight.
|
||||
|
||||
If the tenant wants to allocate VM/volume for general VM, Pod1 should be
|
||||
selected. It can be implemented with flavor or volume type metadata. For
|
||||
general VM/Volume, there is no special tag in flavor or volume type metadata.
|
||||
|
||||
If the tenant wants to allocate VM/volume for CAD modeling VM, Pod4 should be
|
||||
selected. For CAD modeling VM/Volume, a special tag "resource: CAD Modeling"
|
||||
in flavor or volume type metadata determines the binding.
|
||||
|
||||
When it is detected that there is no more resources in Pod1, Pod4. Based on
|
||||
the resource_affinity_tag, the Tricircle queries the pod table for available
|
||||
pods which provision a specific type of resources. The field resource_affinity
|
||||
is a key-value pair. The pods will be selected when there are matched
|
||||
key-value in flavor extra-spec or volume extra-spec. A tenant will be bound
|
||||
to one pod in one group of pods with same resource_affinity_tag. In this case,
|
||||
the Tricircle obtains Pod2 and Pod3 for general purpose, as well as Pod5 an
|
||||
Pod6 for CAD purpose. The Tricircle needs to change the binding, for example,
|
||||
tenant1 needs to be bound to Pod2, Pod5.
|
||||
|
||||
Implementation
|
||||
===============
|
||||
|
||||
Measurement
|
||||
-------------
|
||||
|
||||
To get the information of resource utilization of pods, the Tricircle needs to
|
||||
conduct some measurements on pods. The statistic task should be done in
|
||||
bottom pod.
|
||||
|
||||
For resources usages, current cells provide interface to retrieve usage for
|
||||
cells [1]. OpenStack provides details of capacity of a cell, including disk
|
||||
and ram via api of showing cell capacities [1].
|
||||
|
||||
If OpenStack is not running with cells mode, we can ask Nova to provide
|
||||
an interface to show the usage detail in AZ. Moreover, an API for usage
|
||||
query at host level is provided for admins [3], through which we can obtain
|
||||
details of a host, including cpu, memory, disk, and so on.
|
||||
|
||||
Cinder also provides interface to retrieve the backend pool usage,
|
||||
including updated time, total capacity, free capacity and so on [2].
|
||||
|
||||
The Tricircle needs to have one task to collect the usage in the bottom on
|
||||
daily base, to evaluate whether the threshold is reached or not. A threshold
|
||||
or headroom could be configured for each pod, but not to reach 100% exhaustion
|
||||
of resources.
|
||||
|
||||
On top there should be no heavy process. So getting the sum info from the
|
||||
bottom can be done in the Tricircle. After collecting the details, the
|
||||
Tricircle can judge whether a pod reaches its limit.
|
||||
|
||||
Tricircle
|
||||
----------
|
||||
|
||||
The Tricircle needs a framework to support different binding policy (filter).
|
||||
|
||||
Each pod is one OpenStack instance, including controller nodes and compute
|
||||
nodes. E.g.,
|
||||
|
||||
::
|
||||
|
||||
+-> controller(s) - pod1 <--> compute nodes <---+
|
||||
|
|
||||
The tricircle +-> controller(s) - pod2 <--> compute nodes <---+ resource migration, if necessary
|
||||
(resource controller) .... |
|
||||
+-> controller(s) - pod{N} <--> compute nodes <-+
|
||||
|
||||
|
||||
The Tricircle selects a pod to decide where the requests should be forwarded
|
||||
to which controller. Then the controllers in the selected pod will do its own
|
||||
scheduling.
|
||||
|
||||
One simplest binding filter is as follows. Line up all available pods in a
|
||||
list and always select the first one. When all the resources in the first pod
|
||||
has been allocated, remove it from the list. This is quite like how production
|
||||
cloud is built: at first, only a few pods are in the list, and then add more
|
||||
and more pods if there is not enough resources in current cloud. For example,
|
||||
|
||||
List1 for general pool: Pod1 <- Pod2 <- Pod3
|
||||
List2 for CAD modeling pool: Pod4 <- Pod5 <- Pod6
|
||||
|
||||
If Pod1's resource exhausted, Pod1 is removed from List1. The List1 is changed
|
||||
to: Pod2 <- Pod3.
|
||||
If Pod4's resource exhausted, Pod4 is removed from List2. The List2 is changed
|
||||
to: Pod5 <- Pod6
|
||||
|
||||
If the tenant wants to allocate resources for general VM, the Tricircle
|
||||
selects Pod2. If the tenant wants to allocate resources for CAD modeling VM,
|
||||
the Tricircle selects Pod5.
|
||||
|
||||
Filtering
|
||||
-------------
|
||||
|
||||
For the strategy of selecting pods, we need a series of filters. Before
|
||||
implementing dynamic pod binding, the binding criteria are hard coded to
|
||||
select the first pod in the AZ. Hence, we need to design a series of filter
|
||||
algorithms. Firstly, we plan to design an ALLPodsFilter which does no
|
||||
filtering and passes all the available pods. Secondly, we plan to design an
|
||||
AvailabilityZoneFilter which passes the pods matching the specified available
|
||||
zone. Thirdly, we plan to design a ResourceAffiniyFilter which passes the pods
|
||||
matching the specified resource type. Based on the resource_affinity_tag,
|
||||
the Tricircle can be aware of which type of resource the tenant wants to
|
||||
provision. In the future, we can add more filters, which requires adding more
|
||||
information in the pod table.
|
||||
|
||||
Weighting
|
||||
-------------
|
||||
|
||||
After filtering all the pods, the Tricircle obtains the available pods for a
|
||||
tenant. The Tricircle needs to select the most suitable pod for the tenant.
|
||||
Hence, we need to define a weight function to calculate the corresponding
|
||||
weight of each pod. Based on the weights, the Tricircle selects the pod which
|
||||
has the maximum weight value. When calculating the weight of a pod, we need
|
||||
to design a series of weigher. We first take the pod creation time into
|
||||
consideration when designing the weight function. The second one is the idle
|
||||
capacity, to select a pod which has the most idle capacity. Other metrics
|
||||
will be added in the future, e.g., cost.
|
||||
|
||||
Data Model Impact
|
||||
==================
|
||||
|
||||
Firstly, we need to add a column “resource_affinity_tag” to the pod table,
|
||||
which is used to store the key-value pair, to match flavor extra-spec and
|
||||
volume extra-spec.
|
||||
|
||||
Secondly, in the pod binding table, we need to add fields of start binding
|
||||
time and end binding time, so the history of the binding relationship could
|
||||
be stored.
|
||||
|
||||
Thirdly, we need a table to store the usage of each pod for Cinder/Nova.
|
||||
We plan to use JSON object to store the usage information. Hence, even if
|
||||
the usage structure is changed, we don't need to update the table. And if
|
||||
the usage value is null, that means the usage has not been initialized yet.
|
||||
As just mentioned above, the usage could be refreshed in daily basis. If it's
|
||||
not initialized yet, it means there is still lots of resources available,
|
||||
which could be scheduled just like this pod has not reach usage threshold.
|
||||
|
||||
Dependencies
|
||||
=============
|
||||
|
||||
None
|
||||
|
||||
|
||||
Testing
|
||||
========
|
||||
|
||||
None
|
||||
|
||||
|
||||
Documentation Impact
|
||||
=====================
|
||||
|
||||
None
|
||||
|
||||
|
||||
Reference
|
||||
==========
|
||||
|
||||
[1] http://developer.openstack.org/api-ref-compute-v2.1.html#showCellCapacities
|
||||
|
||||
[2] http://developer.openstack.org/api-ref-blockstorage-v2.html#os-vol-pool-v2
|
||||
|
||||
[3] http://developer.openstack.org/api-ref-compute-v2.1.html#showinfo
|
Loading…
Reference in New Issue
Block a user