trio2o/specs/newton/dynamic-pod-binding.rst
joehuang 180f68eac8 Remove networking related code from the Trio2o
1. What is the problem?
Networking related code is still in the repository of
the Trio2o project.

2. What is the solution to the problem?
According to the blueprint for the Trio2o cleaning:
https://blueprints.launchpad.net/trio2o/+spec/trio2o-code-cleaning
Networking related code which is forked from the Tricircle
repository should be removed from the Trio2o repository.
After the cleanning, the Trio2o should be able to run independently.
There are lots of things to clean and update, and have to do it in
one huge patch, otherwise the code in Trio2o will not be able to run
and tested properply:
1). Remove netwoking operaion from server controller
2). Update devstack script
3). Update installation guide
4). Update README
5). Remove network folder and network related unit tests
6). Rename Tricircle to Trio2o in all source code

THE MEANING OF FILE OPERATION:
D: delete a file
R: rename a file to another name
A: add a new file
C: copy a file

3. What the features need to be implemented to the Tricircle to realize
the solution?
No new features.

Change-Id: I0b48ee38280e25ba6294ca3d5b7a0673cb368ed4
Signed-off-by: joehuang <joehuang@huawei.com>
2016-11-14 02:12:48 -05:00

237 lines
10 KiB
ReStructuredText

=================================
Dynamic Pod Binding in Tricircle
=================================
Background
===========
Most public cloud infrastructure is built with Availability Zones (AZs).
Each AZ is consisted of one or more discrete data centers, each with high
bandwidth and low latency network connection, separate power and facilities.
These AZs offer cloud tenants the ability to operate production
applications and databases deployed into multiple AZs are more highly
available, fault tolerant and scalable than a single data center.
In production clouds, each AZ is built by modularized OpenStack, and each
OpenStack is one pod. Moreover, one AZ can include multiple pods. Among the
pods, they are classified into different categories. For example, servers
in one pod are only for general purposes, and the other pods may be built
for heavy load CAD modeling with GPU. So pods in one AZ could be divided
into different groups. Different pod groups for different purposes, and
the VM's cost and performance are also different.
The concept "pod" is created for the Tricircle to facilitate managing
OpenStack instances among AZs, which therefore is transparent to cloud
tenants. The Tricircle maintains and manages a pod binding table which
records the mapping relationship between a cloud tenant and pods. When the
cloud tenant creates a VM or a volume, the Tricircle tries to assign a pod
based on the pod binding table.
Motivation
===========
In resource allocation scenario, when a tenant creates a VM in one pod and a
new volume in a another pod respectively. If the tenant attempt to attach the
volume to the VM, the operation will fail. In other words, the volume should
be in the same pod where the VM is, otherwise the volume and VM would not be
able to finish the attachment. Hence, the Tricircle needs to ensure the pod
binding so as to guarantee that VM and volume are created in one pod.
In capacity expansion scenario, when resources in one pod are exhausted,
then a new pod with the same type should be added into the AZ. Therefore,
new resources of this type should be provisioned in the new added pod, which
requires dynamical change of pod binding. The pod binding could be done
dynamically by the Tricircle, or by admin through admin api for maintenance
purpose. For example, for maintenance(upgrade, repairement) window, all
new provision requests should be forwarded to the running one, but not
the one under maintenance.
Solution: dynamic pod binding
==============================
It's quite headache for capacity expansion inside one pod, you have to
estimate, calculate, monitor, simulate, test, and do online grey expansion
for controller nodes and network nodes whenever you add new machines to the
pod. It's quite big challenge as more and more resources added to one pod,
and at last you will reach limitation of one OpenStack. If this pod's
resources exhausted or reach the limit for new resources provisioning, the
Tricircle needs to bind tenant to a new pod instead of expanding the current
pod unlimitedly. The Tricircle needs to select a proper pod and stay binding
for a duration, in this duration VM and volume will be created for one tenant
in the same pod.
For example, suppose we have two groups of pods, and each group has 3 pods,
i.e.,
GroupA(Pod1, Pod2, Pod3) for general purpose VM,
GroupB(Pod4, Pod5, Pod6) for CAD modeling.
Tenant1 is bound to Pod1, Pod4 during the first phase for several months.
In the first phase, we can just add weight in Pod, for example, Pod1, weight 1,
Pod2, weight2, this could be done by adding one new field in pod table, or no
field at all, just link them by the order created in the Tricircle. In this
case, we use the pod creation time as the weight.
If the tenant wants to allocate VM/volume for general VM, Pod1 should be
selected. It can be implemented with flavor or volume type metadata. For
general VM/Volume, there is no special tag in flavor or volume type metadata.
If the tenant wants to allocate VM/volume for CAD modeling VM, Pod4 should be
selected. For CAD modeling VM/Volume, a special tag "resource: CAD Modeling"
in flavor or volume type metadata determines the binding.
When it is detected that there is no more resources in Pod1, Pod4. Based on
the resource_affinity_tag, the Tricircle queries the pod table for available
pods which provision a specific type of resources. The field resource_affinity
is a key-value pair. The pods will be selected when there are matched
key-value in flavor extra-spec or volume extra-spec. A tenant will be bound
to one pod in one group of pods with same resource_affinity_tag. In this case,
the Tricircle obtains Pod2 and Pod3 for general purpose, as well as Pod5 an
Pod6 for CAD purpose. The Tricircle needs to change the binding, for example,
tenant1 needs to be bound to Pod2, Pod5.
Implementation
===============
Measurement
-------------
To get the information of resource utilization of pods, the Tricircle needs to
conduct some measurements on pods. The statistic task should be done in
bottom pod.
For resources usages, current cells provide interface to retrieve usage for
cells [1]. OpenStack provides details of capacity of a cell, including disk
and ram via api of showing cell capacities [1].
If OpenStack is not running with cells mode, we can ask Nova to provide
an interface to show the usage detail in AZ. Moreover, an API for usage
query at host level is provided for admins [3], through which we can obtain
details of a host, including cpu, memory, disk, and so on.
Cinder also provides interface to retrieve the backend pool usage,
including updated time, total capacity, free capacity and so on [2].
The Tricircle needs to have one task to collect the usage in the bottom on
daily base, to evaluate whether the threshold is reached or not. A threshold
or headroom could be configured for each pod, but not to reach 100% exhaustion
of resources.
On top there should be no heavy process. So getting the sum info from the
bottom can be done in the Tricircle. After collecting the details, the
Tricircle can judge whether a pod reaches its limit.
Tricircle
----------
The Tricircle needs a framework to support different binding policy (filter).
Each pod is one OpenStack instance, including controller nodes and compute
nodes. E.g.,
::
+-> controller(s) - pod1 <--> compute nodes <---+
|
The tricircle +-> controller(s) - pod2 <--> compute nodes <---+ resource migration, if necessary
(resource controller) .... |
+-> controller(s) - pod{N} <--> compute nodes <-+
The Tricircle selects a pod to decide where the requests should be forwarded
to which controller. Then the controllers in the selected pod will do its own
scheduling.
One simplest binding filter is as follows. Line up all available pods in a
list and always select the first one. When all the resources in the first pod
has been allocated, remove it from the list. This is quite like how production
cloud is built: at first, only a few pods are in the list, and then add more
and more pods if there is not enough resources in current cloud. For example,
List1 for general pool: Pod1 <- Pod2 <- Pod3
List2 for CAD modeling pool: Pod4 <- Pod5 <- Pod6
If Pod1's resource exhausted, Pod1 is removed from List1. The List1 is changed
to: Pod2 <- Pod3.
If Pod4's resource exhausted, Pod4 is removed from List2. The List2 is changed
to: Pod5 <- Pod6
If the tenant wants to allocate resources for general VM, the Tricircle
selects Pod2. If the tenant wants to allocate resources for CAD modeling VM,
the Tricircle selects Pod5.
Filtering
-------------
For the strategy of selecting pods, we need a series of filters. Before
implementing dynamic pod binding, the binding criteria are hard coded to
select the first pod in the AZ. Hence, we need to design a series of filter
algorithms. Firstly, we plan to design an ALLPodsFilter which does no
filtering and passes all the available pods. Secondly, we plan to design an
AvailabilityZoneFilter which passes the pods matching the specified available
zone. Thirdly, we plan to design a ResourceAffiniyFilter which passes the pods
matching the specified resource type. Based on the resource_affinity_tag,
the Tricircle can be aware of which type of resource the tenant wants to
provision. In the future, we can add more filters, which requires adding more
information in the pod table.
Weighting
-------------
After filtering all the pods, the Tricircle obtains the available pods for a
tenant. The Tricircle needs to select the most suitable pod for the tenant.
Hence, we need to define a weight function to calculate the corresponding
weight of each pod. Based on the weights, the Tricircle selects the pod which
has the maximum weight value. When calculating the weight of a pod, we need
to design a series of weigher. We first take the pod creation time into
consideration when designing the weight function. The second one is the idle
capacity, to select a pod which has the most idle capacity. Other metrics
will be added in the future, e.g., cost.
Data Model Impact
==================
Firstly, we need to add a column “resource_affinity_tag” to the pod table,
which is used to store the key-value pair, to match flavor extra-spec and
volume extra-spec.
Secondly, in the pod binding table, we need to add fields of start binding
time and end binding time, so the history of the binding relationship could
be stored.
Thirdly, we need a table to store the usage of each pod for Cinder/Nova.
We plan to use JSON object to store the usage information. Hence, even if
the usage structure is changed, we don't need to update the table. And if
the usage value is null, that means the usage has not been initialized yet.
As just mentioned above, the usage could be refreshed in daily basis. If it's
not initialized yet, it means there is still lots of resources available,
which could be scheduled just like this pod has not reach usage threshold.
Dependencies
=============
None
Testing
========
None
Documentation Impact
=====================
None
Reference
==========
[1] http://developer.openstack.org/api-ref-compute-v2.1.html#showCellCapacities
[2] http://developer.openstack.org/api-ref-blockstorage-v2.html#os-vol-pool-v2
[3] http://developer.openstack.org/api-ref-compute-v2.1.html#showinfo