diff --git a/specs/dynamic-pod-binding.rst b/specs/dynamic-pod-binding.rst new file mode 100644 index 0000000..3af0e2b --- /dev/null +++ b/specs/dynamic-pod-binding.rst @@ -0,0 +1,236 @@ +================================= +Dynamic Pod Binding in Tricircle +================================= + +Background +=========== + +Most public cloud infrastructure is built with Availability Zones (AZs). +Each AZ is consisted of one or more discrete data centers, each with high +bandwidth and low latency network connection, separate power and facilities. +These AZs offer cloud tenants the ability to operate production +applications and databases deployed into multiple AZs are more highly +available, fault tolerant and scalable than a single data center. + +In production clouds, each AZ is built by modularized OpenStack, and each +OpenStack is one pod. Moreover, one AZ can include multiple pods. Among the +pods, they are classified into different categories. For example, servers +in one pod are only for general purposes, and the other pods may be built +for heavy load CAD modeling with GPU. So pods in one AZ could be divided +into different groups. Different pod groups for different purposes, and +the VM's cost and performance are also different. + +The concept "pod" is created for the Tricircle to facilitate managing +OpenStack instances among AZs, which therefore is transparent to cloud +tenants. The Tricircle maintains and manages a pod binding table which +records the mapping relationship between a cloud tenant and pods. When the +cloud tenant creates a VM or a volume, the Tricircle tries to assign a pod +based on the pod binding table. + +Motivation +=========== + +In resource allocation scenario, when a tenant creates a VM in one pod and a +new volume in a another pod respectively. If the tenant attempt to attach the +volume to the VM, the operation will fail. In other words, the volume should +be in the same pod where the VM is, otherwise the volume and VM would not be +able to finish the attachment. Hence, the Tricircle needs to ensure the pod +binding so as to guarantee that VM and volume are created in one pod. + +In capacity expansion scenario, when resources in one pod are exhausted, +then a new pod with the same type should be added into the AZ. Therefore, +new resources of this type should be provisioned in the new added pod, which +requires dynamical change of pod binding. The pod binding could be done +dynamically by the Tricircle, or by admin through admin api for maintenance +purpose. For example, for maintenance(upgrade, repairement) window, all +new provision requests should be forwarded to the running one, but not +the one under maintenance. + +Solution: dynamic pod binding +============================== + +It's quite headache for capacity expansion inside one pod, you have to +estimate, calculate, monitor, simulate, test, and do online grey expansion +for controller nodes and network nodes whenever you add new machines to the +pod. It's quite big challenge as more and more resources added to one pod, +and at last you will reach limitation of one OpenStack. If this pod's +resources exhausted or reach the limit for new resources provisioning, the +Tricircle needs to bind tenant to a new pod instead of expanding the current +pod unlimitedly. The Tricircle needs to select a proper pod and stay binding +for a duration, in this duration VM and volume will be created for one tenant +in the same pod. + +For example, suppose we have two groups of pods, and each group has 3 pods, +i.e., + +GroupA(Pod1, Pod2, Pod3) for general purpose VM, + +GroupB(Pod4, Pod5, Pod6) for CAD modeling. + +Tenant1 is bound to Pod1, Pod4 during the first phase for several months. +In the first phase, we can just add weight in Pod, for example, Pod1, weight 1, +Pod2, weight2, this could be done by adding one new field in pod table, or no +field at all, just link them by the order created in the Tricircle. In this +case, we use the pod creation time as the weight. + +If the tenant wants to allocate VM/volume for general VM, Pod1 should be +selected. It can be implemented with flavor or volume type metadata. For +general VM/Volume, there is no special tag in flavor or volume type metadata. + +If the tenant wants to allocate VM/volume for CAD modeling VM, Pod4 should be +selected. For CAD modeling VM/Volume, a special tag "resource: CAD Modeling" +in flavor or volume type metadata determines the binding. + +When it is detected that there is no more resources in Pod1, Pod4. Based on +the resource_affinity_tag, the Tricircle queries the pod table for available +pods which provision a specific type of resources. The field resource_affinity +is a key-value pair. The pods will be selected when there are matched +key-value in flavor extra-spec or volume extra-spec. A tenant will be bound +to one pod in one group of pods with same resource_affinity_tag. In this case, +the Tricircle obtains Pod2 and Pod3 for general purpose, as well as Pod5 an +Pod6 for CAD purpose. The Tricircle needs to change the binding, for example, +tenant1 needs to be bound to Pod2, Pod5. + +Implementation +=============== + +Measurement +------------- + +To get the information of resource utilization of pods, the Tricircle needs to +conduct some measurements on pods. The statistic task should be done in +bottom pod. + +For resources usages, current cells provide interface to retrieve usage for +cells [1]. OpenStack provides details of capacity of a cell, including disk +and ram via api of showing cell capacities [1]. + +If OpenStack is not running with cells mode, we can ask Nova to provide +an interface to show the usage detail in AZ. Moreover, an API for usage +query at host level is provided for admins [3], through which we can obtain +details of a host, including cpu, memory, disk, and so on. + +Cinder also provides interface to retrieve the backend pool usage, +including updated time, total capacity, free capacity and so on [2]. + +The Tricircle needs to have one task to collect the usage in the bottom on +daily base, to evaluate whether the threshold is reached or not. A threshold +or headroom could be configured for each pod, but not to reach 100% exhaustion +of resources. + +On top there should be no heavy process. So getting the sum info from the +bottom can be done in the Tricircle. After collecting the details, the +Tricircle can judge whether a pod reaches its limit. + +Tricircle +---------- + +The Tricircle needs a framework to support different binding policy (filter). + +Each pod is one OpenStack instance, including controller nodes and compute +nodes. E.g., + +:: + + +-> controller(s) - pod1 <--> compute nodes <---+ + | + The tricircle +-> controller(s) - pod2 <--> compute nodes <---+ resource migration, if necessary + (resource controller) .... | + +-> controller(s) - pod{N} <--> compute nodes <-+ + + +The Tricircle selects a pod to decide where the requests should be forwarded +to which controller. Then the controllers in the selected pod will do its own +scheduling. + +One simplest binding filter is as follows. Line up all available pods in a +list and always select the first one. When all the resources in the first pod +has been allocated, remove it from the list. This is quite like how production +cloud is built: at first, only a few pods are in the list, and then add more +and more pods if there is not enough resources in current cloud. For example, + +List1 for general pool: Pod1 <- Pod2 <- Pod3 +List2 for CAD modeling pool: Pod4 <- Pod5 <- Pod6 + +If Pod1's resource exhausted, Pod1 is removed from List1. The List1 is changed +to: Pod2 <- Pod3. +If Pod4's resource exhausted, Pod4 is removed from List2. The List2 is changed +to: Pod5 <- Pod6 + +If the tenant wants to allocate resources for general VM, the Tricircle +selects Pod2. If the tenant wants to allocate resources for CAD modeling VM, +the Tricircle selects Pod5. + +Filtering +------------- + +For the strategy of selecting pods, we need a series of filters. Before +implementing dynamic pod binding, the binding criteria are hard coded to +select the first pod in the AZ. Hence, we need to design a series of filter +algorithms. Firstly, we plan to design an ALLPodsFilter which does no +filtering and passes all the available pods. Secondly, we plan to design an +AvailabilityZoneFilter which passes the pods matching the specified available +zone. Thirdly, we plan to design a ResourceAffiniyFilter which passes the pods +matching the specified resource type. Based on the resource_affinity_tag, +the Tricircle can be aware of which type of resource the tenant wants to +provision. In the future, we can add more filters, which requires adding more +information in the pod table. + +Weighting +------------- + +After filtering all the pods, the Tricircle obtains the available pods for a +tenant. The Tricircle needs to select the most suitable pod for the tenant. +Hence, we need to define a weight function to calculate the corresponding +weight of each pod. Based on the weights, the Tricircle selects the pod which +has the maximum weight value. When calculating the weight of a pod, we need +to design a series of weigher. We first take the pod creation time into +consideration when designing the weight function. The second one is the idle +capacity, to select a pod which has the most idle capacity. Other metrics +will be added in the future, e.g., cost. + +Data Model Impact +================== + +Firstly, we need to add a column “resource_affinity_tag” to the pod table, +which is used to store the key-value pair, to match flavor extra-spec and +volume extra-spec. + +Secondly, in the pod binding table, we need to add fields of start binding +time and end binding time, so the history of the binding relationship could +be stored. + +Thirdly, we need a table to store the usage of each pod for Cinder/Nova. +We plan to use JSON object to store the usage information. Hence, even if +the usage structure is changed, we don't need to update the table. And if +the usage value is null, that means the usage has not been initialized yet. +As just mentioned above, the usage could be refreshed in daily basis. If it's +not initialized yet, it means there is still lots of resources available, +which could be scheduled just like this pod has not reach usage threshold. + +Dependencies +============= + +None + + +Testing +======== + +None + + +Documentation Impact +===================== + +None + + +Reference +========== + +[1] http://developer.openstack.org/api-ref-compute-v2.1.html#showCellCapacities + +[2] http://developer.openstack.org/api-ref-blockstorage-v2.html#os-vol-pool-v2 + +[3] http://developer.openstack.org/api-ref-compute-v2.1.html#showinfo