Merge "Add cross-pod L2 Networking spec file"

This commit is contained in:
Jenkins 2016-06-17 01:30:39 +00:00 committed by Gerrit Code Review
commit 828f55287e

View File

@ -0,0 +1,563 @@
======================================
Cross pod L2 networking in Tricircle
======================================
Background
==========
The Tricircle provides unified OpenStack API gateway and networking automation
functionality. Those main functionalities allow cloud operators to manage
multiple OpenStack instances which are running in one site or multiple sites
as a single OpenStack cloud.
Each bottom OpenStack instance which is managed by the Tricircle is also called
a pod.
The Tricircle has the following components:
* Nova API-GW
* Cinder API-GW
* Neutron API Server with Neutron Tricircle plugin
* Admin API
* XJob
* DB
Nova API-GW provides the functionality to trigger automatic networking creation
when new VMs are being provisioned. Neutron Tricircle plug-in is the
functionality to create cross OpenStack L2/L3 networking for new VMs. After the
binding of tenant-id and pod finished in the Tricircle, Cinder API-GW and Nova
API-GW will pass the cinder api or nova api request to appropriate bottom
OpenStack instance.
Please refer to the Tricircle design blueprint[1], especially from
'7. Stateless Architecture Proposal' for the detail description of each
components.
Problem Description
===================
When a user wants to create a network in Neutron API Server, the user can
specify the 'availability_zone_hints'(AZ or az will be used for short for
availability zone) during network creation[5], in the Tricircle, the
'az_hints' means which AZ the network should be spreaded into. The 'az_hints'
meaning in Tricircle is a little different from the 'az_hints' meaning in
Neutron[5]. If no 'az_hints' was specified during network creation, this created
network will be spread into any AZ. If there is a list of 'az_hints' during the
network creation, that means the network should be able to be spread into these
AZs which are suggested by a list of 'az_hints'.
When a user creates VM or Volume, there is also one parameter called
availability zone. The AZ parameter is used for Volume and VM co-location, so
that the Volume and VM will be created into same bottom OpenStack instance.
When a VM is being attached to a network, the Tricircle will check whether a
VM's AZ is inside in the network's AZs scope. If a VM is not in the network's
AZs scope, the VM creation will be rejected.
Currently, the Tricircle only supports one pod in one AZ. And only supports a
network associated with one AZ. That means currently a tenant's network will
be presented only in one bottom OpenStack instance, that also means all VMs
connected to the network will be located at one bottom OpenStack instance.
If there are more than one pod in one AZ, refer to the dynamic pod binding[6].
There are lots of use cases where a tenant needs a network being able to be
spread out into multiple bottom OpenStack instances in one AZ or multiple AZs.
* Capacity expansion: tenants add VMs more and more, the capacity of one
OpenStack may not be enough, then a new OpenStack instance has to be added
to the cloud. But the tenant still wants to add new VMs into same network.
* Cross OpenStack network service chaining. Service chaining is based on
the port-pairs. Leveraging the cross pod L2 networking capability which
is provided by the Tricircle, the chaining could also be done by across sites.
For example, vRouter1 in pod1, but vRouter2 in pod2, these two VMs could be
chained.
* Applications are often required to run in different availability zones to
achieve high availability. Application needs to be designed as
Active-Standby/Active-Active/N-Way to achieve high availability, and some
components inside one application are designed to work as distributed
cluster, this design typically leads to state replication or heart
beat among application components (directly or via replicated database
services, or via private designed message format). When this kind of
applications are distributedly deployed into multiple OpenStack instances,
cross OpenStack L2 networking is needed to support heart beat
or state replication.
* When a tenant's VMs are provisioned in different OpenStack instances, there
is E-W (East-West) traffic for these VMs, the E-W traffic should be only
visible to the tenant, and isolation is needed. If the traffic goes through
N-S (North-South) via tenant level VPN, overhead is too much, and the
orchestration for multiple site to site VPN connection is also complicated.
Therefore cross OpenStack L2 networking to bridge the tenant's routers in
different OpenStack instances can provide more light weight isolation.
* In hybrid cloud, there is cross L2 networking requirement between the
private OpenStack and the public OpenStack. Cross pod L2 networking will
help the VMs migration in this case and it's not necessary to change the
IP/MAC/Security Group configuration during VM migration.
The spec[5] is to explain how one AZ can support more than one pod, and how
to schedule a proper pod during VM or Volume creation.
And this spec is to deal with the cross OpenStack L2 networking automation in
the Tricircle.
The simplest way to spread out L2 networking to multiple OpenStack instances
is to use same VLAN. But there is a lot of limitations: (1) A number of VLAN
segment is limited, (2) the VLAN network itself is not good to spread out
multiple sites, although you can use some gateways to do the same thing.
So flexible tenant level L2 networking across multiple OpenStack instances in
one site or in multiple sites is needed.
Proposed Change
===============
Cross pod L2 networking can be divided into three categories,
``Shared VLAN``, ``Shared VxLAN`` and ``Mixed VLAN/VxLAN``.
* Shared VLAN
Network in each bottom OpenStack is VLAN type and has the same VLAN ID.
If we want shared VLAN L2 networking to work in multi-site scenario, i.e.,
Multiple OpenStack instances in multiple sites, physical gateway needs to
be manually configured to make one VLAN networking be extended to other
sites.
*Manual setup physical gateway is out of the scope of this spec*
* Shared VxLAN
Network in each bottom OpenStack instance is VxLAN type and has the same
VxLAN ID.
Leverage L2GW[2][3] to implement this type of L2 networking.
* Mixed VLAN/VxLAN
Network in each bottom OpenStack instance may have different types and/or
have different segment IDs.
Leverage L2GW[2][3] to implement this type of L2 networking.
There is another network type called “Local Network”. For “Local Network”,
the network will be only presented in one bottom OpenStack instance. And the
network won't be presented in different bottom OpenStack instances. If a VM
in another pod tries to attach to the “Local Network”, it should be failed.
This use case is quite useful for the scenario in which cross pod L2
networking is not required, and one AZ will not include more than bottom
OpenStack instance.
Cross pod L2 networking will be able to be established dynamically during
tenant's VM is being provisioned.
There is assumption here that only one type of L2 networking will work in one
cloud deployment.
A Cross Pod L2 Networking Creation
------------------------------------
A cross pod L2 networking creation will be able to be done with the az_hint
attribute of the network. If az_hint includes one AZ or more AZs, the network
will be presented only in this AZ or these AZs, if no AZ in az_hint, it means
that the network can be extended to any bottom OpenStack.
There is a special use case for external network creation. For external
network creation, you need to specify the pod_id but not AZ in the az_hint
so that the external network will be only created in one specified pod per AZ.
*Support of External network in multiple OpenStack instances in one AZ
is out of scope of this spec.*
Pluggable L2 networking framework is proposed to deal with three types of
L2 cross pod networking, and it should be compatible with the
``Local Network``.
1. Type Driver under Tricircle Plugin in Neutron API server
* Type driver to distinguish different type of cross pod L2 networking. So
the Tricircle plugin need to load type driver according to the configuration.
The Tricircle can reuse the type driver of ML2 with update.
* Type driver to allocate VLAN segment id for shared VLAN L2 networking.
* Type driver to allocate VxLAN segment id for shared VxLAN L2 networking.
* Type driver for mixed VLAN/VxLAN to allocate VxLAN segment id for the
network connecting L2GWs[2][3].
* Type driver for Local Network only updating ``network_type`` for the
network to the Tricircle Neutron DB.
When a network creation request is received in Neutron API Server in the
Tricircle, the type driver will be called based on the configured network
type.
2. Nova API-GW to trigger the bottom networking automation
Nova API-GW can be aware of when a new VM is provisioned if boot VM api request
is received, therefore Nova API-GW is responsible for the network creation in
the bottom OpenStack instances.
Nova API-GW needs to get the network type from Neutron API server in the
Tricircle, and deal with the networking automation based on the network type:
* Shared VLAN
Nova API-GW creates network in bottom OpenStack instance in which the VM will
run with the VLAN segment id, network name and type that are retrieved from
the Neutron API server in the Tricircle.
* Shared VxLAN
Nova API-GW creates network in bottom OpenStack instance in which the VM will
run with the VxLAN segment id, network name and type which are retrieved from
Tricricle Neutron API server. After the network in the bottom OpenStack
instance is created successfully, Nova API-GW needs to make this network in the
bottom OpenStack instance as one of the segments in the network in the Tricircle.
* Mixed VLAN/VxLAN
Nova API-GW creates network in different bottom OpenStack instance in which the
VM will run with the VLAN or VxLAN segment id respectively, network name and type
which are retrieved from Tricricle Neutron API server. After the network in the
bottom OpenStack instances is created successfully, Nova API-GW needs to update
network in the Tricircle with the segmentation information of bottom netwoks.
3. L2GW driver under Tricircle Plugin in Neutron API server
Tricircle plugin needs to support multi-segment network extension[4].
For Shared VxLAN or Mixed VLAN/VxLAN L2 network type, L2GW driver will utilize the
multi-segment network extension in Neutron API server to build the L2 network in the
Tricircle. Each network in the bottom OpenStack instance will be a segment for the
whole cross pod L2 networking in the Tricircle.
After the network in the bottom OpenStack instance was created successfully, Nova
API-GW will call Neutron server API to update the network in the Tricircle with a
new segment from the network in the bottom OpenStack instance.
If the network in the bottom OpenStack instance was removed successfully, Nova
API-GW will call Neutron server api to remove the segment in the bottom OpenStack
instance from network in the Tricircle.
When L2GW driver under Tricircle plugin in Neutron API server receives the
segment update request, L2GW driver will start async job to orchestrate L2GW API
for L2 networking automation[2][3].
Data model impact
-----------------
In database, we are considering setting physical_network in top OpenStack instance
as ``bottom_physical_network#bottom_pod_id`` to distinguish segmentation information
in different bottom OpenStack instance.
REST API impact
---------------
None
Security impact
---------------
None
Notifications impact
--------------------
None
Other end user impact
---------------------
None
Performance Impact
------------------
None
Other deployer impact
---------------------
None
Developer impact
----------------
None
Implementation
==============
**Local Network Implementation**
For Local Network, L2GW is not required. In this scenario, no cross pod L2/L3
networking is required.
A user creates network ``Net1`` with single AZ1 in az_hint, the Tricircle plugin
checks the configuration, if ``tenant_network_type`` equals ``local_network``,
it will invoke Local Network type driver. Local Network driver under the
Tricircle plugin will update ``network_type`` in database.
For exmaple, a user creates VM1 in AZ1 which has only one pod ``POD1``, and
connects it to network ``Net1``. ``Nova API-GW`` will send network creation
request to ``POD1`` and the VM will be booted in AZ1 (There should be only one
pod in AZ1).
If a user wants to create VM2 in AZ2 or ``POD2`` in AZ1, and connect it to
network ``Net1`` in the Tricircle, it would be failed. Because the ``Net1`` is
local_network type network and it is limited to present in ``POD1`` in AZ1 only.
**Shared VLAN Implementation**
For Shared VLAN, L2GW is not required. This is the most simplest cross pod
L2 networking for limited scenario. For example, with a small number of
networks, all VLANs are extended through physical gateway to support cross
site VLAN networking, or all pods under same core switch with same visible
VLAN ranges that supported by the core switch are connected by the core
switch.
when a user creates network called ``Net1``, the Tricircle plugin checks the
configuration. If ``tenant_network_type`` equals ``shared_vlan``, the
Tricircle will invoke Shared VLAN type driver. Shared VLAN driver will
create ``segment``, and assign ``network_type`` with VLAN, update
``segment`` and ``network_type`` and ``physical_network`` with DB
A user creates VM1 in AZ1, and connects it to network Net1. If VM1 will be
booted in ``POD1``, ``Nova API-GW`` needs to get the network information and
send network creation message to ``POD1``. Network creation message includes
``network_type`` and ``segment`` and ``physical_network``.
Then the user creates VM2 in AZ2, and connects it to network Net1. If VM will
be booted in ``POD2``, ``Nova API-GW`` needs to get the network information and
send create network message to ``POD2``. Create network message includes
``network_type`` and ``segment`` and ``physical_network``.
**Shared VxLAN Implementation**
A user creates network ``Net1``, the Tricircle plugin checks the configuration, if
``tenant_network_type`` equals ``shared_vxlan``, it will invoke shared VxLAN
driver. Shared VxLAN driver will allocate ``segment``, and assign
``network_type`` with VxLAN, and update network with ``segment`` and
``network_type`` with DB
A user creates VM1 in AZ1, and connects it to network ``Net1``. If VM1 will be
booted in ``POD1``, ``Nova API-GW`` needs to get the network information and send
create network message to ``POD1``, create network message includes
``network_type`` and ``segment``.
``Nova API-GW`` should update ``Net1`` in Tricircle with the segment information
got by ``POD1``.
Then the user creates VM2 in AZ2, and connects it to network ``Net1``. If VM2 will
be booted in ``POD2``, ``Nova API-GW`` needs to get the network information and
send network creation massage to ``POD2``, network creation message includes
``network_type`` and ``segment``.
``Nova API-GW`` should update ``Net1`` in the Tricircle with the segment information
get by ``POD2``.
The Tricircle plugin detects that the network includes more than one segment
network, calls L2GW driver to start async job for cross pod networking for
``Net1``. The L2GW driver will create L2GW1 in ``POD1`` and L2GW2 in ``POD2``. In
``POD1``, L2GW1 will connect the local ``Net1`` and create L2GW remote connection
to L2GW2, then populate the information of MAC/IP which resides in L2GW1. In
``POD2``, L2GW2 will connect the local ``Net1`` and create L2GW remote connection
to L2GW1, then populate remote MAC/IP information which resides in ``POD1`` in L2GW2.
L2GW driver in the Tricircle will also detect the new port creation/deletion API
request. If port (MAC/IP) created or deleted in ``POD1`` or ``POD2``, it needs to
refresh the L2GW2 MAC/IP information.
Whether to populate the information of port (MAC/IP) should be configurable according
to L2GW capability. And only populate MAC/IP information for the ports that are not
resides in the same pod.
**Mixed VLAN/VxLAN**
To achieve cross pod L2 networking, L2GW will be used to connect L2 network in
different pods, using L2GW should work for Shared VxLAN and Mixed VLAN/VxLAN
scenario.
When L2GW connected with local network in the same OpenStack instance, no
matter it's VLAN or VxLAN or GRE, the L2GW should be able to connect the
local network, and because L2GW is extension of Neutron, only network
UUID should be enough for L2GW to connect the local network.
When admin user creates network in Tricircle, he/she specifies the network
type as one of the network type as discussed above. In the phase of creating
network in Tricircle, only one record is saved in the database, no network
will be created in bottom OpenStack.
After the network in the bottom created successfully, need to retrieve the
network information like segment id, network name and network type, and make
this network in the bottom pod as one of the segments in the network in
Tricircle.
In the Tricircle, network could be created by tenant or admin. For tenant, no way
to specify the network type and segment id, then default network type will
be used instead. When user uses the network to boot a VM, ``Nova API-GW``
checks the network type. For Mixed VLAN/VxLAN network, ``Nova API-GW`` first
creates network in bottom OpenStack without specifying network type and segment
ID, then updates the top network with bottom network segmentation information
returned by bottom OpenStack.
A user creates network ``Net1``, plugin checks the configuration, if
``tenant_network_type`` equals ``mixed_vlan_vxlan``, it will invoke mixed VLAN
and VxLAN driver. The driver needs to do nothing since segment is allocated
in bottom.
A user creates VM1 in AZ1, and connects it to the network ``Net1``, the VM is
booted in bottom ``POD1``, and ``Nova API-GW`` creates network in ``POD1`` and
queries the network detail segmentation information (using admin role), and
gets network type, segment id, then updates this new segment to the ``Net1``
in Tricircle ``Neutron API Server``.
Then the user creates another VM2, and with AZ info AZ2, then the VM should be
able to be booted in bottom ``POD2`` which is located in AZ2. And when VM2 should
be able to be booted in AZ2, ``Nova API-GW`` also creates a network in ``POD2``,
and queries the network information including segment and network type,
updates this new segment to the ``Net1`` in Tricircle ``Neutron API Server``.
The Tricircle plugin detects that the ``Net1`` includes more than one network
segments, calls L2GW driver to start async job for cross pod networking for
``Net1``. The L2GW driver will create L2GW1 in ``POD1`` and L2GW2 in ``POD2``. In
``POD1``, L2GW1 will connect the local ``Net1`` and create L2GW remote connection
to L2GW2, then populate information of MAC/IP which resides in ``POD2`` in L2GW1.
In ``POD2``, L2GW2 will connect the local ``Net1`` and create L2GW remote connection
to L2GW1, then populate remote MAC/IP information which resides in ``POD1`` in L2GW2.
L2GW driver in Tricircle will also detect the new port creation/deletion api
calling, if port (MAC/IP) created or deleted in ``POD1``, then needs to refresh
the L2GW2 MAC/IP information. If port (MAC/IP) created or deleted in ``POD2``,
then needs to refresh the L2GW1 MAC/IP information,
Whether to populate MAC/IP information should be configurable according to
L2GW capability. And only populate MAC/IP information for the ports that are
not resides in the same pod.
**L3 bridge network**
Current implementation without cross pod L2 networking.
* A special bridge network is created and connected to the routers in
different bottom OpenStack instances. We configure the extra routes of the routers
to route the packets from one OpenStack to another. In current
implementation, we create this special bridge network in each bottom
OpenStack with the same ``VLAN ID``, so we have an L2 network to connect
the routers.
Difference between L2 networking for tenant's VM and for L3 bridging network.
* The creation of bridge network is triggered during attaching router
interface and adding router external gateway.
* The L2 network for VM is triggered by ``Nova API-GW`` when a VM is to be
created in one pod, and finds that there is no network, then the network
will be created before the VM is booted, network or port parameter is
required to boot VM. The IP/Mac for VM is allocated in the ``Tricircle``,
top layer to avoid IP/mac collision if they are allocated separately in
bottom pods.
After cross pod L2 networking is introduced, the L3 bridge network should
be updated too.
L3 bridge network N-S (North-South):
* For each tenant, one cross pod N-S bridge network should be created for router
N-S inter-connection. Just replace the current shared VLAN N-S bridge network
to corresponding Shared VxLAN or Mixed VLAN/VxLAN.
L3 bridge network E-W (East-West):
* When attaching router interface happened, for Shared VLAN, it will keep
current process to establish E-W bridge network. For Shared VxLAN and Mixed
VLAN/VxLAN, if a L2 network is able to expand to the current pod, then just
expand the L2 network to the pod, all E-W traffic will go out from local L2
network, then no bridge network is needed.
* For example, (Net1, Router1) in ``Pod1``, (Net2, Router1) in ``Pod2``, if
``Net1`` is a cross pod L2 network, and can be expanded to Pod2, then will just
expand ``Net1`` to Pod2. After the ``Net1`` expansion ( just like cross pod L2 networking
to spread one network in multiple pods ), itll look like (Net1, Router1)
in ``Pod1``, (Net1, Net2, Router1) in ``Pod2``, In ``Pod2``, no VM in ``Net1``, only for
E-W traffic. Now the E-W traffic will look like this:
from Net2 to Net1:
Net2 in Pod2 -> Router1 in Pod2 -> Net1 in Pod2 -> L2GW in Pod2 ---> L2GW in
Pod1 -> Net1 in Pod1.
Note: The traffic for ``Net1`` in ``Pod2`` to ``Net1`` in ``Pod1`` can bypass the L2GW in
``Pod2``, that means outbound traffic can bypass the local L2GW if the remote VTEP of
L2GW is known to the local compute node and the packet from the local compute
node with VxLAN encapsulation cloud be routed to remote L2GW directly. It's up
to the L2GW implementation. With the inbound traffic through L2GW, the inbound
traffic to the VM will not be impacted by the VM migration from one host to
another.
If ``Net2`` is a cross pod L2 network, and can be expanded to ``Pod1`` too, then will
just expand ``Net2`` to ``Pod1``. After the ``Net2`` expansion(just like cross pod L2
networking to spread one network in multiple pods ), itll look like (Net2,
Net1, Router1) in ``Pod1``, (Net1, Net2, Router1) in ``Pod2``, In ``Pod1``, no VM in
Net2, only for E-W traffic. Now the E-W traffic will look like this:
from ``Net1`` to ``Net2``:
Net1 in Pod1 -> Router1 in Pod1 -> Net2 in Pod1 -> L2GW in Pod1 ---> L2GW in
Pod2 -> Net2 in Pod2.
To limit the complexity, one networks az_hint can only be specified when
creating, and no update is allowed, if az_hint need to be updated, you have
to delete the network and create again.
If the network cant be expanded, then E-W bridge network is needed. For
example, Net1(AZ1, AZ2,AZ3), Router1; Net2(AZ4, AZ5, AZ6), Router1.
Then a cross pod L2 bridge network has to be established:
Net1(AZ1, AZ2, AZ3), Router1 --> E-W bridge network ---> Router1,
Net2(AZ4, AZ5, AZ6).
Assignee(s)
------------
Primary assignee:
Other contributors:
Work Items
------------
Dependencies
============
None
Testing
=======
None
Documentation Impact
====================
None
References
==========
[1] https://docs.google.com/document/d/18kZZ1snMOCD9IQvUKI5NVDzSASpw-QKj7l2zNqMEd3g/
[2] https://review.openstack.org/#/c/270786/
[3] https://github.com/openstack/networking-l2gw/blob/master/specs/kilo/l2-gateway-api.rst
[4] http://developer.openstack.org/api-ref-networking-v2-ext.html#networks-multi-provider-ext
[5] http://docs.openstack.org/mitaka/networking-guide/adv-config-availability-zone.html
[6] https://review.openstack.org/#/c/306224/