Retire swift-specs
Depends-On: https://review.opendev.org/732999 Change-Id: I8acff8e7c07f3e0f599d86d503eb4b088c0f8521
@ -1,7 +0,0 @@
|
||||
[run]
|
||||
branch = True
|
||||
source = swift-specs
|
||||
omit = swift-specs/tests/*,swift-specs/openstack/*
|
||||
|
||||
[report]
|
||||
ignore_errors = True
|
51
.gitignore
vendored
@ -1,51 +0,0 @@
|
||||
*.py[cod]
|
||||
|
||||
# C extensions
|
||||
*.so
|
||||
|
||||
# Packages
|
||||
*.egg
|
||||
*.egg-info
|
||||
dist
|
||||
build
|
||||
eggs
|
||||
parts
|
||||
bin
|
||||
var
|
||||
sdist
|
||||
develop-eggs
|
||||
.installed.cfg
|
||||
lib
|
||||
lib64
|
||||
|
||||
# Installer logs
|
||||
pip-log.txt
|
||||
|
||||
# Unit test / coverage reports
|
||||
.coverage
|
||||
.tox
|
||||
nosetests.xml
|
||||
.testrepository
|
||||
|
||||
# Translations
|
||||
*.mo
|
||||
|
||||
# Mr Developer
|
||||
.mr.developer.cfg
|
||||
.project
|
||||
.pydevproject
|
||||
|
||||
# Complexity
|
||||
output/*.html
|
||||
output/*/index.html
|
||||
|
||||
# Sphinx
|
||||
doc/build
|
||||
|
||||
# pbr generates these
|
||||
AUTHORS
|
||||
ChangeLog
|
||||
|
||||
# Editors
|
||||
*~
|
||||
.*.swp
|
3
.mailmap
@ -1,3 +0,0 @@
|
||||
# Format is:
|
||||
# <preferred e-mail> <other e-mail 1>
|
||||
# <preferred e-mail> <other e-mail 2>
|
@ -1,7 +0,0 @@
|
||||
[DEFAULT]
|
||||
test_command=OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} \
|
||||
OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} \
|
||||
OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-60} \
|
||||
${PYTHON:-python} -m subunit.run discover -t ./ . $LISTOPT $IDOPTION
|
||||
test_id_option=--load-list $IDFILE
|
||||
test_list_option=--list
|
@ -1,20 +0,0 @@
|
||||
===========================
|
||||
Contributing to swift-specs
|
||||
===========================
|
||||
|
||||
HowToContribute
|
||||
---------------
|
||||
|
||||
If you would like to contribute to the development of OpenStack,
|
||||
you must follow the steps in this page:
|
||||
|
||||
http://docs.openstack.org/infra/manual/developers.html
|
||||
|
||||
GerritWorkFlow
|
||||
--------------
|
||||
|
||||
Once those steps have been completed, changes to OpenStack
|
||||
should be submitted for review via the Gerrit tool, following
|
||||
the workflow documented at:
|
||||
|
||||
http://docs.openstack.org/infra/manual/developers.html#development-workflow
|
3
LICENSE
@ -1,3 +0,0 @@
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
@ -1,5 +0,0 @@
|
||||
include LICENCE
|
||||
exclude .gitignore
|
||||
exclude .gitreview
|
||||
|
||||
global-exclude *.pyc
|
87
README.rst
@ -1,82 +1,9 @@
|
||||
========================
|
||||
Team and repository tags
|
||||
========================
|
||||
This project is no longer maintained.
|
||||
|
||||
.. image:: http://governance.openstack.org/badges/swift-specs.svg
|
||||
:target: http://governance.openstack.org/reference/tags/index.html
|
||||
The contents of this repository are still available in the Git
|
||||
source code management system. To see the contents of this
|
||||
repository before it reached its end of life, please check out the
|
||||
previous commit with ``git checkout HEAD^1``.
|
||||
|
||||
.. Change things from this point on
|
||||
|
||||
======================
|
||||
Swift Specs Repository
|
||||
======================
|
||||
|
||||
This archive is no longer active. Content is kept for historic purposes.
|
||||
========================================================================
|
||||
|
||||
Documents in this repo are a collection of ideas. They are not
|
||||
necessarily a formal design for a feature, nor are they docs for a
|
||||
feature, nor are they a roadmap for future features.
|
||||
|
||||
This is a git repository for doing design review on enhancements to
|
||||
OpenStack Swift. This provides an ability to ensure that everyone
|
||||
has signed off on the approach to solving a problem early on.
|
||||
|
||||
Repository Structure
|
||||
====================
|
||||
The structure of the repository is as follows::
|
||||
|
||||
specs/
|
||||
done/
|
||||
in_progress/
|
||||
|
||||
Implemented specs will be moved to :ref:`done-directory`
|
||||
once the associated code has landed.
|
||||
|
||||
The Flow of an Idea from your Head to Implementation
|
||||
====================================================
|
||||
First propose a spec to the ``in_progress`` directory so that it can be
|
||||
reviewed. Reviewers adding a positive +1/+2 review in gerrit are promising
|
||||
that they will review the code when it is proposed. Spec documents should be
|
||||
approved and merged as soon as possible, and spec documents in the
|
||||
``in_progress`` directory can be updated as often as needed. Iterate on it.
|
||||
|
||||
#. Have an idea
|
||||
#. Propose a spec
|
||||
#. Reviewers review the spec. As soon as 2 core reviewers like something,
|
||||
merge it. Iterate on the spec as often as needed, and keep it updated.
|
||||
#. Once there is agreement on the spec, write the code.
|
||||
#. As the code changes during review, keep the spec updated as needed.
|
||||
#. Once the code lands (with all necessary tests and docs), the spec can be
|
||||
moved to the ``done`` directory. If a feature needs a spec, it needs
|
||||
docs, and the docs must land before or with the feature (not after).
|
||||
|
||||
Spec Lifecycle Rules
|
||||
====================
|
||||
#. Land quickly: A spec is a living document, and lives in the repository
|
||||
not in gerrit. Potential users, ops and developers will look at
|
||||
http://specs.openstack.org/openstack/swift-specs/ to get an idea of what's
|
||||
being worked on, so they need to be quick to land.
|
||||
|
||||
#. Initial version is an idea not a technical design: That way the merits of
|
||||
the idea can be discussed and landed and not stuck in gerrit limbo land.
|
||||
|
||||
#. Second version is an overview of the technical design: This will aid in the
|
||||
technical discussions amongst the community.
|
||||
|
||||
#. Subsequent versions improve/enhance technical design: Each of these
|
||||
versions should be relatively small patches to the spec to keep rule #1. And
|
||||
keeps the spec up to date with the progress of the implementation.
|
||||
|
||||
How to ask questions and get clarifications about a spec
|
||||
========================================================
|
||||
Naturally you'll want clarifications about the way a spec is written. To ask
|
||||
questions, propose a patch to the spec (via the normal patch proposal tools)
|
||||
with your question or your understanding of the confusing part. That will
|
||||
raise the issue in a patch review and allow everyone to answer or comment.
|
||||
|
||||
Learn As We Go
|
||||
==============
|
||||
This is a new way of attempting things, so we're going to be low in
|
||||
process to begin with to figure out where we go from here. Expect some
|
||||
early flexibility in evolving this effort over time.
|
||||
Historical content may still be viewed at
|
||||
http://specs.openstack.org/openstack/swift-specs/
|
||||
|
@ -1,90 +0,0 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
||||
# implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import datetime
|
||||
import os
|
||||
import sys
|
||||
|
||||
sys.path.insert(0, os.path.abspath('../..'))
|
||||
# -- General configuration ----------------------------------------------------
|
||||
|
||||
# Add any Sphinx extension module names here, as strings. They can be
|
||||
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
|
||||
extensions = [
|
||||
'sphinx.ext.autodoc',
|
||||
'oslosphinx',
|
||||
'yasfb',
|
||||
]
|
||||
|
||||
# Feed configuration for yasfb
|
||||
feed_base_url = 'http://specs.openstack.org/openstack/swift-specs'
|
||||
feed_author = 'OpenStack Swift Team'
|
||||
|
||||
exclude_patterns = [
|
||||
'**/test.rst',
|
||||
'template_link.rst',
|
||||
]
|
||||
|
||||
# Optionally allow the use of sphinxcontrib.spelling to verify the
|
||||
# spelling of the documents.
|
||||
try:
|
||||
import sphinxcontrib.spelling
|
||||
extensions.append('sphinxcontrib.spelling')
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
# autodoc generation is a bit aggressive and a nuisance when doing heavy
|
||||
# text edit cycles.
|
||||
# execute "export SPHINX_DEBUG=1" in your terminal to disable
|
||||
|
||||
# The suffix of source filenames.
|
||||
source_suffix = '.rst'
|
||||
|
||||
# The master toctree document.
|
||||
master_doc = 'index'
|
||||
|
||||
# General information about the project.
|
||||
project = u'swift-specs'
|
||||
copyright = u'%s, OpenStack Foundation' % datetime.date.today().year
|
||||
|
||||
# If true, '()' will be appended to :func: etc. cross-reference text.
|
||||
add_function_parentheses = True
|
||||
|
||||
# If true, the current module name will be prepended to all description
|
||||
# unit titles (such as .. function::).
|
||||
add_module_names = True
|
||||
|
||||
# The name of the Pygments (syntax highlighting) style to use.
|
||||
pygments_style = 'sphinx'
|
||||
|
||||
# -- Options for HTML output --------------------------------------------------
|
||||
|
||||
# The theme to use for HTML and HTML Help pages. Major themes that come with
|
||||
# Sphinx are currently 'default' and 'sphinxdoc'.
|
||||
# html_theme_path = ["."]
|
||||
# html_theme = '_theme'
|
||||
# html_static_path = ['static']
|
||||
|
||||
# Output file base name for HTML help builder.
|
||||
htmlhelp_basename = '%sdoc' % project
|
||||
|
||||
# Grouping the document tree into LaTeX files. List of tuples
|
||||
# (source start file, target name, title, author, documentclass
|
||||
# [howto/manual]).
|
||||
latex_documents = [
|
||||
('index',
|
||||
'%s.tex' % project,
|
||||
u'%s Documentation' % project,
|
||||
u'OpenStack Foundation', 'manual'),
|
||||
]
|
@ -1 +0,0 @@
|
||||
.. include:: ../../CONTRIBUTING.rst
|
@ -1,41 +0,0 @@
|
||||
Swift Design Specifications
|
||||
===========================
|
||||
|
||||
|
||||
This archive is no longer active. Content is kept for historic purposes.
|
||||
========================================================================
|
||||
|
||||
Documents in this repo are a collection of ideas. They are not
|
||||
necessarily a formal design for a feature, nor are they docs for a
|
||||
feature, nor are they a roadmap for future features.
|
||||
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 1
|
||||
|
||||
specs/in_progress/*
|
||||
|
||||
Specifications Repository Information
|
||||
=====================================
|
||||
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 2
|
||||
|
||||
*
|
||||
|
||||
Archived Specs
|
||||
==============
|
||||
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 1
|
||||
|
||||
specs/done/*
|
||||
|
||||
Indices and tables
|
||||
==================
|
||||
|
||||
* :ref:`genindex`
|
||||
* :ref:`search`
|
||||
|
@ -1 +0,0 @@
|
||||
.. include:: ../../README.rst
|
@ -1 +0,0 @@
|
||||
../../specs/
|
@ -1 +0,0 @@
|
||||
.. include:: ../../template.rst
|
@ -1,3 +0,0 @@
|
||||
oslosphinx
|
||||
sphinx>=1.1.2,<1.2
|
||||
yasfb>=0.5.1
|
25
setup.cfg
@ -1,25 +0,0 @@
|
||||
[metadata]
|
||||
name = swift-specs
|
||||
summary = OpenStack Swift Development Specifications
|
||||
description-file =
|
||||
README.rst
|
||||
author = OpenStack
|
||||
author-email = openstack-dev@lists.openstack.org
|
||||
home-page = http://www.openstack.org/
|
||||
classifier =
|
||||
Environment :: OpenStack
|
||||
Intended Audience :: Developers
|
||||
Operating System :: POSIX :: Linux
|
||||
|
||||
[build_sphinx]
|
||||
source-dir = doc/source
|
||||
build-dir = doc/build
|
||||
all_files = 1
|
||||
|
||||
[pbr]
|
||||
warnerrors = True
|
||||
skip_authors = True
|
||||
skip_changelog = True
|
||||
|
||||
[upload_sphinx]
|
||||
upload-dir = doc/build/html
|
22
setup.py
@ -1,22 +0,0 @@
|
||||
#!/usr/bin/env python
|
||||
# Copyright (c) 2013 Hewlett-Packard Development Company, L.P.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
||||
# implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
# THIS FILE IS MANAGED BY THE GLOBAL REQUIREMENTS REPO - DO NOT EDIT
|
||||
import setuptools
|
||||
|
||||
setuptools.setup(
|
||||
setup_requires=['pbr'],
|
||||
pbr=True)
|
@ -1,15 +0,0 @@
|
||||
.. _done-directory:
|
||||
|
||||
The ``Done`` Directory
|
||||
======================
|
||||
|
||||
This directory in the specs repo is where specs are moved once the
|
||||
associated code patch has been merged into its respective repo.
|
||||
|
||||
Historical Reference
|
||||
--------------------
|
||||
|
||||
A spec document in this directory is meant only for historical
|
||||
reference, it does not equate to docs for the feature. Swift's
|
||||
documentation for implemented features is published
|
||||
`here <http://docs.openstack.org/developer/swift/>`_.
|
@ -1,872 +0,0 @@
|
||||
|
||||
::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
====================
|
||||
Erasure Code Support
|
||||
====================
|
||||
|
||||
This is a living document to be updated as the team iterates on the design
|
||||
therefore all details here reflect current thinking however are subject to
|
||||
change as development progresses. The team makes use of Trello to track
|
||||
more real-time discussions activities that, as details/thoughts emerge, are
|
||||
captured in this document.
|
||||
|
||||
The Trello discussion board can be found at this `link. <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_
|
||||
Major remaining tasks are identified by a number that can be found in a corresponding Trello card. Outstanding
|
||||
tasks are listed at the end of each section in this document As this doc is updated and/or Trello cards are
|
||||
completed please be sure to update both places.
|
||||
|
||||
WIP Revision History:
|
||||
|
||||
* 7/25, updated meta picture, specify that object metadata is system, redo reconstructor section
|
||||
* 7/31, added traceability to trello cards via section numbers and numbered task items, added a bunch of sections
|
||||
* 8/5, updated middleware section, container_sync section, removed 3.3.3.7 as dup, refactoring section, create common interface to proxy nodes, partial PUT hency on obj sysmet patch, sync'd with trello
|
||||
* 8/23, many updates to reconstructor section based on con-call from 8/22. Also added notes about not deleting on PUT where relevant and updated sections referencing closed Trello cards
|
||||
* 9/4, added section in reconstructor on concurrency
|
||||
* 10/7, reconstructor section updates - lots of them
|
||||
* 10/14, more reconstructor section updates, 2 phase commit intro - misc typos as well from review
|
||||
* 10/15, few clarifications from F2F review and bigger rewording/implementation change for what was called 2 phase commit
|
||||
* 10/17, misc clarifying notes on .durable stuff
|
||||
* 11/13: IMPORANT NOTE: Several aspects of the reconstructor are being re-worked; the section will be updated ASAP
|
||||
* 12/16: reconstructor updates, few minor updates throughout.
|
||||
* 2/3: reconstructor updates
|
||||
* 3/23: quick scrub to bring things in line w/current implementation
|
||||
* 4/14: Ec has been merged to master. Some parts of this spec are no longer the authority on the design, please review code on master and user documentation.
|
||||
|
||||
1. Summary
|
||||
----------
|
||||
EC is implemented in Swift as a Storage Policy, see `docs <http://docs.openstack.org/developer/swift/overview_policies.html>`_
|
||||
for complete details on Storage Policies.
|
||||
|
||||
EC support impacts many of the code paths and background operations for data stored in a
|
||||
container that was created with an EC policy however this is all transparent to users of
|
||||
the cluster. In addition to fully leveraging the Storage Policy framework, the EC design
|
||||
will update the storage policy classes such that new policies, like EC, will be sub
|
||||
classes of a generic base policy class. Major code paths (PUT/GET) are updated to
|
||||
accommodate the different needs to encode/decode versus replication and a new daemon, the
|
||||
EC reconstructor, performs the equivalent jobs of the replicator for replication
|
||||
processes. The other major daemons remain, for the most part, unchanged as another key
|
||||
concept for EC is that EC fragments (see terminology section below) are seen as regular
|
||||
objects by the majority of services thus minimizing the impact on the existing code base.
|
||||
|
||||
The Swift code base doesn't include any of the algorithms necessary to perform the actual
|
||||
encoding and decoding of data; that is left to an external library. The Storage Policies
|
||||
architecture is leveraged to allow EC on a per container basis and the object rings still
|
||||
provide for placement of EC data fragments. Although there are several code paths that are
|
||||
unique to an operation associated with an EC policy, an external dependency to an Erasure Code
|
||||
library is what Swift counts on to perform the low level EC functions. The use of an external
|
||||
library allows for maximum flexibility as there are a significant number of options out there,
|
||||
each with its owns pros and cons that can vary greatly from one use case to another.
|
||||
|
||||
2. Problem description
|
||||
======================
|
||||
|
||||
The primary aim of EC is to reduce the storage costs associated with massive amounts of data
|
||||
(both operating costs and capital costs) by providing an option that maintains the same, or
|
||||
better, level of durability using much less disk space. See this `study <http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/big-data-amplidata-storage-paper.pdf>`_
|
||||
for more details on this claim.
|
||||
|
||||
EC is not intended to replace replication as it may not be appropriate for all usage models.
|
||||
We expect some performance and network usage tradeoffs that will be fully characterized once
|
||||
sufficient code is in place to gather empirical data. Current thinking is that what is typically
|
||||
referred to as 'cold storage' is the most common/appropriate use of EC as a durability scheme.
|
||||
|
||||
3. Proposed change
|
||||
==================
|
||||
|
||||
3.1 Terminology
|
||||
-----------------
|
||||
|
||||
The term 'fragment' has been used already to describe the output of the EC process (a series of
|
||||
fragments) however we need to define some other key terms here before going any deeper. Without
|
||||
paying special attention to using the correct terms consistently, it is very easy to get confused
|
||||
in a hurry!
|
||||
|
||||
* segment: not to be confused with SLO/DLO use of the work, in EC we call a segment a series of consecutive HTTP chunks buffered up before performing an EC operation.
|
||||
* fragment: data and parity 'fragments' are generated when erasure coding transformation is applied to a segment.
|
||||
* EC archive: A concatenation of EC fragments; to a storage node this looks like an object
|
||||
* ec_k: number of EC data fragments (k is commonly used in the EC community for this purpose)
|
||||
* ec_m: number of EC parity fragments (m is commonly used in the EC community for this purpose)
|
||||
* chunk: HTTP chunks received over wire (term not used to describe any EC specific operation)
|
||||
* durable: original data is available (either with or without reconstruction)
|
||||
* quorum: the minimum number of data + parity elements required to be able to guarantee the desired fault tolerance, which is the number of data elements supplemented by the minimum number of parity elements required by the chosen erasure coding scheme. For example,for Reed-Soloman, the minimum number parity elements required is 1, and thus the quorum_size requirement is ec_ndata + 1. Given the number of parity elements required is not the same for every erasure coding scheme, consult PyECLib for min_parity_fragments_needed()
|
||||
* fully durable: all EC archives are written and available
|
||||
|
||||
3.2 Key Concepts
|
||||
----------------
|
||||
|
||||
* EC is a Storage Policy with its own ring and configurable set of parameters. The # of replicas for an EC ring is the total number of data plus parity elements configured for the chosen EC scheme.
|
||||
* Proxy server buffers a configurable amount of incoming data and then encodes it via PyECLib we called this a 'segment' of an object.
|
||||
* Proxy distributes the output of the encoding of a segment to the various object nodes it gets from the EC ring, we call these 'fragments' of the segment
|
||||
* Each fragment carries opaque metadata for use by the PyECLib
|
||||
* Object metadata is used to store meta about both the fragments and the objects
|
||||
* An 'EC Archive' is what's stored on disk and is a collection of fragments appended
|
||||
* The EC archives container metadata contains information about the original object, not the EC archive
|
||||
* Here is a 50K foot overview:
|
||||
|
||||
.. image:: images/overview.png
|
||||
|
||||
3.3 Major Change Areas
|
||||
----------------------
|
||||
|
||||
**Dependencies/Requirements**
|
||||
|
||||
See template section at the end
|
||||
|
||||
3.3.1 **Storage Policy Classes**
|
||||
|
||||
The feature/ec branch modifies how policies are instantiated in order to
|
||||
Support the new EC policy.
|
||||
|
||||
`Trello <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_ Tasks for this section:
|
||||
|
||||
3.3.1.2: Make quorum a policy based function (IMPLEMENTED)
|
||||
|
||||
3.3.2 **Middleware**
|
||||
|
||||
Middleware remains unchanged. For most middleware (e.g., SLO/DLO) the fact that the
|
||||
proxy is fragmenting incoming objects is transparent. For list endpoints, however, it
|
||||
is a bit different. A caller of list endpoints will get back the locations of all of
|
||||
the fragments. The caller will be unable to re-assemble the original object with this information,
|
||||
however the node locations may still prove to be useful information for some applications.
|
||||
|
||||
3.3.3 **Proxy Server**
|
||||
|
||||
Early on it did not appear that any major refactoring would be needed
|
||||
to accommodate EC in the proxy, however that doesn't mean that its not a good
|
||||
opportunity to review what options might make sense right now. Discussions have included:
|
||||
|
||||
* should we consider a clearer line between handing incoming requests and talking to the back-end servers?
|
||||
Yes, it makes sense to do this. There is a Trello card tracking this work and it covered in a section later below.
|
||||
* should the PUT path be refactored just because its huge and hard to follow?
|
||||
Opportunistic refactoring makes sense however its not felt that it makes sense to
|
||||
combine a full refactor of PUT as part of this EC effort. YES! This is active WIP.
|
||||
* should we consider different controllers (like an 'EC controller')?
|
||||
Well, probably... YES This is active WIP.
|
||||
|
||||
The following summarizes proxy changes to support EC:
|
||||
|
||||
*TODO: there are current discussion underway on Trello that affect both of these flows*
|
||||
|
||||
**Basic flow for a PUT:**
|
||||
#. Proxy opens (ec_k + ec_m) backend requests to object servers
|
||||
#. Proxy buffers HTTP chunks up-to a minimum segment size (defined at 1MB to start with)
|
||||
#. Proxy feeds the assembled segment to PyECLib's encode() to get ec_k + ec_m fragments
|
||||
#. Proxy sends the (ec_k + ec_m) fragments to the object servers to be _appended_ to the previous set
|
||||
#. Proxy then continues with the next set of HTTP chunks
|
||||
#. Object servers store objects which are EC archives (their contents are the concatenation of erasure coded fragments)
|
||||
#. Object metadata changes: for 'etag', we store the md5sum of the EC archive object, as opposed to the non-EC case where we store md5sum of the entire object
|
||||
#. Upon quorum of response and some minimal (2) number of commit confirmations, responds to client
|
||||
#. Upon receipt of the commit message (part of a MIME conversation) storage nodes store 0 byte data file as timestamp.durable for respective object
|
||||
|
||||
**Proxy HTTP PUT request handling changes**
|
||||
#. Intercept EC request based on policy type
|
||||
#. Validate ring replica count against (ec_k + ec_m)
|
||||
#. Calculate EC quorum size for min_conns
|
||||
#. Call into PyEClib to encode to client_chunk_size sized object chunks to generate (ec_k + ec_m) EC fragments.
|
||||
#. Queue chunk EC fragments for writing to nodes
|
||||
#. Introduce Multi-phase Commit Conversation
|
||||
|
||||
**Basic flow for a GET:**
|
||||
#. Proxy opens ec_k backend concurrent requests to object servers. See Trello card 3.3.3.3
|
||||
#. Proxy would 1) validates the number of successful connections >= ec_k 2) checks the avaiable fragment archives responsed by obj-server are the same version.
|
||||
3) continue searching from the hand-off nodes (ec_k + ec_m) if not enough data found. See Trello card 3.3.3.6
|
||||
#. Proxy reads from the first ec_k fragment archives concurrently.
|
||||
#. Proxy buffers the content to a segment up-to the minimum segment size.
|
||||
#. Proxy feeds the assembled segment to PyECLib's decode() to get the original content.
|
||||
#. Proxy sends the original content to Client.
|
||||
#. Proxy then continues with the next segment of contents.
|
||||
|
||||
**Proxy HTTP GET request handling changes**
|
||||
|
||||
*TODO - add high level flow*
|
||||
|
||||
*Partial PUT handling*
|
||||
|
||||
NOTE: This is active WIP on trello.
|
||||
|
||||
When a previous PUT fails in the middle, for whatever reason and regardless of how the response
|
||||
was sent to the client, there can be various scenarios at the object servers that require the
|
||||
proxy to make some decisions about what to do. Note that because the object servers will not
|
||||
return data for .data files that don't have a matching .durable file, its not possible for
|
||||
the proxy to get un-reconstrucable data unless there's a combination of a partial PUT and
|
||||
a rebalance going on (or handoff scenario). Here are the basic rules for the proxy when it
|
||||
comes to interpreting its responses when they are mixed::
|
||||
|
||||
If I have all of one timestamp, feed to PyECLib
|
||||
If PYECLib says OK
|
||||
I'm done, move on to next segment
|
||||
Else
|
||||
Fail the request (had sufficient segments but something bad happened)
|
||||
Else I have a mix of timestamps;
|
||||
Because they all have to be recosntructable, choose the newest
|
||||
Feed to PYECLib
|
||||
If PYECLib says OK
|
||||
Im done, move on to next segment
|
||||
Else
|
||||
Its possible that the newest timestamp I chose didn't have enough segments yet
|
||||
because, although each object server claims they're reconstructable, maybe
|
||||
a rebalance or handoff situation has resulted in some of those .data files
|
||||
residing elsewhere right now. In this case, I want to look into the
|
||||
available timestamp headers that came back with the GET and see what else
|
||||
is reconstructable and go with that for now. This is really a corner case
|
||||
because we will restrict moving partitions around such that enough archives
|
||||
should be found at any given point in time but someone might move too quickly
|
||||
so now the next check is...
|
||||
Choose the latest available timestamp in the headers and re-issue GET
|
||||
If PYECLib says OK
|
||||
I'm done, move on to next segment
|
||||
Else
|
||||
Fail the request (had sufficient segments but something bad happened) or
|
||||
we can consider going to the next latest header....
|
||||
|
||||
**Region Support**
|
||||
|
||||
For at least the initial version of EC, it is not recommended that an EC scheme span beyond a
|
||||
single region, Neither performance nor functional validation will be been done in in such
|
||||
a configuration.
|
||||
|
||||
`Trello <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_ Tasks for this section::
|
||||
|
||||
* 3.3.3.5: CLOSED
|
||||
|
||||
* 3.3.3.9: Multi-Phase Commit Conversation
|
||||
|
||||
In order to help solve the local data file cleanup problem, a multi-phase commit scheme is introduced
|
||||
for EC PUT operations (last few steps above). The implementation will be via MIME documents such that
|
||||
a conversation between the proxy and the storage nodes is had for every PUT. This provides us with the
|
||||
ability to handle a PUT in one connection and assure that we have "the essence" of a 2 phase commit,
|
||||
basically having the proxy communicate back to the storage nodes once it has confirmation that all
|
||||
fragment archives in the set have been committed. Note that we still require a quorum of data elements
|
||||
of the conversation to complete before signaling status to the client but we can relax that requirement
|
||||
for the commit phase such that only 2 confirmations to that phase of the conversation are required for
|
||||
success. More will be said about this in the reconstructor section.
|
||||
|
||||
Now the storage node has a cheap indicator of the last known durable set of fragment archives for a given
|
||||
object on a successful durable PUT. The reconstructor will also play a role in the managing of the
|
||||
.durable files, either propagating it or creating one post-reconstruction. The presence of a ts.durable
|
||||
file means, to the object server, "there is a set of ts.data files that are durable at timestamp ts."
|
||||
See reconstructor section for more details and use cases on .durable files. Note that the completion
|
||||
of the commit phase of the conversation is also a signal for the object server to go ahead and immediately
|
||||
delete older timestamp files for this object (for EC they are not immediately deleted on PUT). This is
|
||||
critical as we don't want to delete the older object until the storage node has confirmation from the
|
||||
proxy, via the multi-phase conversation, that the other nodes have landed enough for a quorum.
|
||||
|
||||
On the GET side, the implication here is that storage nodes will return the TS with a matching .durable
|
||||
file even if it has a newer .data file. If there exists a .data file on one node without a .durable file but
|
||||
that same timestamp has both a .data and a .durable on another node, the proxy is free to use the .durable
|
||||
timestamp series as the presence of just one .durable in the set indicates that the object has integrity. In
|
||||
the even that a serires of .data files exist without a .durable file, they will eventually be deleted by the
|
||||
reconstructor as they will be considered partial junk that is unreconstructable (recall that 2 .durables
|
||||
are required for determining that a PUT was successful).
|
||||
|
||||
Note that the intention is that this section/trello card covers the multi-phase commit
|
||||
implementation at both proxy and storage nodes however it doesn't cover the work that
|
||||
the reconstructor does with the .durable file.
|
||||
|
||||
A few key points on the .durable file:
|
||||
|
||||
* the .durable file means "the matching .data file for this has sufficient fragment archives somewhere, committed, to reconstruct the object"
|
||||
* the proxy server will never have knowledge (on GET or HEAD) or the existence of a .data file on an object server if it doesn't have a matching .durable file
|
||||
* the object server will never return a .data that doesn't have a matching .durable
|
||||
* the only component that messes with .data files that don't have matching .durable files is the reconstructor
|
||||
* when a proxy does a GET, it will only receive fragment archives that have enough present somewhere to be reconstructed
|
||||
|
||||
3.3.3.8: Create common interface for proxy-->nodes
|
||||
|
||||
NOTE: This ain't gonna happen as part of the EC effort
|
||||
|
||||
Creating a common module that allows for abstracted access to the a/c/s nodes would not only clean up
|
||||
much of the proxy IO path but would also prevent the introduction of EC from further
|
||||
complicating, for example, the PUT path. Think about an interface that would let proxy code
|
||||
perform generic actions to a back-end node regardless of protocol. The proposed API
|
||||
should be updated here and reviewed prior to implementation and its felt that it can be done
|
||||
in parallel with existing EC proxy work (no dependencies, that work i small enough it can
|
||||
be merged).
|
||||
|
||||
3.3.3.6: Object overwrite and PUT error handling
|
||||
|
||||
What's needed here is a mechanism to assure that we can handle partial write failures. Note: in both cases the client will get a failure back however without additional changes,
|
||||
each storage node that saved a EC fragment archive will effectively have an orphan.
|
||||
|
||||
a) less than a quorum of nodes is written
|
||||
b) quorum is met but not all nodes were written
|
||||
|
||||
and in both cases there are implications to both PUT and GET at both the proxy
|
||||
and object servers. Additionally, the reconstructor plays a role here in cleaning up
|
||||
and old EC archives that result from the scheme described here (see reconstructor
|
||||
for details).
|
||||
|
||||
**High Level Flow**
|
||||
|
||||
* If storing an EC archive fragment, the object server should not delete older .data file unless it has a new one with a matching .durable.
|
||||
* When the object server handles a GET, it needs to send header to the proxy that include all available timestamps for the .data file
|
||||
* If the proxy determines is can reconstruct the object with the latest timestamp (can reach quorum) it proceeds
|
||||
* If quorum cant be reached, find timestamp where quorum can be reached, kill existing connections (unless the body of that request was the found timestamp), and make new connections requesting the specific timestamp
|
||||
* On GET, the object server needs to support requesting a specific timestamp (eg ?timestamp=XYZ)
|
||||
|
||||
`Trello <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_ Tasks for this section::
|
||||
|
||||
* 3.3.3.1: CLOSED
|
||||
* 3.3.3.2: Add high level GET flow
|
||||
* 3.3.3.3: Concurrent connects to object server on GET path in proxy server
|
||||
* 3.3.3.4: CLOSED
|
||||
* 3.3.3.5: Region support for EC
|
||||
* 3.3.3.6 EC PUTs should not delete old data files (in review)
|
||||
* 3.3.3.7: CLOSED
|
||||
* 3.3.3.8: Create common interface for proxy-->nodes
|
||||
* 3.3.3.9: Multi-Phase Commit Conversation
|
||||
|
||||
3.3.4 **Object Server**
|
||||
|
||||
TODO - add high level flow
|
||||
|
||||
`Trello <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_ Tasks for this section::
|
||||
|
||||
* 3.3.4.1: Add high level Obj Serv modifications
|
||||
* 3.3.4.2: Add trailer support (affects proxy too)
|
||||
|
||||
3.3.5 **Metadata**
|
||||
|
||||
NOTE: Some of these metadata names are different in the code...
|
||||
|
||||
Additional metadata is part of the EC design in a few different areas:
|
||||
|
||||
* New metadata is introduced in each 'fragment' that is opaque to Swift, it is used by PyECLib for internal purposes.
|
||||
* New metadata is introduced as system object metadata as shown in this picture:
|
||||
|
||||
.. image:: images/meta.png
|
||||
|
||||
The object metadata will need to be stored as system metadata.
|
||||
|
||||
`Trello <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_ Tasks for this section::
|
||||
|
||||
* 5.1: Enable sysmeta on object PUT (IMPLEMENTED)
|
||||
|
||||
3.3.6 **Database Updates**
|
||||
|
||||
We don't need/want container updates to be sent out by every storage node
|
||||
participating in the EC set and actually that is exactly how it will work
|
||||
without any additional changes, see _backend_requests() in the proxy
|
||||
PUT path for details.
|
||||
|
||||
3.3.7 **The Reconstructor**
|
||||
|
||||
**Overview**
|
||||
|
||||
The key concepts in the reconstructor design are:
|
||||
|
||||
*Focus on use cases that occur most frequently:*
|
||||
#. Recovery from disk drive failure
|
||||
#. Rebalance
|
||||
#. Ring changes and revertible handoff case
|
||||
#. Bit rot
|
||||
|
||||
* Reconstruction happens at the EC archive level (no visibility into fragment level for either auditing or reconstruction)
|
||||
* Highly leverage ssync to gain visibility into which EC archive(s) are needed (some ssync mods needed, consider renaming the verb REPLICATION since ssync can be syncing in different ways now
|
||||
* Minimal changes to existing replicator framework, auditor, ssync
|
||||
* Implement as new reconstructor daemon (much reuse from replicator) as there will be some differences and we will want separate logging and daemon control/visibility for the reconstructor
|
||||
* Nodes in the list only act on their neighbors with regards to reconstruction (nodes don't talk to all other nodes)
|
||||
* Once a set of EC archives has been placed, the ordering/matching of the fragment index to the index of the node in the primary partition list must be maintained for handoff node usage
|
||||
* EC archives are stored with their fragment index encoded in the filename
|
||||
|
||||
**Reconstructor framework**
|
||||
|
||||
The current implementation thinking has the reconstructor live as its own daemon so
|
||||
that it has independent logging and controls. Its structure borrows heavily from
|
||||
the replicator.
|
||||
|
||||
The reconstructor will need to do a few things differently than the replicator,
|
||||
above and beyond the obvious EC functions. The major differences are:
|
||||
|
||||
* there is no longer the concept of 2 job processors that either sync or revert, instead there is a job pre-processor that figures out what needs to be done and one job processor carries out the actions needed
|
||||
* syncs only with nodes to the left and right on the partition list (not with all nodes)
|
||||
* for reversion, syncs with as many nodes as needed as determined by the fragment indexes that it is holding; the number of nodes will be equivalent to the number of unique fragment indexes that it is holding. It will use those indexes as indexes into the primary node list to determine which nodes to sync to.
|
||||
|
||||
**Node/Index Pairing**
|
||||
|
||||
The following are some scenarios that help explain why the node/fragment index pairing is so important for both of the operations just mentioned.
|
||||
|
||||
.. image:: images/handoff1.png
|
||||
|
||||
Next Scenario:
|
||||
|
||||
.. image:: images/handoff2.png
|
||||
|
||||
**Fragment Index Filename Encoding**
|
||||
|
||||
Each storage policy now must include a transformation function that diskfile will use to build the
|
||||
filename to store on disk. This is required by the reconstructor for a few reasons. For one, it
|
||||
allows us to store fragment archives of different indexes on the same storage node. This is not
|
||||
hone in the happy path however is possible in some circumstances. Without unique filenames for
|
||||
the different EC archive files in a set, we would be at risk of overwriting one archive of index
|
||||
n with another of index m in some scenarios.
|
||||
|
||||
The transformation function for the replication policy is simply a NOP. For reconstruction, the index
|
||||
is appended to the filename just before the .data extension. An example filename for a fragment
|
||||
archive storing the 5th fragment would like this this::
|
||||
|
||||
1418673556.92690#5.data
|
||||
|
||||
**Diskfile Refactoring**
|
||||
|
||||
In order to more cleanly accomodate some of the low level on disk storage needs of EC (file names, .durable, etc)
|
||||
diskfile has some additional layering introduced allowing those functions that need EC specific changes to be
|
||||
isolated. TODO: Add detail here.
|
||||
|
||||
**Reconstructor Job Pre-processing**
|
||||
|
||||
Because any given suffix directory may contain more than one fragment archive index data file,
|
||||
the actions that the reconstructor needs to take are not as simple as either syncing or reverting
|
||||
data as is done with the replicator. Because of this, it is more efficient for the reconstructor
|
||||
to analyze what needs to be done on a per part/suffix/fragment index basis and then schedules a
|
||||
series of jobs that are executed by a single job processor (as opposed to having to clear scenarios
|
||||
of sync and revert as with the replicator). The main scenarios that the pre-processor is
|
||||
looking at:
|
||||
|
||||
#) part dir with all FI's matching the local node index this is the case where everything is where it belongs and we just need to compare hashes and sync if needed, here we sync with our partners
|
||||
#) part dir with one local and mix of others here we need to sync with our partners where FI matches the lcoal_id , all others are sync'd with their home nodes and then killed
|
||||
#) part dir with no local FI and just one or more others here we sync with just the FI that exists, nobody else and then all the local FAs are killed
|
||||
|
||||
So the main elements of a job that the job processor is handed include a list of exactly who to talk
|
||||
to, which suffix dirs are out of sync and which fragment index to care about. Additionally the job
|
||||
includes information used by both ssync and the reconstructor to delete, as required, .data files on
|
||||
the source node as needed. Basically the work done by the job processor is a hybrid of what the
|
||||
replicator does in update() and update_deleted().
|
||||
|
||||
**The Act of Reconstruction**
|
||||
|
||||
Reconstruction can be thought of sort of like replication but with an extra step
|
||||
in the middle. The reconstructor is hard-wired to use ssync to determine what
|
||||
is missing and desired by the other side however before an object sent over the
|
||||
wire it needs to be reconstructed from the remaining fragments as the local
|
||||
fragment is just that - a different fragment index than what the other end is
|
||||
asking for.
|
||||
|
||||
Thus there are hooks in ssync for EC based policies. One case would be for
|
||||
basic reconstruction which, at a high level, looks like this:
|
||||
|
||||
* ask PyECLib which nodes need to be contacted to collect other EC archives needed to perform reconstruction
|
||||
* establish a connection to the target nodes and give ssync a DiskFileLike class that it can stream data from. The reader in this class will gather fragments from the nodes and use PyECLib to rebuild each segment before yielding data back to ssync
|
||||
|
||||
Essentially what this means is that data is buffered, in memory, on a per segment basis
|
||||
at the node performing reconstruction and each segment is dynamically reconstructed and
|
||||
delivered to ssync_sender where the send_put() method will ship them on over.
|
||||
|
||||
The following picture shows what the ssync changes to enable reconstruction. Note that
|
||||
there are several implementation details not covered here having to do with things like
|
||||
making sure that the correct fragment archive indexes are used, getting the metadata
|
||||
correctly setup for the reconstructed object, deleting files/suffix dirs as needed
|
||||
after reversion, etc., etc.
|
||||
|
||||
.. image:: images/recon.png
|
||||
|
||||
**Reconstructor local data file cleanup**
|
||||
|
||||
NOTE: This section is outdated, needs to be scrubbed. Do not read...
|
||||
|
||||
For the reconstructor cleanup is a bit different than replication because, for PUT consistency
|
||||
reasons, the object server is going to keep the previous .data file (if it existed) just
|
||||
in case the PUT of the most recent didn't complete successfully on a quorum of nodes. That
|
||||
leaves the replicator with many scenarios to deal with when it comes to cleaning up old files:
|
||||
|
||||
a) Assuming a PUT worked (commit recevied), the reconstructor will need to delete the older
|
||||
timestamps on the local node. This can be detected locally be examining the TS.data and
|
||||
TS.durable filenames. Any TS.data that is older than TS.durable can be deleted.
|
||||
|
||||
b) Assuming a quorum or better and the .durable file didn't make it to some nodes, the reconstructor
|
||||
will detect this (different hashes, further examination shows presence of local .durable file and
|
||||
remote matching ts files but not remote .durable) and simply push the .durable file to the remote
|
||||
node, basically replicating it.
|
||||
|
||||
c) In the event that a PUT was only partially complete but was still able to get a quorum down,
|
||||
the reconstructor will first need to reconstruct the object and then push the EC archives out
|
||||
such that all participating nodes have one, then it can delete the older timestamps on the local
|
||||
node. Once the object is reconstructed, a TS.durable file is created and committed such that
|
||||
each storage node has a record of the latest durable set much in the same way the multi-phase commit
|
||||
works in PUT.
|
||||
|
||||
d) In the event that a PUT was only partially complete and did not get a quorum,
|
||||
reconstruction is not possible. The reconstructor therefore needs to delete these files
|
||||
but there also must be an age factor to prevent it from deleting in flight PUTs. This should be
|
||||
the default behavior but should be able to be overridden in the event that an admin may want
|
||||
partials kept for some reason (easier DR maybe). Regardless, logging when this happens makes a
|
||||
lot of sense. This scenario can be detected when the reconstructor attempts to reconstruct
|
||||
because it notices it does not have a TS.durable for a particular TS.data and gets enough 409s
|
||||
that it can't feed PyECLib enough data to reconstruct (it will need to feed PyECLib what it gets
|
||||
and PYECLib will tell it if there's not enough though). Whether we delete the .data file, mark it
|
||||
somehow so we don't keep trying to reconstruct is TBD.
|
||||
|
||||
**Reconstructor rebalance**
|
||||
|
||||
Current thinking is that there should be no special handling here above and beyond the changes
|
||||
described in the handoff reversion section.
|
||||
|
||||
**Reconstructor concurrency**
|
||||
|
||||
There are 2 aspects of concurrency to consider with the reconstructor:
|
||||
|
||||
1) concurrency of the daemon
|
||||
|
||||
This means the same for the reconstructor as it does for the replicator, the
|
||||
size of the GreenPool used for the 'update' and 'update_deleted' jobs.
|
||||
|
||||
2) overall parallelism of partition reconstruction
|
||||
|
||||
With regards to node-node communication we have already covered the notion that
|
||||
the reconstructor cannot simply check in with its neighbors to determine what
|
||||
action is should take, if any, on its current run because it needs to know the
|
||||
status of the full stripe (not just the status of one or two other EC archives).
|
||||
|
||||
However, we do not want it to actually take action on all other nodes. In other
|
||||
words, we do want to check in with every node to see if a reconstruction is needed
|
||||
and in the event that it is, we dont want to attempt reconstruction on partner
|
||||
nodes, its left and right neighbors. This will minimize reconstruction races but
|
||||
still provide for redundancy in addressing the reconstruction of an EC archive.
|
||||
|
||||
In the event that a node (HDD) is down, there will be 2 partners for that node per
|
||||
partition working the reconstruction thus if we had 6 primaries, for example,
|
||||
and an HDD dies on node 1. We only want nodes 0 and 2 to add jobs to their local
|
||||
reconstructor even though when they call obj_ring.get_part_nodes(int(partition))
|
||||
to get a list of other members of the stripe they will get back 6 nodes. The local
|
||||
node will make its decision as to whether to add a reconstruction job or not based
|
||||
on its position in the node list.
|
||||
|
||||
In doing this, we minimize the reconstruction races but still enable all 6 nodes to be
|
||||
working on reconstruction for a failed HDD as the partitions will be distributed
|
||||
amongst all of the nodes therefore the node with the dead HDD will potentially have
|
||||
all other nodes pushing reconstructed EC archives to the handoff node in parallel on
|
||||
different partitions with every partition having at most 2 nodes racing to reconstruct
|
||||
its archives.
|
||||
|
||||
The following picture illustrates the example above.
|
||||
|
||||
.. image:: images/recons_ex1.png
|
||||
|
||||
**SCENARIOS:**
|
||||
|
||||
The following series of pictures illustrate the various scenarios more completely. We will use
|
||||
these scenarios against each of the main functions of the reconstructor which we will define as:
|
||||
|
||||
#. Reconstructor framework (daemon)
|
||||
#. Reconstruction (Ssync changes per spec sequence diagram)
|
||||
#. Reconstructor local data file cleanup
|
||||
#. Rebalance
|
||||
#. Handoff reversion (move data back to primary)
|
||||
|
||||
*TODO: Once designs are proposed for each of the main areas above, map to scenarios below for completeness.*
|
||||
|
||||
.. image:: images/recons1.png
|
||||
.. image:: images/recons2.png
|
||||
.. image:: images/recons3.png
|
||||
.. image:: images/recons4.png
|
||||
.. image:: images/recons5.png
|
||||
.. image:: images/recons6.png
|
||||
.. image:: images/recons7.png
|
||||
.. image:: images/recons8.png
|
||||
.. image:: images/recons9.png
|
||||
.. image:: images/recons10.png
|
||||
|
||||
`Trello <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_ Tasks for this section::
|
||||
|
||||
* 3.3.7.1: Reconstructor framework
|
||||
* 3.3.7.2: Ssync changes per spec sequence diagram
|
||||
* 3.3.7.3: Reconstructor local data file cleanup
|
||||
* 3.3.7.4: Node to node communication and synchrinozation on stripe status
|
||||
* 3.3.7.5: Reconstructor rebalance
|
||||
* 3.3.7.6: Reconstructor handoff reversion
|
||||
* 3.3.7.7: Add conf file option to never delete un-reconstructable EC archives
|
||||
|
||||
3.3.8 **Auditor**
|
||||
|
||||
Because the auditor already operates on a per storage policy basis, there are no specific
|
||||
auditor changes associated with EC. Each EC archive looks like, and is treated like, a
|
||||
regular object from the perspective of the auditor. Therefore, if the auditor finds bit-rot
|
||||
in an EC archive, it simply quarantines it and the EC reconstructor will take care of the rest
|
||||
just as the replicator does for replication policies. Because quarantine directories are
|
||||
already isolated per policy, EC archives have their own quarantine directories.
|
||||
|
||||
3.3.9 **Performance**
|
||||
|
||||
Lots of considerations, planning, testing, tweaking, discussions, etc., etc. to do here
|
||||
|
||||
`Trello <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_ Tasks for this section::
|
||||
|
||||
* 3.3.9.1: Performance Analysis
|
||||
|
||||
3.3.10 **The Ring**
|
||||
|
||||
I think the only real thing to do here is make rebalance able to move more than 1 replica of a
|
||||
given partition at a time. In my mind, the EC scheme is stored in swift.conf, not in the ring,
|
||||
and the placement and device management doesn't need any changes to cope with EC.
|
||||
|
||||
We also want to scrub ring tools to use the word "node" instead of "replicas" to avoid
|
||||
confusion with EC.
|
||||
|
||||
`Trello <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_ Tasks for this section::
|
||||
|
||||
* 3.3.10.1: Ring changes
|
||||
|
||||
3.3.11 **Testing**
|
||||
|
||||
Since these tests aren't always obvious (or possible) on a per patch basis (because of
|
||||
dependencies on other patches) we need to document scenarios that we want to make sure
|
||||
are covered once the code supports them.
|
||||
|
||||
3.3.11.1 **Probe Tests**
|
||||
|
||||
The `Trello <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_ card for this has a good
|
||||
starting list of test scenarios, more should be added as the design progresses.
|
||||
|
||||
3.3.11.2 **Functional Tests**
|
||||
|
||||
To begin with at least, it believed we just need to make an EC policy the default
|
||||
and run existing functional tests (and make sure it does that automatically)
|
||||
|
||||
`Trello <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_ Tasks for this section::
|
||||
|
||||
* 3.3.11.1: Required probe test scenarios
|
||||
* 3.3.11.2: Required functional test scenarios
|
||||
|
||||
3.3.12 **Container Sync**
|
||||
|
||||
Container synch assumes the use of replicas. In the current design, container synch from an EC
|
||||
policy would send only one fragment archive to the remote container, not the reconstructed object.
|
||||
|
||||
Therefore container sync needs to be updated to use an internal client instead of the direct client
|
||||
that would only grab a fragment archive.
|
||||
|
||||
`Trello <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_ Tasks for this section::
|
||||
|
||||
* 3.3.12.1: Container synch from an EC containers
|
||||
|
||||
3.3.13 **EC Configuration Helper Tool**
|
||||
|
||||
Script to include w/Swift to help determine what the best EC scheme might be and what the
|
||||
parameters should be for swift.conf.
|
||||
|
||||
`Trello <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_ Tasks for this section::
|
||||
|
||||
* 3.3.13.1: EC Configuration Helper Tool
|
||||
|
||||
3.3.14 **SAIO Updates**
|
||||
|
||||
We want to make sure its easy for the SAIO environment to be used for EC development
|
||||
and experimentation. Just as we did with policies, we'll want to update both docs
|
||||
and scripts once we decide what exactly what we want it to look like.
|
||||
|
||||
For now lets start with 8 total nodes (4 servers) and a 4+2+2 scheme (4 data, 2 parity, 2 handoffs)
|
||||
|
||||
`Trello <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_ Tasks for this section::
|
||||
|
||||
* 3.3.13.1: SAIO Updates (IMPLEMENTED)
|
||||
|
||||
3.4 Alternatives
|
||||
----------------
|
||||
|
||||
This design is 'proxy centric' meaning that all EC is done 'in line' as we bring data in/out of
|
||||
the cluster. An alternate design might be 'storage node centric' where the proxy is really
|
||||
unaware of EC work and new daemons move data from 3x to EC schemes based on rules that could
|
||||
include factors such as age and size of the object. There was a significant amount of discussion
|
||||
on the two options but the former was eventually chosen for the following main reasons:
|
||||
|
||||
EC is CPU/memory intensive and being 'proxy centric' more closely aligns with how providers are
|
||||
planning/have deployed their HW infrastructure
|
||||
|
||||
Having more intelligence at the proxy and less at the storage node is more closely aligned with
|
||||
general Swift architectural principles
|
||||
|
||||
The latter approach was limited to 'off line' EC meaning that data would always have to make the
|
||||
'trip' through replication before becoming erasure coded which is not as usable for many applications
|
||||
|
||||
The former approach provides for 'in line' as well as 'off line' by allowing the application
|
||||
to store data in a replication policy first and then move that data at some point later to EC by
|
||||
copying the data to a different container. There are thoughts/ideas for alternate means for
|
||||
allowing a data to change the policy of a container that are not covered here but are recognized to
|
||||
be possible with this scheme making it even easier for an application to control the data durability
|
||||
policy.
|
||||
|
||||
*Alternate Reconstructor Design*
|
||||
|
||||
An alternate, but rejected, proposal is archived on `Trello. <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_
|
||||
|
||||
Key concepts for the REJECTED proposal were:
|
||||
|
||||
Perform auditing at the fragment level (sub segment) to avoid having the smallest unit of work be an EC archive. This will reduce reconstruction network traffic
|
||||
|
||||
Today the auditor quarantines an entire object, for fragment level rebuild we
|
||||
need an additional step to identify which fragment within the archive is bad and
|
||||
potentially quarantine in a different location to project the archive from deletion
|
||||
until the Reconstructor is done with it
|
||||
|
||||
Today hashes.pkl only identifies a suffix directory in need of attention. For
|
||||
fragment level rebuild, the reconstructor needs to have additional information as
|
||||
its not just syncing at the directory level:
|
||||
Needs to know which fragment archive in the suffix dir needs work
|
||||
Needs to know which segment index within the archive is bad
|
||||
Needs to know the fragment index of the archive (the EC archives position within the set)
|
||||
|
||||
Perform reconstruction on the local node, however preserve the push model by having the
|
||||
remote node communicate reconstruction information via a new verb. This will reduce reconstruction
|
||||
network traffic. This could be really bad wrt overloading the local node with reconstruction
|
||||
traffic as opposed to using all the compute power of all systems participating in the partitions
|
||||
kept on the local node.
|
||||
|
||||
*Alternate Reconstructor Design #2*
|
||||
|
||||
The design proposal leverages the REPLICATE verb but introduces a new hashes.pkl format
|
||||
for EC and, for readability, names this file ec_hashes.pkl. The contents of this file will be
|
||||
covered shortly but it essentially needs to contain everything that any node would need to know
|
||||
in order to make a pass over its data and decided whether to reconstruct, delete, or move data.
|
||||
So, for EC, the standard hashes.pkl file and/or functions that operate on it are not relevant.
|
||||
|
||||
The data in ec_hashes.pkl has the following properties:
|
||||
|
||||
* needs to be synchronized across all nodes
|
||||
* needs to have complete information about any given object hash to be valid for that hash
|
||||
* can be complete for some object hashes and incomplete for others
|
||||
|
||||
There are many choices for achieving this ranging from gossip methods to consensus schemes. The
|
||||
proposed design leverages the fact that all nodes have access to a common structure and accessor
|
||||
functions that are assumed to be synchronized (eventually) such that any node position in the list
|
||||
can be used to select a master for one of two operations that require node-node communication:
|
||||
(1) ec_hashes.pkl synchronization and (2) reconstruction.
|
||||
|
||||
*ec_hashes.pkl synchronization*
|
||||
|
||||
At any given point in time there will be one node out of the set of nodes returned from
|
||||
get_part_nodes() that will act as the master for synchronizing ec_hashes.pkl information. The
|
||||
reconstructor, at the start of each pass, will use a bully style algorithm to elect the hash master.
|
||||
When each reconstructor starts a pass it will send an election message to all nodes with a node
|
||||
index lower than its own. If unable to connect with said nodes then it assumes the role of
|
||||
hash master. If any nodes with lower index reply then it continues with the current pass,
|
||||
processing its objects baed on current information in its ec_hashes.pkl. This bully-like
|
||||
algoithm won't actually prevent 2 masters from running at the same time (for example nodes 0-2
|
||||
could all be down so node 3 starts as master and then one of the nodes comes back up, it will
|
||||
also start the hash synchronization process). Note that this does not cause functional issues,
|
||||
its just a bit wasteful but saves us from implementing a more complex consensus algorithm
|
||||
thats not deemed to be worth the effort.
|
||||
|
||||
The role of the master will be to:
|
||||
|
||||
#. send REPLCIATE to all other nodes in the set
|
||||
#. merge results
|
||||
#. send new variation of REPLICATE to all other nodes
|
||||
#. nodes merge into their ec_hashes.pkl
|
||||
|
||||
In this manner there will be typically one node sending 2 REPLICATE verbs to n other nodes
|
||||
for each pass of the reconstructor so a total of 2(n-1) REPLICATE so O(n) versus O(1) for
|
||||
replication where 3 nodes would be sending 2 messages each for a constant 6 messages per
|
||||
pass. Note that there are distinct differences between the merging done by the master
|
||||
after collecting node pkl files and the merging done at the nodes after receiving the
|
||||
master version. When the master is merging, it is only updating the master copy with
|
||||
new information about the sending node. When a node is merging from master, it is only
|
||||
updating information about all other nodes. In other words, the master is only interested
|
||||
in hearing information from a node about that node itself and any given node is only
|
||||
interested in learning about everybody else. More on these merging rules later.
|
||||
|
||||
At any given point in time the ec_hashes.pkl file on a node can be in a variety of states, it
|
||||
is not required that, although a synchronized set was sent by the master, that the synchronized
|
||||
version be inspected by participating nodes. Each object hash within the ec_hashes.pkl will
|
||||
have information indicating whether that particular entry is synchronized or not, therefore it
|
||||
may be the case that a particular pass of a reconstructor run parse an ec_hashes.pkl file and
|
||||
only find some percentage N of synchronized entries where N started at 100% and dropped from there
|
||||
as changes were made to the local node (objects added, objects quarantined). An example will
|
||||
be provided after defining the format of the file.
|
||||
|
||||
ec_hashes data structure
|
||||
|
||||
{object_hash_0: {TS_0: [node0, node1, ...], TS_n: [node0, node1, ...], ...},
|
||||
object_hash_1: {TS_0: [node0, node1, ...], TS_n: [node0, node1, ...], ...},
|
||||
object_hash_n: {TS_0: [node0, node1, ...], TS_n: [node0, node1, ...], ...}}
|
||||
|
||||
where nodeX takes on values of unknown, not present or present such that a reconstructor
|
||||
parsing its local structure can determine on an object by object basis which TS files
|
||||
exist on which nodes, which ones it is missing on or if it has incomplete information for
|
||||
that TS (a node value for that TS is marked as unknown). Note that although this file format
|
||||
will contain per object information, objects are removed from the file by the local nodes
|
||||
once the local node has *seen* information from all other nodes for that entry. Therefore
|
||||
the file will not contain an entry for every object in the system but instead a transient
|
||||
entry for every object while its being accepted into the system (having its consistency wrt
|
||||
EC verified).
|
||||
|
||||
The new ec_hashes.pkl is subject to several potential writers including the hash master,
|
||||
its own local reconstructor, the auditor, the PUT path, etc., and will therefore be using
|
||||
the same locking that hashes.pkl uses today. The following illustrates the ongoing
|
||||
updates to ec_hashes.pkl
|
||||
|
||||
.. image:: images/ec_pkl_life.png
|
||||
|
||||
As the ec_hashes.pkl file is updated, the following rules apply:
|
||||
|
||||
As a **hash master** updating a local master file with any single node file:
|
||||
(recall the goal here is to update the master with info about the incoming node)
|
||||
|
||||
* data is never deleted (ie if an object hash or TS key exists in master but does not in the incoming dictionary, the entry is left in tact)
|
||||
* data can be added (if an object hash or TS key exists in an incoming dicitonary but does not exist in master it is added)
|
||||
* where keys match, only the node index in the TS list for the incoming data is affected and that data is replaced in master with the incoming information
|
||||
|
||||
As a **non-master** node merging from the master:
|
||||
(recall that the goal here is to have this node learn the other nodes in the cluster)
|
||||
|
||||
* an object hash is deleted as soon as all nodes are maked present
|
||||
* data can be added, same as above
|
||||
* where keys match, only *other* the indicies in the TS list for the incoming data is affected and that data is replaced with the incoming information
|
||||
|
||||
**Some examples**
|
||||
|
||||
The following are some example scenarios (used later to help explain use cases) and their
|
||||
corresponding ec_hashes data structures.
|
||||
|
||||
.. image:: images/echash1.png
|
||||
.. image:: images/echash2.png
|
||||
|
||||
4. Implementation
|
||||
=================
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
There are several key contributors, torgomatic is the core sponsor
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
See `Trello discussion board <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_
|
||||
|
||||
Repositories
|
||||
------------
|
||||
|
||||
Using Swift repo
|
||||
|
||||
Servers
|
||||
-------
|
||||
|
||||
N/A
|
||||
|
||||
DNS Entries
|
||||
-----------
|
||||
|
||||
N/A
|
||||
|
||||
5. Dependencies
|
||||
===============
|
||||
|
||||
As mentioned earlier, the EC algorithms themselves are implemented externally in
|
||||
multiple libraries. See the main site for the external work at `PyECLib <https://bitbucket.org/kmgreen2/pyeclib>`_
|
||||
|
||||
PyECLib itself is already an accepted `requirement. <https://review.openstack.org/#/c/76068/>`_
|
||||
|
||||
Work is ongoing to make sure that additional package depend ices for PyECLib are ongoing...
|
||||
There is a linux package, liberasurecode, that is also being developed as part of this effort
|
||||
and is needed by PyECLib. Getting it added for devstack tempest tests and unittests slaves is
|
||||
currently WIP by tsg
|
||||
|
||||
|
||||
`Trello <https://trello.com/b/LlvIFIQs/swift-erasure-codes>`_ Tasks for this section::
|
||||
|
||||
* 5.1: Enable sysmeta on object PUT (IMPLEMENTED)
|
Before Width: | Height: | Size: 52 KiB |
Before Width: | Height: | Size: 38 KiB |
Before Width: | Height: | Size: 48 KiB |
Before Width: | Height: | Size: 123 KiB |
Before Width: | Height: | Size: 40 KiB |
Before Width: | Height: | Size: 38 KiB |
Before Width: | Height: | Size: 112 KiB |
Before Width: | Height: | Size: 126 KiB |
Before Width: | Height: | Size: 24 KiB |
Before Width: | Height: | Size: 32 KiB |
Before Width: | Height: | Size: 26 KiB |
Before Width: | Height: | Size: 35 KiB |
Before Width: | Height: | Size: 26 KiB |
Before Width: | Height: | Size: 26 KiB |
Before Width: | Height: | Size: 33 KiB |
Before Width: | Height: | Size: 27 KiB |
Before Width: | Height: | Size: 26 KiB |
Before Width: | Height: | Size: 22 KiB |
Before Width: | Height: | Size: 90 KiB |
@ -1,445 +0,0 @@
|
||||
::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=====================================
|
||||
Composite Tokens and Service Accounts
|
||||
=====================================
|
||||
|
||||
This is a proposal for how Composite Tokens can be used by services such
|
||||
as Glance and Cinder to store objects in project-specific accounts yet
|
||||
retain control over how those objects are accessed.
|
||||
|
||||
This proposal uses the "Service Token Composite Authorization" support in
|
||||
the auth_token Keystone middleware
|
||||
(http://git.openstack.org/cgit/openstack/keystone-specs/plain/specs/keystonemiddleware/service-tokens.rst).
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Swift is used by many OpenStack services to store data on behalf of users.
|
||||
There are typically two approaches to where the data is stored:
|
||||
|
||||
* *Single-project*. Objects are stored in a single dedicated Swift account
|
||||
(i.e., all data belonging to all users is stored in the same account).
|
||||
|
||||
* *Multi-project*. Objects are stored in the end-user's Swift account (project).
|
||||
Typically, dedicated container(s) are created to hold the objects.
|
||||
|
||||
There are advantages and limitations with both approaches as described in the
|
||||
following table:
|
||||
|
||||
==== ========================================== ========== ========
|
||||
Item Feature/Topic Single- Multi-
|
||||
Project Project
|
||||
---- ------------------------------------------ ---------- --------
|
||||
1 Fragile to password leak (CVE-2013-1840) Yes No
|
||||
2 Fragile to token leak Yes No
|
||||
3 Fragile to container deletion Yes No
|
||||
4 Fragile to service user deletion Yes No
|
||||
5 "Noise" in Swift account No Yes
|
||||
6 Namespace collisions (user and service No Yes
|
||||
picking same name)
|
||||
7 Guarantee of consistency (service Yes No
|
||||
database vs swift account)
|
||||
8 Policy enforcement (e.g., Image Download) Yes No
|
||||
==== ========================================== ========== ========
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
It is proposed to put service data into a separate account to the end-user's
|
||||
"normal" account. Although the account has a different name, the account
|
||||
is linked to the end-user's project. This solves issues with noise
|
||||
and namespace collisions. To remove fragility and improve consistency
|
||||
guarantees, it is proposed to use the composite token feature to manage
|
||||
access to this account.
|
||||
|
||||
In summary, there are three related changes:
|
||||
|
||||
* Support for Composite Tokens
|
||||
|
||||
* The authorization logic can require authenticated information from the
|
||||
composite tokens
|
||||
|
||||
* Support for multiple reseller prefixes, each with their own configuration
|
||||
|
||||
The effect is that access to the data must be made through the service.
|
||||
In addition, the service can only access the data when it is processing
|
||||
a request from the end-user (i.e, when it has an end-user's token).
|
||||
|
||||
The changes are described one by one in this document. The impatient can
|
||||
skip to "Composite Tokens in the OpenStack Environment" for a complete,
|
||||
example.
|
||||
|
||||
Composite Tokens
|
||||
================
|
||||
|
||||
The authentication system will validate a second token. The token is stored
|
||||
in the X-Service-Token header so is known as the service token (name chosen
|
||||
by Keystone).
|
||||
|
||||
The core functions of the token authentication scheme is to determine who the
|
||||
user is, what account is being accessed and what roles apply. Keystoneauth
|
||||
and Tempauth have slightly different semantics, so the tokens are combined
|
||||
in slightly different ways as explained in the following sections.
|
||||
|
||||
Combining Roles in Keystoneauth
|
||||
-------------------------------
|
||||
|
||||
The following rules are used when a service token is present:
|
||||
|
||||
* The user_id is the user_id from the first token (i.e., no change)
|
||||
|
||||
* The account (project) is specified by the first token (i.e., no change)
|
||||
|
||||
* The user roles are initially determined by the first token (i.e., no change).
|
||||
|
||||
* The roles from the service token are made available in service_roles.
|
||||
|
||||
Example 1 - Combining Roles in keystoneauth
|
||||
-------------------------------------------
|
||||
|
||||
In this example, the <token-two> is scoped to a different project than
|
||||
the account/project being accessed::
|
||||
|
||||
Client
|
||||
| <user-token>: project-id: 1234
|
||||
| user-id: 9876
|
||||
| roles: admin
|
||||
| X-Auth-Token: <user-token>
|
||||
| X-Service-Token: <token-two>
|
||||
|
|
||||
| <token-two>: project-id: 5678
|
||||
v user-id: 5432
|
||||
Swift roles: service
|
||||
|
|
||||
v
|
||||
Combined identity information:
|
||||
user_id: 9876
|
||||
project_id: 1234
|
||||
roles: admin
|
||||
service_roles: service
|
||||
|
||||
Combining Groups in Tempauth
|
||||
----------------------------
|
||||
|
||||
The user groups from both tokens are simply combined into one list. The
|
||||
following diagram gives an example of this::
|
||||
|
||||
|
||||
Client
|
||||
| <user-token>: from "joe"
|
||||
|
|
||||
|
|
||||
| X-Auth-Token: <user-token>
|
||||
| X-Service-Token: <token-two>
|
||||
|
|
||||
| <token-two>: from "glance"
|
||||
v
|
||||
Swift
|
||||
|
|
||||
| [filter:tempauth]
|
||||
| user_joesaccount_joe: joespassword .admin
|
||||
| user_glanceaccount_glance: glancepassword servicegroup
|
||||
|
|
||||
v
|
||||
Combined Groups: .admin servicegroup
|
||||
|
||||
Support for multiple reseller prefixes
|
||||
======================================
|
||||
|
||||
The reseller_prefix will now support a list of prefixes. For example,
|
||||
the following supports both ``AUTH_`` and ``SERVICE_`` in keystoneauth::
|
||||
|
||||
[filter:keystoneauth]
|
||||
reseller_prefix = AUTH_, SERVICE_
|
||||
|
||||
For backward compatibility, the default remains as ``AUTH_``.
|
||||
|
||||
All existing configuration options are assumed to apply to the first
|
||||
item in the list. However, to indicate which prefix an option applies to,
|
||||
put the prefix in front of the option name. This applies to the
|
||||
following options:
|
||||
|
||||
* operator_roles (keystoneauth)
|
||||
* service_roles (described below) (keystoneauth)
|
||||
* require_group (described below) (tempauth)
|
||||
|
||||
Other options (logging, storage_url_scheme, etc.) are not specific to
|
||||
the reseller prefix.
|
||||
|
||||
For example, this shows two prefixes and some options::
|
||||
|
||||
[filter:keystoneauth]
|
||||
reseller_prefix = AUTH_, SERVICE_
|
||||
reseller_admin_role = ResellerAdmin <= global, applies to all
|
||||
AUTH_operator_roles = admin <= new style
|
||||
SERVICE_operator_roles = admin
|
||||
allow_overrides = false
|
||||
|
||||
Support for composite authorization
|
||||
===================================
|
||||
|
||||
We will add an option called "service_roles" to keystoneauth. If
|
||||
present, composite tokens must be used and the service_roles must contain the
|
||||
listed roles. Here is an example where the ``AUTH_`` namespace requires the
|
||||
"admin" role be associated with the X-Auth-Token. The ``SERVICE_`` namespace
|
||||
requires that the "admin" role be associated with X-Auth-Token. In
|
||||
addition, it requires that the "service" role be associated with
|
||||
X-Service-Token::
|
||||
|
||||
[filter:keystoneauth]
|
||||
reseller_prefix = AUTH_, SERVICE_
|
||||
AUTH_operator_roles = admin
|
||||
SERVICE_operator_roles = admin
|
||||
SERVICE_service_roles = service
|
||||
|
||||
In tempauth, we will add an option called "require_group". If present,
|
||||
the user or service user must be a member of this group. (since tempauth
|
||||
combines groups from both X-Auth-Token and X-Service-Token, the required
|
||||
group may come from either or both tokens).
|
||||
|
||||
The following shows an example::
|
||||
|
||||
[filter:tempauth]
|
||||
reseller_prefix = AUTH_, SERVICE_
|
||||
SERVICE_require_group = servicegroup
|
||||
|
||||
Composite Tokens in the OpenStack Environment
|
||||
=============================================
|
||||
|
||||
This section presents a simple configuration showing the flow from client
|
||||
through an OpenStack Service to Swift. We use Glance in this example, but
|
||||
the principal is the same for all services. See later for a more
|
||||
complex service-specific setup.
|
||||
|
||||
The flow is as follows::
|
||||
|
||||
Client
|
||||
| <user-token>: project-id: 1234
|
||||
| user-id: 9876
|
||||
| (request) roles: admin
|
||||
| X-Auth-Token: <user-token>
|
||||
|
|
||||
v
|
||||
Glance
|
||||
|
|
||||
| PUT /v1/SERVICE_1234/container/object
|
||||
| X-Auth-Token: <user-token>
|
||||
| X-Service-Token: <glance-token>
|
||||
|
|
||||
| <glance-token>: project-id: 5678
|
||||
v user-id: 5432
|
||||
Swift roles: service
|
||||
|
|
||||
v
|
||||
Combined identity information:
|
||||
user_id: 9876
|
||||
project-id: 1234
|
||||
roles: admin
|
||||
service_roles: service
|
||||
|
||||
[filter:keystoneauth]
|
||||
reseller_prefix = AUTH_, SERVICE_
|
||||
AUTH_operator_roles = admin
|
||||
AUTH_reseller_admin_roles = ResellerAdmin
|
||||
SERVICE_operator_roles = admin
|
||||
SERVICE_service_roles = service
|
||||
SERVICE_reseller_admin_roles = ResellerAdmin
|
||||
|
||||
The authorization logic is as follows::
|
||||
|
||||
/v1/SERVICE_1234/container/object
|
||||
-------
|
||||
|
|
||||
in?
|
||||
|
|
||||
reseller_prefix = AUTH_, SERVICE_
|
||||
\
|
||||
Yes
|
||||
\
|
||||
Use SERVICE_* configuration
|
||||
|
|
||||
|
|
||||
/v1/SERVICE_1234/container/object
|
||||
----
|
||||
|
|
||||
same as? project-id: 1234
|
||||
\
|
||||
Yes
|
||||
\
|
||||
roles: admin
|
||||
|
|
||||
in? SERVICE_operator_roles = admin
|
||||
\
|
||||
Yes
|
||||
\
|
||||
service_roles: service
|
||||
|
|
||||
in? SERVICE_service_roles = service
|
||||
\
|
||||
Yes
|
||||
\
|
||||
----> swift_owner = True
|
||||
|
||||
|
||||
Other Aspects
|
||||
=============
|
||||
|
||||
Tempurl, FormPOST, Container Sync
|
||||
---------------------------------
|
||||
|
||||
These work on the principal that the secret key is stored in a *privileged*
|
||||
header. No change is proposed as the account controls described in this
|
||||
document continue to use this concept. However, an additional use-case
|
||||
becomes possible: it should be possible to use temporary URLs to
|
||||
allow a client to upload or download objects to or from a service
|
||||
account.
|
||||
|
||||
Service-Specific Accounts
|
||||
-------------------------
|
||||
|
||||
Using a common ``SERVICE_`` namespace means that all OpenStack Services share
|
||||
the same account. A simple alternative is to use multiple accounts -- with
|
||||
corresponding reseller_prefixes and service catalog entries. For example,
|
||||
Glance could use ``IMAGE_`` and Cinder could use ``VOLUME_``. There is nothing
|
||||
in this proposal that limits this option. Here is an example of a
|
||||
possible configuration::
|
||||
|
||||
[filter:keystoneauth]
|
||||
reseller_prefix = AUTH_, IMAGE_, VOLUME_
|
||||
IMAGE_service_roles glance_service
|
||||
VOLUME_service_roles = cinder_service
|
||||
|
||||
python-swiftclient
|
||||
------------------
|
||||
|
||||
No changes are needed in python-swiftclient to support this feature.
|
||||
|
||||
Service Changes To Use ``SERVICE_`` Namespace
|
||||
---------------------------------------------
|
||||
|
||||
Services (such as Glance, Cinder) need to be enhanced as follows to use
|
||||
the ``SERVICE_`` namespace:
|
||||
|
||||
* Change the path to use the appropriate prefix. Applications have
|
||||
HTTP_X_SERVICE_CATALOG in their environment so it is easy to construct the
|
||||
appropriate path.
|
||||
* Add their token to the X-Service-Token header
|
||||
* They should have the appropriate service role for this token
|
||||
* They should include their service type (e.g., image) as a prefix to any
|
||||
container names they create. This will prevent conflict between services
|
||||
sharing the account.
|
||||
|
||||
Upgrade Implications
|
||||
====================
|
||||
|
||||
The Swift software must be upgraded before Services attempt to use the
|
||||
``SERVICE_`` namespace. Since Services use configurable options
|
||||
to decide how they use Swift, this should be easy to sequence (i.e., upgrade
|
||||
software first, then change the Service's configuration options).
|
||||
|
||||
How Services handle existing legacy data is beyond the scope of this
|
||||
proposal.
|
||||
|
||||
Alternatives
|
||||
============
|
||||
|
||||
*Account ACL*
|
||||
|
||||
An earlier draft proposed extending the account ACL. It also proposed to
|
||||
add a default account ACL concept. On review, it was decided that this
|
||||
was unnecessary for this use-case (though that work might happen in it's
|
||||
own right).
|
||||
|
||||
*Co-owner sysmeta*
|
||||
|
||||
An earlier draft proposed new sysmeta that established "co-ownership"
|
||||
rules for containers.
|
||||
|
||||
*policy.xml File*:
|
||||
|
||||
The Keystone Composite Authorization scheme has use cases for other Openstack
|
||||
projects. The OSLO incubator policy checker module may be extended to support
|
||||
roles acquired from X-Service-Token. However, this will only be used in
|
||||
Swift if keystoneauth already uses a policy.xml file.
|
||||
|
||||
If policy files are adapted by keystoneauth, it should be easy to apply. In
|
||||
effect, a different policy.xml file would be supplied for each reseller prefix.
|
||||
|
||||
*Proxy Logging*:
|
||||
|
||||
The proxy-logging middleware logs the value of X-Auth-Token. No change is
|
||||
proposed.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
donagh.mccabe@hp.com
|
||||
|
||||
To be fully effective, changes are needed in other projects:
|
||||
|
||||
* Keystone Middleware. Done
|
||||
|
||||
* OSLO. As mentioned above, probably not needed or depended on.
|
||||
|
||||
* Glance. stuart.mclaren@hp.com will make the Glance changes.
|
||||
|
||||
* Cinder. Unknown.
|
||||
|
||||
* Devstack. The Swift change by itself will probably not require Devstack
|
||||
changes. The Glance and Cinder services may need additional configuration
|
||||
options to enable the X-Service-Token feature.
|
||||
Assignee: Unknown
|
||||
|
||||
* Tempest. In principal, no changes should be needed as the proposal is
|
||||
intended to be transparent to end-users. However, it may be possible
|
||||
that some tests incorrectly access images or volume backups directly.
|
||||
Assignee: Unknown
|
||||
|
||||
* Ceilometer (for ``SERVICE_`` namespace). It is not clear if any
|
||||
changes are needed or desirable.
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* swift/common/middleware/tempauth.py is modified to support multiple
|
||||
reseller prefixes, the require_group options and to process the
|
||||
X-Service-Token header
|
||||
|
||||
* swift/common/middleware/keystoneauth.py is modified to support multiple
|
||||
reseller prefixes and the service_roles option.
|
||||
|
||||
* Write unit tests
|
||||
|
||||
* Write functional tests
|
||||
|
||||
Repositories
|
||||
------------
|
||||
|
||||
No new git repositories will be created.
|
||||
|
||||
Servers
|
||||
-------
|
||||
|
||||
No new servers are created. The keystoneauth middleware is used by the
|
||||
proxy-server.
|
||||
|
||||
DNS Entries
|
||||
-----------
|
||||
|
||||
No DNS entries will to be created or updated.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* "Service Token Composite Authorization"
|
||||
https://review.openstack.org/#/c/96315
|
@ -1,736 +0,0 @@
|
||||
::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
******************
|
||||
At-Rest Encryption
|
||||
******************
|
||||
|
||||
1. Summary
|
||||
==========
|
||||
|
||||
To better protect the data in their clusters, Swift operators may wish
|
||||
to have objects stored in an encrypted form. This spec describes a
|
||||
plan to add an operator-managed encryption capability to Swift while
|
||||
remaining completely transparent to clients.
|
||||
|
||||
Goals
|
||||
-----
|
||||
|
||||
Swift objects are typically stored on disk as files in a standard
|
||||
POSIX filesystem; in the typical 3-replica case, an object is
|
||||
represented as 3 files on 3 distinct filesystems within the cluster.
|
||||
|
||||
An attacker may gain access to disks in a number of ways. When a disk
|
||||
fails, it may be returned to the manufacturer under warranty; since it
|
||||
has failed, erasing the data may not be possible, but the data may
|
||||
still be present on the platters. When disks reach end-of-life, they
|
||||
are discarded, and if not properly wiped, may still contain data. An
|
||||
insider might steal or clone disks from the data center.
|
||||
|
||||
Goal 1: an attacker who gains read access to Swift's object servers'
|
||||
filesystems should gain as little useful data as possible. This
|
||||
provides confidentiality for users' data.
|
||||
|
||||
Goal 2: when a keymaster implementation allows for secure deletion of keys,
|
||||
then the deletion of an object's key shall render the object irrecoverable.
|
||||
This provides a means to securely delete an object.
|
||||
|
||||
Not Goals / Possible Future Work
|
||||
--------------------------------
|
||||
|
||||
There are other ways to attack a Swift cluster, but this spec does not
|
||||
address them. In particular, this spec does not address these threats:
|
||||
|
||||
* an attacker gains access to Swift's internal network
|
||||
* an attacker compromises the key database
|
||||
* an attacker modifies Swift's code (on the Swift nodes) for evil
|
||||
|
||||
If these threats are mitigated at all, it is a fortunate byproduct, but it is
|
||||
not the intent of this spec to address them.
|
||||
|
||||
|
||||
2. Encryption and Key Management
|
||||
================================
|
||||
|
||||
There are two logical parts to at-rest encryption. The first part is
|
||||
the crypto engine; this performs the actual encryption and decryption
|
||||
of the data and metadata.
|
||||
|
||||
The second part is key management. This is the process by which the
|
||||
key material is stored, retrieved and supplied to the crypto engine.
|
||||
The process may be split with an agent responsible for storing key
|
||||
material safely (sometimes a Hardware Security Module) and an agent
|
||||
responsible for retrieving key material for the crypto engine. Swift
|
||||
will support a variety of key-material retrievers, called
|
||||
"keymasters", via Python's entry-points mechanism. Typically, a Swift
|
||||
cluster will use only one keymaster.
|
||||
|
||||
2.1 Request Path
|
||||
----------------
|
||||
|
||||
The crypto engine and the keymaster shall be implemented as three
|
||||
separate pieces of middleware. The crypto engine shall have both
|
||||
"decrypter" and "encrypter" filter-factory functions, and the
|
||||
keymaster filter shall sit between them. Example::
|
||||
|
||||
[pipeline:main]
|
||||
pipeline = catch_errors gatekeeper ... decrypter keymaster encrypter proxy-logging proxy-server
|
||||
|
||||
The encrypter middleware is responsible for encrypting the object's
|
||||
data and metadata on a PUT or POST request.
|
||||
|
||||
The decrypter middleware is responsible for three things. First, it
|
||||
decrypts the object's data and metadata on an object GET or HEAD
|
||||
response. Second, it decrypts the container listing entries and the
|
||||
container metadata on a container GET or HEAD response. Third, it
|
||||
decrypts the account metadata on an account GET or HEAD response.
|
||||
|
||||
DELETE requests are unaffected by encryption, so neither
|
||||
the encrypter nor decrypter need to do anything. The keymaster may
|
||||
wish to delete any key or keys associated with the deleted entity.
|
||||
|
||||
OPTIONS requests should be ignored entirely by the crypto engine, as
|
||||
OPTIONS requests and responses contain neither user data nor user
|
||||
metadata.
|
||||
|
||||
2.1.1 Large Objects
|
||||
-------------------
|
||||
|
||||
In Swift, large objects are composed of segments, which are plain old
|
||||
objects, and a manifest, which is a special object that ties the
|
||||
segments together. Here, "special" means "has a particular header
|
||||
value".
|
||||
|
||||
Large-object support is implemented in middlewares ("dlo" and "slo").
|
||||
The encrypter/keymaster/decrypter trio must be placed to the right of
|
||||
the dlo and slo middlewares in the proxy's middleware pipeline. This
|
||||
way, the encrypter and decrypter do not have to do any special
|
||||
processing for large objects; rather, each request is for a plain old
|
||||
object, container, or account.
|
||||
|
||||
2.1.2 Etag Validation
|
||||
---------------------
|
||||
|
||||
With unencrypted objects, the object server is responsible for
|
||||
validating any Etag header sent by the client on a PUT request; the
|
||||
Etag header's value is the MD5 hash of the uploaded object data.
|
||||
|
||||
With encrypted objects, the plaintext is not available to the object server, so
|
||||
the encrypter must perform the validation instead by calculating the MD5 hash
|
||||
of the object data and validating this against any Etag header sent by the
|
||||
client - if the two do not match then the encrypter should immediately return a
|
||||
response with status 422.
|
||||
|
||||
Assuming that the computed MD5 hash of plaintext is validated, the encrypter
|
||||
will encrypt this value and pass to the object server to be stored as system
|
||||
metadata. Since the validated value will not be available until the plaintext
|
||||
stream has been completely read, this metadata will be sent using a 'request
|
||||
footer', as described in section 7.2.
|
||||
|
||||
If the client request included an Etag header then the encrypter should also
|
||||
compute the MD5 hash of the ciphertext and include this value in an Etag
|
||||
request footer. This will allow the object server to validate the hash of the
|
||||
ciphertext that it receives, and so complete the end-to-end validation
|
||||
requirement implied by the client sending an Etag: encrypter validates client
|
||||
to proxy communication, object server validates proxy to object server
|
||||
communication.
|
||||
|
||||
|
||||
2.2 Inter-Middleware Communication
|
||||
----------------------------------
|
||||
|
||||
The keymaster is responsible for deciding if any particular resource should be
|
||||
encrypted. This decision is implementation dependent but may be based, for
|
||||
example, on container policy or account name. When a resource is not to be
|
||||
encrypted the keymaster will set the key `swift.crypto.override` in the request
|
||||
environ to indicate to the encrypter middleware that encryption is not
|
||||
required.
|
||||
|
||||
When encryption is required, the keymaster communicates the encryption key to
|
||||
the encrypter and decrypter middlewares by placing a zero-argument callable in
|
||||
the WSGI environment dictionary at the key "swift.crypto.fetch_crypto_keys".
|
||||
When called, this will return the key(s) necessary to process the current
|
||||
request. It must be present on any GET or HEAD request for an account,
|
||||
container, or object which contains any encrypted data or metadata. If
|
||||
encrypted data or metadata is encountered while processing a GET or HEAD
|
||||
request but fetch_crypto_keys is not present _or_ it does not return keys when
|
||||
called, then this is an error and the client will receive a 500-series
|
||||
response.
|
||||
|
||||
On a PUT or POST request, the keymaster must place
|
||||
"swift.crypto.fetch_crypto_keys" in the WSGI environment during request
|
||||
processing; that is, before passing the request to the remainder of the
|
||||
middleware pipeline. This is so that the encrypter can encrypt the object's
|
||||
data in a streaming fashion without buffering the whole object.
|
||||
|
||||
On a GET or HEAD request, the keymaster must place
|
||||
"swift.crypto.fetch_crypto_keys" in the WSGI environment before returning
|
||||
control to the decrypter. It need not be done at request-handling time. This
|
||||
lets attributes of the key be stored in sysmeta, for example the key ID in an
|
||||
external database, or anything else the keymaster wants.
|
||||
|
||||
|
||||
3. Cipher Choice
|
||||
================
|
||||
|
||||
3.1. The Chosen Cipher
|
||||
----------------------
|
||||
|
||||
Swift will use AES in CTR mode with 256-bit keys.
|
||||
|
||||
In order to allow for ranged GET requests, the cipher shall be used
|
||||
in counter (CTR) mode.
|
||||
|
||||
The entire object body shall be encrypted as a single byte stream. The
|
||||
initialization vector (IV) used for encrypting the object body will be randomly
|
||||
generated and stored in system metadata.
|
||||
|
||||
|
||||
3.2. Why AES-256-CTR
|
||||
--------------------
|
||||
|
||||
CTR mode basically turns a block cipher into a stream cipher, so
|
||||
dealing with range GET requests becomes much easier. No modification
|
||||
of the client's requested byte ranges is needed. When decrypting, some
|
||||
padding will be required to align the requested data to AES's 16-byte
|
||||
block size, but that can all be done at the proxy level.
|
||||
|
||||
Remember that when a GET request is made, the decrypter knows nothing
|
||||
about the object. The object may or may not be encrypted; it may or
|
||||
may not exist. If Swift were to allow configurable cipher modes, then
|
||||
the requested byte range would have to be expanded to get enough bytes
|
||||
for any supported cipher mode at all, which means taking into account
|
||||
the block size and operating characteristics of every single supported
|
||||
cipher/blocksize/mode. Besides the network overhead (especially for
|
||||
small byteranges), the complexity of the resulting code would make it
|
||||
an excellent home for bugs.
|
||||
|
||||
3.3 Future-Proofing
|
||||
-------------------
|
||||
|
||||
The cipher and mode will be stored in system metadata on every
|
||||
encrypted object. This way, when Swift gains support for other ciphers
|
||||
or modes, existing objects can still be decrypted.
|
||||
|
||||
In general we must assume that any resource (account/container/object metadata
|
||||
or object data) in a Swift cluster may be encrypted using a different cipher,
|
||||
or not encrypted. Consequently, the cipher choice must be stored as metadata of
|
||||
every encrypted resource, along with the IV. Since user metadata may be updated
|
||||
independently of objects, this implies storing encryption related metadata of
|
||||
metadata.
|
||||
|
||||
|
||||
4. Robustness
|
||||
=============
|
||||
|
||||
|
||||
4.1 No Key
|
||||
----------
|
||||
|
||||
If the keymaster fails to add "swift.crypto.fetch_crypto_keys" to the WSGI
|
||||
environment of a GET request, then the client would receive the ciphertext of
|
||||
the object instead of the plaintext, which looks to the client like garbage.
|
||||
However, we can tell if an object is encrypted or not by the presence of system
|
||||
metadata headers, so the decrypter can prevent this by raising an error if no
|
||||
key was provided for the decryption of an encrypted object.
|
||||
|
||||
|
||||
5. Multiple Keymasters
|
||||
======================
|
||||
|
||||
5.1 Coexisting Keymasters
|
||||
-------------------------
|
||||
|
||||
Just as Swift supports multiple simultaneous auth systems, it can
|
||||
support multiple simultaneous keymasters. With auth, each auth system
|
||||
claims a subset of the Swift namespace by looking at accounts starting
|
||||
with their reseller prefix. Similarly, multiple keymasters may
|
||||
partition the Swift namespace in some way and thus coexist peacefully.
|
||||
|
||||
5.2 Keymasters in Core Swift
|
||||
----------------------------
|
||||
|
||||
5.2.1 Trivial Keymaster
|
||||
^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Swift will need a trivial keymaster for functional tests of the crypto
|
||||
engine. The trivial keymaster will not be suitable for production use
|
||||
at all. To that end, it should be deliberately kept as small as
|
||||
possible without regard for any actual security of the keys.
|
||||
|
||||
Perhaps the trivial keymaster could use the SHA-256 of a configurable
|
||||
prefix concatenated with the object's full path for the cryptographic
|
||||
key. That is,::
|
||||
|
||||
key = SHA256(prefix_from_conf + request.path)
|
||||
|
||||
This will allow for testing of the PUT and GET paths, the COPY path
|
||||
(the destination object's key will differ from the source object's),
|
||||
and also the invalid key path (by changing the prefix after an object
|
||||
is PUT).
|
||||
|
||||
|
||||
5.2.2 Barbican Keymaster
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Swift will probably want a keymaster that stores things in Barbican at
|
||||
some point.
|
||||
|
||||
|
||||
5.3 Keymaster implementation considerations - informational only
|
||||
----------------------------------------------------------------
|
||||
|
||||
As stated above, Swift will support a variety of keymaster implementations, and
|
||||
the implementation details of any keymaster is beyond the scope of this spec
|
||||
(other than providing a trivial keymaster for testing). However, we include
|
||||
here an *informational* discussion of how keymasters might behave, particularly
|
||||
with respect to managing the choice of when to encrypt a resource (or not).
|
||||
|
||||
The keymaster is ultimately responsible for specifying *whether or not* a
|
||||
resource should be encrypted. The means of communicating this decision is the
|
||||
request environ variable `swift.crypto.override`, as discussed above. (The only
|
||||
exception to this rule may be in the case that the decrypter finds no crypto
|
||||
metadata in the headers, and assumes that the object was never encrypted.)
|
||||
|
||||
If we consider object encryption (as opposed to account or container metadata),
|
||||
a keymaster may choose to specify encryption of objects on a per-account,
|
||||
per-container or per-object basis. If encryption is specified per-account or
|
||||
per-container, the keymaster may base its decision on metadata that it (or some
|
||||
other agent) has previously set on the account or container. For example:
|
||||
|
||||
* an administrator or user might add keymaster-specific system metadata to an
|
||||
account when it is created;
|
||||
* a keymaster may inspect container metadata for a storage policy index that
|
||||
it then maps to an encrypt/don't-encrypt decision;
|
||||
* a keymaster may accept a client supplied header that enables/disables
|
||||
encryption and transform that to system metadata that it subsequently
|
||||
inspects on each request to that resource.
|
||||
|
||||
If encryption is specified per-object then the decision may be based on the
|
||||
object's name or based on client supplied header(s).
|
||||
|
||||
The keymaster is also responsible for specifying *which key* is used when a
|
||||
resource is to be encrypted/decrypted. Again, if we focus on object encryption,
|
||||
the keymaster could choose to use a unique key for each object, or for all
|
||||
objects in the same container, or for all object in the same account (using a
|
||||
single key for an entire cluster is not disallowed but would not be
|
||||
recommended). The specification of crypto metadata storage below is flexible
|
||||
enough to support any of those choices.
|
||||
|
||||
If a keymaster chooses to specify a unique key for each object then it will
|
||||
clearly need to be capable of managing as many keys as there are objects in the
|
||||
cluster. For performance reasons it should also be capable of retrieving any
|
||||
object's key in a timely fashion when required. A keymaster *might* choose to
|
||||
store encrypted keys in Swift itself: for example, an object's unique key could
|
||||
be encrypted using its container key before storing perhaps as object metadata.
|
||||
However, although scalable, such a solution might not provide the desired
|
||||
properties for 'secure deletion' of keys since the deletion of an object in
|
||||
Swift does not guarantee immediate deletion of content on disk.
|
||||
|
||||
For the sake of illustration, consider a *hypothetical* keymaster
|
||||
implementation code-named Vinz. Vinz enables object encryption on a
|
||||
per-container basis:
|
||||
|
||||
* for every object PUT, Vinz inspects the target container's metadata to
|
||||
discover the container's storage policy.
|
||||
* Vinz then uses the storage policy as a key into its own encryption policy
|
||||
configuration.
|
||||
* Containers using storage-policy 'gold' or 'silver' are encrypted, containers
|
||||
using storage policy 'bronze' are not encrypted.
|
||||
* Significantly, the mapping of storage policy to encryption policy is a
|
||||
property of the keymaster alone and could be changed if desired.
|
||||
* Vinz also checks the account metadata for a metadata item
|
||||
'X-Account-Sysmeta-Vinz-Encrypt: always' that a sys admin may have set. If
|
||||
present Vinz will specify object encryption regardless of the container
|
||||
policy.
|
||||
* For objects that are to be encrypted/decrypted, Vinz adds the variable
|
||||
``swift.crypto.fetch_crypto_keys=vinz_fetch_crypto_keys`` to the request
|
||||
environ. Vinz also interacts with Barbican to fetch a key for the object's
|
||||
container which it provides in response to calls to
|
||||
``vinz_fetch_crypto_keys``.
|
||||
* For objects that are not to be encrypted/decrypted, Vinz adds the variable
|
||||
``swift.crypto.override=True`` to the request environ.
|
||||
|
||||
|
||||
6 Encryption of Object Body
|
||||
===========================
|
||||
|
||||
Each object is encrypted with the key from the keymaster. A new IV is
|
||||
randomly generated by the encrypter for each object body.
|
||||
|
||||
The IV and the choice of cipher is stored using sysmeta. For the following
|
||||
discussion we shall refer to the choice of cipher and IV collectively as
|
||||
"crypto metadata".
|
||||
|
||||
The crypto metadata for object body can be stored as an item of sysmeta that
|
||||
the encrypter adds to the object PUT request headers, e.g.::
|
||||
|
||||
X-Object-Sysmeta-Crypto-Meta: "{'iv': 'xxx', 'cipher': 'AES_CTR_256'}"
|
||||
|
||||
.. note::
|
||||
Here, and in following examples, it would be possible to omit the
|
||||
``'cipher'`` keyed item from the crypto metadata until a future
|
||||
change introduces alternative ciphers. The existence of any crypto metadata
|
||||
is sufficient to infer use of the 'AES_CTR_256' unless otherwise specified.
|
||||
|
||||
|
||||
7. Metadata Encryption
|
||||
======================
|
||||
|
||||
7.1 Background
|
||||
--------------
|
||||
|
||||
Swift entities (accounts, containers, and objects) have three kinds of
|
||||
metadata.
|
||||
|
||||
First, there is basic object metadata, like Content-Length, Content-Type, and
|
||||
Etag. These are always present and user-visible.
|
||||
|
||||
Second, there is user metadata. These are headers starting with
|
||||
X-Object-Meta-, X-Container-Meta-, or X-Account-Meta- on objects,
|
||||
containers, and accounts, respectively. There are per-entity limits on
|
||||
the number, individual sizes, and aggregate size of user metadata.
|
||||
User metadata is optional; if present, it is user-visible.
|
||||
|
||||
Third and finally, there is system metadata, often abbreviated to
|
||||
"sysmeta". These are headers starting with X-Object-Sysmeta-,
|
||||
X-Container-Sysmeta-, and X-Account-Sysmeta-. There are _no_ limits on
|
||||
the number or aggregate sizes of system metadata, though there may be
|
||||
limits on individual datum sizes due to HTTP header-length
|
||||
restrictions. System metadata is not user-visible or user-settable; it
|
||||
is intended for use by Swift middleware to safely store data away from
|
||||
the prying eyes and fingers of users.
|
||||
|
||||
|
||||
7.2 Basic Object Metadata
|
||||
-------------------------
|
||||
|
||||
An object's plaintext etag and content type are sensitive information and will
|
||||
be stored encrypted, both in the container listing and in the object's
|
||||
metadata. To accomplish this, the encrypter middleware will actually encrypt
|
||||
the etag and content type _twice_: once with the object's key, and once with
|
||||
the container's key.
|
||||
|
||||
There must be a different IV used for each different encrypted header.
|
||||
Therefore, crypto metadata will be stored for the etag and content_type::
|
||||
|
||||
X-Object-Sysmeta-Crypto-Meta-ct: "{'iv': 'xxx', 'cipher': 'AES_CTR_256'}"
|
||||
X-Object-Sysmeta-Crypto-Meta-Etag: "{'iv': 'xxx', 'cipher': 'AES_CTR_256'}"
|
||||
|
||||
The object-key-encrypted values will be sent to the object server using
|
||||
``X-Object-Sysmeta-Crypto-Etag`` and ``Content-Type`` headers that will be
|
||||
stored in the object's metadata.
|
||||
|
||||
The container-key-encrypted etag and content-type values will be sent to the
|
||||
object server using header names ``X-Backend-Container-Update-Override-Etag``
|
||||
and ``X-Backend-Container-Update-Override-Content-Type`` respectively. Existing
|
||||
object server behavior is to then use these values in the ``X-Etag`` and
|
||||
``X-Content-Type`` headers included with the container update sent to the
|
||||
container server.
|
||||
|
||||
When handling a container GET request, the decrypter must process the container
|
||||
listing and decrypt every occurrence of an Etag or Content-Type using the
|
||||
container key. When handling an object GET or HEAD, the decrypter must decrypt
|
||||
the values of ``X-Object-Sysmeta-Crypto-Etag`` and
|
||||
``X-Object-Sysmeta-Crypto-Content-Type`` using the object key and copy these
|
||||
value to the ``Etag`` and ``Content-Type`` headers returned to the client.
|
||||
|
||||
This way, the client sees the plaintext etag and content type in container
|
||||
listings and in object GET or HEAD responses, just like it would without
|
||||
encryption enabled, but the plaintext values of those are not stored anywhere.
|
||||
|
||||
.. note::
|
||||
The encrypter will not know the value of the plaintext etag until it has
|
||||
processed all object content. Therefore, unless the encrypter buffers the
|
||||
entire object ciphertext (!) it cannot send the encrypted etag headers to
|
||||
object servers before the request body. Instead, the encrypter will emit a
|
||||
multipart MIME document for the request body and append the encrypted etag
|
||||
as a 'request footer'. This mechanism will build on the use of
|
||||
multipart MIME bodies in object server requests introduced by the Erasure
|
||||
Coding feature [1].
|
||||
|
||||
For basic object metadata that is encrypted (i.e. etag and content-type), the
|
||||
object data crypto metadata will apply, since this basic metadata is only set
|
||||
by an object PUT. However, the encrypted copies of basic object metadata that
|
||||
are forwarded to container servers with container updates will require
|
||||
accompanying crypto metadata to also be stored in the container server DB
|
||||
objects table. To avoid significant code churn in the container server, we
|
||||
propose to append the crypto metadata to the basic metadata value string.
|
||||
|
||||
For example, the Etag header value included with a container update will have
|
||||
the form::
|
||||
|
||||
Etag: E(CEK, <etag>); meta={'iv': 'xxx', 'cipher': 'AES_CTR_256'}
|
||||
|
||||
where ``E(CEK, <etag>)`` is the ciphertext of the object's etag encrypted with
|
||||
the container key (``CEK``).
|
||||
|
||||
When handling a container GET listing, the decrypter will need to parse each
|
||||
etag value in the listing returned from the container server and transform its
|
||||
value to the plaintext etag expected in the response to the client. Since a
|
||||
'regular' plaintext etag is a fixed length string that cannot contain the ';'
|
||||
character, the decrypter will be able to easily differentiate between an
|
||||
unencrypted etag value and an etag value with appended crypto metadata that by
|
||||
design is always longer than a plaintext etag.
|
||||
|
||||
The crypto metadata appended to the container update etag will also be valid
|
||||
for the encrypted content-type ``E(CEK, <content-type>)`` since both are set at
|
||||
the same time. However, other proposed work [2] makes it possible to update the
|
||||
object content-type with a POST, meaning that the crypto metadata associated
|
||||
with content-type value could be different to that associated with the etag. We
|
||||
therefore propose to similarly append crypto metadata in the content-type value
|
||||
that is destined for the container server:
|
||||
|
||||
Content-Type: E(CEK, <content-type>); meta="{'iv': 'yyy', 'cipher': 'AES_CTR_256'}"
|
||||
|
||||
In this case the use of the ';' separator character will allow the decrypter to
|
||||
parse content-type values in container listings and remove the crypto metadata
|
||||
attribute.
|
||||
|
||||
7.2.1 A Note On Etag
|
||||
^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
In the stored object's metadata, the basic-metadata field named "Etag"
|
||||
will contain the MD5 hash of the ciphertext. This is required so that
|
||||
the object server will not error out on an object PUT, and also so
|
||||
that the object auditor will not quarantine the object due to hash
|
||||
mismatch (unless bit rot has happened).
|
||||
|
||||
The plaintext's MD5 hash will be stored, encrypted, in system
|
||||
metadata.
|
||||
|
||||
|
||||
7.3 User Metadata
|
||||
-----------------
|
||||
|
||||
Not only the contents of an object are sensitive; metadata is sensitive too.
|
||||
Since metadata values must be valid UTF-8 strings, the encrypted values will be
|
||||
suitably encoded (probably base64) for storage. Since this encoding may
|
||||
increase the size of user metadata values beyond the allowed limits, the
|
||||
metadata limit checking will need to be implemented by the encrypter
|
||||
middleware. That way, users don't see lower metadata-size limits when
|
||||
encryption is in use. The encrypter middleware will set a request environ key
|
||||
`swift.constraints.override` to indicate to the proxy-server that limit
|
||||
checking has already been applied.
|
||||
|
||||
User metadata names will *not* be encrypted. Since a different IV (or indeed a
|
||||
different cypher) may be used each time metadata is updated by a POST request,
|
||||
encrypting metadata names would make it impossible for Swift to delete
|
||||
out-dated metadata items. Similarly, if encryption is enabled on an existing
|
||||
Swift cluster, encrypting metadata names would prevent previously unencrypted
|
||||
metadata being deleted when updated.
|
||||
|
||||
For each piece of user metadata on objects we need to store crypto metadata,
|
||||
since all user metadata items are encrypted with a different IV. This cannot
|
||||
be stored as an item of sysmeta since sysmeta cannot be updated by an object
|
||||
POST. We therefore propose to modify the object server to persist the headers
|
||||
``X-Object-Massmeta-Crypto-Meta-*`` with the same semantic as ``X-Object-Meta-*``
|
||||
headers i.e. ``X-Object-Massmeta-Crypto-Meta-*`` will be updated on every POST
|
||||
and removed if not present in a POST. The gatekeeper middleware will prevent
|
||||
``X-Object-Massmeta-Crypto-Meta-*`` headers ever being included in client
|
||||
requests or responses.
|
||||
|
||||
The encrypter will add a ``X-Object-Massmeta-Crypto-Meta-<key>`` header
|
||||
to object PUT and POST request headers for each piece of user metadata, e.g.::
|
||||
|
||||
X-Object-Massmeta-Crypto-Meta-<key>: "{'iv': 'zzz', 'cipher': 'AES_CTR_256'}"
|
||||
|
||||
.. note::
|
||||
There is likely to be value in adding a generic mechanism to persist *any*
|
||||
header in the ``X-Object-Massmeta-`` namespace, and adding that prefix to
|
||||
those blacklisted by the gatekeeper. This would support other middlewares
|
||||
(such as a keymaster) similarly annotating user metadata with middleware
|
||||
generated metadata.
|
||||
|
||||
For user metadata on containers and accounts we need to store crypto metadata
|
||||
for each item of user metadata, since these can be independently updated by
|
||||
POST requests. Here we can use sysmeta to store the crypto metadata items,
|
||||
e.g. for a user metadata item with key ``X-Container-Meta-Color`` we would
|
||||
store::
|
||||
|
||||
X-Container-Sysmeta-Crypto-Meta-Color: "{'iv': 'ccc', 'cipher': 'AES_CTR_256'}"
|
||||
|
||||
7.4 System Metadata
|
||||
-------------------
|
||||
|
||||
System metadata ("sysmeta") will not be encrypted.
|
||||
|
||||
Consider a middleware that uses sysmeta for storage. If, for some
|
||||
reason, that middleware moves from before-crypto to after-crypto in
|
||||
the pipeline, then all its previously stored sysmeta will become
|
||||
unreadable garbage from its viewpoint.
|
||||
|
||||
Since middlewares sometimes do move, either due to code changes or to
|
||||
correct an erroneous configuration, we prefer robustness of the
|
||||
storage system here.
|
||||
|
||||
7.5 Summary
|
||||
-----------
|
||||
|
||||
The encrypter will set the following headers on PUT requests to object
|
||||
servers::
|
||||
|
||||
Etag = MD5(ciphertext) (IFF client request included an etag header)
|
||||
X-Object-Sysmeta-Crypto-Meta-Etag = {'iv': <iv>, 'cipher': <C_req>}
|
||||
|
||||
Content-Type = E(OEK, content-type)
|
||||
X-Object-Sysmeta-Crypto-Meta-ct = {'iv': <iv>, 'cipher': <C_req>}
|
||||
|
||||
X-Object-Sysmeta-Crypto-Meta = {'iv': <iv>, 'cipher': <C_req>}
|
||||
X-Object-Sysmeta-Crypto-Etag = E(OEK, MD5(plaintext))
|
||||
|
||||
X-Backend-Container-Update-Override-Etag = \
|
||||
E(CEK, MD5(plaintext); meta={'iv': <iv>, 'cipher': <C_req>}
|
||||
X-Backend-Container-Update-Override-Content-Type = \
|
||||
E(CEK, content-type); meta={'iv': <iv>, 'cipher': <C_req>}
|
||||
|
||||
where ``OEK`` is the object encryption key, ``iv`` is a randomly chosen
|
||||
initialization vector and ``C_req`` is the cipher used while handling this
|
||||
request.
|
||||
|
||||
Additionally, on object PUT or POST requests that include user defined
|
||||
metadata headers, the encrypter will set::
|
||||
|
||||
X-Object-Meta-<user_key> = E(OEK, <user_value>} for every <user-key>
|
||||
X-Object-Massmeta-Crypto-Meta-<user_key> = {'iv': <iv>, 'cipher': <C_req>}
|
||||
|
||||
On PUT or POST requests to container servers, the encrypter will set the
|
||||
following headers for each user defined metadata header::
|
||||
|
||||
X-Container-Meta-<user_key> = E(CEK, <user_value>}
|
||||
X-Container-Sysmeta-Crypto-Meta-<user_key> = {'iv': <iv>, 'cipher': <C_req>}
|
||||
|
||||
Similarly, on PUT or POST requests to account servers, the encrypter will set
|
||||
the following headers for each user defined metadata header::
|
||||
|
||||
X-Account-Meta-<user_key> = E(AEK, <user_value>}
|
||||
X-Account-Sysmeta-Crypto-Meta-<user_key> = {'iv': <iv>, 'cipher': <C_req>}
|
||||
|
||||
where ``AEK`` is the account encryption key.
|
||||
|
||||
|
||||
8. Client-Visible Changes
|
||||
=========================
|
||||
|
||||
There are no known client-visible API behavior changes in this spec.
|
||||
If any are found, they should be treated as flaws and fixed.
|
||||
|
||||
|
||||
9. Possible Future Work
|
||||
=======================
|
||||
|
||||
9.1 Protection of Internal Network
|
||||
----------------------------------
|
||||
|
||||
Swift's security model is perimeter-based: the proxy server handles
|
||||
authentication and authorization, then makes unauthenticated requests
|
||||
on a private internal network to the storage servers. If an attacker
|
||||
gains access to the internal network, they can read and modify any
|
||||
object in the Swift cluster, as well as create new ones. It is
|
||||
possible to use authenticated encryption (e.g. HMAC, GCM) to detect
|
||||
object tampering.
|
||||
|
||||
Roughly, this would involve computing a strong hash (e.g. SHA-384
|
||||
or SHA-3) of the object, then authenticating that hash. The object
|
||||
auditor would have to get involved here so that we'd have an upper
|
||||
bound on how long it takes to detect a modified object.
|
||||
|
||||
Also, to prevent an attacker from simply overwriting an encrypted
|
||||
object with an unencrypted one, the crypto engine would need the
|
||||
ability to notice a GET for an unencrypted object and return an error.
|
||||
This implies that this feature is primarily good for clusters that
|
||||
have always had encryption on, which (sadly) excludes clusters that
|
||||
pre-date encryption support.
|
||||
|
||||
|
||||
9.2 Other ciphers
|
||||
-----------------
|
||||
|
||||
AES-256 may be considered inadequate at some point, and support for
|
||||
another cipher will then be needed.
|
||||
|
||||
|
||||
9.3 Client-Managed Keys
|
||||
-----------------------
|
||||
|
||||
CPU-constrained clients may want to manage their own encryption keys
|
||||
but have Swift perform the encryption. Amazon S3 supports something
|
||||
like this. Client-managed key support would probably take the form of
|
||||
a new keymaster.
|
||||
|
||||
9.4 Re-Keying Support
|
||||
---------------------
|
||||
|
||||
Instead of using the object key K-obj and computing the ciphertext as
|
||||
E(k-obj, plaintext), treat the object key as a key-encrypting-key
|
||||
(KEK) and make up a random data-encrypting key (DEK) for each object.
|
||||
|
||||
Then, the object ciphertext would be E(DEK, plaintext), and in system
|
||||
metadata, Swift would store E(KEK, DEK). This way, if we wish to
|
||||
re-key objects, we can decrypt and re-encrypt the DEK to do it, thus
|
||||
turning a re-key operation from a full read-modify-write cycle to a
|
||||
simple metadata update.
|
||||
|
||||
|
||||
Alternatives
|
||||
============
|
||||
|
||||
Storing user metadata in sysmeta
|
||||
--------------------------------
|
||||
|
||||
To avoid the need to check metadata header limits in the encrypter, encrypted
|
||||
metadata values could be stored using sysmeta, which is not subject to the same
|
||||
limits. When handling a GET or HEAD response, the decrypter would need to
|
||||
decrypt metadata values and copy them back to user metadata headers.
|
||||
|
||||
This alternative was rejected because object sysmeta cannot be updated by a
|
||||
POST request, and so Swift would be restricted to operating in the POST-as-copy
|
||||
mode when encryption is enabled.
|
||||
|
||||
Enforce a single immutable cipher choice per container
|
||||
------------------------------------------------------
|
||||
|
||||
We could avoid storing cipher choice as metadata on every resource (including
|
||||
individual metadata items) if the choice of cipher were made immutable for a
|
||||
container or even for an account. Unfortunately it is hard to implement an
|
||||
immutable property in an eventually consistent system that allows multiple
|
||||
concurrent operations on distributed replicas of the same resource.
|
||||
|
||||
Container storage policy is 'eventually immutable' (any inconsistency is
|
||||
eventually reconciled across replicas and no replica's policy state may be
|
||||
updated by a client request). If we made cipher choice a property of a policy
|
||||
then the cipher for a container could be similarly 'eventually immutable'.
|
||||
However, it would be possible for objects in the same container to be encrypted
|
||||
using different ciphers during the any initial window of policy inconsistency
|
||||
immediately after the container is first created. The existing container policy
|
||||
reconciler process would need to re-encrypt any object found to have used the
|
||||
'wrong' cipher, and to do so it would need to know which cipher had been used
|
||||
for each object, which leads back to cipher choice being stored per-object.
|
||||
|
||||
It should also be noted that the IV would still need to be stored for every
|
||||
resource, so this alternative would not mitigate the need to store crypto
|
||||
metadata in general.
|
||||
|
||||
Furthermore, binding cipher choice to container policy does not provide a means
|
||||
to guarantee an immutable cipher choice for account metadata.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignees:
|
||||
|
||||
| jrichli@us.ibm.com
|
||||
| alistair.coles@hp.com
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
[1] http://specs.openstack.org/openstack/swift-specs/specs/done/erasure_coding.html
|
||||
|
||||
[2] Updating containers on object fast-POST: https://review.openstack.org/#/c/102592/
|
@ -1,368 +0,0 @@
|
||||
::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=============================
|
||||
Changing Policy of Containers
|
||||
=============================
|
||||
|
||||
Our proposal is to give swift users power to change storage policies of
|
||||
containers and objects which are contained in those containers.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Swift currently prohibits users from changing containers' storage policies so
|
||||
this constraint raises at least two problems.
|
||||
|
||||
One problem is the flexibility. For example, there is an organization using
|
||||
Swift as a backup storage of office data and all data is archived monthly in a
|
||||
container named after date like 'backup-201502'. Older archive becomes less
|
||||
important so users want to reduce the consumed capacity to store it. Then Swift
|
||||
users will try to change the storage policy of the container into cheaper one
|
||||
like '2-replica policy' or 'EC policy' but they will be strongly
|
||||
disappointed to find out that they cannot change the policy of the container
|
||||
once created. The workaround for this problem is creating other new container
|
||||
with other storage policy then copying all objects from an existing container
|
||||
to it but this workaround raises another problem.
|
||||
|
||||
Another problem is the reachability. Copying all files to other container
|
||||
brings about the change of all files' URLs. That makes users confused and
|
||||
frustrated. The workaround for this problem is that after copying all files to
|
||||
new container, users delete an old container and create the same name container
|
||||
again with other storage policy then copy all objects back to the original name
|
||||
container. However this obviously involves twice as heavy workload and long
|
||||
time as a single copy.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
The ring normally differs from one policy to another so 'a/c/o' object of
|
||||
policy 1 is likely to be placed in devices of different nodes from 'a/c/o'
|
||||
object of policy 0. Therefore, objects replacement associated with the policy
|
||||
change needs very long time and heavy internal traffic. For this reason,
|
||||
an user request to change a policy must be translated
|
||||
into asynchronous behavior of transferring objects among storage nodes which is
|
||||
driven by background daemons. Obviously, Swift must not suspend any
|
||||
user's requests to store or get information during changing policies.
|
||||
|
||||
We need to add or modify Swift servers' and daemons' behaviors as follows:
|
||||
|
||||
**Servers' changes**
|
||||
|
||||
1. Adding POST container API to send a request for changing a storage policy
|
||||
of a container
|
||||
#. Adding response headers for GET/HEAD container API to notify how many
|
||||
objects are placed in a new policy or still in an old policy
|
||||
#. Modifying GET/HEAD object API to get an object even if replicas are placed
|
||||
in a new policy or in an old policy
|
||||
|
||||
**Daemons' changes**
|
||||
|
||||
1. Adding container-replicator a behavior to watch a container which is
|
||||
requested to change its storage policy
|
||||
#. Adding a new background daemon which transfers objects among storage nodes
|
||||
from an old policy to a new policy
|
||||
|
||||
Servers' changes
|
||||
----------------
|
||||
|
||||
1. Add New Behavior for POST Container
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Currently, Swift returns "204 No Content" for the user POST container request
|
||||
with X-Storage-Policy header. This indicates "nothing done." For the purpose
|
||||
of maintaining backward compatibility and avoiding accidental execution, we
|
||||
prefer to remain this behavior unchanged. Therefore, we propose introducing the
|
||||
new header to 'forcibly' execute policy changing as follows.
|
||||
|
||||
.. list-table:: Table 1: New Request Header to change Storage Policy
|
||||
:widths: 30 8 12 50
|
||||
:header-rows: 1
|
||||
|
||||
* - Parameter
|
||||
- Style
|
||||
- Type
|
||||
- Description
|
||||
* - X-Forced-Change-Storage-Policy: <policy_name> (Optional)
|
||||
- header
|
||||
- xsd:string
|
||||
- Change a storage policy of a container to the policy specified by
|
||||
'policy_name'. This change accompanies asynchronous background process
|
||||
to transfer objects.
|
||||
|
||||
Possible responses for this API are as follows.
|
||||
|
||||
.. list-table:: Table 2: Possible Response Codes for the New Request
|
||||
:widths: 2 8
|
||||
:header-rows: 1
|
||||
|
||||
* - Code
|
||||
- Notes
|
||||
* - 202 Accepted
|
||||
- Accept the request properly and start to prepare objects replacement.
|
||||
* - 400 Bad Request
|
||||
- Reject the request with a policy which is deprecated or is not defined
|
||||
in a configuration file.
|
||||
* - 409 Conflict
|
||||
- Reject the request because another changing policy process is not
|
||||
completed yet (relating to 3-c change)
|
||||
|
||||
When a request of changing policies is accepted (response code is 202), a
|
||||
target container stores following two sysmetas.
|
||||
|
||||
.. list-table:: Table 3: Container Sysmetas for Changing Policies
|
||||
:widths: 2 8
|
||||
:header-rows: 1
|
||||
|
||||
* - Sysmeta
|
||||
- Notes
|
||||
* - X-Container-Sysmeta-Prev-Index: <int>
|
||||
- "Pre-change" policy index. It will be used for GET or DELETE objects
|
||||
which are not transferred to the new policy yet.
|
||||
* - X-Container-Sysmeta-Objects-Queued: <bool>
|
||||
- This will be used for determining the status of policy changing by
|
||||
daemon processes. If False, policy change request is accepted but not
|
||||
ready for objects transferring. If True, objects have been queued to the
|
||||
special container for policy changing so those are ready for
|
||||
transferring. If undefined, policy change is not requested to that
|
||||
container.
|
||||
|
||||
This feature should be implemented as middleware 'change-policy' because of
|
||||
the following two reasons:
|
||||
|
||||
1. This operation probably should be authorized only to limitted group
|
||||
(e.g., swift cluster's admin (reseller_admin)) because this operation
|
||||
occurs heavy internal traffic.
|
||||
Therefore, authority of this operation should be managed in the middleware
|
||||
level.
|
||||
#. This operation needs to POST sysmetas to the container. Sysmeta must be
|
||||
managed in middleware level according to Swift's design principle
|
||||
|
||||
2. Add Response Headers for GET/HEAD Container
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Objects will be transferred gradually by backend processes. From the viewpoint
|
||||
of Swift operators, it is important to know the progress of policy changing,
|
||||
that is, how many objects are already transferred or still remain
|
||||
untransferred. This can be accomplished by simply exposing policy_stat table of
|
||||
container DB file for each storage policy. Each policy's stat will be exposed
|
||||
by ``X-Container-Storage-Policy-<Policy_name>-Bytes-Used`` and
|
||||
``X-Container-Storage-Policy-<Policy_name>-Object-Count`` headers as follows::
|
||||
|
||||
$ curl -v -X HEAD -H "X-Auth-Token: tkn" http://<host>/v1/AUTH_test/container
|
||||
< HTTP/1.1 200 OK
|
||||
< X-Container-Storage-Policy-Gold-Object-Count: 3
|
||||
< X-Container-Storage-Policy-Gold-Bytes-Used: 12
|
||||
< X-Container-Storage-Policy-Ec42-Object-Count: 7
|
||||
< X-Container-Storage-Policy-Ec42-Bytes-Used: 28
|
||||
< X-Container-Object-Count: 10
|
||||
< X-Container-Bytes-Used: 40
|
||||
< Accept-Ranges: bytes
|
||||
< X-Storage-Policy: ec42
|
||||
< ...
|
||||
|
||||
Above response indicates 70% of object transferring is done.
|
||||
|
||||
3. Modify Behavior of GET/HEAD object API
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
In my current consideration, object PUT should be done only to the new policy.
|
||||
This does not affect any object in the previous policy so this makes the
|
||||
process of changing policies simple.
|
||||
Therefore, the best way to get an object is firstly sending a GET request to
|
||||
object servers according to the new policy's ring, and if the response code is
|
||||
404 NOT FOUND, then a proxy resends GET requests to the previous policy's
|
||||
object servers.
|
||||
|
||||
However, this behavior is in discussion because sending GET/HEAD requests twice
|
||||
to object servers can increase the latency of user's GET object request,
|
||||
especially in the early phase of changing policies.
|
||||
|
||||
Daemons' changes
|
||||
----------------
|
||||
|
||||
1. container-replicator
|
||||
^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
To enqueue objects to the list for changing policies, some process must watch
|
||||
what a container is requested for changing its policy. Adding this task to
|
||||
container-replicator seems best way because container-replicator originally
|
||||
has a role to seek all container DBs for sanity check of Swift cluster.
|
||||
Therefore, this can minimize extra time to lock container DBs for adding this
|
||||
new feature.
|
||||
|
||||
Container-replicator will check if a container has
|
||||
``X-Container-Sysmeta-Objects-Queued`` sysmeta and its value is False. Objects
|
||||
in that container should be enqueued to the object list of a special container
|
||||
for changing policies. That special container is created under the special
|
||||
account ``.change_policy``. The name of a special container should be unique
|
||||
and one-to-one relationship with a container to which policy changing is
|
||||
requested. The name of a special container is simply defined as
|
||||
``<account_name>:<container_name>``. This special account and containers are
|
||||
accessed by the new daemon ``object-transferrer``, which really transfers
|
||||
objects from the old policy to the new policy.
|
||||
|
||||
2. object-transferrer
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Object-transferrer is newly introduced daemon process for changing policies.
|
||||
Object-transferrer reads lists of special containers from the account
|
||||
``.change_policy`` and reads lists of objects from each special container.
|
||||
Object-transferrer transfers those objects from the old policy to the new
|
||||
policy by using internal client. After an object is successfully transferred
|
||||
to the new policy, an object in the old policy will be deleted by DELETE
|
||||
method.
|
||||
|
||||
If transferrer finishes to transfer all objects in a special container, it
|
||||
deletes a special container and deletes sysmetas
|
||||
``X-Container-Sysmeta-Prev-Index`` and ``X-Container-Sysmeta-Objects-Queued``
|
||||
from a container to change that container's status from IN-CHANGING to normal
|
||||
(POLICY CHANGE COMPLETED).
|
||||
|
||||
Example
|
||||
-------
|
||||
|
||||
.. list-table:: Table 4: Example of data transition during changing policies
|
||||
:widths: 1 4 2 4 2
|
||||
:header-rows: 1
|
||||
|
||||
* - Step
|
||||
- Description
|
||||
- Container /a/c
|
||||
objects
|
||||
- Container /a/c/ metadata
|
||||
- Container /.change_policy/a:c
|
||||
objects
|
||||
* - | 0
|
||||
- | Init.
|
||||
- | ('o1', 1)
|
||||
| ('o2', 1)
|
||||
| ('o3', 1)
|
||||
- | X-Backend-Storage-Policy-Index: 1
|
||||
- | N/A
|
||||
* - | 1
|
||||
- | POST /a/c X-Forced-Change-Storage-Policy: Pol-2
|
||||
- | ('o1', 1)
|
||||
| ('o2', 1)
|
||||
| ('o3', 1)
|
||||
- | X-Backend-Storage-Policy-Index: 2
|
||||
| X-Container-Sysmeta-Prev-Policy-Index: 1
|
||||
| X-Container-Sysmeta-Objects-Queued: False
|
||||
- | N/A
|
||||
* - | 2
|
||||
- | container-replicator seeks policy changing containers
|
||||
- | ('o1', 1)
|
||||
| ('o2', 1)
|
||||
| ('o3', 1)
|
||||
- | X-Backend-Storage-Policy-Index: 2
|
||||
| X-Container-Sysmeta-Prev-Policy-Index: 1
|
||||
| X-Container-Sysmeta-Objects-Queued: True
|
||||
- | ('o1', 0, 'application/x-transfer-1-to-2')
|
||||
| ('o2', 0, 'application/x-transfer-1-to-2')
|
||||
| ('o3', 0, 'application/x-transfer-1-to-2')
|
||||
* - | 3
|
||||
- | object-transferrer transfers 'o1' and 'o3'
|
||||
- | ('o1', 2)
|
||||
| ('o2', 1)
|
||||
| ('o3', 2)
|
||||
- | X-Backend-Storage-Policy-Index: 2
|
||||
| X-Container-Sysmeta-Prev-Policy-Index: 1
|
||||
| X-Container-Sysmeta-Objects-Queued: True
|
||||
- | ('o2', 0, 'application/x-transfer-1-to-2')
|
||||
* - | 4
|
||||
- | object-transferrer transfers 'o2'
|
||||
- | ('o1', 2)
|
||||
| ('o2', 2)
|
||||
| ('o3', 2)
|
||||
- | X-Backend-Storage-Policy-Index: 2
|
||||
| X-Container-Sysmeta-Prev-Policy-Index: 1
|
||||
| X-Container-Sysmeta-Objects-Queued: True
|
||||
- | Empty
|
||||
* - | 5
|
||||
- | object-transferrer deletes a special container and metadatas from
|
||||
container /a/c
|
||||
- | ('o1', 2)
|
||||
| ('o2', 2)
|
||||
| ('o3', 2)
|
||||
- | X-Backend-Storage-Policy-Index: 2
|
||||
- | N/A
|
||||
|
||||
Above table focuses data transition of a container in changing a storage policy
|
||||
and a corresponding special container. A tuple indicates object info, first
|
||||
element is an object name, second one is a policy index and third one, if
|
||||
available, is a value of content-type, which is defined for policy changing.
|
||||
|
||||
Given that three objects are stored in the container ``/a/c`` as policy-1
|
||||
(Step 0). When the request to change this container's
|
||||
policy to policy-2 is accepted (Step 1), a backend policy index will be
|
||||
changed to 2 and two sysmetas are stored in this container. In the periodical
|
||||
container-replicator process, replicator finds a container with policy change
|
||||
sysmetas and then creates a special container ``/.change_policy/a:c`` with
|
||||
a list of objects (Step 2). Those objects have info of old policy and new policy
|
||||
with the field of content-type. When object-transferrer finds this special
|
||||
container from ``.change_policy`` account, it gets some objects from the old
|
||||
policy (usually from a local device) and puts them to the new policy's storage
|
||||
nodes (Step 3 and 4). If the special container becomes empty (Step 5), it
|
||||
indicates policy changing for that container finished so the special container
|
||||
is deleted and policy changing metadatas of an original container are also
|
||||
deleted.
|
||||
|
||||
Alternatives: As Sub-Function of Container-Reconciler
|
||||
-----------------------------------------------------
|
||||
|
||||
Container-reconciler is a daemon process which restores objects registered in
|
||||
an incorrect policy into a correct policy. Therefore, the reconciling procedure
|
||||
satisfies almost all of functional requirements for policy changing. The
|
||||
advantage of using container-reconciler for policy changing is that we need to
|
||||
modify a very few points of existing Swift sources. However, there is a big
|
||||
problem to use container-reconciler. This problem is that container-reconciler
|
||||
has no function to determine the completeness of changing policy of objects
|
||||
contained in a specific container. As a result, this problem makes it
|
||||
complicated to handle GET/HEAD object from the previous policy and to allow
|
||||
the next storage policy change request. Based on discussion in Swift hack-a-thon
|
||||
(held in Feb. 2015) and Tokyo Summit (held in Oct. 2015), we decided to add
|
||||
object-transferrer to change container's policy.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
Daisuke Morita (dmorita)
|
||||
|
||||
Milestones
|
||||
----------
|
||||
|
||||
Target Milestone for completion:
|
||||
Mitaka
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Add API for Policy Changing
|
||||
|
||||
* Add a middleware 'policy-change' to process Container POST request with
|
||||
"X-Forced-Change-Storage-Policy" header. This middleware stores sysmeta
|
||||
headers to target container DB for policy changing.
|
||||
* Modify container-server to add response headers for Container GET/HEAD
|
||||
request to show the progress of changing policies by exposing all the info
|
||||
from policy_stat table
|
||||
* Modify proxy-server (or add a feature to new middleware) to get object for
|
||||
referring both new and old policy index to allow users' object read during
|
||||
changing policy
|
||||
|
||||
* Add daemon process among storage nodes for policy changing
|
||||
|
||||
* Modify container-replicator to watch a container if it should be initialized
|
||||
(creation of a corresponding special container) for changing policies
|
||||
* Write object-transferrer code
|
||||
* Daemonize object-transferrer
|
||||
|
||||
* Add unit, functional and probe tests to check that new code works
|
||||
intentionally and that it is OK for splitted brain cases
|
||||
|
@ -1,588 +0,0 @@
|
||||
::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
..
|
||||
This template should be in ReSTructured text. Please do not delete
|
||||
any of the sections in this template. If you have nothing to say
|
||||
for a whole section, just write: "None". For help with syntax, see
|
||||
http://sphinx-doc.org/rest.html To test out your formatting, see
|
||||
http://www.tele3.cz/jbar/rest/rest.html
|
||||
|
||||
===============================
|
||||
Container Sharding
|
||||
===============================
|
||||
|
||||
Include the URL of your blueprint:
|
||||
|
||||
https://blueprints.launchpad.net/swift/+spec/container-sharding
|
||||
|
||||
A current limitation in swift is the container database. The SQLite database
|
||||
stores the name of all the objects inside the container. As the amount of
|
||||
objects in a container grows, so does the size of the database file. This causes
|
||||
higher latency due to the size and reading on the single file, and can be improved
|
||||
by using container sharding.
|
||||
|
||||
Over the last year, there has been a few POC's covered, the last POC was using
|
||||
distributed prefix trees, which although worked well (kept order and adding infinite sharding)
|
||||
while at the last hackathon (August in Austin), it was found that it required too many requests. In
|
||||
smaller or not high load clusters this would have been fine, but talking to users running clusters
|
||||
in high load, this approach only added to their problems. The code for this approach can be found in
|
||||
the sharding_trie branch of https://github.com/matthewoliver/swift/.
|
||||
|
||||
After discussions at the hackathon, it was decided we should try a similar but simpler approach. Which
|
||||
I am calling the Pivot Range approach. This POC is being worked on in the sharding_range branch.
|
||||
|
||||
https://github.com/matthewoliver/swift/tree/sharding_range
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
The SQLite database used to represent a container stores the name of all the objects
|
||||
contained within. As the amount of objects in a container grows, so does the size of
|
||||
the database file. Because of this behaviour, the current suggestion for clusters storing
|
||||
many objects in a single container is to make sure the container databases are stored on
|
||||
SSDs, to reduce latency when accessing large database files.
|
||||
|
||||
In a previous version of this spec, I investigated different approaches we could use
|
||||
to shard containers. These were:
|
||||
|
||||
#. Path hashing (part power)
|
||||
|
||||
#. Consistent Hash ring
|
||||
|
||||
#. Distributed prefix trees (trie)
|
||||
|
||||
#. Pivot/Split tree (Pivot Ranges)
|
||||
|
||||
In discussions about the SPEC at the SFO Swift Hackathon, distributed prefix trees (trie)
|
||||
became the forerunner. More recently at the Austin hackathon the prefix trie approach though
|
||||
worked would cause more requests and on larger highly loaded clusters, may actually cause more
|
||||
issues then it was solving.
|
||||
It was decided to try a similar but simplified approach, which I'm calling the pivot (or split)
|
||||
tree approach. This is what this version of the spec will be covering.
|
||||
|
||||
When talking about splitting up the objects in a container, we are only talking about the container metadata, not the objects themselves.
|
||||
|
||||
The Basic Idea
|
||||
=================
|
||||
|
||||
The basic and simplified idea is rather simple. Firstly, to enable container sharding pass in a
|
||||
"X-Container-Sharding: On" X-Header via either PUT or POST::
|
||||
|
||||
curl -i -H 'X-Auth-Token: <token>' -H 'X-Container-Sharding: On' <url>/<account>/<container> -X PUT
|
||||
|
||||
Once enabled when a container gets too full, say at 1 million objects. A pivot point is found
|
||||
(the middle item) which will be used to split the container. This split will create 2 additional containers each
|
||||
holding 1/2 the objects. The configuration parameter `shard_container_size` determines what size a container can get to before it's sharded (defaulting to 1 million).
|
||||
|
||||
All new containers created when splitting exist in a separate account namespace based off the users account. Meaning the user will only
|
||||
ever see 1 container, this we call the root container. The sharded namespace is::
|
||||
|
||||
.sharded_<account>/
|
||||
|
||||
Containers that have been split no longer hold object metadata and so once the new containers are durable can be deleted (except for the root container).
|
||||
The root container, like any other split container, contains no objects in it's ``object`` table however it
|
||||
has a new table to store the pivot/range information. This information can be used to easily and quickly
|
||||
determine where meta should live.
|
||||
|
||||
The pivot (split) tree, ranges and Swift sharding
|
||||
=====================================================
|
||||
|
||||
A determining factor in what sharding technique we chose was that having a consistent order is
|
||||
important, a prefix tree is a good solution, but we need something even simpler. Conceptually we
|
||||
can split container in two on a pivot (middle object) in the object list, turning the resulting
|
||||
prefix tree into a more basic binary tree. In the initial version of this new POC, we had a class called
|
||||
the PivotTree, which was a binary tree with the extra smarts we needed.. but as development went on,
|
||||
maintaining a full tree became more complex, we were only storing the pivot node (to save space).
|
||||
Finding the bounds of what should belong in a part of the tree (for misplaced object checks, see later)
|
||||
became rather complicated.
|
||||
We have since decided to simplify the design again and store a list of ranges (like encyclopaedia's), which
|
||||
still behaves like a binary tree (the searching algorithm) but also greatly simplifies parts the sharding
|
||||
in Swift.
|
||||
|
||||
The pivot_tree version still exists (although incomplete) in the pivot_tree branch.
|
||||
|
||||
Pivot Tree vs Pivot Range
|
||||
----------------------------
|
||||
|
||||
Let's start with a picture, this is how the pivot tree worked:
|
||||
|
||||
.. image:: images/PivotPoints.png
|
||||
|
||||
Here, the small circles under the containers represent the point on which the container was pivoted,
|
||||
and thus you can see the pivot tree.
|
||||
|
||||
The picture was one I used in the last spec, and also demonstrates how the naming of a sharded container
|
||||
is defined and how they are stored in the DB.
|
||||
|
||||
Looking at the ``pivot_points`` table from the above image, you can see that the original container '/acc/cont' has been split a few times:
|
||||
|
||||
* First it pivoted at 'l', which would have created 2 new sharded containers (cont.le.l and cont.gt.l).
|
||||
* Second, container /.sharded_acc/cont.le.l was split at pivoted 'f' creating cont.le.f and cont.gt.f.
|
||||
* Finally the cont.gt.l container also split pivoting on 'r' creating cont.le.r and cont.gt.r.
|
||||
|
||||
Because it is essentially a binary tree we can infer the existence of these additional 6 containers with just 3 pivots in the pivot table. The level of the pivot tree each pivot lives is also stored so we are sure to build the tree correctly whenever it's needed.
|
||||
|
||||
The way the tree was stored in the database was basically a list and the tree needed to be built. In the
|
||||
range approach, we just use a list of ranges. A rather simple PivotRange class was introduced which
|
||||
has methods that makes searching ranges and thus the binary search algorithm simple.
|
||||
|
||||
Here is an example of the same data stored in PivotRanges:
|
||||
|
||||
.. image:: images/PivotRanges.png
|
||||
|
||||
As you can see from this diagram, there is more records in the table, but it is simplified.
|
||||
|
||||
The bytes_used and object_count stored in the database may look confusing, but this is so we can keep track
|
||||
of these statistics in the root container without having to go visit each node. The container-sharder will update these stats as it visits containers.
|
||||
This keeps the sharded containers stats vaguely correct and eventually consistent.
|
||||
|
||||
All user and system metadata only lives in the root container. The sharded containers only hold some metadata which help the sharder in it's work and in being able to audit the container:
|
||||
|
||||
* X-Container-Sysmeta-Shard-Account - This is the original account.
|
||||
* X-Container-Sysmeta-Shard-Container - This is the original container.
|
||||
* X-Container-Sysmeta-Shard-Lower - The lower point of the range for this container.
|
||||
* X-Container-Sysmeta-Shard-Upper - The upper point of the range for this container.
|
||||
|
||||
Pivot point
|
||||
--------------
|
||||
The Pivot point is the middle object in the container. As Swift is eventually consistent all the containers
|
||||
could be in flux and so they may not have the same pivot point to split on. Because of this something needs to make the decision. In the initial version of the POC, this will be one of the jobs of the container-sharder.
|
||||
And to do so is rather simple. It will query each primary copy of the container asking for what they think the
|
||||
pivot point is. The sharder will choose the container with the most objects (how it does this will be explained in more detail in the container-sharder section).
|
||||
|
||||
There is a new method in container/backend.py called ``get_possible_pivot_point`` which does exactly what
|
||||
you'd expect, finds the pivot point of the container, it does this via querying the database with::
|
||||
|
||||
SELECT name
|
||||
FROM object
|
||||
WHERE deleted=0 LIMIT 1 OFFSET (
|
||||
SELECT reported_object_count / 2
|
||||
FROM container_info);
|
||||
|
||||
This pivot point is placed in container_info, so is now easily accessible.
|
||||
|
||||
PivotRange Class
|
||||
-----------------
|
||||
Now that we are storing a list of ranges, and as you probably remember from the initial picture we only store the lower and upper of this range. We have have a class that makes dealing with ranges simple.
|
||||
|
||||
The class is pretty basic, it stores the timestamp, lower and upper values. `_contains_`, `_lt_`, `_gt_` and `_eq_` have been overrided, to do checks against a string or another PivotRange.
|
||||
|
||||
The class also contains some extra helper methods:
|
||||
|
||||
* newer(other) - is it newer then another range.
|
||||
* overlaps(other) - does this range overlap another range.
|
||||
|
||||
The PivotRange class lives in swift.common.utils, and there are some other helper methods there that are used:
|
||||
|
||||
* find_pivot_range(item, ranges) - Finds what range from a list of ranges that an item belongs.
|
||||
* pivot_to_pivot_container(account, container, lower=None, upper=None, pivot_range=None) - Given a root container account and container and either lower and upper or just a pivot_range generate the required sharded name.
|
||||
|
||||
Getting PivotRanges
|
||||
--------------------
|
||||
|
||||
There are two ways of getting a list of PivotRanges and it depends on where you are in swift. The easiest and most obvious way is to use a new method in the ContainerBroker `build_pivot_ranges()`.
|
||||
|
||||
The second is to ask the container for a list of pivot nodes rather than objects. This is done with a simple
|
||||
GET to the container server, but with the nodes=pivot parameter sent::
|
||||
|
||||
GET /acc/cont?nodes=pivot&format=json
|
||||
|
||||
You can then build a list of PivotRange objects. And example of how this is done can be seen in the
|
||||
`_get_pivot_ranges` method in the container sharder daemon.
|
||||
|
||||
Effects to the object path
|
||||
-------------------------------
|
||||
|
||||
Proxy
|
||||
^^^^^^^^^
|
||||
As far as the proxy is concerned nothing has changed. An object will always hashed with the root container,
|
||||
so no movement of object data is required.
|
||||
|
||||
Object-Server and Object-Updater
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
The Object-server and Object-Updater (async-pending's) need some more smarts because they need to update the
|
||||
correct shard. In the current POC implementation, these daemons don't actually need to be shard aware,
|
||||
they just be know what to do if a container server responds with a HTTPMovedPermanently (301),
|
||||
as the following picture demonstrates:
|
||||
|
||||
.. image:: images/seq_obj_put_delete.png
|
||||
|
||||
This is accomplished by getting the container-server to set the required X-Container-{Host, Device, Partition}
|
||||
headers in the response that the object-{server, updater} require to redirect it's update.
|
||||
Only one new host is added to the headers, the container server determines which one by picking the
|
||||
primary node of the new partition that sits in at the same index as itself.
|
||||
This helps stop what a call a request storm.
|
||||
|
||||
Effects to the container path
|
||||
---------------------------------
|
||||
|
||||
PUT/POST
|
||||
^^^^^^^^^
|
||||
These remain unaffected. All container metadata will be stored with the root container.
|
||||
|
||||
GET/HEAD
|
||||
^^^^^^^^^
|
||||
Both GET and HEADs get much more complicated.
|
||||
HEAD's need to return the bytes_used and object_count stats on the container. The root container doesn't have any objects, so we need to either:
|
||||
|
||||
* Visit every shard node and build the stats... this is very very expensive; or
|
||||
* Have a mechanism of updating the stats on a regurlar basis, and they can lag a little.
|
||||
|
||||
The latter of these was chosen and the POC stores the stats of each shard in the root pivot_ranges table which gets updated during each sharding pass (see later).
|
||||
|
||||
On GETs, additional requests will be required to hit leaf nodes to gather and build the object listings.
|
||||
We could make these extra requests from the proxy or the container server, both have their pros and cons:
|
||||
|
||||
In the proxy:
|
||||
|
||||
* Pro: this has the advantage of making new requests from the proxy, being able to probe and build a response.
|
||||
* Con: The pivot tree of the root container needs to be passed back to the proxy. Because this tree could grow, even if only passing back the leaves, passing a tree back would have to be in a body (as the last POC using distributed prefix trees implemented) or in multiple headers.
|
||||
* Con: The proxy needs to merge the responses from the container servers hit, meaning needing to understand XML and json. Although this could be resolved by calling format=json and moving the container server's create_listing method into something more global (like utils).
|
||||
|
||||
In the container-server:
|
||||
|
||||
* Pro: No need to pass the root containers pivot nodes (or just leaves) as all the action will happen with access to the root containers broker.
|
||||
* Pro: Happening in the container server means we can call format=json on shard containers and then use the root container's create_listing method to deal with any format cleanly.
|
||||
* Con: Because it's happening on the container-servers, care needs to be given in regard to requests. We don't want to send out replica additional requests when visiting each leaf container. Otherwise we could generate a kind of request storm.
|
||||
|
||||
The POC is currently using container-server, keeping the proxy shard aware free (which is pretty cool).
|
||||
|
||||
DELETE
|
||||
^^^^^^^
|
||||
Delete has the same options as GET/HEAD above, either it runs in the proxy or the container-server. But the general idea will be:
|
||||
|
||||
* Receive a DELETE
|
||||
* Before applying to the root container, go apply to shards.
|
||||
* If all succeed then delete root container.
|
||||
|
||||
Container delete is yet to be implemented in the POC.
|
||||
|
||||
The proxy base controller has a bunch of awesome code that deals with quorums and best responses, and if
|
||||
we put the DELETE code in the container we'd need to replicate it or do some major refactoring. This isn't great but might be the best.
|
||||
|
||||
On the other hand having shard DELETE code in the proxy suddenly makes the proxy shard aware..
|
||||
which makes it less cool.. but definitely makes the delete code _much_ simpler.
|
||||
|
||||
**So the question is:** `Were should the shard delete code live?`
|
||||
|
||||
Replicater changes
|
||||
--------------------
|
||||
The container-replicator (and db_replicator as required) has been updated to replicate and sync the pivot_range table.
|
||||
|
||||
Swift is eventually consistent, meaning at some point we will have an unsharded version of a container replicate with a sharded one, and being eventually consistent, some of the objects in the un-sharded might actually exist and need to be merged into a lower down shard.
|
||||
The current thinking is that a sharded container holds all it's objects in the leaves. Leaving the root and branch container's object table empty, these non-leaves also will not be queried when object listing. So the plan is:
|
||||
|
||||
#. Sync the objects from the unsharded container into the objects table of the root/branch container.
|
||||
#. Let the container-sharder replicate the objects down to the correct shard. (noting that dealing with misplaced objects in a shard is apart of the sharder's job)
|
||||
|
||||
pending and merge_items
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
This version of the POC will take advantage of the last POC's changes to replication. They will at least suffice while it's a POC.
|
||||
The merge_items method in the container/backend.py has been modified to be pivot_points aware. That is to say, the list of items
|
||||
passed to it can now contain a mix of objects and pivot_nodes. A new flag will be added to the pending/pickle file format
|
||||
called record_type, which defaults to RECORD_TYPE_OBJECT in existing pickle/pending files when unpickled. Merge_items
|
||||
will sort into 2 different lists based on the record_type, then insert, update, delete the required tables accordingly.
|
||||
|
||||
TODO - Explain this in more detail and maybe a diagram or two.
|
||||
|
||||
Container replication changes
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
Because swift is an eventually consistent system, we need to make sure that when container databases are replicated, this doesn't
|
||||
only replicate items in the objects table, but also the nodes in the pivot_points table. Most of the database replication code
|
||||
is apart of the db_replicator which is a parent and so shared by account and container replication.
|
||||
The current solution in the POC, is to add an _other_items_hook(broker) hook that is over-written in the container replicator
|
||||
to grab the items from the pivot_range table and returned in the items format to be passed into merge_items.
|
||||
|
||||
There is a caveat however, which is that currently the hook grabs all the objects from the pivot_points table.
|
||||
There is no notion of a pointer/sync point. The number of pivot_point should remain fairly small, at least in relation to objects.
|
||||
|
||||
.. note:: We are using an other_items hook, but this can change once we get around to sharding accounts. In which case we can simply update the db_replicator to include replicating the list of ranges properly.
|
||||
|
||||
Container-sharder
|
||||
-------------------
|
||||
The container-sharder, will run on all container-server nodes. At an interval it will parse all shared containers,
|
||||
on each it:
|
||||
|
||||
* Audits the container
|
||||
* Deals with any misplaced items. That is items that should belong in a different range container.
|
||||
* We then check the size of the container, when we do _one_ of the following happens:
|
||||
* If the container is big enough and hasn't already a pivot point defined, determine a pivot point.
|
||||
* If the container is big enough and has a pivot point defined, then split (pivot) on it.
|
||||
* If the container is small enough (see later section) then it'll shrink
|
||||
* If the container isn't too big or small so just leave it.
|
||||
* Finally the containers `object_count` and `bytes_used` is sent to the root container's `pivot_ranges` table.
|
||||
|
||||
As the above alludes to, sharding is a 2 phase process, on the first shard pass the container will get a
|
||||
pivot point, the next time round it will be sharded (split). Shrinking is even more complicated, this two
|
||||
is a 2 phase process, but I didn't want to complicate this initial introduction here. See the shrinking
|
||||
section below for more details.
|
||||
|
||||
Audit
|
||||
^^^^^^
|
||||
The sharder will perform a basic audit of which simply makes sure the current shard's range exists in the root's `pivot_ranges` table. And if its the root container, check to see if there are any overlap or missing ranges.
|
||||
|
||||
The following truth table was from the old POC spec. We need to update this.
|
||||
|
||||
+----------+------------+------------------------+
|
||||
| root ref | parent ref | Outcome |
|
||||
+==========+============+========================+
|
||||
| no | no | Quarantine container |
|
||||
+----------+------------+------------------------+
|
||||
| yes | no | Fix parent's reference |
|
||||
+----------+------------+------------------------+
|
||||
| no | yes | Fix root's reference |
|
||||
+----------+------------+------------------------+
|
||||
| yes | yes | Container is fine |
|
||||
+----------+------------+------------------------+
|
||||
|
||||
Misplaced objects
|
||||
^^^^^^^^^^^^^^^^^^^
|
||||
A misplaced object is an object that is in the wrong shard. If it's a branch shard (a shard that has split), then anything in the object table is
|
||||
misplaced and needs to be dealt with. On a leaf node, a quick SQL statement is enough to find out all the objects
|
||||
that are on the wrong side of the pivot. Now that we are using ranges, it's easy to determine what should and shouldn't be in the range.
|
||||
|
||||
The sharder uses the container-reconciler/replicator's approach of creating a container database locally in a handoff
|
||||
partition, loading it up, and then using replication to push it over to where it needs to go.
|
||||
|
||||
Splitting (Growing)
|
||||
^^^^^^^^^^^^^^^^^^^
|
||||
For the sake of simplicity, the POC uses the sharder to both determine and split. It does this is a 2-phase process which I have already alluded to above.
|
||||
On each sharding pass, all sharded containers local to this container server are checked. On each check the container is audited and any misplaced items are dealt with.
|
||||
Once that's complete only *one* of the following actions happen, and then the sharder moves onto the next container or finishes it's pass:
|
||||
|
||||
* **Phase 1** - Determine a pivot point: If there are enough objects in the container to warrant a spit and a pivot point hasn't already been determined then we need to find one, it does this by:
|
||||
* Firstly find what the local container thinks is the best pivot point is and it's object count (it can get these from broker.get_info).
|
||||
* Then query the other primary nodes to get their best pivot point and object count.
|
||||
* We compare the results, the container with the most objects wins, in the case of a tie, the one that reports the winning object count first.
|
||||
* Set X-Container-Sysmeta-Shard-Pivot locally and on all nodes to the winning pivot point.
|
||||
|
||||
* **Phase 2** - Split the container on the pivot point: If X-Container-Sysmeta-Shard-Pivot exists then we:
|
||||
* Before we commit to splitting ask the other primary nodes and make sure there a quorum (replica / 2 + 1) of which they agree on the same pivot point.
|
||||
* If we reach quorum, then it's safe to split. In which case we:
|
||||
* create new containers locally
|
||||
* fill them up while delete the objects locally
|
||||
* replicate all the containers out and update the root container of the changes. (Delete old range, and add to 2 new).
|
||||
|
||||
.. note::
|
||||
|
||||
When deleting objects from the container being split the timestamp used is the same as the existing object but using Timestamp.internal with the offset incremented. Allowing newer versions of the object/metadata to not be squashed.
|
||||
Noting this incase this collides with fast post work acoles has been working on.. I'll ask him at summit.
|
||||
|
||||
* Do nothing.
|
||||
|
||||
Shrinking
|
||||
^^^^^^^^^^^^^
|
||||
Turns out shrinking (merging containers back when they get too small) is even more complicated then sharding (growing).
|
||||
|
||||
When sharding, we at least have all the objects that need to shard all on the container server we were on.
|
||||
When shrinking, we need to find a range neighbour that most likely lives somewhere else.
|
||||
|
||||
So what does the current POC do? At the moment it's another 2 phase procedure. Although while writing this SPEC update I think this might have to become a 3 phase as we probably need an initial state to do nothing but
|
||||
let Swift know something will happen.
|
||||
|
||||
So how does shrinking work, glad you asked. Firstly shrinking happens during the sharding pass loop.
|
||||
If a container has too few items then the sharder will look into the possibility of shrinking the container.
|
||||
Which starts at phase 1:
|
||||
|
||||
* **Phase 1**:
|
||||
* Find out if the container really has few enough objects, that is a quorum of counts below the threshold (see below).
|
||||
* Check the neighbours to see if it's possible to shrink/merge together, again this requires getting a quorum.
|
||||
* Merge, if possible, with the smallest neighbour.
|
||||
* create a new range container, locally.
|
||||
* Set some special metadata on both the smallest neighbour and on the current container.
|
||||
* `X-Container-Sysmeta-Shard-Full: <neighbour>`
|
||||
* `X-Container-Sysmeta-Shard-Empty: <this container>`
|
||||
* merge objects into the metadata full container (neighbour), update local containers.
|
||||
* replicate, and update root container's ranges table.
|
||||
* let misplaced objects and replication do the rest.
|
||||
|
||||
* **Phase 2** - On the storage node the other neighbour is on (full container), when the sharder hits it then:
|
||||
* Get a quorum what the metadata is still what is says.. though it might be too late if it isn't).
|
||||
* Create a new container locally in a handoff partition.
|
||||
* Load with all the data (cause we want to name the container properly) while deleting locally.
|
||||
* Send range updates to the root container.
|
||||
* Delete both old containers and replicate all three containers.
|
||||
|
||||
The potential extra phase I see might be important here would be to set the metadata only as phase 1 to let the rest of Swift know something will be happening. The set metadata is what Swift checks for in the areas it need to be shrink aware.
|
||||
|
||||
.. node::
|
||||
In phase 2, maybe an actual filesystem copy would be faster and better then creating and syncing. Also again we have the space vs vacuum issue.
|
||||
|
||||
Small enough
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
OK, so that's all good and fine, but what is small enough, both from the container and small enough neighbour?
|
||||
|
||||
Shrinking has added 2 new configuration parameters to the container-sharder config section:
|
||||
|
||||
* `shard_shrink_point` - percentage of `shard_container_size` (default 1 million) that a container is deemed small enough to try and shrink. Default 50 (note no % sign).
|
||||
* `shard_shrink_merge_point` - percentage of `shard_container_size` that a container will need to be below after 2 containers have merged. Default 75.
|
||||
|
||||
These are just numbers I've picked out of the air. But are tunable. The idea is, taking the defaults,
|
||||
when a container gets < 50% of shard_container_size, then the sharder will look to see if there are any neighbours
|
||||
that when its object count added to itself is < 75% of shard_container_size then merge with it. If it can't
|
||||
find a neighbour what will be < 75% then we can't shrink and the container will have to stay as it is.
|
||||
|
||||
shard aware
|
||||
~~~~~~~~~~~~
|
||||
The new problem is things now need to be shrink aware. Otherwise we can get ourselves in a spot of danger:
|
||||
|
||||
* **Container GET** - Needs to know if it hits an shrink `empty` container to look in the shrink `full` container for the empty containers object metadata.
|
||||
* Sharding or shrinking should not touch a container that is in a shrinking state. That is if it's either the emtpy or full container.
|
||||
* **Sharder's misplaced objects** - A shrink full container will obviously look like it has a bunch of objects that don't belong in the range, so misplaced objects needs to know about this state otherwise we'll have some problems.
|
||||
* **Container server 301 redirects** - We want to make sure that when finding the required container to update in a the 301 response, if it happens to be an empty container we need to redirect to the full one. (or do we, maybe this doesn't matter?).
|
||||
* **Container shard delete** - an empty container now has 0 objects, and could be deleted. When we delete a container all the metadata is lost, including the empty and full metadata.. this could cause some interesting problems. (This hasn't been implemented yet, see problem with deletes)
|
||||
|
||||
Space cost of Sharding or to Vacuum
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
When we split, currently we:
|
||||
|
||||
* create the 2 new containers locally in handoff partitions.
|
||||
* Split objects into the new containers keeping their timestamps. At the same time delete the objects in the container being split by setting deleted = 1 and setting the timestamp to the object timestamp + offset.
|
||||
* replicate all three containers.
|
||||
|
||||
.. note::
|
||||
|
||||
Maybe a good compromise here would be instead of splitting and filling up 2 new containers completely and
|
||||
then replcating, maybe a smarter way would be to create the new containers, fill them up a bit (shard_max_size) then replicate, rinse, repeat.
|
||||
|
||||
We set the deleted = 1 timestamps to the existing objects timestamp + offset, because there could be a container out there that is out of sync with an updated object record we want to keep.
|
||||
In which case it'll override the deleted one in the splitting container, and then get moved to the new shard container via the sharder's misplaced items method.
|
||||
|
||||
The problem we have here is it means sharding container, especially really large containers, takes up _alot_ of room.
|
||||
To shard a container, we need to double the size on disk the spitting container takes due to the inserting
|
||||
of object into the new containers _and_ them still existing in the original with the deleted flag set.
|
||||
|
||||
We either live with this limitation.. or try and keep size on disk to a minimum when sharding.
|
||||
Another option is to:
|
||||
|
||||
* create the 2 new containers locally in handoff patitions.
|
||||
* Split objects into the new containers keeping their timestamps. At the same time deleting the original objects (DELETE FROM).
|
||||
* Replicate all three containers.
|
||||
|
||||
Here in the second option, we would probably need to experiment with how often we would need to vacuum,
|
||||
otherwise there is a chance that the database on disk, even though we are using `DELETE FROM` may still remain the same size.
|
||||
Further in the case of this old container syncing with a replica that is out of date would mean _all_ objects
|
||||
in the out of date container being merged into the old (split) container which would need all need to be rectified in merge items.
|
||||
This too could be very costly.
|
||||
|
||||
Sharded container stats
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
As you would expect, if we simply did a HEAD of the root container. The `bytes_used` and `object_count` stats
|
||||
would come back at 0. This is because when sharded the root container doesn't have any objects in it's
|
||||
objects table, they've been sharded away.
|
||||
|
||||
The last time the very slow and expensive approach of propagating the HEAD to every container shard and then collating the results would happen. This is *_VERY_* expensive.
|
||||
|
||||
We discussed this in Tokyo, and the solution was to update the counts every now and again. Because we are
|
||||
dealing with container shards that are also replicated, there are alot of counts out there to update, and this gets complicated when they all need to update a single count in the root container.
|
||||
|
||||
Now the pivot_ranges table also stores the "current" count and bytes_used for each range, as each range represents a sharded container, we now have a place to update individually::
|
||||
|
||||
CREATE TABLE pivot_ranges (
|
||||
ROWID INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
lower TEXT,
|
||||
upper TEXT,
|
||||
object_count INTEGER DEFAULT 0,
|
||||
bytes_used INTEGER DEFAULT 0,
|
||||
created_at TEXT,
|
||||
deleted INTEGER DEFAULT 0
|
||||
);
|
||||
|
||||
When we container HEAD the root container all we need to do is sum up the columns.
|
||||
This is what the ContainerBroker's `get_pivot_usage` method does with a simple SQL statement::
|
||||
|
||||
SELECT sum(object_count), sum(bytes_used)
|
||||
FROM pivot_ranges
|
||||
WHERE deleted=0;
|
||||
|
||||
Some work has been done to be able to update these `pivot_ranges` so the stats can be updated.
|
||||
You can now update them through a simple PUT or DELETE via the container-server API.
|
||||
The pivot range API allows you to send a PUT/DELETE request with some headers to update the pivot range, these headers are:
|
||||
|
||||
* x-backend-record-type - which must be RECORD_TYPE_PIVOT_NODE, otherwise it'll be treated as an object.
|
||||
* x-backend-pivot-objects - The object count, which prefixed with a - or + (More on this next).
|
||||
* x-backend-pivot-bytes - The bytes used of the range, again can be prefixed with - or +.
|
||||
* x-backend-pivot-upper - The upper range, lower range is the name of the object in the request.
|
||||
|
||||
.. note::
|
||||
|
||||
We use x-backend-* headers becuase these should only be used by swift's backend.
|
||||
|
||||
The objects and bytes can optionally be prefixed with '-' or '+' when they do they effect the count accordingly.
|
||||
For example, if we want to define a new value for the number of objects then we can::
|
||||
|
||||
x-backend-pivot-objects: 100
|
||||
|
||||
This will set the number for the `object_count` stat for the range to 100. The sharder sets the new count and bytes like this during each pass to reflect the current state of the world, seeing it knows best at the time.
|
||||
The API however allows a request of::
|
||||
|
||||
x-backend-pivot-object: +1
|
||||
|
||||
This would increment the current value. In this case it would make the new value 101. A '-' will decrement.
|
||||
|
||||
The idea behind this is if an Op wants to sacrifice more requests in the cluster with more uptodate stats, we could get the object-updaters and object-servers to send a + or - once an object is added or deleted. The sharder would correct the count if it's gets slightly out of sync.
|
||||
|
||||
The merge_items method in the ContainerBroker will merge prefixed requests together (+2 etc) if required.
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
mattoliverau
|
||||
|
||||
Other assignees:
|
||||
blmartin
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
Work items or tasks -- break the feature up into the things that need to be
|
||||
done to implement it. Those parts might end up being done by different people,
|
||||
but we're mostly trying to understand the timeline for implementation.
|
||||
|
||||
Repositories
|
||||
------------
|
||||
|
||||
No new repositories required.
|
||||
|
||||
Services
|
||||
---------
|
||||
A container-sharder daemon has been created to shard containers in the background
|
||||
|
||||
Documentation
|
||||
-------------
|
||||
|
||||
Will this require a documentation change? YES
|
||||
|
||||
If so, which documents? Deployment guide, API references, sample config files
|
||||
(TBA)
|
||||
Will it impact developer workflow? The limitations of sharded containers,
|
||||
specifically object order, will effect DLO and existing swift app developer
|
||||
tools if pointing to a sharded container.
|
||||
|
||||
Security
|
||||
--------
|
||||
|
||||
Does this introduce any additional security risks, or are there
|
||||
security-related considerations which should be discussed?
|
||||
|
||||
TBA (I'm sure there are, like potential sharded container name collisions).
|
||||
|
||||
Testing
|
||||
-------
|
||||
|
||||
What tests will be available or need to be constructed in order to
|
||||
validate this? Unit/functional tests, development
|
||||
environments/servers, etc.
|
||||
|
||||
TBA (all of the above)
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
TBA
|
@ -1,140 +0,0 @@
|
||||
::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
..
|
||||
|
||||
=================
|
||||
Container aliases
|
||||
=================
|
||||
|
||||
A container alias makes it possible to link to other containers, even to
|
||||
containers in different accounts.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Currently it is more complicated to access containers in other accounts than
|
||||
containers defined in the account returned as your storage URL because you
|
||||
need to use a different storage URL than returned by your auth-endpoint -
|
||||
which is known to not be support by all clients. Even if the storage URL of a
|
||||
shared container which you have access is known and supported by the client of
|
||||
choice - shared containers are not listed when doing a GET request on the
|
||||
users account, thus they are not discoverable by a regular client applications
|
||||
or users.
|
||||
|
||||
Alias container could simplify this task. A swift account owner/admin with
|
||||
permissions to create containers could create an alias onto a container which
|
||||
users of the account already have access to (most likely via ACL's), and
|
||||
requests rooted at or under this alias could be redirected or proxied to a
|
||||
second container on a different account.
|
||||
|
||||
This would make it simpler to access these containers with existing clients
|
||||
for different reasons.
|
||||
|
||||
#. A GET request on the account level would list these containers
|
||||
#. Requests to an alias container are forwarded to the target container,
|
||||
making it possible to access that container without using a different
|
||||
storage URL in the client.
|
||||
|
||||
However, setting the alias still requires the storage URL (see
|
||||
`Automatic container alias provisioning`_ for alternative future work).
|
||||
|
||||
Caveats
|
||||
=======
|
||||
|
||||
Setting an alias should be impossible if there are objects in the source
|
||||
container because these would become inaccessible, but still require storage
|
||||
space. There is a possible race condition if a container is created and
|
||||
objects are stored within while at the same time (plus a few milliseconds?) an
|
||||
alias is set.
|
||||
|
||||
A reconciler mechanism (similar to the one used in storage policies) might
|
||||
solve this, as well as ensuring that the alias can be only set during
|
||||
container creation. Un-setting alias would be denied, instead the alias
|
||||
container is to be deleted.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
New metadata to set and review, as well as sys-metadata to store - the target
|
||||
container on a container alias.
|
||||
|
||||
Most of the required changes can be put into a separate middleware. There is an
|
||||
existing patch: https://review.openstack.org/#/c/62494
|
||||
|
||||
.. note::
|
||||
|
||||
The main problem identified with that patch was that a split brain could
|
||||
allow a container created on handoffs WITHOUT an alias to shadow a
|
||||
pre-existing alias container, and during upload could cause the user
|
||||
perception of to which location data was written to be confused and
|
||||
potentially un-resolved.
|
||||
|
||||
It's been purposed that reconciliation process to move objects in an alias
|
||||
container to the target container could allow an eventually consistent repair
|
||||
of the split-brain'd container.
|
||||
|
||||
Security
|
||||
========
|
||||
|
||||
Test and verify what happens if requests are not yet authenticated; make sure
|
||||
ACLs are respected and unauthorized requests to containers in other accounts is
|
||||
impossible.
|
||||
|
||||
The change should include functional tests which validate cross-account and
|
||||
non-swift owner cross-container alias's correctly respect target ACL's - even
|
||||
if in some cases they appear to duplicate the storage-url based
|
||||
cross-account/cross-container ACL tests.
|
||||
|
||||
If a background process is permitted to move object stored in a container
|
||||
which is later determined to have been an alias there is likely to be
|
||||
authorization implications if the ACL's on the target have changed.
|
||||
|
||||
Documentation
|
||||
--------------
|
||||
|
||||
Update the documentation and document the behavior.
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
Further discussion of design.
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
cschwede <christian.schwede@enovance.com>
|
||||
|
||||
Future Work
|
||||
===========
|
||||
|
||||
Automatic container alias provisioning
|
||||
--------------------------------------
|
||||
|
||||
Cross-account container sharing might be even more simplified, leading to a
|
||||
better user experience.
|
||||
|
||||
Let's assume there are two users in different accounts:
|
||||
|
||||
``test:tester`` and ``test2:tester2``
|
||||
|
||||
If ``test:tester`` puts an ACL onto an container ``/AUTH_test/container`` to
|
||||
allow access for ``test2:tester2``, the middleware could create an alias
|
||||
container ``/AUTH_test2/test_container`` linking to ``/AUTH_test/container``.
|
||||
This would make it possible to discover shared containers to other
|
||||
users/accounts. However, there are two challenges:
|
||||
|
||||
1. Name conflicts: there might be an existing container
|
||||
``/AUTH_test2/test_container``
|
||||
2. A lookup would require to resolve an account name into the account ID
|
||||
|
||||
Cross realm container aliases
|
||||
-----------------------------
|
||||
|
||||
Might be possible to tie into container sync realms (or something similar) to
|
||||
allow operators the ability to let users proxy requests to other realms.
|
||||
|
@ -1,79 +0,0 @@
|
||||
::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
Scaling Expiring Objects
|
||||
========================
|
||||
|
||||
Problem description
|
||||
-------------------
|
||||
The object expirer daemon does not process objects fast enought
|
||||
when there are a large amount of files that need to expire.
|
||||
This leads to situtions like:
|
||||
|
||||
- Objects that give back 404s upon requests, but are still in showing in the
|
||||
container listing.
|
||||
- Objects not being deleted in a timely manner.
|
||||
- Object expirer passes never completing.
|
||||
|
||||
Problem Example
|
||||
---------------
|
||||
Imagine a client is PUTting 1000 object a second spread out over 10 containers into
|
||||
the cluster. First on the PUT we are using double the container resources of the
|
||||
cluster, because of the extra PUT to the .expiring_objects account. Then when we
|
||||
start deleting the objects we double the strain of the container layer again. The
|
||||
customer’s containers now have to handle the 100 PUTs/sec and 100 DELETEs a second
|
||||
from the expirer daemon. If it can’t keep up the daemon begins to gets behind.
|
||||
If there are no changes to this system the daemon will never catch up- in addition
|
||||
to this other customers will begin to be starved for resources as well.
|
||||
|
||||
Proposed change(s)
|
||||
------------------
|
||||
There will need to be two changes needed to fix the problem described.
|
||||
|
||||
1.) Allow for the container databases to know whether an object is expired.
|
||||
This will allow for the container replicater to keep the object counts correct.
|
||||
|
||||
2.) Allow the auditor to delete objects that have expired during its pass.
|
||||
This will allow for the removal of the object expirer daemon.
|
||||
|
||||
Implementation Plan
|
||||
-------------------
|
||||
There are multiple parts to the implementation. The updating of the container
|
||||
database to remove the expired objects and the removal of the object from disk.
|
||||
|
||||
Step 1:
|
||||
A expired table will be added to the container database. There will be a
|
||||
'obj_row_id' and 'expired_at' column on the table. The 'obj_row_id' column will
|
||||
correlate to the row_id for an object in the objects table. The 'expired_at'
|
||||
column will be an integer timestamp of when the object expires.
|
||||
|
||||
The container replicator will remove the object rows from objects table when
|
||||
their corresponding 'expire_at' time in the expired table is before the start
|
||||
time of the pass. There will be a trigger to delete row(s) in the 'expired'
|
||||
table after the deletion of row(s) out of the 'objects' table. Once, the
|
||||
removal of the expired objects are complete the container database will
|
||||
be replicated.
|
||||
|
||||
Step 2:
|
||||
The object auditor as it makes its pass will remove any expired objects.
|
||||
When the object auditor inspects an object's metadata, if the X-Delete-At is
|
||||
before the current time, the auditor will delete the object. Due to slow auditor
|
||||
passes, the cluster will have extra data until the objects get processed.
|
||||
|
||||
Rollout Plan
|
||||
------------
|
||||
When deploying this change the current expirer deamon can contiue to run until
|
||||
all objects are removed from the '.expiring_objects' account. Once that is done
|
||||
the deamon can be stopped.
|
||||
|
||||
Also, a script for updating the container databases with the 'expire_at' times
|
||||
for all the objects with will be created.
|
||||
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
Primary assignee:
|
||||
(aerwin3) Alan Erwin alan.erwin@rackspace.com
|
@ -1,830 +0,0 @@
|
||||
::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
..
|
||||
This template should be in ReSTructured text. Please do not delete
|
||||
any of the sections in this template. If you have nothing to say
|
||||
for a whole section, just write: "None". For help with syntax, see
|
||||
http://sphinx-doc.org/rest.html To test out your formatting, see
|
||||
http://www.tele3.cz/jbar/rest/rest.html
|
||||
|
||||
=======================================
|
||||
Resolving limitations of fast-POST
|
||||
=======================================
|
||||
|
||||
The purpose of this document is to describe the requirements to enable
|
||||
``object_post_as_copy = false`` in the proxy config as a
|
||||
reasonable deployment configuration in Swift once again, without
|
||||
sacrificing the features enabled by ``object_post_as_copy = true``.
|
||||
|
||||
For brevity we shall use the term 'fast-POST' to refer to the mode of operation
|
||||
enabled by setting ``object_post_as_copy = false``, and 'POST-as-COPY' to refer
|
||||
to the mode when ``object_post_as_copy = true``.
|
||||
|
||||
Currently using fast-POST incurs the following limitations:
|
||||
|
||||
#. Users can not update the content-type of an object with a POST.
|
||||
#. The change in last-modified time of an object due to a POST is not reflected
|
||||
in the container listing.
|
||||
#. Container-Sync does not "sync" objects after their metadata has been changed
|
||||
by a POST. This is a consequence of the container listing not being updated
|
||||
and the Container-Sync process therefore not detecting that the object's
|
||||
state has changed.
|
||||
|
||||
The solution is to implement fast-POST such that a POST to an object will
|
||||
trigger a container update.
|
||||
|
||||
This will require all of the current semantics of container updates from a PUT
|
||||
(or a DELETE) to be extended into POST and similarly cover all failure
|
||||
scenarios. In particular container updates from a POST must be serialize-able
|
||||
(in the log transaction sense, see :ref:`container_server`) so that
|
||||
out-of-order metadata updates via POST and data updates via PUT and DELETE can
|
||||
be replicated and reconciled across the container databases.
|
||||
|
||||
Additionally the new ssync replication engine has less operational testing with
|
||||
fast-POST. Some behaviors are not well understood. Currently it seems ssync
|
||||
with fast-POST has the following limitations:
|
||||
|
||||
#. A t0.data with a t2.meta on the sender can overwrite a t1.data file on the
|
||||
receiver.
|
||||
#. The whole .data file is transferred to sync only a metadata update.
|
||||
|
||||
If possible, or as follow on work (see :ref:`ssync`), ssync should
|
||||
preserve the semantic differences of syncing updates to .meta and .data
|
||||
files.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
The Swift client API describes that Swift allows an object's "metadata" to be
|
||||
"updated" via the POST verb.
|
||||
|
||||
The client API also describes that a container listing includes, for each
|
||||
object, the following specific items of metadata: name, size, hash (Etag),
|
||||
last_modified timestamp and content-type. If any of these metadata values are
|
||||
updated then the client expectation is that the named entry in the container
|
||||
listing for that object should reflect their new values.
|
||||
|
||||
For example if an object is uploaded at t0 with a size s0 and etag m0, and then
|
||||
later at t1 an object with the same name is successfully stored with a size of
|
||||
s1 and etag m1, then the container listing should *eventually* reflect the new
|
||||
values s1 and m1 for the named object last_modified at t1.
|
||||
|
||||
These two API features can both be satisfied by either:
|
||||
|
||||
#. Not allowing POST to change any of the metadata values tracked in the
|
||||
container.
|
||||
#. Ensuring that if a POST changes one of those metadata values then the
|
||||
container is also updated.
|
||||
|
||||
It is reasonable to argue that some of the object metadata items stored in the
|
||||
container should not be allowed to be changed by a POST - the object name, size
|
||||
and hash should be considered immutable for a POST because a POST is restricted
|
||||
from modifying the body of the object - from which both the etag and size are
|
||||
derived.
|
||||
|
||||
However, it can reasonably be argued that content-type should be allowed to
|
||||
change on a POST. It is also reasonable to argue that the last_modified time of
|
||||
an object as reported by the container listing should be equal to the timestamp
|
||||
of the most recent POST or PUT.
|
||||
|
||||
If content-type changes are to be allowed on a POST then the container listing
|
||||
must be updated, in order to satisfy the client API expectations, but the
|
||||
current implementation lacks support for container updates triggered by a POST:
|
||||
|
||||
#. The object-server POST path does not issue container update requests, or
|
||||
store async pendings.
|
||||
|
||||
#. The container-server's PUT /object path has no semantics for a
|
||||
partial update of an object row - meaning there is no way to change the
|
||||
content-type of an object without creating a new record to replace the
|
||||
old one. However, because a POST only describes a transformation of an
|
||||
object, and not a complete update, an object server cannot reliably provide
|
||||
the entire object state required to generate a new container record under a
|
||||
single timestamp.
|
||||
|
||||
For example, an object server handling a POST may not have the most recent
|
||||
object size and/or hash, and therefore should not include those items in a
|
||||
container update under the timestamp of the POST.
|
||||
|
||||
#. The backend container replication process similarly does not support
|
||||
replication of partially updated object records.
|
||||
|
||||
Consequently, updates to object metadata using the fast-POST mode results in an
|
||||
inconsistency between the object state and the container listing: the
|
||||
Last-Modified header returned on an HEAD request for an object will reflect the
|
||||
last time of the last POST, while the value in the container listing will
|
||||
reflect the time of the last PUT.
|
||||
|
||||
Furthermore, the container-sync process is unable to detect when object state
|
||||
has been changed by a POST, since it relies on a new row being created in the
|
||||
container database whenever an object changes.
|
||||
|
||||
Code archeology seems to support that the primary motivations for the
|
||||
POST-as-COPY mode of operation were allowing content-type to be
|
||||
modified without re-uploading the entire object with a PUT from the client,
|
||||
and enabling container-sync to sync object metadata updates.
|
||||
|
||||
Proposed Changes
|
||||
================
|
||||
|
||||
The changes proposed below contribute to achieving the property that all Swift
|
||||
internal services which track the state of an object will eventually reach a
|
||||
consistent view of the object metadata, which has three components:
|
||||
|
||||
#. immutable metadata (i.e. name, size and hash) that can only be set at the
|
||||
time that the object data is set i.e. by a PUT request
|
||||
#. content-type that is set by a PUT and *may* be modified by a POST
|
||||
#. mutable metadata such as custom user metadata, which is set by a PUT or POST
|
||||
|
||||
Since each of these components could be set at different times on different
|
||||
nodes, it follows that an object's state must include three timestamps, all or
|
||||
some of which may be equal:
|
||||
|
||||
#. the 'data-timestamp', describing when the immutable metadata was set, which
|
||||
is less than or equal to:
|
||||
#. the 'content-type-timestamp', which is less than or equal to:
|
||||
#. the 'metadata-timestamp' which describes when the object's mutable metadata
|
||||
was set, and defines the Last-Modified time of the object.
|
||||
|
||||
We assert that to guarantee eventual consistency, Swift internal processes must
|
||||
track the timestamp of each metadata component independently. Some or all of
|
||||
the three timestamps will often be equal, but a Swift process should never
|
||||
assert such equality unless it can be inferred from state generated by a client
|
||||
request.
|
||||
|
||||
Proxy-server
|
||||
------------
|
||||
|
||||
No changes required - the proxy server already includes container update
|
||||
headers with backend object POST requests.
|
||||
|
||||
Object-server
|
||||
-------------
|
||||
|
||||
#. The DiskFile class will be modified to allow content-type to
|
||||
be updated and written to a .meta file. When content-type is updated by a
|
||||
POST, a content-type-timestamp value equal to the POST request timestamp
|
||||
will also be written to the .meta file.
|
||||
#. The DiskFile class will be modified so that existing content-type and
|
||||
content-type-timestamp values will be copied to a new .meta file if no new
|
||||
values are provided.
|
||||
#. The DiskFile interface will be modified to provide methods to access the
|
||||
object's data-timestamp (already stored in the .data file), content-type
|
||||
timestamp (as described above) and metadata-timestamp (already stored in the
|
||||
.meta file).
|
||||
#. The DiskFile class will be modified to support using encoded timestamps as
|
||||
.meta file names (see :ref:`rsync` and :ref:`timestamp_encoding`).
|
||||
#. The object-server POST path will be updated to issue container-update
|
||||
requests with fallback to the async pending queue similar to the PUT path.
|
||||
#. Container update requests triggered by a POST will include all three of
|
||||
the object's timestamp values: the data-timestamp, the content-type
|
||||
timestamp and the metadata-timestamp. These timestamps will either be sent
|
||||
as separate headers or encoded into a single timestamp header
|
||||
(:ref:`timestamp_encoding`) header value.
|
||||
|
||||
.. _container_server:
|
||||
|
||||
Container-server
|
||||
----------------
|
||||
|
||||
#. The container-server 'PUT /<object>' path will be modified to support three
|
||||
timestamp values being included in the update item that are stored in the
|
||||
pending file and eventually passed to the database merge_items method.
|
||||
#. The merge_items method will be modified so that any existing row for an
|
||||
updated object is merged with the object update to produce a new row that
|
||||
encodes the most recent of each of the metadata components and their
|
||||
respective timestamps i.e. the row will encode three tuples::
|
||||
|
||||
(data-timestamp, size, name, hash)
|
||||
(content-type-timestamp, content-type)
|
||||
(metadata-timestamp)
|
||||
|
||||
This requires storing two additional timestamps which will be achieved by
|
||||
either encoding all three timestamps in a single string stored in the
|
||||
existing created_at column (:ref:`timestamp_encoding`) value stored in same
|
||||
field as the existing (data) timestamp or by adding new columns to the
|
||||
objects table. Note that each object will continue to have only one row in
|
||||
the database table.
|
||||
|
||||
#. The container listing code will be modified to use the object's metadata
|
||||
timestamp as the value for the reported last-modified time.
|
||||
|
||||
.. note::
|
||||
With this proposal, new container db rows do not necessarily store all of
|
||||
the attributes sent with a single object update. Each new row is now
|
||||
comprised of the most recent metadata components from the update and any
|
||||
existing row.
|
||||
|
||||
|
||||
Container-replicator
|
||||
--------------------
|
||||
|
||||
#. The container-replicator will be modified to ensure that all three object
|
||||
timestamps are included in replication updates. At the receiving end these
|
||||
are handled by the same merge_items method as described above.
|
||||
|
||||
.. _rsync:
|
||||
|
||||
rsync object replication
|
||||
------------------------
|
||||
|
||||
With the proposed changes, .meta files may now contain a content-type value set
|
||||
at a different time to the other mutable metadata. Unlike :ref:`ssync`, the
|
||||
rsync based replication process has no visibility of the contents of the object
|
||||
files. The replication process cannot therefore distinguish between two meta
|
||||
files which have the same name but may contain different content-type and
|
||||
content-type-timestamp values.
|
||||
|
||||
The naming of .meta files must therefore be modified so that the filename
|
||||
indicates both the metadata-timestamp and the content-type-timestamp. The
|
||||
current proposal is to use an encoding of the content-type-timestamp and
|
||||
metadata-timestamp as the .meta file name. Specifically:
|
||||
|
||||
* if the the .meta file contains a content-type value, its name shall be
|
||||
the encoding of the metadata-timestamp followed by the (older or equal)
|
||||
content-type-timestamp, with a `.meta` extension.
|
||||
* if the the .meta file does not contain a content-type value, its name shall
|
||||
be the metadata-timestamp, with a `.meta` extension.
|
||||
|
||||
Other options for .meta file naming are discussed in :ref:`alternatives`.
|
||||
|
||||
The hash_cleanup_listdir function will be modified so that the decision as to
|
||||
whether a particular meta file should be deleted will no longer be based on a
|
||||
lexicographical sort of the file names - the file names will be decomposed into
|
||||
a content-type-timestamp and a metadata-timestamp and the one (or two) file(s)
|
||||
having the newest of each will be retained.
|
||||
|
||||
In addition the DiskFile implementation must be changed to preserve, and read,
|
||||
up to two meta files in the object directory when their names indicate that one
|
||||
contains the most recent content-type and the other contains the most recent
|
||||
metadata.
|
||||
|
||||
Multiple .meta files will only exist until the next PUT or POST request is
|
||||
handled. On a PUT, all older .meta files are deleted - their content is
|
||||
obsolete. On a newer POST, the multiple .meta files are read and their contents
|
||||
merged, taking the newest of user metadata and content-type. The merged
|
||||
metadata is written to a single newer .meta file and all older .meta files are
|
||||
deleted.
|
||||
|
||||
For example, consider an object directory that after rsync has the following
|
||||
files (sorted)::
|
||||
|
||||
t0_offset.meta - unwanted
|
||||
t2.data - wanted, most recent data-timestamp
|
||||
enc(t6, t2).meta - wanted, most recent metadata-timestamp
|
||||
enc(t4, t3).meta - unwanted
|
||||
enc(t5, t5).meta - wanted, most recent content-type-timestamp
|
||||
|
||||
If a POST occurs at t7 with new user metadata but no new content-type value,
|
||||
the contents of the directory after handling the post will be::
|
||||
|
||||
t2.data
|
||||
enc(t7, t5).meta
|
||||
|
||||
Note that the when an object merges content-type and metadata-timestamp from
|
||||
two .meta files, it is reconstructing the same state that will already have
|
||||
been propagated to container servers. There is no need for object servers to
|
||||
send container updates in response to replication events (i.e. no change to
|
||||
current behavior in that respect).
|
||||
|
||||
.. _ssync:
|
||||
|
||||
Updates to ssync
|
||||
----------------
|
||||
|
||||
Additionally we should endeavor to enumerate the required changes to ssync to
|
||||
support the preservation of semantic difference between a POST and PUT. For
|
||||
example:
|
||||
|
||||
#. The missing check request sent by the ssync_sender should include enough
|
||||
information for the ssync_receiver to determine which of the object's state
|
||||
is out of date i.e. none, some or all of data, content-type and metadata.
|
||||
#. The missing check response from ssync_receiver should include enough
|
||||
information for the ssync_sender to differentiate between a hash that
|
||||
is "missing" and out-of-date content-type and/or metadata update.
|
||||
#. When handling ssync_sender's send_list during the UPDATES portion, in
|
||||
addition to sending PUT and DELETE requests the sender should be able
|
||||
to send a pure metadata POST update
|
||||
#. The ssync_receiver's updates method must be prepared to dispatch POST
|
||||
requests to the underlying object-server app in addition to PUT and
|
||||
DELETE requests.
|
||||
|
||||
The current ssync implementation seems to indicate that it was originally
|
||||
intended to be optimized for the default POST-as-COPY configuration, and it
|
||||
does not handle some corner cases with fast-POST as well as rsync replication.
|
||||
Because ssync is still described as experimental, improving ssync support
|
||||
should not be a requirement for resolving the current limitations of fast-POST
|
||||
for rsync deployments. However ssync is still actively being developed and
|
||||
improved, and remains a key component to a number of other efforts improve and
|
||||
enhance Swift. Full ssync support for fast-POST should be a requirement for
|
||||
making fast-POST the default.
|
||||
|
||||
.. _container-sync:
|
||||
|
||||
Container Sync
|
||||
--------------
|
||||
|
||||
Container Sync will require both the ability to discover that an
|
||||
object has changed, and the ability to request that object.
|
||||
|
||||
Because each object update via fast-POST will trigger a container
|
||||
update, there will be a new row (and timestamp) in the container
|
||||
databases for every update to an object (just like with POST-as-COPY
|
||||
today!)
|
||||
|
||||
The metadata-timestamp in the database will reflect a complete version
|
||||
of an object and metadata transformation. The exact version of the
|
||||
object retrieved can be verified with X-Backend-Timestamp.
|
||||
|
||||
.. _x-newest:
|
||||
|
||||
X-Newest
|
||||
--------
|
||||
|
||||
X-Newest should be updated to use X-Backend-Timestamp.
|
||||
|
||||
.. note::
|
||||
|
||||
We should fix the sync daemon from using the row[‘created_at’] value
|
||||
to set the x-timestamp of the object PUT to the peer container, and
|
||||
have it instead use the X-Timestamp from the object being synced.
|
||||
|
||||
.. _timestamp_encoding:
|
||||
|
||||
Multiple Timestamp Encoding
|
||||
---------------------------
|
||||
|
||||
If required, multiple timestamps t0, t1 ... will be encoded into a single
|
||||
timestamp string having the form::
|
||||
|
||||
<t0[_offset]>[<+/-><offset_to_t1>[<+/-><offset_to_t2>]]
|
||||
|
||||
where:
|
||||
|
||||
* t0 may include an offset, if non-zero, with leading zero's removed from the
|
||||
offset, e.g. 1234567890.12345_2
|
||||
* offset_to_t1 is the difference in units of 10 microseconds between t0 and
|
||||
t1, in hex, if non-zero
|
||||
* offset_to_t2 is the difference in units of 10 microseconds between t1 and
|
||||
t2, in hex, if non-zero
|
||||
|
||||
An example of encoding three monotonically increasing timestamps would be::
|
||||
|
||||
1234567890.12345_2+9f3c+aa322
|
||||
|
||||
An example of encoding of three equal timestamps would be::
|
||||
|
||||
1234567890.12345_2
|
||||
|
||||
i.e. identical to the shortened form of t0.
|
||||
|
||||
An example of encoding two timestamps where the second is older would be::
|
||||
|
||||
1234567890.12345_2-9f3c
|
||||
|
||||
Note that a lexicographical sort of encoded timestamps is not required to
|
||||
result in any chronological ordering.
|
||||
|
||||
|
||||
Example Scenarios
|
||||
=================
|
||||
|
||||
In the following examples we attempt to enumerate various failure conditions
|
||||
that would require making decisions about how the implementation serializes or
|
||||
merges out-of-order metadata updates.
|
||||
|
||||
These examples use the current proposal for encoding multiple timestamps
|
||||
:ref:`timestamp_encoding` in .meta file names and in the container db
|
||||
`created_at` column. For simplicity we use the shorthand `t2-t1` to represent
|
||||
the encoding of timestamps t2 and t1 in this form, but note that the `-t1` part
|
||||
is in fact a time difference and not the absolute value of the t2 timestamp.
|
||||
|
||||
(The exact format of the .meta file name is still being discussed.)
|
||||
|
||||
Consider initial state for an object that was PUT at time t1::
|
||||
|
||||
Obj server 1,2,3: /t1.data {etag=m1, size=s1, c_type=c1}
|
||||
Cont server 1,2,3: {ts=t1, etag=m1, size=s1, c_type=c1}
|
||||
|
||||
Happy Path
|
||||
----------
|
||||
|
||||
All servers initially consistent, successful fast-POST at time t2 that
|
||||
modifies an object’s content-type. When all is well our object
|
||||
servers will end up in a consistent state::
|
||||
|
||||
Obj server 1,2,3: /t1.data {etag=m1, size=s1, c_type=c1}
|
||||
/t2+t2.meta {c_type=c2}
|
||||
|
||||
The proposal is for the fast-POST to trigger a container update that is
|
||||
a combination of the existing metadata from the .data file and the new
|
||||
content-type::
|
||||
|
||||
Cont server 1,2,3: {ts=t1+t2+t2, etag=m1, size=s1, c_type=c2}
|
||||
|
||||
|
||||
.. note::
|
||||
|
||||
A container update will be issued for every POST even if the
|
||||
content-type is not updated to ensure that the container listing
|
||||
last-modified time is consistent with the object state, and to ensure
|
||||
that a new row is created for container sync.
|
||||
|
||||
Now consider some failure scenarios...
|
||||
|
||||
Object node down
|
||||
----------------
|
||||
|
||||
In this case only a subset of object nodes would receive the metadata
|
||||
update::
|
||||
|
||||
Obj server 1,2: /t1.data {etag=m1, size=s1, c_type=c1}
|
||||
/t2+t2.meta {c_type=c2}
|
||||
Obj server 3: /t1.data {etag=m1, size=s1, c_type=c1}
|
||||
|
||||
Normal object replication will copy the metadata update t2 to the failed object
|
||||
server 3, bringing its state in line with the other object servers.
|
||||
|
||||
Because the failed object node would not have updated it's respective
|
||||
container server, that will be out of date as well::
|
||||
|
||||
Cont server 1,2: {ts=t1+t2+t2, etag=m1, size=s1, c_type=c2}
|
||||
Cont server 3: {ts=t1, etag=m1, size=s1, c_type=c1}
|
||||
|
||||
During replication, row merging on server 3 would merge the content-type update
|
||||
at t2 with the existing row to create a new row identical to that on servers 1
|
||||
and 2..
|
||||
|
||||
Container update fails
|
||||
----------------------
|
||||
|
||||
If a container server is offline while an object server is handling a POST then
|
||||
the object server will store an async_pending of the update record in the same
|
||||
as for PUTs and DELETEs.
|
||||
|
||||
Object node missing .data file
|
||||
------------------------------
|
||||
|
||||
POST will return 404 and not process the request if the object does not
|
||||
exist::
|
||||
|
||||
Obj server 1,2: /t1.data {etag=m1, size=s1, c_type=c1}
|
||||
/t2+t2.meta {c_type=c2}
|
||||
Obj server 3: 404
|
||||
|
||||
After object replication the object servers should have the same files. This
|
||||
requires no change to rsync replication. ssync replication will be modified to
|
||||
send a PUT with t1 (including content-type=c1) followed by a POST with t2
|
||||
(including content-type=c2), i.e. ssync will replicate the requests received by
|
||||
the healthy servers.
|
||||
|
||||
Object node stale .data file
|
||||
----------------------------
|
||||
|
||||
If one object server has an older .data file then the composite timestamp sent
|
||||
with it's container update will not match that of the other nodes::
|
||||
|
||||
Obj server 1,2: /t1.data {etag=m1, size=s1, c_type=c1}
|
||||
/t2+t2.meta {c_type=c2}
|
||||
Obj server 3: /t0.data {etag=m0, size=s0, c_type=c0}
|
||||
/t2+t2.meta {c_type=c2}
|
||||
|
||||
After object replication the object servers should have the same files. This
|
||||
requires no change to rsync replication. ssync replication will be modified to
|
||||
send a PUT with t1, i.e. ssync will replicate the request missed by the failed
|
||||
server.
|
||||
|
||||
Assuming container server 3 was also out of date, the container row will be
|
||||
updated to::
|
||||
|
||||
Cont server 1,2: {ts=t1+t2+t2, etag=m1, size=s1, c_type=c2}
|
||||
Cont server 3: {ts=t0+t2+t2, etag=m0, size=s0, c_type=c2}
|
||||
|
||||
During container replication on server 3, row merging will apply the later data
|
||||
timestamp at t1 to the existing row to create a new row that matches servers 2
|
||||
and 3.
|
||||
|
||||
Assuming container server 3 was also up to date, the container row will be
|
||||
updated to::
|
||||
|
||||
Cont server 1,2: {ts=t1+t2+t2, etag=m1, size=s1, c_type=c2}
|
||||
Cont server 3: {ts=t1+t2+t2, etag=m1, size=s1, c_type=c2}
|
||||
|
||||
Note that in this case the row merging has applied the content-type from the
|
||||
update but ignored the immutable metadata from the update which is older than
|
||||
the values in the existing db row.
|
||||
|
||||
Newest .data file node down
|
||||
---------------------------
|
||||
|
||||
If none of the nodes that have the t1 .data file are available to handle the
|
||||
POST at the time of the client request the the metadata may only be applied on
|
||||
nodes having a stale .data file::
|
||||
|
||||
Obj server 1,2: /t0.data {etag=m0, size=s0, c_type=c0}
|
||||
/t2+t2.meta {c_type=c2}
|
||||
Obj server 3: /t1.data {etag=m1, size=s1, c_type=c1}
|
||||
|
||||
Object replication will eventually make the object servers consistent.
|
||||
|
||||
The containers may be similarly inconsistent::
|
||||
|
||||
Cont server 1,2: {ts=t0+t2+t2, etag=m0, size=s0, c_type=c2}
|
||||
Cont server 3: {ts=t1, etag=m1, size=s1, c_type=c1}
|
||||
|
||||
During container replication on server 3, row merging will apply the
|
||||
content-type update at t2 to the existing row but ignore the data-timestamp and
|
||||
immutable metadata, since the existing row on server 3 has newer data
|
||||
timestamp.
|
||||
|
||||
During replication on container servers 1 and 2, row merging will apply the
|
||||
data-timestamp and immutable metadata updates from server 3 but ignore the
|
||||
content-type update since they have a newer content-type-timestamp.
|
||||
|
||||
Additional POSTs with Content-Type to overwrite metadata
|
||||
--------------------------------------------------------
|
||||
|
||||
If the initial state already includes a metadata update, the content-type may
|
||||
have been overridden::
|
||||
|
||||
Obj server 1,2,3: /t1.data {etag=m1, size=s1, c_type=c1}
|
||||
/t2+t2.meta {c_type=c2}
|
||||
|
||||
In this case the container's would also reflect the content-type of the
|
||||
metadata update::
|
||||
|
||||
Cont server 1,2,3: {ts=t1+t2+t2, etag=m1, size=s1, c_type=c2}
|
||||
|
||||
When another POST occurs at t3 which includes a content-type update, the final
|
||||
state of the object server would overwrite the last metadata update entirely::
|
||||
|
||||
Obj server 1,2,3: /t1.data {etag=m1, size=s1, c_type=c1}
|
||||
/t3+t3.meta {c_type=c3}
|
||||
|
||||
|
||||
Additional POSTs without Content-Type to overwrite metadata
|
||||
-----------------------------------------------------------
|
||||
|
||||
If the initial state already includes a metadata update, the content-type may
|
||||
have been overridden::
|
||||
|
||||
Obj server 1,2,3: /t1.data {etag=m1, size=s1, c_type=c1}
|
||||
/t2+t2.meta {c_type=c2}
|
||||
|
||||
In this case the container's would also reflect the content-type of the
|
||||
metadata update::
|
||||
|
||||
Cont server 1,2,3: {ts=t1+t2+t2, etag=m1, size=s1, c_type=c2}
|
||||
|
||||
When another POST occurs at t3 which does not include a content-type update,
|
||||
the object server will merge its current record of the content-type with the
|
||||
new metadata and store in a new .meta file, the name of which indicates that it
|
||||
contains state modified at two separate times::
|
||||
|
||||
Obj server 1,2,3: /t1.data {etag=m1, size=s1, c_type=c1}
|
||||
/t3-t2.meta {c_type=c2}
|
||||
|
||||
The container server updates will now encode three timestamps which will cause
|
||||
row merging on the container servers to apply the metadata-timestamp to their
|
||||
existing rows and create a new row for the object::
|
||||
|
||||
Cont server 1,2,3: {ts=t1+t2+t3, etag=m1, size=s1, c_type=c2}
|
||||
|
||||
|
||||
Resolving conflicts with multiple metadata overwrites
|
||||
-----------------------------------------------------
|
||||
|
||||
If a previous content-type update is not consistent across all nodes then a
|
||||
subsequent metadata update at t3 that does not include a content-type value
|
||||
will result in divergent metadata sets across the nodes::
|
||||
|
||||
Obj server 1,2: /t1.data {etag=m1, size=s1, c_type=c1}
|
||||
/t3-t2.meta {c_type=c2}
|
||||
Obj server 3: /t1.data {etag=m1, size=s1, c_type=c1}
|
||||
/t3.meta
|
||||
|
||||
Even worse, if subsequent POSTs are not successfully handled successfully on
|
||||
all nodes then we can end up with no single node having completely up to date
|
||||
metadata::
|
||||
|
||||
Obj server 1,2: /t1.data {etag=m1, size=s1, c_type=c1}
|
||||
/t3-t2.meta {c_type=c2}
|
||||
Obj server 3: /t1.data {etag=m1, size=s1, c_type=c1}
|
||||
/t4.meta
|
||||
|
||||
With rsync replication, each object server will eventually have a consistent
|
||||
set of files, but will have two .meta files::
|
||||
|
||||
Obj server 1,2,3: /t1.data {etag=m1, size=s1, c_type=c1}
|
||||
/t3-t2.meta {c_type=c2}
|
||||
/t4.meta
|
||||
|
||||
When the diskfile is opened, both .meta files are read to retrieve the most
|
||||
recent content-type and the most recent mutable metadata.
|
||||
|
||||
With ssync replication, the inconsistent nodes will exchange POSTs that will
|
||||
eventually result in a consistent single .meta file on each node::
|
||||
|
||||
Obj server 1,2,3: /t1.data {etag=m1, size=s1, c_type=c1}
|
||||
/t4-t2.meta {c_type=c2}
|
||||
|
||||
|
||||
.. _alternatives:
|
||||
|
||||
Alternatives
|
||||
============
|
||||
|
||||
Alternative .meta file naming
|
||||
-----------------------------
|
||||
|
||||
#. Encoding the content-type timestamp followed by the metadata timestamp (i.e.
|
||||
reverse the order w.r.t. the proposal. This would result in encodings that
|
||||
always have a positive offset which is consistent with the
|
||||
enc(data-timestamp, content-type-timestamp, metadata-timestamp) form used
|
||||
in container updates. However, having the proposed encoding order ensures
|
||||
that files having *some* content newer than a data file will always sort
|
||||
ahead of the data file, which reduces the churn in diskfile code such as
|
||||
hash_cleanup_listdir, and is arguably more intuitive for human inspection
|
||||
("t2-offset.meta is preserved in the dir with t1.data because t2 is later
|
||||
than t1", rather than "t0+offset is preserved in the dir with t1.meta
|
||||
because the sum of t0 and offset is later than t1).
|
||||
|
||||
#. Using a two vector timestamp with the 'normal' part being the content-type
|
||||
timestamp and the offset being the time delta to the metadata-timestamp.
|
||||
|
||||
(It is the author's understanding that it is safe to use a timestamp offset
|
||||
to represent the metadata-timestamp in this way because .meta files will
|
||||
never be assigned a timestamp offset by the container-reconciler, since the
|
||||
container-reconciler only uses timestamp offsets to imposing an internal
|
||||
ordering on object PUTs and DELETEs having the same external timestamp.)
|
||||
|
||||
This is in principle the same as the proposed option but possibly results
|
||||
in a less compact filename and may create confusion with two vector
|
||||
timestamps.
|
||||
|
||||
#. Using a combination of the metadata-timestamp and a hash of the .meta file
|
||||
contents to form a name for the .meta file. The timestamp part allows for
|
||||
cleanup of .meta files that are older than a .data or .ts file, while the
|
||||
hash part distinguishes .meta that contain different Content-Type and/or
|
||||
Content-Type timestamp values. During replication, all valid .meta files are
|
||||
preserved in the object directory (the worst case number being capped at the
|
||||
number of replicas in the object ring). When DiskFile loads the metadata,
|
||||
all .meta files will be read and the most recent values merged into the
|
||||
metadata dict. When the merged metadata dict is written, all contributing
|
||||
.meta files may be deleted.
|
||||
|
||||
This option is more general in that it allows other metadata items to also
|
||||
have individual timestamps (without requiring an unbounded number of
|
||||
timestamps to be encoded in the .meta filename). It therefore supports
|
||||
other potential new features such as updatable object sysmeta and
|
||||
updatable user metadata. Any such feature is of course beyond the scope of
|
||||
proposal.
|
||||
|
||||
|
||||
Just use POST-as-COPY
|
||||
---------------------
|
||||
|
||||
POST-as-COPY has some limitations that make it ill-suited for some workloads.
|
||||
|
||||
#. POST to large objects is slow
|
||||
#. POST during failure can result in stale data being copied over fresher data.
|
||||
|
||||
Also because COPY is exposed to the client first hand the semantic behavior can
|
||||
always be achieved explicitly by a determined client.
|
||||
|
||||
Force content-type-timestamp to be same as metadata-timestamp
|
||||
-------------------------------------------------------------
|
||||
|
||||
We can simplify the management of .meta files by requiring every POST arriving
|
||||
at an object server to include the content-type, and therefore remove the need
|
||||
to maintain a separate content-type-timestamp. There would be no need to
|
||||
maintain multiple meta files. Container updates would still need to be sent
|
||||
during an object POST in order to keep the container server in sync with the
|
||||
object state. The container server still needs to be modified to merge both
|
||||
content-type and metadata-timestamps with an existing row.
|
||||
|
||||
The requirement for content-type to be included with every POST is unreasonably
|
||||
onerous on clients, but could be achieved by having the proxy server retrieve
|
||||
the current content-type using a HEAD request with X-Newest = True and insert
|
||||
it into the backend POST when content-type is missing from the client POST.
|
||||
|
||||
However, this scheme violates our assertion that no internal process should
|
||||
ever assume one of an object's timestamps to be equal to another. In this case,
|
||||
the proxy is forcing the content-type-timestamp to be the same as the metadata
|
||||
timestamp that is due to the incoming POST request. In failure conditions, the
|
||||
proxy may read a stale content-type value, associate it with the latest
|
||||
metadata-timestamp and as a result erroneously overwrite a fresher content-type
|
||||
value.
|
||||
|
||||
If, as a development of this alternative, the proxy were also to read the
|
||||
'current' content-type value and its timestamp using a HEAD with X-Newest, and
|
||||
add both of these items to the backend object POST, then we are get back to the
|
||||
object server needing to maintain separate content-type and metadata-timestamps
|
||||
in .meta file.
|
||||
|
||||
Further, if the newest content-type in the system is unavailable during a POST
|
||||
it would be lost, and worse yet if the latest value was associated with a
|
||||
datafile there's no obvious way to correctly promote it's data timestamp
|
||||
values in the containers short of doing the very merging described in this
|
||||
spec - so it comes out as less desirable for the same amount of work.
|
||||
|
||||
Use the metadata-timestamp as last modified
|
||||
-------------------------------------------
|
||||
|
||||
This is basically what both fast-POST and POST-as-COPY do today. When an
|
||||
object's metadata is updated at t3 the x-timestamp for the transformed object
|
||||
is t3. However, fast-POST never updates the last modified in the container
|
||||
listing.
|
||||
|
||||
In the case of fast-POST it can apply a t3 metadata update asynchronously to a
|
||||
t1 .data file because it restricts metadata updates from including changes to
|
||||
metadata that would require being merged into a container update.
|
||||
|
||||
We want to be able to update the content-type and therefore the container
|
||||
listing.
|
||||
|
||||
In the case of POST-as-COPY it can do this because the metadata update applied
|
||||
to a .data file t0 is considered "newer" than the .data file t1. The record for
|
||||
the transformation applied to the t0 data file at t3 is stored in the
|
||||
container, and the record of the "newer" t1 .data file is irrelevant.
|
||||
|
||||
Use metadata-timestamp as primary portion of two vector timestamp
|
||||
-----------------------------------------------------------------
|
||||
|
||||
This suggests the .data file timestamp would be the offset, and merging t3_t0
|
||||
and t3_t1 would prefer t3_t1. However merging t3_t0 and t1 would prefer t3_t0
|
||||
(much as POST-as-COPY does today). The unlink old method would have to be
|
||||
updated for rsync replication to ensure that a t3_t0 metadata file "guards" a
|
||||
t0 data against the "newer" t1 .data file.
|
||||
|
||||
It's generally presumed that a stale read during POST-as-COPY resulting in data
|
||||
loss is rare, the same false-hope applies equivalently to this purposed
|
||||
specification for a container updating fast-POST implementation. The
|
||||
difference being this implementation would throw out the *meta* data update
|
||||
with a preference to the latest .data file instead.
|
||||
|
||||
This alternative was rejected as workable but less desirable.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
#. `Prefer X-Backend-Timestamp for X-Newest <https://review.openstack.org/133869>`_
|
||||
#. `Update container on fast-POST <https://review.openstack.org/#/c/135380/>`_
|
||||
#. `Make ssync compatible with fast-post meta files <https://review.openstack.org/#/c/138498/>`_
|
||||
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
#. Alistair Coles (acoles)
|
||||
#. Clay Gerrard (clayg)
|
||||
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
TBD
|
||||
|
||||
Repositories
|
||||
------------
|
||||
|
||||
None
|
||||
|
||||
Servers
|
||||
-------
|
||||
None
|
||||
|
||||
DNS Entries
|
||||
-----------
|
||||
None
|
||||
|
||||
Documentation
|
||||
-------------
|
||||
|
||||
Changes may be required to API docs if the last modified time reported in a
|
||||
container listing changes to be the time of a POST rather than the time of the
|
||||
PUT (there is currently an inconsistency between POST-as-COPY operation and
|
||||
fast-POST operation).
|
||||
|
||||
We may want to deprecate POST-as-COPY after successful implementation of this
|
||||
proposal.
|
||||
|
||||
Security
|
||||
--------
|
||||
|
||||
None
|
||||
|
||||
Testing
|
||||
-------
|
||||
|
||||
New and modified unit tests will be required for the object server and
|
||||
container-sync. Probe tests will be useful to verify behavior.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
@ -1,132 +0,0 @@
|
||||
::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
..
|
||||
|
||||
================================================
|
||||
formpost should allow subprefix-based signatures
|
||||
================================================
|
||||
|
||||
The signature used by formpost to validate a file upload should also be considered valid,
|
||||
if the object_prefix, which is used to calculate the signature, is a real subprefix of the
|
||||
object_prefix used in the action url of the form.
|
||||
With this, sharing of data with external people is made much easier
|
||||
via webbased applications, because just one signature is needed to create forms for every
|
||||
pseudofolder in a container.
|
||||
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
At the moment, if one wants to use a form to upload data, the signature of the form must be
|
||||
calculated using the same object_prefix as the object_prefix in the url of the action attribute
|
||||
of the form.
|
||||
We propose to allow dynamically created forms, which are valid for all object_prefixes which contain
|
||||
a common prefix.
|
||||
|
||||
With this, one could generate one signature, which is valid for all pseudofolders in a container.
|
||||
This signature could be used in a webapplication, to share every possible pseudofolder
|
||||
of a container with external people. The user who wants to share his container would not be obliged
|
||||
to generate a signature for every pseudofolder.
|
||||
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
The formpost middleware should be changed. The code change would be really small.
|
||||
If a subprefix-based signature is desired, the hmac_body of the signature must contain a "subprefix"
|
||||
field to make sure that the creator of the signature explicitly allows uploading of objects into
|
||||
sub-pseudofolders. Beyond that, the form must contain a hidden field "subprefix", too.
|
||||
Formpost would use the value of this field to calculate a hash based on that
|
||||
value. Furthermore, the middleware would check if the object path really contains this prefix.
|
||||
|
||||
Lets have one example: A user wants to share the pseudofolder "folder" with external users in
|
||||
a web-based fashion. He (or a webapplication) calcluates the signature with the path
|
||||
"/v1/my_account/container/folder" and subprefix "folder":
|
||||
::
|
||||
|
||||
import hmac
|
||||
from hashlib import sha1
|
||||
from time import time
|
||||
path = '/v1/my_account/container/folder'
|
||||
redirect = 'https://myserver.com/some-page'
|
||||
max_file_size = 104857600
|
||||
max_file_count = 10
|
||||
expires = int(time() + 600)
|
||||
key = 'MYKEY'
|
||||
hmac_body = '%s\n%s\n%s\n%s\n%s\n%s' % (path, redirect,
|
||||
max_file_size, max_file_count, expires, "folder")
|
||||
signature = hmac.new(key, hmac_body, sha1).hexdigest()
|
||||
|
||||
If an external user is willing to post to the subfolder folder/subfolder/, a form which contains
|
||||
the above calculated signature and the hidden field subprefix would be used:
|
||||
::
|
||||
|
||||
<![CDATA[
|
||||
<form action="https://myswift/v1/my_account_container/folder/subfolder/"
|
||||
method="POST"
|
||||
enctype="multipart/form-data">
|
||||
<input type="hidden" name="redirect" value="REDIRECT_URL"/>
|
||||
<input type="hidden" name="max_file_size" value="BYTES"/>
|
||||
<input type="hidden" name="max_file_count" value="COUNT"/>
|
||||
<input type="hidden" name="expires" value="UNIX_TIMESTAMP"/>
|
||||
<input type="hidden" name="signature" value="HMAC"/>
|
||||
<input type="hidden" name="subprefix" value="folder"
|
||||
<input type="file" name="FILE_NAME"/>
|
||||
<br/>
|
||||
<input type="submit"/>
|
||||
</form>
|
||||
]]>
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
bartz
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
Add modifications to formpost and respective test module.
|
||||
|
||||
Repositories
|
||||
------------
|
||||
|
||||
None
|
||||
|
||||
Servers
|
||||
-------
|
||||
|
||||
None
|
||||
|
||||
DNS Entries
|
||||
-----------
|
||||
|
||||
None
|
||||
|
||||
Documentation
|
||||
-------------
|
||||
|
||||
Modify documentation for formpost middleware.
|
||||
|
||||
Security
|
||||
--------
|
||||
|
||||
None
|
||||
|
||||
Testing
|
||||
-------
|
||||
|
||||
Tests should be added to the existing test module.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
@ -1,183 +0,0 @@
|
||||
::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=====================================================
|
||||
Improve Erasure Coding Efficiency for Global Cluster
|
||||
=====================================================
|
||||
|
||||
This SPEC describes an improvement of efficiency for Global Cluster with
|
||||
Erasure Coding. It proposes a way to improve the PUT/GET performance
|
||||
in the case of Erasure Coding with more than 1 regions ensuring original
|
||||
data even if a region is lost.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Swift now supports Erasure Codes (EC) which ensures higher durability and lower
|
||||
disk cost than the replicated case for a one region cluster. However, currently
|
||||
if Swift were running EC over 2 regions, using < 2x data redundancy
|
||||
(e.g. ec_k=10, ec_m=4) and then one of the regions is gone due to some unfortunate
|
||||
reasons (e.g. huge earthquake, fire, tsunami), there is a chance data would be lost.
|
||||
That is because, assuming each region has an even available volume of disks, each
|
||||
region should have around 7 fragments, less than ec_k, which is not enough data
|
||||
for the EC scheme to rebuild the original data.
|
||||
|
||||
To protect stored data and to ensure higher durability, Swift has to keep >= 1
|
||||
data size for each region (i.e. >= 2x for 2 regions) by employing larger ec_m
|
||||
like ec_m=14 for ec_k=10. However, this increase sacrifices encode performance.
|
||||
In my measurements running PyECLib encode/decode on an Intel Xeon E5-2630v3 [1], the
|
||||
benchmark result was as follows:
|
||||
|
||||
+----------------+----+----+---------+---------+
|
||||
|scheme |ec_k|ec_m|encode |decode |
|
||||
+================+====+====+=========+=========+
|
||||
|jerasure_rs_vand|10 |4 |7.6Gbps |12.21Gbps|
|
||||
+----------------+----+----+---------+---------+
|
||||
| |10 |14 |2.67Gbps |12.27Gbps|
|
||||
+----------------+----+----+---------+---------+
|
||||
| |20 |4 |7.6Gbps |12.87Gbps|
|
||||
+----------------+----+----+---------+---------+
|
||||
| |20 |24 |1.6Gbps |12.37Gbps|
|
||||
+----------------+----+----+---------+---------+
|
||||
|isa_lrs_vand |10 |4 |14.27Gbps|18.4Gbps |
|
||||
+----------------+----+----+---------+---------+
|
||||
| |10 |14 |6.53Gbps |18.46Gbps|
|
||||
+----------------+----+----+---------+---------+
|
||||
| |20 |4 |15.33Gbps|18.12Gbps|
|
||||
+----------------+----+----+---------+---------+
|
||||
| |20 |24 |4.8Gbps |18.66Gbps|
|
||||
+----------------+----+----+---------+---------+
|
||||
|
||||
Note that "decode" uses (ec_k + ec_m) - 2 fragments so performance will
|
||||
decrease less than when encoding as is shown in the results above.
|
||||
|
||||
In the results above, comparing ec_k=10, ec_m=4 vs ec_k=10, ec_m=14, the encode
|
||||
performance falls down about 1/3 and other encodings follow a similar trend.
|
||||
This demonstrates that there is a problem when building a 2+ region EC cluster.
|
||||
|
||||
1: http://ark.intel.com/ja/products/83356/Intel-Xeon-Processor-E5-2630-v3-20M-Cache-2_40-GHz
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
Add an option like "duplication_factor". Which will create duplicated (copied)
|
||||
fragments instead of employing a larger ec_m.
|
||||
|
||||
For example, with a duplication_factor=2, Swift will encode ec_k=10, ec_m=4 and
|
||||
store 28 fragments (10x2 data fragments and 4x2 parity fragments) in Swift.
|
||||
|
||||
This requires a change to PUT/GET and the reconstruct sequence to map from the
|
||||
fragment index in Swift to actual fragment index for PyECLib but IMHO we don't
|
||||
need to make an effort to build much modification for the conversation among
|
||||
proxy-server <-> object-server <-> disks.
|
||||
|
||||
I don't want describe the implementation in detail in the first patch of the spec
|
||||
because it should be an idea to improve Swift. More discussion on the implementation
|
||||
side will following in subsequent patches.
|
||||
|
||||
Considerations of acutal placement
|
||||
----------------------------------
|
||||
Placement of these doubled fragments are important. If the same fragments,
|
||||
original and copied, appear in the same region and the second region fails,
|
||||
then we would be in the same situation where we couldn't rebuild the original
|
||||
object as we were in the smaller parity fragments case.`
|
||||
|
||||
e.g:
|
||||
|
||||
- duplication_factor=2, k=4, m=2
|
||||
- 1st Region: [0, 1, 2, 6, 7, 8]
|
||||
- 2nd Region: [3, 4, 5, 9, 10, 11]
|
||||
- (Assuming actual indices to rebuild mapped as index // (k+m))
|
||||
|
||||
In this case, 1st region has only fragments consisting of fragment index 0, 1, 2
|
||||
and 2nd has only 3, 4, 5. Therefore, it is not able to rebuild the original object
|
||||
from the fragments in only one region because the fragment uniqueness in the
|
||||
region is less than k. The worst case scenario, like this, will cause significant data
|
||||
loss as would happen with no duplication factor.
|
||||
|
||||
i.e. In fact, data durability will be
|
||||
|
||||
- "no duplication" < "with duplication" < "more unique parities"
|
||||
|
||||
In future work, we can find a way to tie a fragment index to a region,
|
||||
something like "1st subset should be in 1st Region and 2nd subset
|
||||
should be ..." but so far this is beyond this spec.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
We can find a way to use container-sync as a solution to the problem rather
|
||||
then employing my proposed change.
|
||||
This section will describe the pros/cons for my "proposed change" and "container-sync".
|
||||
|
||||
Proposed Change
|
||||
^^^^^^^^^^^^^^^
|
||||
Pros:
|
||||
|
||||
- Higher performance way to spread objects across regions (No need to re-decode/encode for transferring across regions)
|
||||
- No extra configuration other than storage policy is needed for users to turn on the global replication. (strictly global erasure coding?)
|
||||
- Able to use other global cluster efficiency improvements (affinity control)
|
||||
|
||||
Cons:
|
||||
|
||||
- Need to employ more complex handling around ECObjecController
|
||||
|
||||
Container-Sync
|
||||
^^^^^^^^^^^^^^
|
||||
Pros:
|
||||
|
||||
- Simple and able to reuse existing swift mechanisms
|
||||
- Less data transfer between regions
|
||||
|
||||
Cons:
|
||||
|
||||
- Re-decode/encode is required when transferring objects to another region
|
||||
- Need to set the sync option for each container
|
||||
- Impossible to retrieve/reconstruct an object when > ec_m disks unavailable (includes ip unreachable)
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
- Proxy-Server PUT/GET path
|
||||
- Object-Reconstructor
|
||||
- (Optional) Ring placement strategy
|
||||
|
||||
Questions and Answers
|
||||
=====================
|
||||
|
||||
- TBD
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
kota\_ (Kota Tsuyuzaki)
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
Develop codes around proxy-server and object-reconstructor
|
||||
|
||||
Repositories
|
||||
------------
|
||||
|
||||
None
|
||||
|
||||
Servers
|
||||
-------
|
||||
|
||||
None
|
||||
|
||||
DNS Entries
|
||||
-----------
|
||||
|
||||
None
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
Before Width: | Height: | Size: 79 KiB |
Before Width: | Height: | Size: 60 KiB |
Before Width: | Height: | Size: 85 KiB |
Before Width: | Height: | Size: 164 KiB |
@ -1,449 +0,0 @@
|
||||
::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
|
||||
===============================
|
||||
Increasing ring partition power
|
||||
===============================
|
||||
|
||||
This document describes a process and modifications to swift code that
|
||||
together enable ring partition power to be increased without cluster downtime.
|
||||
|
||||
Swift operators sometimes pick a ring partition power when deploying swift
|
||||
and later wish to change the partition power:
|
||||
|
||||
#. The operator chooses a partition power that proves to be too small and
|
||||
subsequently constrains their ability to rebalance a growing cluster.
|
||||
#. Perhaps more likely, in an attempt to avoid the above problem, operators
|
||||
choose a partition power that proves to be unnecessarily large and would
|
||||
subsequently like to reduce it.
|
||||
|
||||
This proposal directly addresses the first problem by enabling partition power
|
||||
to be increased. Although it does not directly address the second problem
|
||||
(i.e. it does not enable ring power reduction), it does indirectly help to
|
||||
avoid that problem by removing the motivation to choose large partition power
|
||||
when first deploying a cluster.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
The ring power determines the partition to which a resource (account, container
|
||||
or object) is mapped. The partition is included in the path under which the
|
||||
resource is stored in a backend filesystem. Changing the partition power
|
||||
therefore requires relocating resources to new paths in backend filesystems.
|
||||
|
||||
In a heavily populated cluster a relocation process could be time-consuming and
|
||||
so to avoid down-time it is desirable to relocate resources while the cluster
|
||||
is still operating. However, it is necessary to do so without (temporary) loss
|
||||
of access to data and without compromising the performance of processes such as
|
||||
replication and auditing.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
The proposed solution avoids copying any file contents during a partition power
|
||||
change. Objects are 'moved' from their current partition to a new partition,
|
||||
but the current and new partitions are arranged to be on the same device, so
|
||||
the 'move' is achieved using filesystem links without copying data.
|
||||
|
||||
(It may well be that the motivation for increasing partition power is to allow
|
||||
a rebalancing of the ring. Any rebalancing would occur after the partition
|
||||
power increase has completed - during partition power changes the ring balance
|
||||
is not changed.)
|
||||
|
||||
To allow the cluster to continue operating during a partition power change (in
|
||||
particular, to avoid any disruption or incorrect behavior of the replicator and
|
||||
auditor processes), new partition directories are created in a separate
|
||||
filesystem branch from the current partition directories. When all new
|
||||
partition directories have been populated, the ring transitions to using the
|
||||
new filesystem branch.
|
||||
|
||||
During this transition, object servers maintain links to resource files from
|
||||
both the current and new partition directories. However, as already discussed,
|
||||
no file content is duplicated or copied. The old partition directories are
|
||||
eventually deleted.
|
||||
|
||||
Detailed description
|
||||
--------------------
|
||||
|
||||
The process of changing a ring's partition power comprises three phases:
|
||||
|
||||
1. Preparation - during this phase the current partition directories continue
|
||||
to be used but existing resources are also linked to new partition
|
||||
directories in anticipation of the new ring partition power.
|
||||
|
||||
2. Switchover - during this phase the ring transitions to using the new
|
||||
partition directories; proxy and backend servers rollover to using the new
|
||||
ring partition power.
|
||||
|
||||
3. Cleanup - once all servers are using the new ring partition power,
|
||||
resource files in old partition directories are removed.
|
||||
|
||||
For simplicity, we describe the details of each phase in terms of an object
|
||||
ring but note that the same process can be applied to account and container
|
||||
rings and servers.
|
||||
|
||||
Preparation phase
|
||||
^^^^^^^^^^^^^^^^^
|
||||
|
||||
During the preparation phase two new attributes are set in the ring file:
|
||||
|
||||
* the ring's `epoch`: if not already set, a new `epoch` attribute is added to
|
||||
the ring. The ring epoch is used to determine the parent directory for
|
||||
partition directories. Similar to the way in which a ring's policy index is
|
||||
appended to the `objects` directory name, the epoch will be prefixed to the
|
||||
`objects` directory name. For simplicity, the ring epoch will be a
|
||||
monotonically increasing integer starting at 0. A 'legacy' ring having no
|
||||
epoch attribute will be treated as having epoch 0.
|
||||
|
||||
* the `next_part_power` attribute indicates the partition power that will be
|
||||
used in the next epoch of the ring. The `next_part_power` attribute is used
|
||||
during the preparation phase to determine the partition directory in which
|
||||
an object should be stored in the next epoch of the ring.
|
||||
|
||||
At this point in time no other changes are made to the ring file:
|
||||
the current part power and the mapping of partitions to devices are unchanged.
|
||||
|
||||
The updated ring file is distributed to all servers. During this preparation
|
||||
phase, proxy servers will continue to use the current ring partition mapping to
|
||||
determine the backend url for objects. Object servers, along with replicator
|
||||
and auditor processes, also continue to use the current ring
|
||||
parameters. However, during PUT and DELETE operations object servers will
|
||||
create additional links to object files in the object's future partition
|
||||
directory in preparation for an eventual switchover to the ring's next
|
||||
epoch. This does not require any additional copying or writing of object
|
||||
contents.
|
||||
|
||||
The filesystem path for future partition directories is determined as follows.
|
||||
In general, the path to an object file on an object server's filesystem has the
|
||||
form::
|
||||
|
||||
dev/[<epoch>-]objects[-<policy>]/<partition>/<suffix>/<hash>/<ts>.<ext>
|
||||
|
||||
where:
|
||||
|
||||
* `epoch` is the ring's epoch, if non-zero
|
||||
* `policy` is the object container's policy index, if non-zero
|
||||
* `dev` is the device to which `partition` is mapped by the ring file
|
||||
* `partition` is the object's partition,
|
||||
calculated using `partition = F(hash) >> (32 - P)`,
|
||||
where `P` is the ring partition power
|
||||
* `suffix` is the last three digits of `hash`
|
||||
* `hash` is a hash of the object name
|
||||
* `ts` is the object timestamp
|
||||
* `ext` is the filename extension (`data`, `meta` or `ts`)
|
||||
|
||||
Given `next_part_power` and `epoch` in the ring file, it is possible to
|
||||
calculate::
|
||||
|
||||
future_partition = F(hash) >> (32 - next_part_power)
|
||||
next_epoch = epoch + 1
|
||||
|
||||
The future partition directory is then::
|
||||
|
||||
dev/<next_epoch>-objects[-<policy>]/<next_partition>/<suffix>/<hash>/<ts>.<ext>
|
||||
|
||||
For example, consider a ring in its first epoch, with current partition power
|
||||
P, containing an object currently in partition X, where 0 <= X < 2**P. If the
|
||||
partition power increases by a factor of 2, the object's future partition will
|
||||
be either 2X or 2X+1 in the ring's next epoch. During a DELETE an additional
|
||||
filesystem link will be created at one of::
|
||||
|
||||
dev/1-objects/<2X>/<suffix>/<hash>/<ts>.ts
|
||||
dev/1-objects/<2X+1>/<suffix>/<hash>/<ts>.ts
|
||||
|
||||
Once object servers are known to be using the updated ring file a new relinker
|
||||
process is started. The relinker prepares an object server's filesystem for a
|
||||
partition power change by crawling the filesystem and linking existing objects
|
||||
to future partition directories. The relinker determines each object's future
|
||||
partition directory in the same way as described above for the object server.
|
||||
|
||||
The relinker does not remove links from current partition directories. Once the
|
||||
relinker has successfully completed, every existing object should be linked
|
||||
from both a current partition directory and a future partition directory. Any
|
||||
subsequent object PUTs or DELETEs will be reflected in both the current and
|
||||
future partition directory as described above.
|
||||
|
||||
To avoid newly created objects being 'lost', it is important that an object
|
||||
server is using the updated ring file before the relinker process starts in
|
||||
order to guarantee that either the object server or the relinker create future
|
||||
partition links for every object. This may require object servers to be
|
||||
restarted prior to the relinker process being started, or to otherwise report
|
||||
that they have reloaded the ring file.
|
||||
|
||||
The relinker will report successful completion in a file
|
||||
`/var/cache/swift/relinker.recon` that can be queried via (modified) recon
|
||||
middleware.
|
||||
|
||||
Once the relinker process has successfully completed on all object servers, the
|
||||
partition power change process may move on to the switchover phase.
|
||||
|
||||
Switchover phase
|
||||
^^^^^^^^^^^^^^^^
|
||||
|
||||
To begin the switchover to using the next partition power, the ring file is
|
||||
updated once more:
|
||||
|
||||
* the current partition power is stored as `previous_part_power`
|
||||
* the current partition power is set to `next_partition_power`
|
||||
* `next_partition_power` is set to None
|
||||
* the ring's `epoch` is incremented
|
||||
* the mapping of partitions to devices is re-created so that partitions 2X and
|
||||
2X+1 map to the same devices to which partition X was mapped in the previous
|
||||
epoch. This is a simple transformation. Since no object content is moved
|
||||
between devices the actual ring balance remains unchanged.
|
||||
|
||||
The updated ring file is then distributed to all proxy and object servers.
|
||||
|
||||
Since ring file distribution and loading is not instantaneous, there is a
|
||||
window of time during which a proxy server may direct object requests to either
|
||||
an old partition or a current partition (note that the partitions previously
|
||||
referred to as 'future' are now referred to as 'current'). Object servers will
|
||||
therefore create additional filesystem links during PUT and DELETE requests,
|
||||
pointing from old partition directories to files in the current partition
|
||||
directories. The paths to the old partition directories are determined in the
|
||||
same way as future partition directories were determined during the preparation
|
||||
phase, but now using the `previous_part_power` and decrementing the current
|
||||
ring `epoch`.
|
||||
|
||||
This means that if one proxy PUTs an object using a current partition, then
|
||||
another proxy subsequently attempts to GET the object using the old partition,
|
||||
the object will be found, since both current and old partitions map to the same
|
||||
device. Similarly if one proxy PUTs an object using the old partition and
|
||||
another proxy then GETs the object using the current partition, the object will
|
||||
be found in the current partition on the object server.
|
||||
|
||||
The object auditor and replicator processes are restarted to force reloading of
|
||||
the ring file and commence to operate using the current ring parameters.
|
||||
|
||||
Cleanup phase
|
||||
^^^^^^^^^^^^^
|
||||
|
||||
The cleanup phase may start once all servers are known to be using the updated
|
||||
ring file. Once again, this may require servers to be restarted or to report
|
||||
that they have reloaded the ring file during switchover.
|
||||
|
||||
A final update is made to the ring file: the `previous_partition_power`
|
||||
attribute is set to `None` and the ring file is once again distributed. Once
|
||||
object servers have reloaded the update ring file they will cease to create
|
||||
object file links in old partition directories.
|
||||
|
||||
At this point the old partition directories may be deleted - there is no need
|
||||
to create tombstone files when deleting objects in the old partitions since
|
||||
these partition directories are no longer used by any swift process.
|
||||
|
||||
A cleanup process will crawl the filesystem and delete any partition
|
||||
directories that are not part of the current epoch or a future epoch. This
|
||||
cleanup process should repeat periodically in case any devices that were
|
||||
offline during the partition power change come back online - the old epoch
|
||||
partition directories discovered on those devices may be deleted. Normal
|
||||
replication may cause current epoch partition directories to be created on a
|
||||
resurrected disk.
|
||||
|
||||
(The cleanup function could be added to an existing process such as the
|
||||
auditor).
|
||||
|
||||
Other considerations
|
||||
--------------------
|
||||
|
||||
swift-dispersion-[populate|report]
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The swift-dispersion-[populate|report] tools will need to be made epoch-aware.
|
||||
After increasing partition power, swift-dispersion-populate may need to be
|
||||
run to achieve the desired coverage. (Although initially the device coverage
|
||||
will remain unchanged, the percentage of partitions covered will have reduced
|
||||
by whatever factor the partition power has increased.)
|
||||
|
||||
Auditing
|
||||
^^^^^^^^
|
||||
|
||||
During preparation and switchover, the auditor may find a corrupt object. The
|
||||
quarantine directory is not in the epoch partition directory filesystem branch,
|
||||
so a quarantined object will not be lost when old partitions are deleted.
|
||||
|
||||
The quarantining of an object in a current partition directory will not remove
|
||||
the object from a future partition, so after switchover the auditor will
|
||||
discover the object again, and quarantine it again. The diskfile quarantine
|
||||
renamer could optionally be made 'relinker' aware and unlink duplicate object
|
||||
references when quarantining an object.
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Prior work
|
||||
^^^^^^^^^^
|
||||
|
||||
The swift_ring_tool_ enables ring power increases while swift services are
|
||||
disabled. It takes a similar approach to this proposal in that the ring
|
||||
mapping is changed so that every resource remains on the same device when
|
||||
moved to its new partition. However, new partitions are created in the
|
||||
same filesystem branch as existing (hence the need for services to be suspended
|
||||
during the relocation).
|
||||
|
||||
.. _swift_ring_tool: https://github.com/enovance/swift-ring-tool/
|
||||
|
||||
Previous proposals have been made to upstream swift:
|
||||
|
||||
https://bugs.launchpad.net/swift/+bug/933803 suggests a 'same-device'
|
||||
partition re-mapping, as does this proposal, but did not provide for
|
||||
relocation of resources to new partition directories.
|
||||
|
||||
https://review.openstack.org/#/c/21888/ suggests maintaining a partition power
|
||||
per device (so only new devices use the increase partition power) but appears
|
||||
to have been abandoned due to complexities with replication.
|
||||
|
||||
|
||||
Create future partitions in existing `objects[-policy]` directory
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The duplication of filesystem entries for objects and creation of (potentially
|
||||
duplicate) partitions during the preparation phase could have undesirable
|
||||
effects on other backend processes if they are not isolated in another
|
||||
filesystem branch.
|
||||
|
||||
For example, the object replicator is likely to discover newly created future
|
||||
partition directories that appear to be 'misplaced'. The replicator will
|
||||
attempt to sync these to their primary nodes (according to the old ring
|
||||
mapping) which is unnecessary. Worse, the replicator might then delete the
|
||||
future partitions from their current nodes, undoing the work of the relinker
|
||||
process.
|
||||
|
||||
If the replicator were to adopt the future ring mappings from the outset of the
|
||||
preparation phase then the same problems arise with respect to current
|
||||
partitions that now appear to be misplaced. Furthermore, the replication
|
||||
process is likely to race with the relinker process on remote nodes to
|
||||
populate future partitions: if relocation proceeds faster on node A than B then
|
||||
the replicator may start to sync objects from A to B, which is again
|
||||
unnecessary and expensive.
|
||||
|
||||
The auditor will also be impacted as it will discover objects in the future
|
||||
partition directories and audit them, being unable to distinguish them as
|
||||
duplicates of the object still stored in the current partition.
|
||||
|
||||
These issues could of course be avoided by disabling replication and auditing
|
||||
during the preparation phase, but instead we propose to make the future ring
|
||||
partition naming be mutually exclusive from current ring partition naming, and
|
||||
simply restrict the replicator and auditor to only process partitions that are
|
||||
in the current ring partition set. In other words we isolate these processes
|
||||
from the future partition directories that are being created by the relinker.
|
||||
|
||||
|
||||
Use mutually exclusive future partitions in existing `objects` directory
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The current algorithm for calculating the partition for an object is to
|
||||
calculate a 32 bit hash of the object and then use its P most significant bits,
|
||||
resulting in partitions in the range {0, 2**P - 1}. i.e.::
|
||||
|
||||
part = H(object name) >> (32 - P)
|
||||
|
||||
A ring with partition power P+1 will re-use all the partition numbers of a ring
|
||||
with partition power P.
|
||||
|
||||
To eliminate overlap of future ring partitions with current ring partitions we
|
||||
could change the partition number algortihm to add an offset to each partition
|
||||
number when a ring's partition power is increased:
|
||||
|
||||
offset = 2**P part = (H(object name) >> (32 - P)) + offset
|
||||
|
||||
This is backwards compatible: if `offset` is not defined in a ring file then it
|
||||
is set to zero.
|
||||
|
||||
To ensure that partition numbers remain < 2**32, this change will reduce the
|
||||
maximum partition power from 32 to 31.
|
||||
|
||||
Proxy servers start to use the new ring at outset of relocation phase
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
This would mean that GETs to backends would use the new rings partitions in
|
||||
object urls. Objects may not yet have been relocated to their new partition
|
||||
directory and the object servers would therefore need to fall back to looking
|
||||
in the old ring partition for the object. PUTs and DELETEs to the new partition
|
||||
would need to be made conditional upon a newer object timestamp not existing in
|
||||
the old location. This is more complicated than the proposed method.
|
||||
|
||||
Enable partition power reduction
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Ring power reduction is not easily achieved with the approach presented in this
|
||||
proposal because there is no guarantee that partitions in the current epoch
|
||||
that will be merged into partitions in the next epoch are located on the same
|
||||
device. File contents are therefore likely to need copying between devices
|
||||
during a preparation phase.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
alistair.coles@hp.com
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
#. modify ring classes to support new attributes
|
||||
#. modify ringbuilder to manage new attributes
|
||||
#. modify backend servers to duplicate links to files in future epoch partition
|
||||
directories
|
||||
#. make backend servers and relinker report their status in a way that recon
|
||||
can report e.g. servers report when a new ring epoch has been loaded, the
|
||||
relinker reports when all relinking has been completed.
|
||||
#. make recon support reporting these states
|
||||
#. modify code that assumes storage-directory is objects[-policy_index] to
|
||||
be aware of epoch prefix
|
||||
#. make swift-dispersion-populate and swift-dispersion-report epoch-aware
|
||||
#. implement relinker daemon
|
||||
#. document process
|
||||
|
||||
Repositories
|
||||
------------
|
||||
|
||||
No new git repositories will be created.
|
||||
|
||||
Servers
|
||||
-------
|
||||
|
||||
No new servers are created.
|
||||
|
||||
DNS Entries
|
||||
-----------
|
||||
|
||||
No DNS entries will to be created or updated.
|
||||
|
||||
Documentation
|
||||
-------------
|
||||
|
||||
Process will be documented in the administrator's guide. Additions will be made
|
||||
to the ring-builder documents.
|
||||
|
||||
Security
|
||||
--------
|
||||
|
||||
No security issues are foreseen.
|
||||
|
||||
Testing
|
||||
-------
|
||||
|
||||
Unit tests will be added for changes to ring-builder, ring classes and
|
||||
object server.
|
||||
|
||||
Probe tests will be needed to verify the process of increasing ring power.
|
||||
|
||||
Functional tests will be unchanged.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
@ -1,114 +0,0 @@
|
||||
::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
..
|
||||
|
||||
==============================================
|
||||
Send notifications on PUT/POST/DELETE requests
|
||||
==============================================
|
||||
|
||||
Swift should be able to send out notifications if new objects are uploaded,
|
||||
metadata has been changed or data has been deleted.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Currently there is no way to detect changes in a given container except listing
|
||||
it's contents and comparing timestamps. This makes it difficult and slow in case
|
||||
there are a lot of objects stored, and it is not very efficient at all.
|
||||
Some external services might be interested when an object got uploaded, updated
|
||||
or deleted; for example to store the metadata in an external database for
|
||||
searching or to trigger specific events like computing on object data.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
A new middleware should be added that can be configured to run inside the proxy
|
||||
server pipeline.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
Another option might be to analyze logfiles and parsing them, aggregating data
|
||||
into notifications per account and sending batches of updates to an external
|
||||
service. However, performance is most likely worse since there is a lot of
|
||||
string parsing involved, and a central logging service might be required to send
|
||||
notifications in order.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Sending out notifications should happen when an object got modified. That means
|
||||
every successful object change (PUT, POST, DELETE) should trigger an action and
|
||||
send out an event notification.
|
||||
It should be configurable on either an account or container level that
|
||||
notifications should be sent; this leaves it up to the user to decide where they
|
||||
end up and if a possible performance impact is acceptable.
|
||||
An implementation should be developed as an additional middleware inside the Swift
|
||||
proxy, and make use of existing queuing implementations within OpenStack,
|
||||
namely Zaqar (https://wiki.openstack.org/wiki/Zaqar).
|
||||
It needs to be discussed if metadata that is stored along the object should be
|
||||
included in the notification or not; if there is a lot of metadata the
|
||||
notifications are getting quite large. A possible trade off might be a threshold
|
||||
for included metadata, for example only the first X bytes. Or send no metadata
|
||||
at all, but only the account/container/objectname.
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
cschwede
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
Develop middleware for the Swift proxy server including functional tests.
|
||||
|
||||
Update Swift functional test VMs to include Zaqar service for testing.
|
||||
|
||||
Repositories
|
||||
------------
|
||||
|
||||
None
|
||||
|
||||
Servers
|
||||
-------
|
||||
|
||||
Functional tests require either a running Zaqar service on the testing VM, or a
|
||||
dummy implementation that acts like a Zaqar queue.
|
||||
|
||||
DNS Entries
|
||||
-----------
|
||||
|
||||
None
|
||||
|
||||
Documentation
|
||||
-------------
|
||||
|
||||
Add documentation for new middleware
|
||||
|
||||
Security
|
||||
--------
|
||||
|
||||
Notifications should be just enabled or disabled per-container, and the
|
||||
receiving server should be set only in the middleware configuration setting.
|
||||
This prevents users from forwarding events to an own, external side that the
|
||||
operator is not aware of.
|
||||
|
||||
Enabling or disabling should be restricted to account owners.
|
||||
|
||||
Sent notifications include the account/container/objectname, thus traffic should
|
||||
be transmitted over a private network or SSL-encrypted.
|
||||
|
||||
Testing
|
||||
-------
|
||||
|
||||
Unit and functional testing shall be included in a patch.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
- python-zaqarclient: https://github.com/openstack/python-zaqarclient
|
||||
- zaqar service running on the gate (inside the VM)
|
@ -1,146 +0,0 @@
|
||||
::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
..
|
||||
|
||||
=====================================
|
||||
tempurls with a prefix-based scope
|
||||
=====================================
|
||||
|
||||
The tempurl middleware should be allowed to use a prefix-based signature, which grants access for
|
||||
all objects with this specific prefix. This allows access to a whole container or pseudofolder
|
||||
with one signature, instead of using a new signature for each object.
|
||||
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
At the moment, if one wants to share a large amount of objects inside a container/pseudofolder
|
||||
with external people, one has to create temporary urls for each object. Additionally, objects which
|
||||
are placed inside the container/pseudofolder after the generation of the signature cannot
|
||||
be accessed with the same signature.
|
||||
Prefix-based signatures would allow to reuse the same signature for a large amount of objects
|
||||
which share the same prefix.
|
||||
|
||||
Use cases:
|
||||
|
||||
1. We have one pseudofolder with 1000000 objects. We want to share this pseudofolder with external
|
||||
partners. Instead of generating 1000000 different signatures, we only need to generate one
|
||||
signature.
|
||||
2. We have an webbased-application on top of swift like the swiftbrowser
|
||||
(https://github.com/cschwede/django-swiftbrowser), which acts as a filebrowser. We want to
|
||||
support the sharing of temporary pseudofolders with external people. We do not know in advance
|
||||
which and how many objects will live inside the pseudofolder.
|
||||
With prefix-based signatures, we could develop the webapplication in a way so that the user
|
||||
could generate a temporary url for one pseudofolder, which could be used by external people
|
||||
for accessing all objects which will live inside it
|
||||
(this use-case additionaly needs a temporary container listing, to display which objects live
|
||||
inside the pseudofolder and a modification of the formpost middleware, please see spec
|
||||
https://review.openstack.org/#/c/225059/).
|
||||
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
The temporary url middleware should be changed. The code change should not be too big.
|
||||
If the client desires to use a prefix-based signature, he can append an URL parameter
|
||||
"temp_url_prefix" with the desired prefix (an empty prefix would specify the whole container),
|
||||
and the middleware would only use the container path + prefix for calculating the signature.
|
||||
Furthermore, the middleware would check if the object path really contains this prefix.
|
||||
|
||||
Lets have two examples. In the first example, we want to allow a user to upload a bunch of objects
|
||||
in a container c.
|
||||
He first creates a tempurl, for example using the swift command line tool
|
||||
(modified version which supports tempurls on container-level scope):
|
||||
::
|
||||
|
||||
$swift tempurl --container-level PUT 86400 /v1/AUTH_account/c/ KEY
|
||||
/v1/AUTH_account/c/?temp_url_sig=9dd9e9c318a29c6212b01343a2d9f9a4c9deef2d&temp_url_expires=1439280760&temp_url_prefix=
|
||||
|
||||
The user then uploads a bunch of files using each time the same container-level signature:
|
||||
::
|
||||
|
||||
$curl -XPUT --data-binary @file1 https://example.host/v1/AUTH_account/c/o1?temp_url_sig=9dd9e9c318a29c6212b01343a2d9f9a4c9deef2d&temp_url_expires=1439280760&temp_url_prefix=
|
||||
$curl -XPUT --data-binary @file2 https://example.host/v1/AUTH_account/c/o2?temp_url_sig=9dd9e9c318a29c6212b01343a2d9f9a4c9deef2d&temp_url_expires=1439280760&temp_url_prefix=
|
||||
$curl -XPUT --data-binary @file3 https://example.host/v1/AUTH_account/c/p/o3?temp_url_sig=9dd9e9c318a29c6212b01343a2d9f9a4c9deef2d&temp_url_expires=1439280760&temp_url_prefix=
|
||||
|
||||
In the next example, we want to allow an external user to download a whole pseudofolder p:
|
||||
::
|
||||
|
||||
$swift tempurl --container-level GET 86400 /v1/AUTH_account/c/p KEY
|
||||
|
||||
/v1/AUTH_account/c/p?temp_url_sig=4e755839d19762e06c12d807eccf46ff3224cb3f&temp_url_expires=1439281346&temp_url_prefix=p
|
||||
|
||||
$curl https://example.host/v1/AUTH_account/c/p/o1?temp_url_sig=4e755839d19762e06c12d807eccf46ff3224cb3f&temp_url_expires=1439281346&temp_url_prefix=p
|
||||
$curl https://example.host/v1/AUTH_account/c/p/o2?temp_url_sig=4e755839d19762e06c12d807eccf46ff3224cb3f&temp_url_expires=1439281346&temp_url_prefix=p
|
||||
$curl https://example.host/v1/AUTH_account/c/p/p2/o3?temp_url_sig=4e755839d19762e06c12d807eccf46ff3224cb3f&temp_url_expires=1439281346&temp_url_prefix=p
|
||||
|
||||
Following requests would be denied, because of missing/wrong prefix:
|
||||
::
|
||||
|
||||
$curl https://example.host/v1/AUTH_account/c/o4?temp_url_sig=4e755839d19762e06c12d807eccf46ff3224cb3f&temp_url_expires=1439281346&temp_url_prefix=p
|
||||
$curl https://example.host/v1/AUTH_account/c/p3/o5?temp_url_sig=4e755839d19762e06c12d807eccf46ff3224cb3f&temp_url_expires=1439281346&temp_url_prefix=p
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
A new middleware could be introduced. But it seems that this would only lead to a lot of
|
||||
code-copying, as the changes are really small in comparison to the original middleware.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
bartz
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
Add modifications to tempurl and respective test module.
|
||||
|
||||
Repositories
|
||||
------------
|
||||
|
||||
None
|
||||
|
||||
Servers
|
||||
-------
|
||||
|
||||
None
|
||||
|
||||
DNS Entries
|
||||
-----------
|
||||
|
||||
None
|
||||
|
||||
Documentation
|
||||
-------------
|
||||
|
||||
Modify documentation for tempurl middleware.
|
||||
|
||||
Security
|
||||
--------
|
||||
The calculation of the signature uses the hmac module (https://docs.python.org/2/library/hmac.html)
|
||||
in combination with the sha1 hash function.
|
||||
The difference of a prefix-based signature to the current object-path-based signature is, that
|
||||
the path is shrunk to the prefix. The remaining part of the calculation stays the same.
|
||||
A shorter path induces a shorter message as input to the hmac calculation, which should not reduce
|
||||
the cryptographic strength. Therefore, I do not see security-related problems with introducing
|
||||
a prefix-based signature.
|
||||
|
||||
Testing
|
||||
-------
|
||||
|
||||
Tests should be added to the existing test module.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
@ -1,123 +0,0 @@
|
||||
::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
..
|
||||
This template should be in ReSTructured text. Please do not delete
|
||||
any of the sections in this template. If you have nothing to say
|
||||
for a whole section, just write: "None". For help with syntax, see
|
||||
http://sphinx-doc.org/rest.html To test out your formatting, see
|
||||
http://www.tele3.cz/jbar/rest/rest.html
|
||||
|
||||
==================================================
|
||||
Swift Request Tagging for detailed logging/tracing
|
||||
==================================================
|
||||
|
||||
URL of your blueprint:
|
||||
|
||||
None.
|
||||
|
||||
To tag a particular request/every 'x' requests, which would undergo more detailed logging.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
Reasons for detailed logging:
|
||||
|
||||
- A Swift user is having problems, which we cannot recreate but could tag this user request for more logging.
|
||||
|
||||
- In order to better investigate a cluster for bottlenecks/problems - Internal user (admin/op) wants additional info on some situations where the client is getting inconsistent container listings. With the Swift-inspector, we can tell what node is not returning the correct listings.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Existing: Swift-Inspector (https://github.com/hurricanerix/swift-inspector ) currently
|
||||
provides middleware in Proxy and Object servers. Relays info about a request back to the client with the assumption that the client is actively making a decision to tag a request to trigger some action that would not otherwise occur.
|
||||
Current Inspectors:
|
||||
|
||||
- Timing -‘Inspector-Timing’: gives the amount of time it took for the proxy-server to process the request
|
||||
- Handlers – ‘Inspector-Handlers’: not implemented (meant to return the account/container/object servers that were contacted in the request) ‘Inspector-Handlers-Proxy’: returns the proxy that handled the request
|
||||
- Nodes - ‘Inspector-Nodes’: returns what account/container/object servers the path resides on ‘Inspector-More-Nodes’: returns extra nodes for handoff.
|
||||
|
||||
Changes:
|
||||
|
||||
- Add logging inspector to the above inspectors , which would enable detailed logging for tagged requests.
|
||||
- Add the capability to let the system decide (instead of the client) to tag a request and nice to add rules to trigger actions like extra logging etc.
|
||||
|
||||
Possible Tagging criteria: Tagging
|
||||
|
||||
- every 'x' requests/ a % of all requests.
|
||||
|
||||
- based on something in the request/response headers (e.g.if the HTTP method is DELETE, or the response is sending a specific status code back)
|
||||
|
||||
- based on a specific account/container/object/feature.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
- Logging: log collector/log aggregator like logstash.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
https://launchpad.net/~shashirekha-j-gundur
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
- To add an Inspector– ‘Logging’ to existing inspectors , to enable the logs.
|
||||
|
||||
- Add rules to tag decide which requests to be tagged
|
||||
|
||||
- Trigger actions like logging.
|
||||
|
||||
- Restrict the access of nodes/inventory list displayed to admins/ops only.
|
||||
|
||||
- Figure out hmac_key access (Inspector-Sig) and ‘Logging’ work together?
|
||||
|
||||
Repositories
|
||||
------------
|
||||
|
||||
Will any new git repositories need to be created? Yes.
|
||||
|
||||
Servers
|
||||
-------
|
||||
|
||||
Will any new servers need to be created? No.
|
||||
|
||||
What existing servers will be affected? Proxy and Object servers.
|
||||
|
||||
DNS Entries
|
||||
-----------
|
||||
|
||||
Will any other DNS entries need to be created or updated? No.
|
||||
|
||||
Documentation
|
||||
-------------
|
||||
|
||||
Will this require a documentation change? Yes , Swift-inspector docs.
|
||||
|
||||
Will it impact developer workflow? No.
|
||||
|
||||
Will additional communication need to be made? No.
|
||||
|
||||
Security
|
||||
--------
|
||||
|
||||
None.
|
||||
|
||||
Testing
|
||||
-------
|
||||
|
||||
Unit tests.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
- Swift-Inspector https://github.com/hurricanerix/swift-inspector
|
||||
|
||||
- Does it require a new puppet module? No.
|
@ -1,81 +0,0 @@
|
||||
::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
===============================
|
||||
PACO Single Process deployments
|
||||
===============================
|
||||
|
||||
Since the release of the DiskFile API, there's been a number of different
|
||||
implementations providing the ability of storing Swift objects in the
|
||||
third-party storage systems. Commonly these systems provide the durability and
|
||||
availability of the objects (e.g., GlusterFS, GPFS), thus requiring the object
|
||||
ring to be created with only one replica.
|
||||
|
||||
A typical deployment style for this configuration is a "PACO" deployment,
|
||||
where the proxy, account, container and object services are running on the same
|
||||
node. The object ring is built in a such a way that the proxy server always
|
||||
send requests to the local object server. The object server (with it's
|
||||
third-party DiskFile) is then responsible for writing the data to the underlying
|
||||
storage system which will then distribute the data according to its own
|
||||
policies.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
In a typical swift deployment, proxy nodes send data to object
|
||||
servers running on different nodes and the object servers write the data
|
||||
directly to disk. In the case of third-party storage systems, the object server
|
||||
typically makes another network connection to send the object to that storage
|
||||
system, adding some latency to the data path.
|
||||
|
||||
Even when the proxy and object servers are on the same node, latency is still
|
||||
introduced due to RPC communication over local network.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
For the scenario of single replica - PACO deployments, the proxy server would
|
||||
be sending data directly to the third-party storage systems. To accomplish this
|
||||
we would like to call the object wsgi application directly from
|
||||
the proxy process instead of making the additional network connection.
|
||||
|
||||
This proposed solution focuses on reducing the proxy to object server latency
|
||||
Proxy to account and/or container communications would stay the same for now
|
||||
and be addressed on later patch.
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
thiago@redhat.com
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
A WiP patch has been submitted: https://review.openstack.org/#/c/159285/.
|
||||
The work that has been done recently to the Object Controllers in the proxy
|
||||
servers provides the ability for a very nice separation of the code.
|
||||
|
||||
TODOs and where further investigation is needed:
|
||||
|
||||
* How to load the object WSGI application instance in the proxy process?
|
||||
* How to add support for multiple storage policies?
|
||||
|
||||
Prototype
|
||||
---------
|
||||
|
||||
To test patch `159285 <https://review.openstack.org/#/c/159285/>`_ follow these
|
||||
steps:
|
||||
|
||||
#. Create new single replica storage system. Update swift.conf and create new
|
||||
ring. The port provided during ring creation will not be used for anything.
|
||||
#. Create an object-server config file: ``/etc/swift/single-process.conf``.
|
||||
This configuration file can look like any other object-server configuration
|
||||
file, just make sure it specifies the correct device the object server
|
||||
should be writing to. For example, in the case of `Swift-on-File <https://github.com/stackforge/swiftonfile>`_
|
||||
object server, the device is the mountpoint of the shared filesystem (i.e.,
|
||||
Gluster, GPFS).
|
||||
#. Start the proxy.
|
@ -1,295 +0,0 @@
|
||||
|
||||
::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
====================
|
||||
Swift Symbolic Links
|
||||
====================
|
||||
|
||||
1. Problem description
|
||||
======================
|
||||
|
||||
With the advent of storage policies and erasure codes, moving an
|
||||
object between containers is becoming increasingly useful. However, we
|
||||
don't want to break existing references to the object when we do so.
|
||||
|
||||
For example, a common object lifecycle has the object starting life
|
||||
"hot" (i.e. frequently requested) and gradually "cooling" over time
|
||||
(becoming less frequently requested). The user will want an object to
|
||||
start out replicated for high requests-per-second while hot, but
|
||||
eventually transition to EC for lower storage cost once cold.
|
||||
|
||||
A completely different use case is when an application is sharding
|
||||
objects across multiple containers, but finds that it needs to use
|
||||
even more containers; for example, going from 256 containers up to
|
||||
4096 as write rate goes up. The application could migrate to the new
|
||||
schema by creating 4096-sharded references for all 256-sharded
|
||||
objects, thus avoiding a lot of data movement.
|
||||
|
||||
Yet a third use case is a user who has large amounts of
|
||||
infrequently-accessed data that is stored replicated (because it was
|
||||
uploaded prior to Swift's erasure-code support) and would like to
|
||||
store it erasure-coded instead. The user will probably ask for Swift
|
||||
to allow storage-policy changes at the container level, but as that is
|
||||
fraught with peril, we can offer them this instead.
|
||||
|
||||
|
||||
2. Proposed change
|
||||
==================
|
||||
|
||||
Swift will gain the notion of a symbolic link ("symlink") object. This
|
||||
object will reference another object. GET, HEAD, and OPTIONS
|
||||
requests for a symlink object will operate on the referenced object.
|
||||
DELETE and PUT requests for a symlink object will operate on the
|
||||
symlink object, not the referenced object, and will delete or
|
||||
overwrite it, respectively.
|
||||
|
||||
GET, HEAD, and OPTIONS requests can operate on a symlink object
|
||||
instead of the referenced object by adding a query parameter
|
||||
``?symlink=true`` to the request.
|
||||
|
||||
The ideal behaviour for POSTs would be for them to apply to the referenced
|
||||
object, but due to Swift's eventually-consistent nature this is not possible.
|
||||
Initially, it was suggested that POSTs should apply to the symlink directly,
|
||||
and during a GET or HEAD both the symlink and referenced object's headers would be
|
||||
compared and the newest returned. While this would work, the behaviour can be
|
||||
rather odd if an application were to ever GET or HEAD the referenced object directly
|
||||
as it would not contain any of the headers posted to the symlink.
|
||||
|
||||
Given all of this the best choice left is to fail a POST to a symlink and let
|
||||
the application take care of it, namely by posting the referenced object
|
||||
directly. Achieving this behaviour requires several changes:
|
||||
|
||||
1) To avoid a HEAD on every POST, the object server will be made aware of
|
||||
symlinks and can detect their presence and fail appropriately.
|
||||
2) Simply failing a POST in the object server when the object is a symlink will
|
||||
not work; Consider the following scenarios:
|
||||
|
||||
Scenario A::
|
||||
|
||||
- Add a symlink
|
||||
T0 - PUT /accnt/cont/obj?symlink=true
|
||||
- Overwrite symlink with an regular object
|
||||
T1 - PUT /accnt/cont/obj
|
||||
- Assume at this point some of the primary nodes were down so handoff nodes
|
||||
were used.
|
||||
T2 - POST /accnt/cont/obj
|
||||
- Depending on the object server hit it may see obj as either a symlink or a
|
||||
regular object, though we know in time it will indeed be a real object.
|
||||
|
||||
Scenario B::
|
||||
|
||||
- Add a regular object
|
||||
T0 - PUT /accnt/cont/obj
|
||||
- Overwrite regular object with a symlink
|
||||
T1 - PUT /accnt/cont/obj?symlink=true
|
||||
- Assume at this point some of the primary nodes were down so handoff nodes
|
||||
were used.
|
||||
T2 - POST /accnt/cont/obj
|
||||
- Depending on the object server hit it may see obj as either a symlink or a
|
||||
regular object, though we know in time it will indeed be a symlink.
|
||||
|
||||
Given the scenarios above at T1 (i.e. during the post) it is possible some object
|
||||
servers can see a symlink and others a regular object, thus it is not possible
|
||||
to fail the POST of a symlink. Instead, the following behaviour will be
|
||||
utilized, the object server will always apply the POST whether the object is a
|
||||
symlink or a regular object. Next, we will still return an error to the client
|
||||
if the object server believes it has seen a symlink. In scenario A) this would
|
||||
imply the POST at T1 may fail but the update will indeed be applied to the
|
||||
regular object, which is the correct behaviour. In scenario B) this would imply
|
||||
the POST at T1 may fail but the update will indeed be applied to the symlink,
|
||||
which while not ideal is not incorrect behaviour per say, and the error
|
||||
returned to the application should cause it to apply the POST to the reference
|
||||
object and given the initial point raised earlier this is indeed desirable.
|
||||
|
||||
The aim is for Swift symlinks to operate analogously to Unix symbolic
|
||||
links (except where it does not make sense to do so).
|
||||
|
||||
|
||||
2.1. Alternatives
|
||||
-----------------
|
||||
|
||||
One could use a single-segment SLO manifest to achieve a similar
|
||||
effect. However, the ETag of a SLO manifest is the MD5 of the ETags of
|
||||
its segments, so using a single-segment SLO manifest changes the ETag
|
||||
of the object. Also, object metadata (X-Object-Meta-\*) would have to
|
||||
be copied to the SLO manifest since metadata from SLO segments does
|
||||
not appear in the response. Further, SLO manifests contain the ETag of
|
||||
the referenced segments, and if a segment changes, the manifest
|
||||
becomes invalid. This is not a desirable property for symlinks.
|
||||
|
||||
A DLO manifest does not validate ETags, but it still fails to preserve
|
||||
the referenced object's ETag and metadata, so it is also unsuitable.
|
||||
Further, since DLOs are based on object name prefixes, the upload of a
|
||||
new object (e.g. ``thesis.doc``, then later ``thesis.doc.old``) could
|
||||
cause corrupted downloads.
|
||||
|
||||
Also, DLOs and SLOs cannot use each other as segments, while Swift
|
||||
symlinks can reference DLOs and SLOs *and* act as segments in DLOs and
|
||||
SLOs.
|
||||
|
||||
3. Client-facing API
|
||||
====================
|
||||
|
||||
Clients create a Swift symlink by performing a zero-length PUT request
|
||||
with the query parameter ``?symlink=true`` and the header
|
||||
``X-Object-Symlink-Target-Object: <object>``.
|
||||
|
||||
For a cross-container symlink, also include the header
|
||||
``X-Object-Symlink-Target-Container: <container>``. If omitted, it defaults to
|
||||
the container of the symlink object.
|
||||
|
||||
For a cross-account symlink, also include the header
|
||||
``X-Object-Symlink-Target-Account: <account>``. If omitted, it defaults to
|
||||
the account of the symlink object.
|
||||
|
||||
Symlinks must be zero-byte objects. Attempting to PUT a symlink
|
||||
with a nonempty request body will result in a 400-series error.
|
||||
|
||||
The referenced object need not exist at symlink-creation time. This
|
||||
mimics the behavior of Unix symbolic links. Also, if we ever make bulk
|
||||
uploads work with symbolic links in the tarballs, then we'll have to
|
||||
avoid validation. ``tar`` just appends files to the archive as it
|
||||
finds them; it does not push symbolic links to the back of the
|
||||
archive. Thus, there's a 50% chance that any given symlink in a
|
||||
tarball will precede its referent.
|
||||
|
||||
|
||||
3.1 Example: Move an object to EC storage
|
||||
-----------------------------------------
|
||||
|
||||
Assume the object is /v1/MY_acct/con/obj
|
||||
|
||||
1. Obtain an EC-storage-policy container either by finding a
|
||||
pre-existing one or by making a container PUT request with the
|
||||
right X-Storage-Policy header.
|
||||
|
||||
1. Make a COPY request to copy the object into the EC-policy
|
||||
container, e.g.::
|
||||
|
||||
COPY /v1/MY_acct/con/obj
|
||||
Destination: ec-con/obj
|
||||
|
||||
1. Overwrite the replicated object with a symlink object::
|
||||
|
||||
PUT /v1/MY_acct/con/obj?symlink=true
|
||||
X-Object-Symlink-Target-Container: ec-con
|
||||
X-Object-Symlink-Target-Object: obj
|
||||
|
||||
4. Interactions With Existing Features
|
||||
======================================
|
||||
|
||||
4.1 COPY requests
|
||||
-----------------
|
||||
|
||||
If you copy a symlink without ``?symlink=true``, you get a copy of the
|
||||
referenced object. If you copy a symlink with ``?symlink=true``, you
|
||||
get a copy of the symlink; it will refer to the same object,
|
||||
container, and account.
|
||||
|
||||
However, if you copy a symlink without
|
||||
``X-Object-Symlink-Target-Container`` between containers, or a symlink
|
||||
without ``X-Object-Symlink-Target-Account`` between accounts, the new
|
||||
symlink will refer to a different object.
|
||||
|
||||
4.2 Versioned Containers
|
||||
------------------------
|
||||
|
||||
These will definitely interact. We should probably figure out how.
|
||||
|
||||
|
||||
4.3 Object Expiration
|
||||
---------------------
|
||||
|
||||
There's nothing special here. If you create the symlink with
|
||||
``X-Delete-At``, the symlink will get deleted at the appropriate time.
|
||||
|
||||
If you use a plain POST to set ``X-Delete-At`` on a symlink, it gets
|
||||
set on the referenced object just like other object metadata. If you
|
||||
use POST with ``?symlink=true`` to set ``X-Delete-At`` on a symlink,
|
||||
it will be set on the symlink itself.
|
||||
|
||||
|
||||
4.4 Large Objects
|
||||
-----------------
|
||||
|
||||
Since we'll almost certainly end up implementing symlinks as
|
||||
middleware, we'll order the pipeline like this::
|
||||
|
||||
[pipeline:main]
|
||||
pipeline = catch_errors ... slo dlo symlink ... proxy-server
|
||||
|
||||
This way, you can create a symlink whose target is a large object
|
||||
*and* a large object can reference symlinks as segments.
|
||||
|
||||
This also works if we decide to implement symlinks in the proxy
|
||||
server, though that would only happen if a compelling reason were
|
||||
found.
|
||||
|
||||
|
||||
4.5 User Authorization
|
||||
----------------------
|
||||
|
||||
Authorization will be checked for both the symlink and the referenced
|
||||
object. If the user is authorized to see the symlink but not the
|
||||
referenced object, they'll get a 403, same as if they'd tried to
|
||||
access the referenced object directly.
|
||||
|
||||
|
||||
4.6. Quotas
|
||||
-----------
|
||||
|
||||
Nothing special needed here. A symlink counts as 1 object toward an
|
||||
object-count quota. Since symlinks are zero bytes, they do not count
|
||||
toward a storage quota, and we do not need to write any code to make
|
||||
that happen.
|
||||
|
||||
|
||||
4.7 list_endpoints / Hadoop / ZeroVM
|
||||
------------------------------------
|
||||
|
||||
If the application talks directly to the object server and fetches a
|
||||
symlink, it's up to the application to deal with it. Applications that
|
||||
bypass the proxy should either avoid use of symlinks or should know
|
||||
how to handle them.
|
||||
|
||||
The same is true for SLO, DLO, versioning, erasure codes, and other
|
||||
services that the Swift proxy server provides, so we are not without
|
||||
precedent here.
|
||||
|
||||
|
||||
4.8 Container Sync
|
||||
------------------
|
||||
|
||||
Symlinks are synced like every other object. If the referenced object
|
||||
in cluster A has a different container name than in cluster B, then
|
||||
the symlink will point to the wrong place in one of the clusters.
|
||||
|
||||
Intra-container symlinks (those with only
|
||||
``X-Object-Symlink-Target-Object``) will work correctly on both
|
||||
clusters. Also, if containers are named identically on both clusters,
|
||||
inter-container symlinks (those with
|
||||
``X-Object-Symlink-Target-Object`` and
|
||||
``X-Object-Symlink-Target-Container``) will work correctly too.
|
||||
|
||||
|
||||
4.9 Bulk Uploads
|
||||
----------------
|
||||
|
||||
Currently, bulk uploads ignore all non-file members in the uploaded
|
||||
tarball. This could be expanded to also process symbolic-link members
|
||||
(i.e. those for which ``tarinfo.issym() == True``) and create symlink
|
||||
objects from them. This is not necessary for the initial
|
||||
implementation of Swift symlinks, but it would be nice to have.
|
||||
|
||||
4.10 Swiftclient
|
||||
----------------
|
||||
|
||||
python-swiftclient could download Swift symlinks as Unix symlinks if a
|
||||
flag is given, or it could upload Unix symlinks as Swift symlinks in
|
||||
some cases. This is not necessary for the initial implementation of
|
||||
Swift symlinks, and is mainly mentioned here to show that
|
||||
python-swiftclient was not forgotten.
|
@ -1,84 +0,0 @@
|
||||
::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==================
|
||||
Test Specification
|
||||
==================
|
||||
|
||||
This is a test specification. It should be removed after the first
|
||||
real specification is merged.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
A detailed description of the problem.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
Here is where you cover the change you propose to make in detail. How do you
|
||||
propose to solve this problem?
|
||||
|
||||
If this is one part of a larger effort make it clear where this piece ends. In
|
||||
other words, what's the scope of this effort?
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
This is an optional section, where it does apply we'd just like a demonstration
|
||||
that some thought has been put into why the proposed approach is the best one.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Who is leading the writing of the code? Or is this a blueprint where you're
|
||||
throwing it out there to see who picks it up?
|
||||
|
||||
If more than one person is working on the implementation, please designate the
|
||||
primary author and contact.
|
||||
|
||||
Primary assignee:
|
||||
<launchpad-id or None>
|
||||
|
||||
Can optionally list additional ids if they intend on doing substantial
|
||||
implementation work on this blueprint.
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
Work items or tasks -- break the feature up into the things that need to be
|
||||
done to implement it. Those parts might end up being done by different people,
|
||||
but we're mostly trying to understand the timeline for implementation.
|
||||
|
||||
Repositories
|
||||
------------
|
||||
|
||||
Will any new git repositories need to be created?
|
||||
|
||||
Servers
|
||||
-------
|
||||
|
||||
Will any new servers need to be created? What existing servers will
|
||||
be affected?
|
||||
|
||||
DNS Entries
|
||||
-----------
|
||||
|
||||
Will any other DNS entries need to be created or updated?
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
- Include specific references to specs and/or stories in infra, or in
|
||||
other projects, that this one either depends on or is related to.
|
||||
|
||||
- Does this feature require any new library or program dependencies
|
||||
not already in use?
|
||||
|
||||
- Does it require a new puppet module?
|
@ -1,496 +0,0 @@
|
||||
::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
*************************
|
||||
Automated Tiering Support
|
||||
*************************
|
||||
|
||||
1. Problem Description
|
||||
======================
|
||||
Data hosted on long-term storage systems experience gradual changes in
|
||||
access patterns as part of their information lifecycles. For example,
|
||||
empirical studies by companies such as Facebook show that as image data
|
||||
age beyond their creation times, they become more and more unlikely to be
|
||||
accessed by users, with access rates dropping exponentially at times [1].
|
||||
Long retention periods, as is the case with data stored on cold storage
|
||||
systems like Swift, increase the possibility of such changes.
|
||||
|
||||
Tiering is an important feature provided by many traditional file & block
|
||||
storage systems to deal with changes in data “temperature”. It enables
|
||||
seamless movement of inactive data from high performance storage media to
|
||||
low-cost, high capacity storage media to meet customers’ TCO (total cost of
|
||||
ownership) requirements. As scale-out object storage systems like Swift are
|
||||
starting to natively support multiple media types like SSD, HDD, tape and
|
||||
different storage policies such as replication and erasure coding, it becomes
|
||||
imperative to complement the wide range of available storage tiers (both
|
||||
virtual and physical) with automated data tiering.
|
||||
|
||||
|
||||
2. Tiering Use Cases in Swift
|
||||
=============================
|
||||
Swift users and operators can adapt to changes in access characteristics of
|
||||
objects by transparently converting their storage policies to cater to the
|
||||
goal of matching overall business needs ($/GB, performance, availability) with
|
||||
where and how the objects are stored.
|
||||
|
||||
Some examples of how objects can be moved between Swift containers of different
|
||||
storage policies as they age.
|
||||
|
||||
[SSD-based container] --> [HDD-based container]
|
||||
|
||||
[HDD-based container] --> [Tape-based container]
|
||||
|
||||
[Replication policy container] --> [Erasure coded policy container]
|
||||
|
||||
In some customer environments, a Swift container may not be the last storage
|
||||
tier. Examples of archival-class stores lower in cost than Swift include
|
||||
specialized tape-based systems [2], public cloud archival solutions such as
|
||||
Amazon Glacier and Google Nearline storage. Analogous to this proposed feature
|
||||
of tiering in Swift, Amazon S3 already has the in-built support to move
|
||||
objects between S3 and Glacier based on user-defined rules. Redhat Ceph has
|
||||
recently added tiering capabilities as well.
|
||||
|
||||
|
||||
3. Goals
|
||||
========
|
||||
The main goal of this document is to propose a tiering feature in Swift that
|
||||
enables seamless movement of objects between containers belonging to different
|
||||
storage policies. It is “seamless” because users will not experience any
|
||||
disruption in namespace, access API, or availability of the objects subject to
|
||||
tiering.
|
||||
|
||||
Through new Swift API enhancements, Swift users and operators alike will have
|
||||
the ability to specify a tiering relationship between two containers and the
|
||||
associated data movement rules.
|
||||
|
||||
The focus of this proposal is to identify, create and bring together the
|
||||
necessary building blocks towards a baseline tiering implementation natively
|
||||
within Swift. While this narrow scope is intentional, the expectation is that
|
||||
the baseline tiering implementation will lay the foundation and not preclude
|
||||
more advanced tiering features in future.
|
||||
|
||||
4. Feature Dependencies
|
||||
=======================
|
||||
The following in-progress Swift features (aka specs) have been identified as
|
||||
core dependencies for this tiering proposal.
|
||||
|
||||
1. Swift Symbolic Links [3]
|
||||
2. Changing Storage Policies [4]
|
||||
|
||||
A few other specs are classified as nice-to-have dependencies, meaning that
|
||||
if they evolve into full implementations we will be able to demonstrate the
|
||||
tiering feature with advanced use cases and capabilities. However, they are
|
||||
not considered mandatory requirements for the first version of tiering.
|
||||
|
||||
3. Metadata storage/search [5]
|
||||
4. Tape support in Swift [6]
|
||||
|
||||
5. Implementation
|
||||
=================
|
||||
The proposed tiering implementation depends on several building blocks, some
|
||||
of which are unique to tiering, like the requisite API changes. They will be
|
||||
described in their entirety. Others like symlinks are independent features and
|
||||
have uses beyond tiering. Instead of re-inventing the wheel, the tiering
|
||||
implementation aims to leverage specific constructs that will be available
|
||||
through these in-progress features.
|
||||
|
||||
5.1 Overview
|
||||
------------
|
||||
For a quick overview of the tiering implementation, please refer to the Figure
|
||||
(images/tiering_overview.png). It highlights the flow of actions taking place
|
||||
within the proposed tiering engine.
|
||||
|
||||
1. Swift client creates a tiering relationship between two Swift containers by
|
||||
marking the source container with appropriate metadata.
|
||||
2. A background process named tiering-coordinator examines the source container
|
||||
and iterates through its objects.
|
||||
3. Tiering-coordinator identifies candidate objects for movement and de-stages
|
||||
each object to target container by issuing a copy request to an object server.
|
||||
4. After an object is copied, tiering-coordinator replaces it by a symlink in
|
||||
the source container pointing to corresponding object in target container.
|
||||
|
||||
|
||||
5.2 API Changes
|
||||
---------------
|
||||
Swift clients will be able to create a tiering relationship between two
|
||||
containers, i.e., source and target containers, by adding the following
|
||||
metadata to the source container.
|
||||
|
||||
X-Container-Tiering-Target: <target_container_name>
|
||||
X-Container-Tiering-Age: <threshold_object_age >
|
||||
|
||||
The metadata values can be set during the creation of the source container
|
||||
(PUT) operation or they can be set later as part of a container metadata
|
||||
update (POST) operation. Object age refers to the time elapsed since the
|
||||
object’s creation time (creation time is stored with the object as
|
||||
‘X-Timestamp’ header).
|
||||
|
||||
The user semantics of setting the above container metadata are as follows.
|
||||
When objects in the source container become older than the specified threshold
|
||||
time, they become candidates for being de-staged to the target container. There
|
||||
are no guarantees on when exactly they will be moved or the precise location of
|
||||
the objects at any given time. Swift will operate on them asynchronously and
|
||||
relocate objects based on user-specified tiering rules. Once the tiering
|
||||
metadata is set on the source container, the user can expect levels of
|
||||
performance, reliability, etc. for its objects commensurate with the storage
|
||||
policy of either the source or target container.
|
||||
|
||||
One can override the tiering metadata for individual objects in the source
|
||||
container by setting the following per-object metadata,
|
||||
|
||||
X-Object-Tiering-Target: <target_container_name>
|
||||
X-Object-Tiering-Age: <object_age_in_minutes>
|
||||
|
||||
Presence of tiering metadata on an object will imply that it will take
|
||||
precedence over the tiering metadata set on the hosting container. However,
|
||||
if a container is not tagged with any tiering metadata, the objects inside it
|
||||
will not be considered for tiering regardless of whether they are tagged with
|
||||
any tiering related metadata or not. Also, if the tiering age threshold on the
|
||||
object metadata is lower than the value set on the container, it will not take
|
||||
effect until the container age criterion is met.
|
||||
|
||||
An important invariant preserved by the tiering feature is the namespace of
|
||||
objects. As will be explained in later sections, after objects are moved they
|
||||
will be replaced immediately by symlinks that will allow users to continue
|
||||
foreground operations on objects as if no migrations have taken place. Please
|
||||
refer to section 7 on open questions for further commentary on the API topic.
|
||||
|
||||
To summarize, here are the steps that a Swift user must perform in order to
|
||||
initiate tiering between objects from a source container (S) to a target
|
||||
container (T) over time.
|
||||
|
||||
1. Create containers S and T with desired storage policies, say replication
|
||||
and erasure coding respectively
|
||||
2. Set the tiering-related metadata (X-Container-Tiering-*) on container S
|
||||
as described earlier in this section.
|
||||
3. Deposit objects into container S.
|
||||
4. If needed, override the default container settings for individual objects
|
||||
inside container S by setting object metadata (X-Object-Tiering-*).
|
||||
|
||||
It will also be possible to create cascading tiering relationships between
|
||||
more than two containers. For example, a sequence of tiering relationships
|
||||
between containers C1 -> C2 -> C3 can be established by setting appropriate
|
||||
tiering metadata on C1 and C2. When an object is old enough to be moved from
|
||||
C1, it will be deposited in C2. The timer will then start on the moved object
|
||||
in C2 and depending on the age settings on C2, the object will eventually be
|
||||
migrated to C3.
|
||||
|
||||
|
||||
5.3 Tiering Coordinator Process
|
||||
-------------------------------
|
||||
The tiering-coordinator is a background process similar to container-sync,
|
||||
container-reconciler and other container-* processes running on each container
|
||||
server. We can potentially re-use one of the existing container processes,
|
||||
specifically either container-sync or container-reconciler to perform the job of
|
||||
tiering-coordinator, but for the purposes of this discussion it will be assumed
|
||||
that it is a separate process.
|
||||
|
||||
The key actions performed by tiering-coordinator are
|
||||
|
||||
(a) Walk through containers marked with tiering metadata
|
||||
(b) Identify candidate objects for tiering within those containers
|
||||
(c) Initiate copy requests on candidate objects
|
||||
(d) Replace source objects with corresponding symlinks
|
||||
|
||||
We will discuss (a) and (b) in this section and cover (c) and (d) in subsequent
|
||||
sections. Note that in the first version of tiering, only one metric
|
||||
<object age> will be used to determine the eligibility of an object for
|
||||
migration.
|
||||
|
||||
The tiering-coordinator performs its operations in a series of rounds. In each
|
||||
round, it iterates through containers whose SQLite DBs it has direct access to
|
||||
on the container server it is running on. It checks if the container has the
|
||||
right X-Container-Tiering-* metadata. If present, it starts the scanning process
|
||||
to identify candidate objects. The scanning process leverages a convenient (but
|
||||
not necessary) property of the container DB that objects are listed in the
|
||||
chronological order of their creation times. That is, the first index in the
|
||||
container DB points to the object with oldest creation time, followed by next
|
||||
younger object and so on. As such, the scanning process described below is
|
||||
optimized for the object age criterion chosen for tiering v1 implementation.
|
||||
For extending to other tiering metrics, we refer the reader to section 6.1 for
|
||||
discussion.
|
||||
|
||||
Each container DB will have two persistent markers to track the progress of
|
||||
tiering – tiering_sync_start and tiering_sync_end. The marker tiering_sync_start
|
||||
refers to the starting index in the container DB upto which objects have already
|
||||
been processed. The marker tiering_sync_end refers to the index beyond which
|
||||
objects have not yet been considered for tiering. All the objects that fall
|
||||
between the two markers are the ones for which tiering is currently in progress.
|
||||
Note that the presence of persistent markers in the container DB helps with
|
||||
quickly resuming from previous work done in the event of container server
|
||||
crash/reboot.
|
||||
|
||||
When a container is selected for tiering for the first time, both the markers
|
||||
are initialized to -1. If the first object is old enough to meet the
|
||||
X-Container-Tiering-Age criterion, tiering_sync_start is set to 0. Then the
|
||||
second marker tiering_sync_end is advanced to an index that is lesser than
|
||||
the two values - (i) tiering_sync_start + tier_max_objects_per_round (latter
|
||||
will be a configurable value in /etc/swift/container.conf) or (ii) largest
|
||||
index in the container DB whose corresponding object meets the tiering age
|
||||
criterion.
|
||||
|
||||
The above marker settings will ensure two invariants. First, all objects
|
||||
between (and including) tiering_sync_start and tiering_sync_end are candidates
|
||||
for moving to the target container. Second, it will guarantee that the number
|
||||
of objects processed on the container in a single round is bound by the
|
||||
configuration parameter (tier_max_objects_per_round, say = 200). This will
|
||||
ensure that the coordinator process will round robin effectively amongst all
|
||||
containers on the server per round without spending undue amount of time on
|
||||
only a few.
|
||||
|
||||
After the markers are fixed, tiering-coordinator will issue a copy request
|
||||
for each object within the range. When the copy requests are completed, it
|
||||
updates tiering_sync_start = tiering_sync_end and moves on to the next
|
||||
container. When tiering-coordinator re-visits the same container after
|
||||
completing the current round, it restarts the scanning routine described
|
||||
above from tiering_sync_start = tiering_sync_end (except they are not both
|
||||
-1 this time).
|
||||
|
||||
In a typical Swift cluster, each container DB is replicated three times and
|
||||
resides on multiple container servers. Therefore, without proper
|
||||
synchronization, tiering-coordinator processes can end up conflicting with
|
||||
each other by processing the same container and same objects within. This
|
||||
can potentially lead to race conditions with non-deterministic behavior. We
|
||||
can overcome this issue by adopting the approach of divide-and-conquer
|
||||
employed by container-sync process. The range of object indices between
|
||||
(tiering_sync_start, tiering_sync_end) can be initially split up into as
|
||||
many disjoint regions as the number of tiering-coordinator processes
|
||||
operating on the same container. As they work through the object indices,
|
||||
each process might additionally complete others’ portions depending on the
|
||||
collective progress. For a detailed description of how container-sync
|
||||
processes implicitly communicate and make group progress, please refer
|
||||
to [7].
|
||||
|
||||
5.4 Object Copy Mechanism
|
||||
-------------------------
|
||||
For each candidate object that the tiering-coordinator deems eligible to move to
|
||||
the target container, it issues an ‘object copy’ request using an API call
|
||||
supported by the object servers. The API call will map to a method used by
|
||||
object-transferrer daemons running on the object servers. The
|
||||
tiering-coordinator can select any of the object servers (by looking up the ring
|
||||
datastructure corresponding to the object in source container policy) as a
|
||||
destination for the request.
|
||||
|
||||
The object-transferrer daemon is supposed to be optimized for converting an
|
||||
object from one storage policy to another. As per the ‘Changing policies’ spec,
|
||||
the object-transferrer daemon will be equipped with the right techniques to move
|
||||
objects between Replication -> EC, EC -> EC, etc. Alternatively, in the absence
|
||||
of object-transferrer, the tiering coordinator can simply make use of the
|
||||
server-side ‘COPY’ API that vanilla Swift exposes to regular clients. It can
|
||||
send the COPY request to a swift proxy server to clone the source object into
|
||||
the target container. The proxy server will perform the copy by first reading in
|
||||
(GET request) the object from any of the source object servers and creating a
|
||||
copy (PUT request) of the object in the target object servers. While this will
|
||||
work correctly for the purposes of the tiering coordinator, making use of the
|
||||
object-transferrer interface is likely to be a better option. Leveraging the
|
||||
specialized code in object-transferrer through a well-defined interface for
|
||||
copying an object between two different storage policy containers will make the
|
||||
overall tiering process efficient.
|
||||
|
||||
Here is an example interface represented by a function call in the
|
||||
object-transferrer code:
|
||||
|
||||
def copy_object(source_obj_path, target_obj_path)
|
||||
|
||||
The above method can be a wrapper over similar functionality used by the
|
||||
object-transferrer daemon. The tiering-coordinator will use this interface to
|
||||
call the function through a HTTP call.
|
||||
|
||||
copy_object(/A/S/O, /A/T/O)
|
||||
|
||||
where S is the source container and T is the target container. Note that the
|
||||
object name in the target container will be the same as in the source container.
|
||||
|
||||
Upon receiving the copy request, the object server will first check if the
|
||||
source path is a symlink object. If it is a symlink, it will respond with an
|
||||
error to the tiering-coordinator to indicate that a symlink already exists.
|
||||
This behavior will ensure idempotence and guard against situations where
|
||||
tiering-coordinator crashes and retries a previously completed object copy
|
||||
request. Also, it avoids tiering for sparse objects such as symlinks created
|
||||
by users. Secondly, the object server will check if the source object has
|
||||
tiering metadata in the form of X-Object-Tiering-* that overrides the default
|
||||
tiering settings on the source container. It may or may not perform the object
|
||||
copy depending on the result.
|
||||
|
||||
5.5 Symlink Creation
|
||||
--------------------
|
||||
After an object is successfully copied to the destination container, the
|
||||
tiering-coordinator will issue a ‘symlink create’ request to proxy server to
|
||||
replace the source object by a reference to the destination object. Waiting
|
||||
until the object copy is completed before replacing it by a symlink ensures
|
||||
safety in case of failures. The system could end up with an extra target
|
||||
object without a symlink pointing to it, but not the converse which
|
||||
constitutes data loss. Note that the symlink feature is currently
|
||||
work-in-progress and will also be available as an external API to swift clients.
|
||||
|
||||
When the symlink is created by the tiering-coordinator, it will need to ensure
|
||||
that the original object’s ‘X-Timestamp’ value is preserved on the symlink
|
||||
object. Therefore, it is proposed that in the symlink creation request, the
|
||||
original time field can be provided (tiering-coordinator can quickly read the
|
||||
original values from container DB entry) as object user metadata, which is
|
||||
translated internally to a special sysmeta field by the symlink middleware.
|
||||
On subsequent user requests, the sysmeta field storing the correct creation
|
||||
timestamp will be sent to the user.
|
||||
|
||||
With the symlink successfully created, Swift users can continue to issue object
|
||||
requests like GET, PUT to the original namespace /Account/Container/Object. The
|
||||
Symlink middleware will ensure that the swift users do not notice the presence
|
||||
of a symlink object unless a query parameter ‘?symlink=true’ [3] is explicitly
|
||||
provided with the object request.
|
||||
|
||||
Users can also continue to read and update object metadata as before. It is not
|
||||
entirely clear at the time of this writing if the symlink object will store a
|
||||
copy of user metadata in its own extended attributes or if it will fetch the
|
||||
metadata from the referenced object for every HEAD/GET on the object. We will
|
||||
defer to whichever implementation that the symlink feature chooses to provide.
|
||||
|
||||
An interesting race condition is possible due to the time window between object
|
||||
copy request and symlink creation. If there is an interim PUT request issued by
|
||||
a swift user between the two, it will be overwritten by the internal symlink
|
||||
created by the tiering-coordinator. This is an incorrect behavior that we need
|
||||
to protect against. We can use the same technique [8] (with help of a second
|
||||
vector timestamp) that container-reconciler uses to resolve a similar race
|
||||
condition. The tiering-coordinator, at the time of symlink creation, can detect
|
||||
the race condition and undo the COPY request. It will have to delete the object
|
||||
that was created in the destination container. Though this is wasted work in
|
||||
the face of such race conditions, we expect them to be rare scenarios. If the
|
||||
user conceives tiering rules properly, there ought to be little to no
|
||||
foreground traffic for the object that is being tiered.
|
||||
|
||||
6. Future Work
|
||||
===============
|
||||
|
||||
6.1 Other Tiering Criteria
|
||||
--------------------------
|
||||
The first version of tiering implementation will be heavily tailored (especially
|
||||
the scanning mechanism of tiering-coordinator) to the object age criterion. The
|
||||
convenient property of container DBs that store objects in the same order as
|
||||
they are created/overwritten lends to very efficient linear scanning for
|
||||
candidate objects.
|
||||
|
||||
In the future, we should be able to support advanced criteria such as read
|
||||
frequency counts, object size, metadata-based selection, etc. For example,
|
||||
consider the following hypothetical criterion:
|
||||
|
||||
"Tier objects from container S to container T if older than 1 month AND size >
|
||||
1GB AND tagged with metadata ‘surveillance-video’"
|
||||
|
||||
When the metadata search feature [5] is available in Swift, tiering-coordinator
|
||||
should be able to run queries to quickly retrieve the set of object names that
|
||||
match ad-hoc criteria on both user and system metadata. As the metadata search
|
||||
feature evolves, we should be able to leverage it to add custom metadata such
|
||||
as read counts, etc for our purposes.
|
||||
|
||||
6.2 Integration with External Storage Tiers
|
||||
-------------------------------------------
|
||||
The first implementation of tiering will only support object movement between
|
||||
Swift containers. In order to establish a tiering relationship between a swift
|
||||
container and an external storage backend, the backend must be mounted in Swift
|
||||
as a native container through the DiskFile API or other integration mechanisms.
|
||||
For instance, a target container fully hosted on GlusterFS or Seagate Kinetic
|
||||
drives can be created through Swift-on-file or Kinetic DiskFile implementations
|
||||
respectively.
|
||||
|
||||
The Swift community believes that a similar integration approach is necessary
|
||||
to support external storage systems as tiering targets. There is already work
|
||||
underway to integrate tape-based systems in Swift. In the same vein, future
|
||||
work is needed to integrate external systems like Amazon Glacier or vendor
|
||||
archival products via DiskFile drivers or other means.
|
||||
|
||||
7. Open Issues
|
||||
==============
|
||||
This section is structured as a series of questions and possible answers. With
|
||||
more feedback from the swift community, the open issues will be resolved and
|
||||
merged into the main document.
|
||||
|
||||
Q1: Can the target container exist on a different account than the source
|
||||
container?
|
||||
|
||||
Ans: The proposed API assumes that the target container is always on the same
|
||||
account as the source container. If this restriction is lifted, the proposed
|
||||
API needs to be modified appropriately.
|
||||
|
||||
Q2: When the client sets the tiering metadata on the source container, should
|
||||
the target container exist at that time? What if the user has no permissions on
|
||||
the target container? When is all the error checking done?
|
||||
|
||||
Ans: The error checking can be deferred to the tiering-coordinator process. The
|
||||
background process, upon detecting that the target container is unavailable can
|
||||
skip performing any tiering activity on the source container and move on to the
|
||||
next container. However, it might be better to detect errors in the client path
|
||||
and report early. If the latter approach is chosen, middleware functionality is
|
||||
needed to sanity check tiering metadata set on containers.
|
||||
|
||||
Q3: How is the target container presented to the client? Would it be just like
|
||||
any other container with read/write permissions?
|
||||
|
||||
Ans: The target container will be just like any other container. The client is
|
||||
responsible for manipulating the contents in the target container correctly. In
|
||||
particular, it should be aware that there might be symlinks in source container
|
||||
pointing to target objects. Deletions or overwrites of objects directly using
|
||||
the target container namespace could render some symlinks useless or obsolete.
|
||||
|
||||
Q4: What is the behavior when conflicting tiering metadata are set over a
|
||||
period of time. For example, if the tiering age threshold is increased on a
|
||||
container with a POST metadata operation, will previously de-staged objects
|
||||
be brought back to the source container to match the new tiering rule?
|
||||
|
||||
Ans: Perhaps not. The new tiering metadata should probably only be applied to
|
||||
objects that have not yet been processed by tiering-coordinator. Previous
|
||||
actions performed by tiering-coordinator based on older metadata need not be
|
||||
reversed.
|
||||
|
||||
Q5: When a user issues a PUT operation to an object that has been de-staged to
|
||||
the target container earlier, what is the behavior?
|
||||
|
||||
Ans: The default symlink behavior should apply but it’s not clear what it will
|
||||
be. Will an overwrite PUT cause the symlink middleware to delete both the
|
||||
symlink and the object being pointed to?
|
||||
|
||||
Q6: When a user issues a GET operation to an object that has been de-staged to
|
||||
the target container earlier, will it be promoted back to source container?
|
||||
|
||||
Ans: The proposed implementation does not promote objects back to an upper tier
|
||||
seamless to the user. If needed, such a behavior can be easily added with help
|
||||
of a tiering middleware in the proxy server.
|
||||
|
||||
Q7: There is a mention of the ability to set cascading tiering relationships
|
||||
between multiple containers, C1 -> C2 -> C3. What if there is a cycle in this
|
||||
relationship graph?
|
||||
|
||||
Ans: A cycle should be prevented, else we can run into atleast one complicated
|
||||
situation where a symlink might be pointing to an object on the same container
|
||||
with the same name, thereby overwriting the symlink ! It is possible to detect
|
||||
cycles at the time of tiering metadata creation in the client path with a
|
||||
tiering-specific middleware that is entrusted with the cycle detection by
|
||||
iterating through existing tiering relationships.
|
||||
|
||||
Q8: Are there any unexpected interactions of tiering with existing or new
|
||||
features like SLO/DLO, encryption, container sharding, etc ?
|
||||
|
||||
Ans: SLO and DLO segments should continue to work as expected. If an object
|
||||
server receives an object copy request for a SLO manifest object from a
|
||||
tiering-coordinator, it will iteratively perform the copy for each constituent
|
||||
object. Each constituent object will be replaced by a symlink. Encryption
|
||||
should also work correctly as it is almost entirely orthogonal to the tiering
|
||||
feature. Each object is treated as an opaque set of bytes by the tiering engine
|
||||
and it does not pay any heed to whether the object is cipher text or not.
|
||||
Dealing with container sharding might be tricky. Tiering-coordinator expects
|
||||
to linearly walk through the indices of a container DB. If the container DB
|
||||
is fragmented and stored in many different container servers, the scanning
|
||||
process can get complicated. Any ideas there?
|
||||
|
||||
8. References
|
||||
=============
|
||||
|
||||
1. http://www.enterprisetech.com/2013/10/25/facebook-loads-innovative-cold-storage-datacenter/
|
||||
2. http://www-03.ibm.com/systems/storage/tape/
|
||||
3. Symlinks in Swift. https://review.openstack.org/#/c/173609/
|
||||
4. Changing storage policies in Swift. https://review.openstack.org/#/c/168761/
|
||||
5. Add metadata search in Swift. https://review.openstack.org/#/c/180918/
|
||||
6. Tape support in Swift. https://etherpad.openstack.org/p/liberty-swift-tape-storage
|
||||
7. http://docs.openstack.org/developer/swift/overview_container_sync.html
|
||||
8. Container reconciler section at http://docs.openstack.org/developer/swift/overview_policies.html
|
@ -1,270 +0,0 @@
|
||||
::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
..
|
||||
This template should be in ReSTructured text. Please do not delete
|
||||
any of the sections in this template. If you have nothing to say
|
||||
for a whole section, just write: "None". For help with syntax, see
|
||||
http://sphinx-doc.org/rest.html To test out your formatting, see
|
||||
http://www.tele3.cz/jbar/rest/rest.html
|
||||
|
||||
=========================
|
||||
Updateable Object Sysmeta
|
||||
=========================
|
||||
|
||||
The original system metadata patch ( https://review.openstack.org/#/c/51228/ )
|
||||
supported only account and container system metadata.
|
||||
|
||||
There are now patches in review that store middleware-generated metadata
|
||||
with objects, e.g.:
|
||||
|
||||
* on demand migration https://review.openstack.org/#/c/64430/
|
||||
* server side encryption https://review.openstack.org/#/c/76578/1
|
||||
|
||||
Object system metadata should not be stored in the x-object-meta- user
|
||||
metadata namespace because (a) there is a potential name conflict with
|
||||
arbitrarily named user metadata and (b) system metadata in the x-object-meta-
|
||||
namespace will be lost if a user sends a POST request to the object.
|
||||
|
||||
A patch is under review ( https://review.openstack.org/#/c/79991/ ) that will
|
||||
persist system metadata that is included with an object PUT request,
|
||||
and ignore system metadata sent with POSTs.
|
||||
|
||||
The goal of this work is to enable object system metadata to be persisted
|
||||
AND updated. Unlike user metadata, it should be possible to update
|
||||
individual items of system metadata independently when making a POST request
|
||||
to an object server.
|
||||
|
||||
This work applies to fast-POST operation, not POST-as-copy operation.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Item-by-item updates to metadata can be achieved by simple changes to the
|
||||
metadata read-modify-write cycle during a POST to the object server: read
|
||||
system metadata from existing data or meta file, merge new items,
|
||||
write to a new meta file. However, concurrent POSTs to a single server or
|
||||
inconsistent results between multiple servers can lead to multiple meta
|
||||
files containing divergent sets of system metadata. These must be preserved
|
||||
and eventually merged to achieve eventual consistency.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
The proposed new behavior is to preserve multiple meta files in the obj_dir
|
||||
until their system metadata is known to have been read and merged into a
|
||||
newer meta file.
|
||||
|
||||
When constructing a diskfile object, all existing meta files that are newer
|
||||
that the data file (usually just one) should be read for potential system
|
||||
metadata contributions. To enable a per-item most-recent-wins semantic when
|
||||
merging contributions from multiple meta files, system metadata should be
|
||||
stored in meta files as `key: (value, timestamp)` pairs. This is not
|
||||
necessary when system metadata is stored in a data file because the
|
||||
timestamp of those items is known to be that of the data file.
|
||||
|
||||
When writing the diskfile during a POST, the merged set of system metadata
|
||||
should be written to the new meta file, after which the older meta files can
|
||||
be deleted.
|
||||
|
||||
This requires a change to the diskfile cleanup code (`hash_cleanup_listdir`).
|
||||
After creating a new meta file, instead of deleting all older meta files,
|
||||
only those that were either older than the data file or read during
|
||||
construction of the new meta file are deleted.
|
||||
|
||||
In most cases the result will be same, but if a second concurrent request
|
||||
has written a meta file that was not read by the first request handler then
|
||||
this meta file will be left in place.
|
||||
|
||||
Similarly, a change is required in the async cleanup process (called by the
|
||||
replicator daemon). The cleanup process should merge any existing meta files
|
||||
into the most recent file before deleting older files. To reduce workload,
|
||||
this merge process could be conditional upon a threshold number of meta
|
||||
files being found.
|
||||
|
||||
Replication considerations
|
||||
--------------------------
|
||||
|
||||
As a result of failures, object servers may have different existing meta
|
||||
files for an object when a POST is handled and a new (merged) metadata set
|
||||
is written to a new meta file. Consequently, object servers may end up with
|
||||
identically timestamped meta files having different system metadata content.
|
||||
|
||||
rsync:
|
||||
|
||||
To differentiate between these meta files it is proposed to include a hash
|
||||
of the metadata content in the name of the meta file. As a result,
|
||||
meta files with differing content will be replicated between object servers
|
||||
and their contents merged to achieve eventual consistency.
|
||||
|
||||
The timestamp part of the meta filename is still required in order to (a)
|
||||
allow meta files older than a data or tombstone file to be deleted without
|
||||
being read and (b) to continue to record the modification time of user
|
||||
metadata.
|
||||
|
||||
ssync - TBD
|
||||
|
||||
Deleting system metadata
|
||||
------------------------
|
||||
|
||||
An item of system metadata with key `x-object-sysmeta-x` should be deleted
|
||||
when a header `x-object-sysmeta-x:""` is included with a POST request. This
|
||||
can be achieved by persisting the system metadata item in meta files with an
|
||||
empty value, i.e. `key : ("", timestamp)`, to indicate to any future metadata
|
||||
merges that the item has been deleted. This guards against inclusion of
|
||||
obsolete values from older meta files at the expense of storing the empty
|
||||
value. The empty-valued system metadata may be finally removed during a
|
||||
subsequent merge when it is observed that some expiry time has passed since
|
||||
its timestamp (i.e. any older value that the empty value is overriding would
|
||||
have been replicated by this time, so it is safe to delete the empty value).
|
||||
|
||||
Example
|
||||
-------
|
||||
|
||||
Consider the following scenario. Initially the object dir on each object
|
||||
server contains just the original data file::
|
||||
|
||||
obj_dir:
|
||||
t1.data:
|
||||
x-object-sysmeta-p: ('p1', t0)
|
||||
|
||||
Two concurrent POSTs update the object on servers A and B,
|
||||
with timestamps t2 and t3, but fail on server C. One POST updates
|
||||
`x-object-sysmeta-p` and adds `x-object-sysmeta-y`. The other POST adds
|
||||
`x-object-sysmeta-z`. These POSTs result in two meta files being added to the
|
||||
object directory on A and B::
|
||||
|
||||
obj_dir:
|
||||
t1.data:
|
||||
x-object-sysmeta-p: ('p1', t0)
|
||||
t2.h2.meta:
|
||||
x-object-sysmeta-p: ('p2', t2)
|
||||
x-object-sysmeta-x: ('x1', t2)
|
||||
x-object-sysmeta-y: ('y1', t2)
|
||||
t3.h3.meta:
|
||||
x-object-sysmeta-p: ('p1', t0)
|
||||
x-object-sysmeta-x: ('x2', t3)
|
||||
x-object-sysmeta-z: ('z1', t3)
|
||||
|
||||
(`hx` in filename represents hash of metadata)
|
||||
|
||||
A response to a subsequent HEAD request would contain the composition of the
|
||||
two meta files' system metadata items::
|
||||
|
||||
x-object-sysmeta-p: 'p2'
|
||||
x-object-sysmeta-x: 'x2'
|
||||
x-object-sysmeta-y: 'y1'
|
||||
x-object-sysmeta-z: 'z1'
|
||||
|
||||
A further POST request received at t4 deletes `x-object-sysmeta-p`. This
|
||||
causes the two meta files to be read, their contents merged and a new meta
|
||||
file to be written. This POST succeeds on all servers,
|
||||
so on servers A and B we have::
|
||||
|
||||
obj_dir:
|
||||
t1.data :
|
||||
x-object-sysmeta-p: ('p1', t0)
|
||||
t4.h4a.meta:
|
||||
x-object-sysmeta-p: ('', t4)
|
||||
x-object-sysmeta-x: ('x3', t3)
|
||||
x-object-sysmeta-z: ('z1', t3)
|
||||
x-object-sysmeta-y: ('y1', t2)
|
||||
|
||||
whereas on server C we have::
|
||||
|
||||
obj_dir:
|
||||
t1.data :
|
||||
x-object-sysmeta-p: ('p1', t0)
|
||||
t4.h4b.meta:
|
||||
x-object-sysmeta-p: ('', t4)
|
||||
|
||||
Eventually the meta files will be replicated between servers and merged,
|
||||
leaving all servers with::
|
||||
|
||||
obj_dir:
|
||||
t1.data :
|
||||
x-object-sysmeta-p: ('p1', t0)
|
||||
t4.h4a.meta:
|
||||
x-object-sysmeta-p: ('', t4)
|
||||
x-object-sysmeta-x: ('x3', t3)
|
||||
x-object-sysmeta-z: ('z1', t3)
|
||||
x-object-sysmeta-y: ('y1', t2)
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
One alternative approach would be to preserve all meta files that are newer
|
||||
than a data or tombstone file and never merge their contents. This removes
|
||||
the need to include a hash in the meta file name, but has the obvious
|
||||
disadvantage of accumulating an increasing number of files, each of which
|
||||
needs to be read when constructing a diskfile.
|
||||
|
||||
Another alternative would store system metadata in separate `sysmeta` file.
|
||||
It may then be possible to discard the timestamp from the filename (if the
|
||||
`timestamp.hash` format is deemed too long).
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
Alistair Coles (acoles)
|
||||
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
TBD
|
||||
|
||||
Repositories
|
||||
------------
|
||||
|
||||
None
|
||||
|
||||
Servers
|
||||
-------
|
||||
|
||||
None
|
||||
|
||||
DNS Entries
|
||||
-----------
|
||||
|
||||
None
|
||||
|
||||
Documentation
|
||||
-------------
|
||||
|
||||
No change to external API docs. Developer docs would be updated to make
|
||||
developers aware of the feature.
|
||||
|
||||
Security
|
||||
--------
|
||||
|
||||
None
|
||||
|
||||
Testing
|
||||
-------
|
||||
|
||||
Additional unit tests will be required for diskfile.py, object server. Probe
|
||||
tests will be useful to verify replication behavior.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
Patch for object system metadata on PUT only:
|
||||
https://review.openstack.org/#/c/79991/
|
||||
|
||||
Spec for updating containers on fast-POST:
|
||||
https://review.openstack.org/#/c/102592/
|
||||
|
||||
There is a mutual dependency between this spec and the spec to update
|
||||
containers on fast-POST: the latter requires content-type to be treated as
|
||||
an item of mutable system metadata, which this spec aims to enable. This
|
||||
spec assumes that fast-POST becomes usable, which requires consistent
|
||||
container updates to be enabled.
|
114
template.rst
@ -1,114 +0,0 @@
|
||||
::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
..
|
||||
This template should be in ReSTructured text. Please do not delete
|
||||
any of the sections in this template. If you have nothing to say
|
||||
for a whole section, just write: "None". For help with syntax, see
|
||||
http://sphinx-doc.org/rest.html To test out your formatting, see
|
||||
http://www.tele3.cz/jbar/rest/rest.html
|
||||
|
||||
===============================
|
||||
The Title of Your Specification
|
||||
===============================
|
||||
|
||||
Include the URL of your blueprint:
|
||||
|
||||
https://blueprints.launchpad.net/swift/...
|
||||
|
||||
Introduction paragraph -- why are we doing anything?
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
A detailed description of the problem.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Here is where you cover the change you propose to make in detail. How do you
|
||||
propose to solve this problem?
|
||||
|
||||
If this is one part of a larger effort make it clear where this piece ends. In
|
||||
other words, what's the scope of this effort?
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
This is an optional section, where it does apply we'd just like a demonstration
|
||||
that some thought has been put into why the proposed approach is the best one.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Who is leading the writing of the code? Or is this a blueprint where you're
|
||||
throwing it out there to see who picks it up?
|
||||
|
||||
If more than one person is working on the implementation, please designate the
|
||||
primary author and contact.
|
||||
|
||||
Primary assignee:
|
||||
<launchpad-id or None>
|
||||
|
||||
Can optionally list additional ids if they intend on doing substantial
|
||||
implementation work on this blueprint.
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
Work items or tasks -- break the feature up into the things that need to be
|
||||
done to implement it. Those parts might end up being done by different people,
|
||||
but we're mostly trying to understand the timeline for implementation.
|
||||
|
||||
Repositories
|
||||
------------
|
||||
|
||||
Will any new git repositories need to be created?
|
||||
|
||||
Servers
|
||||
-------
|
||||
|
||||
Will any new servers need to be created? What existing servers will
|
||||
be affected?
|
||||
|
||||
DNS Entries
|
||||
-----------
|
||||
|
||||
Will any other DNS entries need to be created or updated?
|
||||
|
||||
Documentation
|
||||
-------------
|
||||
|
||||
Will this require a documentation change? If so, which documents?
|
||||
Will it impact developer workflow? Will additional communication need
|
||||
to be made?
|
||||
|
||||
Security
|
||||
--------
|
||||
|
||||
Does this introduce any additional security risks, or are there
|
||||
security-related considerations which should be discussed?
|
||||
|
||||
Testing
|
||||
-------
|
||||
|
||||
What tests will be available or need to be constructed in order to
|
||||
validate this? Unit/functional tests, development
|
||||
environments/servers, etc.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
- Include specific references to specs and/or stories in swift, or in
|
||||
other projects, that this one either depends on or is related to.
|
||||
|
||||
- Does this feature require any new library or program dependencies
|
||||
not already in use?
|
||||
|
||||
- Does it require a new puppet module?
|
26
tox.ini
@ -1,26 +0,0 @@
|
||||
[tox]
|
||||
minversion = 1.6
|
||||
envlist = docs
|
||||
skipsdist = True
|
||||
|
||||
[testenv]
|
||||
usedevelop = True
|
||||
install_command = pip install -U {opts} {packages}
|
||||
setenv =
|
||||
VIRTUAL_ENV={envdir}
|
||||
deps = -r{toxinidir}/requirements.txt
|
||||
-r{toxinidir}/test-requirements.txt
|
||||
passenv = *_proxy *_PROXY
|
||||
|
||||
[testenv:venv]
|
||||
commands = {posargs}
|
||||
|
||||
[testenv:docs]
|
||||
commands = python setup.py build_sphinx
|
||||
|
||||
[testenv:spelling]
|
||||
deps =
|
||||
-r{toxinidir}/requirements.txt
|
||||
sphinxcontrib-spelling
|
||||
PyEnchant
|
||||
commands = sphinx-build -b spelling doc/source doc/build/spelling
|