Characterizing and Contrasting

Lee Calcote

February 2017

Container Orchestrators

KNect365

Delivered by

TMT

Container

Lee Calcote

clouds, containers, infrastructure, applications  and their management

Show of Hands

[k uh   n- tey -ner] 

[ awr -k uh -streyt-or] 

Definition:

CaaS

(Stay tuned for updates to presentation and book)

Joyent Triton

Docker Datacenter

AWS ECS

Azure Container Service

Rackspace Carina

Fleet
Nomad
Swarm
Kubernetes
Mesos+Marathon

One size does not fit all.

A strict apples-to-apples comparison is inappropriate and not the objective, hence  characterizing  and  contrasting.

Let's not go here today.

Container orchestrators may be in te rm ix ed.

Categorically Speaking

  • Genesis & Purpose

  • Support & Momentum

  • Host & Service Discovery

  • Scheduling

  • Modularity & Extensibility

  • Updates & Maintenance

  • Health Monitoring

  • Networking & Load-Balancing

  • Secrets Management

  • High Availability & Scale

Core

Capabilities

  • Cluster Management

    • Host Discovery

    • Host Health Monitoring

  • Scheduling

  • Orchestrator Updates and Host Maintenance

  • Service Discovery

  • Networking and Load-Balancing

  • Stateful services

  • Multi-tenant, multi-region

Additional

Key Capabilities

  • Application Health & Performance Monitoring

  • Application Deployments

  • Application Secrets

Nomad

Genesis & Purpose

  • designed for both long-lived services and short-lived batch processing workloads.
     
  • cluster manager with declarative job specifications.
     
  • ensures constraints are satisfied and resource utilization is optimized by efficient task packing.
     
  • supports all major operating systems and virtualized, containerized or standalone workloads.
     
  • written in Go and under the Unix philosophy.

 

Support & Momentum

  • Project began June 2015 (19 months old) has 141 contributors 

    • Current release v0.5.4

    • Nomad Enterprise offering aimed for first half of this year.
       

  • Supported and governed by HashiCorp

    • HashiConf US '15 had ~300 attendees

    • HashiConf EU '16 had ~320 attendees

    • HashiConf US '16 had ~500 attendees 

Nomad Architecture

Host &
      Service Discovery

Host Discovery

  • Gossip protocol - Serf is used

    • Docker multi-host networking and Swarmkit use Serf, too

  • Servers advertise full set of Nomad servers to clients

    • heartbeats every 30 seconds

  • Creating federated clusters is simple
     

Service Discovery

  • Nomad integrates with Consul to provide service discovery and monitoring.

Scheduling

  • two distinct phases, feasibility checking and ranking.
     

  • optimistically concurrent

    • enabling all servers to participate in scheduling decisions which increases the total throughput and reduces latency
       

  • three scheduler types used when creating jobs:

    • service, batch and system

    •  `nomad plan` point-in-time-view of what Nomad will do

Modularity & Extensibility

Task drivers

  • Used by Nomad clients to execute a task and provide resource isolation.
     

  • By having extensible task drivers are important for flexibility to support a broad set of workloads (e.g. rkt, lxc).
     

  • Does not currently support pluggable task drivers,

    • Have to implement task driver interface and compile Nomad binary.

Updates & Maintenance

Nodes

  • Drain allocations on a running node.

  • integrates with tools like Packer, Consul, and Terraform

    • to support building artifacts, service discovery, monitoring and capacity management.
       

Applications

  • Log rotation (stderr and stdout)

    • ​​no log forward support, yet

  • Rolling updates (via the `update` block in the job specification).

Health Monitoring

Nodes

  • Node health monitoring is done via heartbeats, so Nomad can detect failed nodes and migrate the allocations to other healthy clients.

 

Applications

  • currently http, tcp and script

  • In the future Nomad will add support for more Consul checks.

  • `nomad alloc-status` reports actual resource utilization

Networking & Load-Balancing

Networking

 

  • Dynamic ports are allocated in a range from 20000 to 60000.

  • Shared IP address with Node.
     

Load-Balancing

  • Consul provides DNS-based load-balancing

Secrets Management

  • Nomad agents provide secure integration with Vault

    • for all tasks and containers it spins up

 

  • gives secure access to Vault secrets through a workflow which minimizes risk of secret exposure during bootstrapping.

High Availability & Scale

  • distributed and highly available, using both leader election and state replication to provide availability in the face of failures.
     

  • shared state optimistic scheduler

    • only open source implementation.
       

  • 1,000,0000 across 5,000 hosts and scheduled in 5 min.

 

  • Built for managing multiple clusters / cluster federation.

  • Easy to use

  • Single binary for both clients and servers

  • Supports non-containerized tasks and multiple container runtimes

  • Arguably the most advanced scheduler design

  • Upfront consideration of federation / hybrid cloud

  • Broad OS support

  • Outside of scheduler, comparatively less sophisticated

  • Young project

  • Less relative momentum

  • Less relative adoption

  • Less extensible / pluggable

Docker Swarm

Docker Swarm 1.12

aka

Swarmkit or Swarm mode

Genesis & Purpose

  • Swarm is simple and easy to setup.
     

  • Initially responsible for clustering and scheduling  

    • Driving toward application's needs with services, secrets, etc.
       

  • Originally an imperative system, now declarative.
     

  • Swarm’s architecture is not complex as those of Kubernetes and Mesos.
     

  • Written in Go, Swarm is lightweight, modular and somewhat extensible.

Docker Swarm 1.11 (Standalone)

Docker Swarm Mode 1.12 (Swarmkit)

Support & Momentum

  • Contributions:

    • Standalone: ~3,000 commits, 12 core maintainers (140 contributors)

    • Swarmkit: ~2,800 commits, ~12 core maintainers (70 contributors)
       

  • ~289 Docker meetups worldwide

    • Disclaimer: I organize Docker Austin.
       

  • Production-ready:

    • Standalone announced ~15 months ago (Nov 2015)

    • Swarmkit announced ~7 months ago (July 2016)

Host & Service Discovery

Host Discovery

  • Like Nomad, uses Hashicorp's go MemDB for storing cluster state

  • Pull model - where worker checks-in with the Manager

  • Rate Control - of checks-in with Manager may be controlled at Manager - add jitter

  • Workers don't need to know which Manager is active; Follower Managers will redirect Workers to Leader

Service Discovery

  • Embedded DNS and round robin load-balancing

  • Services are a new concept

 

Scheduling

  • Swarm’s scheduler is pluggable

  • Swarm scheduling is a combination of strategies and filters/constraint: 

    • Strategies

      • Random

      • Spread*

      • Binpack

    • Filters

      • container constraints (affinity, dependency, port) are defined as environment variables in the specification file

      • node constraints (health, constraint) must be specified when starting the docker daemon and define which nodes a container may be scheduled on.

Modularity & Extensibility

Ability to remove batteries is a strength for Swarm:

  • Pluggable scheduler

  • Pluggable network driver

  • Pluggable distributed K/V store

  • Docker container engine runtime-only

  • Pluggable authorization (in docker engine)*

Updates & Maintenance

Nodes

  • Nodes may be Active, Drained and Paused

    • Manager weights are used to drain or pause Managers

  • Manual swarm manager and worker updates
     

Applications

  • Rolling updates now supported

    • --update-delay

    • --update-parallelism

    • --update-failure-action

Health Monitoring

Nodes

  • Swarm monitors the availability and resource usage of nodes within the cluster

 

Applications

  • One health check per container may be run

    • check container health by running a command inside the container

      • --interval=DURATION (default: 30s)

      • --timeout=DURATION (default: 30s)

      • --retries=N (default: 3)

Networking & Load-Balancing

  • Swarm and multi-host networking are simpatico

    • provides for user-defined overlay networks that are micro-segmentable

    • uses Hashicorp's Serf gossip protocol for quick convergence of neighbor table

    • facilitates container name resolution via embedded DNS server (previously via etc/hosts)
       

  • Load-balancing based on IPVS

    • ​expose Service's port externally

    • L4 load-balancer; cluster-wide port publishing
       

  • ​Mesh routing​

    • ​send a request to any one of the nodes and it will be routed automatically

    • send a request to any one of the nodes and it will be internally load balanced

Secrets Management

Landed in 1.13

 

  • encrypted and kept in Raft store

  • managed by Swarm Managers

  • retrieved by Swarm Services (not containers)

  • via mounted in-memory filesystem on the node

High Availability & Scale

  • Managers may be deployed in a highly-available configuration

    • Active/Standby - ​only one active Leader at-a-time

    • Maintain odd number of managers
       

  • Rescheduling upon node failure

    • ​No rebalancing upon node addition to the cluster
       

  • Does not support multiple failure isolation regions or federation

Scaling swarm to 1,000 AWS  nodes and 50,000 containers

  • Suitable for orchestrating a combination of infrastructure containers

    • Has only recently added capabilities falling into the application bucket

  • Swarmkit is a young project

    • advanced features forthcoming

    • natural expectation of caveats in functionality

  • No rebalancing, autoscaling or monitoring, yet

  • Only schedules Docker containers, not containers using other specifications.

    • Does not schedule VMs or non-containerized processes

    • Does not provide support for batch jobs

  • Need separate load-balancer for overlapping ingress ports

  • While dependency and affinity filters are available, Swarm does not provide the ability to enforce scheduling of two containers onto the same host or not at all.

    • Filters  facilitate sidecar pattern. No “pod” concept.

  • Swarm works. Swarm is simple and easy to deploy.

    • ​1.12 eliminated need for much, but not all third-party software

    • Facilitates earlier stages of adoption by organizations viewing containers as faster VMs

    • now with built-in functionality for applications

  • Swarm is easy to extend, if can already know Docker APIs, you can customize Swarm

  • Still modular, but has stepped back here.

  • Moving very fast; eliminating gaps quickly.

Kubernetes

Genesis & Purpose

  • an opinionated framework for building distributed systems

    "an open source system for automating deployment, scaling, and operations of applications."
  • Written in Go, Kubernetes is lightweight, modular and extensible

  • considered a third generation container orchestrator led by Google, Red Hat and others.

  • Declaratively, opinionated with many key features included

    • bakes in load-balancing, scale, volumes, deployments, secret management and cross-cluster federated services among other features.

 

Kubernetes Architecture

Support & Momentum

  • Kubernetes is 2 yrs. 20 months old (June 2014)

    • Announced as production-ready 19 months ago (July 2015)
       

  • Project has over 1,000 commits per month (~44,000 total)

    • reach 1,000 committers (~100 core) Kubernauts in Dec. 2016

    • ~5,000 commits made in each release   (1.5 is latest)
       

  • ~244 Kubernetes meetups worldwide.

    • ​Disclaimer: I organize Microservices and Containers Austin.
       

  • Under the governance of the Cloud Native Computing Foundation

    • KubeCon earlier this year capped at 1,000 attendees

Host & Service Discovery

Host Discovery

  • by default, the node agent (kubelet) is configured to register itself with the master (API server)

    • automating the joining of new hosts to the cluster

Service Discovery

Two primary modes of finding a Service

  • DNS

    • SkyDNS is deployed as a cluster add-on

  • environment variables​

    • environment variables are used as a simple way of providing compatibility with Docker links-style networking

Scheduling

  • By default, scheduling is handled by kube-scheduler (pluggable).
     

  • Selection criteria used by kube-scheduler to identify the best-fit node is defined by policy:

    • Predicates (node resources and characteristics):

      • ​PodFitPorts , PodFitsResources, NoDiskConflict , MatchNodeSelector, HostName , ServiceAffinity, LabelsPresence

    • Priorities (weighted strategies used to identify “best fit” node):

      • ​LeastRequestedPriority, BalancedResourceAllocation, ServiceSpreadingPriority, ​EqualPriority

Modularity &

         Extensibility

  • One of Kubernetes strengths its pluggable architecture and it being an extensible platform 

  • Choice of:

    • database for service discovery or network driver

    • container runtime - may choose to run docker with rkt containers

  • ​Cluster add-ons

    • optional system components that implement a cluster feature (e.g. DNS, logging, etc.)

    • shipped with the Kubernetes binaries and are considered an inherent part of the Kubernetes clusters

       

Updates & Maintenance

Applications

  • `Deployment` objects automate deploying and rolling updating applications.​

  • Support for rolling back deployments

Kubernetes Components

  • Consistently backwards compatible

  • Upgrading the Kubernetes components and hosts is done via shell script 

  • Host maintenance - mark the node as unschedulable.

    • existing pods are vacated from the node

    • prevents new pods from being scheduled on the node

Health Monitoring

Nodes

  • Failures - actively monitors the health of nodes within the cluster

    • via Node Controller

  • Resources - usage monitoring leverages a combination of open source components:

    • cAdvisor, Heapster, InfluxDB, Grafana, Prometheus

Applications 

  • three types of user-defined application health-checks and uses the Kubelet agent as the the health check monitor

    • ​HTTP Health Checks, Container Exec, TCP Socket

​Cluster-level Logging

  • collect logs which persist beyond the lifetime of the pod’s container images or the lifetime of the pod or even cluster

    • ​standard output and standard error output of each container can be ingested using a Fluentd agent running on each node

Networking & Load-Balancing

…enter the Pod

  • atomic unit of scheduling

  • flat networking with each pod receiving an IP address

  • no NAT required, port conflicts localized

  • intra-pod communication via localhost​

Load-Balancing

  • Services provide inherent load-balancing via kube-proxy:

    • runs on each node of a Kubernetes cluster

    • reflects services as defined in the Kubernetes API

    • supports simple TCP/UDP forwarding and round-robin and Docker-links-based service IP:PORT mapping. 

Secrets Management

  • encrypted and stored in etcd

  • used by containers in a pod either:

     

    1. mounted as data volumes

    2. exposed as environment variables
       

  • None of the pod’s containers will start until all the pods' volumes are mounted.

  • Individual secrets are limited to 1MB in size.

  • Secrets are created and accessible within a given namespace, not cross-namespace.

High Availability & Scale

  • Each master component may be deployed in a highly-available configuration.

    • ​Active/Standby configuration

  • Federated clusters / multi-region deployments

​​ Scale

  • v1.2 support for 1,000 node clusters

  • v1.3 supports 2,000 node clusters
     

  • Horizontal Pod Autoscaling​ (via Replication Controllers​).

    • Node Autoscaling (if you're running on GCE with AWS support is coming soon).

    • Only runs containerized applications

    • ​ For those familiar with Docker-only, Kubernetes requires understanding of new concepts

      • Powerful frameworks with more moving pieces beget complicated cluster deployment and management.

    • Lightweight graphical user interface

    • Does not provide as sophisticated techniques for resource utilization as Mesos

     

     

    • Kubernetes can schedule docker or rkt containers

    • Inherently opinionated w/functionality built-in.

      • relatively easy to change its opinion

      • little to no third-party software needed

      • builds in many application-level concepts and services (petsets, jobsets, daemonsets, application packages / charts, etc.)

      • advanced storage/volume management

    • project has most momentum

    • project is arguably most extensible

    • thorough project documentation

    • Supports multi-tenancy

    • Multi-master, cross-cluster federation, robust logging & metrics aggregation

     

    Mesos

    +

    Marathon

    Genesis & Purpose

    • Mesos is a distributed systems kernel

      • stitches together many different machines into a logical computer

    • Mesos has been around the longest (launched in 2009)

      • and is arguably the most stable, with highest (proven) scale currently

    • Mesos is written mostly in C++

      • with Java, Python and C++ APIs

    • Marathon as a Framework

      • Marathon is one of a number of frameworks (Chronos and Aurora other examples) that may be run on top of Mesos

      • Frameworks have a scheduler and executor. Schedulers get resource offers. Executors run tasks.

      • Marathon is written in Scala

    Mesos Architecture

    Support & Momentum

    • MesosCon 2016 in Denver had    ? attendees

    • MesosCon 2015 in Seattle had 700 attendees

      • up from 262 attendees in 2014
         

    • Mesos has 224 contributors

    • Marathon has 227 contributors
       

    • Mesos under the governance of Apache Foundation

    • Marathon under governance of Mesosphere
       

    • Mesos is used by Twitter, AirBnb, eBay, Apple, Cisco, Yodle

    • Marathon is used by Verizon and Samsung

    Host &
          Service Discovery

    • Mesos-DNS generates an SRV record for each Mesos task

      • including Marathon application instances

    • Marathon will ensure that all dynamically assigned service ports are unique

    • Mesos-DNS is particularly useful when:

      • apps are launched through multiple frameworks (not just Marathon)

      • you are using an IP-per-container solution like Project Calico

      • you use random host port assignments in Marathon

    Scheduling

    • Two-level scheduler

      • First-level scheduling happens at Mesos master based on allocation policy, which decides which framework get resources.

      • Second-level scheduling happens at Framework scheduler, which decides what tasks to execute.
         

    • Provide reservations, over-subscriptions and preemption.

    Modularity & Extensibility

    Frameworks

    • multiple available

    • may run multiple frameworks concurrently

    Modules

    • extend inner-workings of Mesos by creating and using shared libraries that are loaded on demand

    • many types of Modules

      • Replacement, Isolator, Allocator, Authentication, Hook, Anonymous

    Updates & Maintenance

    Nodes

    - Mesos has maintenance mode.

    - Marathon does not.

    • Mesos API backwards compatible from v1.0 forward

     

    Applications

    • Marathon can be instructed to deploy containers based on that component using a blue/green strategy

      • where old and new versions co-exist for a time.

    Health Monitoring

    Nodes

    • Master tracks a set of statistics and metrics to monitor resource usage

    Applications

    • support for health checks (HTTP and TCP)

    • an event stream that can be integrated with load-balancers or for analyzing metrics

    Networking & Load-Balancing

    Networking

    • An IP per Container

      • No longer share the node's IP

      • ​Helps remove port conflicts

      • Enables 3rd party network drivers

    • Container Network Interface (CNI)   isolator with MesosContainerizer

    Load-Balancing

    • Marathon offers two TCP/HTTP proxies

      • A simple shell script and a more complex one called `marathon-lb` that has more features.

      • Pluggable (e.g. Traefik  for load-balancing)

    Secrets Management

    Not yet.

     

    Only supported by Enterprise DC/OS

     

    • Stored in ZooKeeper, exposed as ENV variables in Marathon

    • Secrets shorter than eight characters may not be accepted by Marathon.

    • By default, you cannot store a secret larger than 1MB.

    High Availability & Scale

    • A strength of Mesos’s architecture

      • requires masters to form a quorum using ZooKeeper (point of failure)

      • only one Active (Leader) master at-a-time in  Mesos and Marathon
         

    • Scale is a strong suit for Mesos. TBD for Marathon.
       

    • Autoscale

      • `marathon-autoscale.py` - autoscales application based on the utilization metrics from Mesos

      • marathon-lb-autoscale - request rate-based autoscaling with Marathon.

     

    • Great at short-lived jobs. High availability built-in.

      • Referred to as the “golden standard” by Solomon Hykes, Docker CTO.

    • Universal Containerizer

      • abstract away from docker, rkt, kurma?, lxc?

    • Can run multiple frameworks, including Kubernetes and Swarm.

    • Supports multi-tenancy.

    • Good for Big Data shops and job / task-oriented workloads.​

      • Good for mixed workloads and with data-locality policies

    • Mesos is powerful and scalable, battle-tested

      • ​Good for multiple large things you need to do 10,000+ node cluster system

    • ​Marathon UI is young, but promising.

    • Still needs 3rd party tools

    • Marathon interface could be more Docker friendly (hard to get at volumes and registry)

    • May need a dedicated infrastructure IT team

      • ​an overly complex solution  for small deployments

    Summary

    A high-level perspective of the container orchestrator spectrum .

    Lee Calcote

    Thank you. Questions?

    clouds, containers, infrastructure,

    applications  and their management