Infrastructure Apps

Applications that are critical or important for our infrastructure

Apps

Utilities

Opencost: To get cost of the cluster per namespace or metric (i.e customer)
Helm controller: Allows to use Helm charts and manages lifecycle of applications in Kubernetes. Works well together with GitOps.
Node problem detector: Extends error reporting on a node which is made available to the control nodes. The information is made available through the conditions when looking up a node with kubectl. The conditions tab is used by Kubernetes to determine if a node is healthy to schedule workloads on.
Forgejo: Fork of Gitea which is a self-hosted alternative to Github/Gitlab/Bitbucket, is used to store code, packages and run CICD pipelines. Has a better package manager than most dedicated solutions.
N8N: Low-code tool to set up data pipelines that can do different jobs towards internal and external services. Had a license change some time ago to become more commercial, open source contribution switcharoo. Locks basic security behind enterprise tier.
Hajimari: A landing page which can be used to discover services and set up other helpful links. Supports using annotations on services and ingresses to discover services.
Spegel: Sets up a local image registry in the cluster which works as a cache for images. Significantly speeds up fetching existing images and also eliminates issues related to rate-limits for external registries.
Kubelet Serving Cert Approver: Used to automatically approve new nodes to the cluster. This is needed so the nodes gets distributed their certificates. Has some security implications but is needed to get i.e autoscaling to work.
Reloader: Can be used with annotations to restart applications when a configmap is changed.
Unscheduler: Application that rebalances the cluster and moves workloads around if i.e a node is underutilised. Currently turned off until Local Storage Provisioner bug is fixed.

Storage

VolSync: Platform agnostic backup and storage sync tool. Handles backups and restoring PVC's, copies data by mounting a pod as a sidecar. Should be ran as the container user/group. Stores data externally by using S3. Backups can be protected with credentials, backup retention is based upon rules and common files between backups are only stored once, reducing storage requirements.
Local Storage Provider: A CSI which can be used to dynamically create PVC's on the local file system. The local storage provider is used by databases. Has not release 1.0.0 yet, but is widely used and production stable.
Hcloud Controller: A CSI which works on nodes which are cloud instances (not root/dedicated/robot servers). Dynamically provisions PVC's on Hetzner and mounts them. Max PVC's per node is 16.

Local Storage Provider: The version 0.28.0 introduced a bug that made PVC affinity break in clusters which does not resolve the node hostname to the node IP, such as Talos versions older than v1.8. That meant pods were not aware of where the data was stored, so they would be scheduled on another nodes which didn't have access to the data which would break workloads. This was fixed in 0.29.0, however already existing PVCs won't have affinity and will eventually need to be recreated.

Missing storage solution on dedicated Servers: Currently we're a bit limited with storage on our dedicated/root servers, those servers have 3-4 dedicated NVMe disks that can be used by i.e Rook-Ceph or Seaweedfs, which allows read-write-many PVCs, faster read/write, better cost efficiency, more PVC features and much more, but Rook Ceph requires a better network configuration to work optimally and is more suited where there's a lot of disks and storage demands. Seaweedfs has been tested but most likely due to the bug with Local Storage Provider, it got into a bad state which didn't report any issues, it requires more debugging to see if our setup had any issues. A plethorae of other CSI's were also considered, but testing a CSI takes time and often low level configuration on node level which might require manual cleanup or resetting whole machines.

Databases and Database Operators

Mariadb Operator: An operator that can spin up and manage Mariadb clusters, supports backups to S3 and high-availability through a system called Galera. Galera is said to be in beta in the operator, but has been battle tested. The beta is probably due to lower priority since the Mariadb team focuses on another commercial solution for HA. The mariadb operator can spin up clusters, handle secrets and also manage database state such as setting up databases, users and roles.
Cloudnative PG (CNPG): One of the most mature postgres operators and most used operators overall for Kubernetes. Has extensive documentation, very good CLI and will handle most common database failures automatically. Does not manage database state like Mariadb does.
DragonFly (redis): An operator that creates Dragonfly clusters. Dragonfly is a faster alternative to Redis and also has gained a lot of momentum after the Redis lisense change.
NATS and NATS Jetstream: Modern message system, can replace all of the functionality that Kafka does and is much easier to host/develop towards. Has built-in object storage, replayability of message queues and a KV store. Very suited for websocket/SEE for realtime updates systems.
OpenSearch/Elasticsearch operator: Index database, often used for search queries, typically a dependency for some third party systems.
RabbitMQ: Message queue (use NATS instead). Is used for some third party systems.

Not in use yet, but are easy to make available

Mongodb: Operator which allows to quickly spin up and manage a Mongodb cluster. The initial configuration is a bit difficult since Mongodb intentionally mixes up their commercial alternative with their OSS offering in their docs. We however already have a config that works.
Memgraph: Graph database, which is a competitor to Neo4j (and uses much less resources for the same performance). The company behind it is commercial but doesn't lie about their tech like Neo4j. Also has pretty good support and interacts well with OSS.
Kafka: Message queue system with the one of the highest throughoutput in the world. Easy to connect to, but can be a hassle to host due to amount of moving parts.
Janusgraph: Graph database. Requires a storage backend (i.e Cassandra/ScyllaDB) and an index database (i.e Elasticsearch). Is probably the best graph database whcih supports Gremlin, but is horrible to host due to all the moving parts. Requires at least 10GB of RAM due to most of the dependencies running on JVM and being made to be hyperconverged for scalability. Alternatives are being developed.

Network

Cilium: The Kubernetes CNI, one of the most important parts of our cluster. Handles traffic routing, monitoring, policies and much more. Also has support utilities like Hubble.
Tailscale: A commercial Wireguard based mesh VPN. Can be used to allow access to and from any external and internal services over a private network, allowing services to communicate without exposing them on the internet. Has ACL rules that can be used to restrict how and where data flows. Can also be used for authentication with SSH and Kubernetes. Can use both ingress and service resource types. Netbird is planned to replace Tailscale.
Traefik: Used as an Ingress for public services. Works with the Hetzner external load balancer. In some cases Cilium gateways are used instead.
Hcloud controller: Can create load balancers for the cluster in Hetzner.

Observability (logs, metrics, tracing and uptime)

Grafana Loki: Log storage/querying
Grafana Tempo: Trace storage/querying, supports OpenTelemetry to see the full path that a package has taken, from the browser and through any other services. Significantly helps debugging traffic errors and more complex errors in applications.
Vector: Log processing. Has two services, an aggregator and an log agent. Responsible for pod log collection, parsing and aggregation/distribution. Can be used for processing or generating metrics from logs as well.
Prometheus Node Exporter: Automatically collects metrics from applications that exposes it and sends it to Prometheus or Vector.
Grafana: Web GUI for Grafana services (dashboards)
Gatus: Self-hosted system for uptime-monitoring. Can use Kubernetes annotations to dynamically create uptime monitoring.
Cilium Hubble: GUI which integrates with the Cilium Gateway (which collects traces from the whole cluster). Differentiated from Grafana Tempo since it's tightly knit with Cilium and catches all network traces from/to a node, namespace, deployment, pod etc. Grafana Tempo is more for applications that can export it's own traces or has autoinstrumentation.

GitOps

RenovateBot: Central and very powerful tool to manage updates. Checks git repositories and make PR's with new updates. Will compile a detailed list for the update with changelogs, dependency conflicts. Supports most package managers (Node.js, Composer, Flux, Helm and Dockerfiles) and systems. RenovateBot also supports checking for new versions in internal packages.
Mozilla SOPS: Used to encrypt/decrypt secrets. Which let's even a public Git registry to have secrets without exposing them. Works with Flux and Git repositories. Allows encrypting secrets based on a public key without having access to the private key to decrypt it.

Flux

All these are different parts of Flux:

Flux: Syncs/reconciles cluster state with a git repository
Flux Alerts: Sends messages to a defined Flux provider (slack, matrix, git system etc.) when it gets an event based on a ruleset (can be an error or just informational)
Flux Provider: Where to send alerts.
Flux Reciever: Sets up a webhook that can be called to reconcile the cluster based upon a ruleset. I.e when the webhook is called, all helm charts in a namespace can be reconciled.
Flux Image Automation: Continiously checks package registries for new versions based upon a ruleset, can be used to automatically update to new packages. Changes are first pushed to Git.

Security

CertManager: Handles certificates, either self-signed or through an external provider such as LetsEncrypt.
CrowdSec: Parses logs for Traefik to detect attempts at exploiting known vulnerabilities. Blocks and reports IPs if an attack is detected. Also supports blocklists from other customers, so a known malicious actor can be blocked before they even connect to our systems. False positives are low due to a rating system based upon attack types and frequency.
Kubeclarity: At intervals, scans images that are running in the cluster to check for CVE's. Can be used to detect vulnerable services.
Falco: Intrusion detection. Checks containers during runtime to see what commands are being ran and flags weird behaviours.
Kyverno: Enforces policies surrounding Kubernetes manifests, prevents or modifies manifests based upon rules.
Tetragon: Tetragon is a flexible Kubernetes-aware security observability and runtime enforcement tool that applies policy and filtering directly with eBPF, allowing for reduced observation overhead, tracking of any process, and real-time enforcement of policies. - Highly performant, by the same people as Cilium. Can be used to set up policies for blocking and monitoring anomalies in file system/system calls. Requires some setup but is very powerful if policies are created for each application.
PodAdmissionController: Built-in feature in Kubernetes, mentioning as not all clusters have it enabled. It's defined by Talos which uses the recommended baseline restriction. This policy will prevent workloads from running in priviliged mode without explicitly being defined as being allowed to do so. Often this affects operators which injects side-cars, log agents etc. Normal workloads shouldn't need privileged mode.
Zitadel: Authentication and OIDC provider, can be used by customers, internal services or our own services.
Keycloak: Authentication that was used previously, it is less modern than Zitadel but has functionality to set time based levels of access, which can be used to enforce higher types of authentication in certain parts of applications.

Network (WIP)

How to set up networking in Bifrost

Previously evaluated

Applications or solutions we have previously considered or work with

On This Page

Star on GitHub Create Issues