# Security Best Practices Standards for Kubernetes
This document outlines security best practices for Kubernetes development. Following these standards helps to mitigate common vulnerabilities, implement secure coding patterns, and ensure the overall security posture of Kubernetes applications. The guidance provided is aligned with the latest Kubernetes versions and reflects modern security principles.
## 1. Pod Security Standards (PSS) and Pod Security Admission (PSA)
### Standard
* **Do This:** Enforce Pod Security Standards (PSS) at the Namespace level using Pod Security Admission (PSA). Start with "audit" or "warn" mode to identify violations before enforcing "enforce". Consider profiles like Baseline or Restricted based on application needs.
* **Don't Do This:** Rely solely on network policies or RBAC for security without proper pod security context configuration. Avoid the "privileged" PSS profile in production environments.
### Explanation
PSS defines three security profiles: "privileged", "baseline", and "restricted". PSA leverages these profiles at the namespace level to control what type of pods are allowed to run. This is essential for preventing privilege escalation and reducing the attack surface.
### Example
"""yaml
apiVersion: v1
kind: Namespace
metadata:
name: my-namespace
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/enforce-version: latest
pod-security.kubernetes.io/audit: baseline # Or use 'warn'
pod-security.kubernetes.io/audit-version: latest
pod-security.kubernetes.io/warn: baseline
pod-security.kubernetes.io/warn-version: latest
"""
### Anti-Pattern
Not defining PSS at the namespace level, allowing pods with overly permissive security contexts to be deployed.
## 2. Secure Pod Configuration
### Standard
* **Do This:** Explicitly define the security context for each pod or container. Minimum requirements include:
* "runAsUser" and "runAsGroup": Ensure containers run as non-root users.
* "allowPrivilegeEscalation: false": Prevent processes from gaining more privileges.
* "readOnlyRootFilesystem: true": Mount the root filesystem as read-only where possible
* **Don't Do This:** Allow containers to run as "root" or with unnecessary capabilities. Grant excessive permissions without reviewing implications.
### Explanation
Properly configured security contexts prevent containers from gaining elevated privileges and limit the impact of potential vulnerabilities. Running as a non-root user prevents exploitation of root-owned files and directories.
### Example
"""yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: secure-app
spec:
replicas: 1
selector:
matchLabels:
app: secure-app
template:
metadata:
labels:
app: secure-app
spec:
securityContext:
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
containers:
- name: app-container
image: your-image:latest
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL # Drop all capabilities by default and selectively add back only needed ones
seccompProfile:
type: RuntimeDefault
"""
### Anti-Pattern
* Omitting "securityContext" definitions, leading to default, often insecure configurations.
* Using "privileged: true" unnecessarily, which disables most security mechanisms.
## 3. Network Policies
### Standard
* **Do This:** Implement Network Policies to isolate applications at the network level. Default-deny all traffic and then selectively allow necessary communication based on labels and namespaces.
* **Don't Do This:** Rely solely on firewall rules outside the cluster. Avoid overly permissive network policies that negate isolation benefits.
### Explanation
Network Policies control traffic flow between Pods. A default-deny policy ensures that no traffic is allowed unless explicitly permitted, reducing the attack surface. This enhances segmentation and containment in multi-tenant environments.
### Example
"""yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-all
spec:
podSelector: {}
ingress: [] # Deny all inbound traffic
egress: [] # Deny all outbound traffic
policyTypes:
- Ingress
- Egress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-from-namespace
spec:
podSelector:
matchLabels:
app: my-app
ingress:
- from:
- namespaceSelector:
matchLabels:
name: my-namespace
policyTypes:
- Ingress
"""
### Anti-Pattern
* Failing to implement Network Policies beyond simple examples.
* Creating overly broad policies that allow unrestricted traffic between namespaces or pods.
## 4. Secrets Management
### Standard
* **Do This:** Use Kubernetes Secrets to manage sensitive information like API keys and passwords. Encrypt Secrets at rest using a KMS provider. Rotate secrets regularly. Use external secrets operators that can automatically synchronize your secrets with providers like AWS Secrets Manager, HashiCorp Vault, or Azure Key Vault.
* **Don't Do This:** Store secrets in environment variables, configuration files, or container images. Commit secrets to version control.
### Explanation
Kubernetes Secrets should be used to store sensitive data separately from application code and configuration. Encrypting secrets at rest adds an additional layer of protection.
### Example
"""yaml
apiVersion: v1
kind: Secret
metadata:
name: my-secret
type: Opaque
data:
# Base64 encoded values
username: dXNlcm5hbWU=
password: cGFzc3dvcmQ=
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 1
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: app-container
image: your-image:latest
env:
- name: DB_USERNAME
valueFrom:
secretKeyRef:
name: my-secret
key: username
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: my-secret
key: password
"""
Using an External Secrets Operator:
"""yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: db-credentials
spec:
secretStoreRef:
name: vault-backend
kind: ClusterSecretStore
target:
name: db-credentials
data:
- secretKey: username
remoteRef:
key: secret/data/production/db
property: username
- secretKey: password
remoteRef:
key: secret/data/production/db
property: password
"""
### Anti-Pattern
* Storing secrets as plain text in YAML files or environment variables within Pod definitions.
* Not encrypting Secrets at rest, leaving them vulnerable to unauthorized access if the "etcd" data store is compromised.
* Configuring an external secrets operator improperly resulting in credential leakage or service disruption.
## 5. Image Security
### Standard
* **Do This:** Use minimal base images (e.g., Distroless, Alpine). Regularly scan container images for vulnerabilities using tools like Trivy, Clair, or Anchore. Enforce image signing and verification using admission controllers like Kyverno or Open Policy Agent (OPA).
* **Don't Do This:** Use images from untrusted sources or with known vulnerabilities. Store sensitive information in image layers.
### Explanation
Smaller images have a reduced attack surface. Regular scanning and patching helps identify and mitigate vulnerabilities. Image signing ensures that only trusted images are deployed.
### Example
Dockerfile:
"""dockerfile
FROM gcr.io/distroless/static:latest
COPY my-app /
ENTRYPOINT ["/my-app"]
USER 1000
"""
Scanning with Trivy:
"""bash
trivy image your-image:latest
"""
Enforcing with Kyverno:
"""yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-verified-images
spec:
validationFailureAction: enforce
rules:
- name: check-image-signature
match:
any:
- resources:
kinds:
- Pod
verifyImages:
- imageReferences:
- "your-registry/*"
keyless:
rekor:
url: "https://rekor.sigstore.dev"
"""
### Anti-Pattern
* Using outdated base images that contain known vulnerabilities.
* Failing to regularly scan and update container images.
* Not implementing image signing and verification, allowing potentially malicious images to be deployed.
## 6. Role-Based Access Control (RBAC)
### Standard
* **Do This:** Implement RBAC to control access to Kubernetes resources. Follow the principle of least privilege, granting only the necessary permissions to users, services accounts, and groups. Use RoleBindings and ClusterRoleBindings to define roles and assign them to subjects.
* **Don't Do This:** Grant cluster-admin privileges unnecessarily. Use wildcard permissions (e.g., "*") without careful consideration.
### Explanation
RBAC restricts access to Kubernetes API resources, preventing unauthorized actions. Least privilege ensures that users and service accounts can only perform the actions necessary for their function.
### Example
"""yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: pod-reader
namespace: my-namespace
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: read-pods
namespace: my-namespace
subjects:
- kind: ServiceAccount
name: my-service-account
namespace: my-namespace
roleRef:
kind: Role
name: pod-reader
apiGroup: rbac.authorization.k8s.io
"""
### Anti-Pattern
* Assigning the "cluster-admin" role to all users, which bypasses RBAC controls.
* Using overly permissive roles that grant more permissions than required.
* Failing to apply RBAC consistently across all resources and namespaces.
## 7. Auditing and Logging
### Standard
* **Do This:** Enable Kubernetes Auditing to track API calls and system events. Configure audit policies to capture relevant actions. Centralize logs using a log aggregation system (e.g., Elasticsearch, Fluentd, Kibana (EFK) stack, or Loki). Actively monitor logs for suspicious activity and security incidents.
* **Don't Do This:** Disable auditing or rely solely on container logs. Ignore security alerts or fail to investigate suspicious events.
### Explanation
Auditing provides a record of API server activity, which is crucial for detecting and investigating security incidents. Centralized logging allows for easier analysis and correlation of events.
### Example
Configuring Audit Policy:
"""yaml
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: Metadata
resources:
- groups: [""]
resources: ["pods"]
verbs: ["get", "list", "create", "update", "delete"]
- level: RequestResponse
users: ["system:serviceaccount:kube-system:default"]
"""
Configuring Fluentd to collect logs:
(Requires some configuration of Fluentd itself to forward logs)
"""yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd
namespace: kube-system
labels:
k8s-app: fluentd-logging
spec:
selector:
matchLabels:
k8s-app: fluentd-logging
template:
metadata:
labels:
k8s-app: fluentd-logging
spec:
tolerations:
- key: node-role.kubernetes.io/control-plane
effect: NoSchedule
- key: node-role.kubernetes.io/master
effect: NoSchedule
containers:
- name: fluentd
image: fluent/fluentd-kubernetes-daemonset:v1-debian-elasticsearch
env:
# ...Fluentd environment variables...
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
terminationGracePeriodSeconds: 30
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
"""
### Anti-Pattern
* Disabling auditing entirely or using overly verbose audit policies that generate excessive noise.
* Not centralizing logs, making it difficult to analyze and correlate events across the cluster.
* Ignoring security alerts and failing to investigate suspicious activity promptly.
## 8. Runtime Security
### Standard
* **Do This:** Implement runtime security monitoring using tools like Falco or Sysdig. Define rules to detect anomalous behavior (e.g., unexpected process execution, file access, or network connections).
* **Don't Do This:** Rely solely on static analysis or vulnerability scanning. Ignore runtime alerts or fail to respond to security incidents.
### Explanation
Runtime security monitoring provides real-time visibility into container activity, allowing for detection and prevention of attacks that exploit zero-day vulnerabilities or bypass other security controls.
### Example
Falco rule detecting shell execution in a container:
"""yaml
- rule: Shell in container
desc: Detect attempts to run shell inside container
condition: spawned_process and container and shell_procs
output: "Shell run in container (user=%user.name command=%proc.cmdline pid=%proc.pid container_id=%container.id container_name=%container.name image=%container.image.repository)"
priority: WARNING
tags: [shell, container]
"""
Install Falco using Helm:
"""bash
helm repo add falcosecurity https://falcosecurity.github.io/charts
helm install falco falcosecurity/falco
"""
### Anti-Pattern
* Not implementing runtime security monitoring, leaving the cluster vulnerable to attacks that bypass existing security controls.
* Using overly permissive runtime rules that generate excessive false positives.
* Failing to respond to runtime alerts, allowing attacks to progress unchecked.
## 9. Service Mesh
### Standard
* **Do This:** Implement mutual TLS (mTLS) for secure service-to-service communication within your service mesh. Enforce strong identity and authentication policies. Utilize service mesh features like traffic encryption, access control, rate limiting, and observability.
* **Don't Do This:** Rely solely on network policies without mTLS. Expose internal services directly without proper authentication and authorization.
### Explanation
A service mesh (e.g., Istio, Linkerd) provides a transparent and consistent way to secure, connect, and manage microservices. mTLS ensures that only authorized services can communicate with each other.
### Example
"""yaml
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: my-namespace
spec:
mtls:
mode: STRICT #Enforce mutual TLS
"""
### Anti-Pattern
* Not implementing mTLS, leaving service-to-service communication vulnerable to eavesdropping and tampering.
* Misconfiguring service mesh policies, leading to insecure communication patterns.
* Failing to monitor service mesh metrics, making it difficult to detect and respond to security incidents.
## 10. Principle of Least Privilege for Service Accounts
### Standard
* **Do This:** Assign dedicated service accounts to each application or workload requiring Kubernetes API access. Follow the principle of least privilege.
* **Don't Do This:** Mount the default service account token into pods that don't require it. Grant cluster-admin privileges.
### Explanation
Service accounts provide an identity for pods. Adhering to the principle of least privilege ensures that only necessary permissions are granted, limiting the impact of potential compromise. Disable auto-mounting default service account where possible.
### Example
"""yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: my-app-sa
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 1
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
serviceAccountName: my-app-sa
automountServiceAccountToken: false # disable default token
containers:
- name: app-container
image: your-image:latest
"""
### Anti-Pattern
* Relying on the "default" service account for all pods, which typically has broader permissions than necessary.
* Granting service accounts excessive permissions, such as the ability to create or delete resources across the entire cluster.
## 11. Prevent Host Path Mounts
### Standard
* **Do This:** Use Kubernetes built-in resource types ("Volume", "PersistentVolume", "PersistentVolumeClaim") or cloud provider managed storage and persistent volumes. Use admission controllers like OPA or Kyverno to restrict hostPath volume mounts. Use local persistent volumes only when there is a performance issue in your workload that can justify using a volume on a specific host.
* **Don't Do This:** Use HostPath volumes which bypass Kubernetes resource model and pose a security risk.
### Explanation
HostPath volumes mount local directories into pods, bypassing the Kubernetes resource model and potentially providing access to sensitive host system files and directories.
### Example
Kyverno rule to disallow HostPath mounts:
"""yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: disallow-host-path
spec:
validationFailureAction: enforce
rules:
- name: check-host-path
match:
any:
- resources:
kinds:
- Pod
validate:
message: "HostPath volumes are not allowed. Use PersistentVolumeClaims instead."
pattern:
spec:
volumes:
- hostPath: "null"
"""
### Anti-Pattern
* Using HostPath volumes for data persistence, creating a dependency on specific nodes and potentially exposing sensitive host system files.
## 12. Avoid Running Processes as Root
### Standard
* **Do This:** Ensure containers run as non-root users in the container image and set security context. Specify "runAsUser", "runAsGroup", and "fsGroup" in pod spec.
* **Don't Do This:** Run containers as root which pose a security risk and violate the principle of least privilege.
### Explanation
Running containers as root allows exploitation of root files and increases attack surface.
### Example
"""yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: non-root-app
spec:
replicas: 1
selector:
matchLabels:
app: non-root-app
template:
metadata:
labels:
app: non-root-app
spec:
securityContext:
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
containers:
- name: app-container
image: your-image:latest
securityContext:
allowPrivilegeEscalation: false
"""
### Anti-Pattern
* Running containers as root which bypasses many of Kubernetes security controls.
* Not setting "securityContext" defaults and letting your containers run as root.
Adhering to these standards will greatly improve the security posture of your Kubernetes deployments, while also providing a solid foundation for maintainability, performance, and reliable operation. Remember to consult the official Kubernetes documentation regularly for updates and further guidance.
danielsogl
Created Mar 6, 2025
This guide explains how to effectively use .clinerules
with Cline, the AI-powered coding assistant.
The .clinerules
file is a powerful configuration file that helps Cline understand your project's requirements, coding standards, and constraints. When placed in your project's root directory, it automatically guides Cline's behavior and ensures consistency across your codebase.
Place the .clinerules
file in your project's root directory. Cline automatically detects and follows these rules for all files within the project.
# Project Overview project: name: 'Your Project Name' description: 'Brief project description' stack: - technology: 'Framework/Language' version: 'X.Y.Z' - technology: 'Database' version: 'X.Y.Z'
# Code Standards standards: style: - 'Use consistent indentation (2 spaces)' - 'Follow language-specific naming conventions' documentation: - 'Include JSDoc comments for all functions' - 'Maintain up-to-date README files' testing: - 'Write unit tests for all new features' - 'Maintain minimum 80% code coverage'
# Security Guidelines security: authentication: - 'Implement proper token validation' - 'Use environment variables for secrets' dataProtection: - 'Sanitize all user inputs' - 'Implement proper error handling'
Be Specific
Maintain Organization
Regular Updates
# Common Patterns Example patterns: components: - pattern: 'Use functional components by default' - pattern: 'Implement error boundaries for component trees' stateManagement: - pattern: 'Use React Query for server state' - pattern: 'Implement proper loading states'
Commit the Rules
.clinerules
in version controlTeam Collaboration
Rules Not Being Applied
Conflicting Rules
Performance Considerations
# Basic .clinerules Example project: name: 'Web Application' type: 'Next.js Frontend' standards: - 'Use TypeScript for all new code' - 'Follow React best practices' - 'Implement proper error handling' testing: unit: - 'Jest for unit tests' - 'React Testing Library for components' e2e: - 'Cypress for end-to-end testing' documentation: required: - 'README.md in each major directory' - 'JSDoc comments for public APIs' - 'Changelog updates for all changes'
# Advanced .clinerules Example project: name: 'Enterprise Application' compliance: - 'GDPR requirements' - 'WCAG 2.1 AA accessibility' architecture: patterns: - 'Clean Architecture principles' - 'Domain-Driven Design concepts' security: requirements: - 'OAuth 2.0 authentication' - 'Rate limiting on all APIs' - 'Input validation with Zod'
# Core Architecture Standards for Kubernetes This document outlines the core architecture standards for Kubernetes development. It serves as a guide for developers and a context for AI coding assistants to ensure code quality, maintainability, performance, and security within the Kubernetes ecosystem. These standards are based on the latest version of Kubernetes and incorporate modern best practices. ## 1. Fundamental Architectural Patterns Kubernetes, at its core, follows a distributed systems architecture. Understanding these underlying patterns is critical for building robust components: ### 1.1. Control Plane and Data Plane Separation **Standard:** Clearly delineate between control plane and data plane responsibilities. Control plane components manage the cluster state, schedule workloads, and handle API requests. Data plane components execute the workloads. **Do This:** Design components with clear separation of concerns. Control plane components (e.g., controllers) should focus on declarative state management. Data plane components (e.g., kubelet) should focus on executing instructions from the control plane. **Don't Do This:** Avoid blurring the lines by embedding workload execution logic directly within control plane controllers. Avoid control plane operations directly modifying workloads without proper orchestration. **Why This Matters:** Separation of concerns improves resilience. A failing node (data plane) shouldn't affect the cluster's ability to schedule new workloads (control plane). It also enhances scalability and simplifies debugging. **Code Example:** """go // Controller managing a custom resource package controller import ( "context" appsv1 "k8s.io/api/apps/v1" corev1 "k8s.io/api/core/v1" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" "k8s.io/apimachinery/pkg/runtime" ctrl "sigs.k8s.io/controller-runtime" "sigs.k8s.io/controller-runtime/pkg/client" "sigs.k8s.io/controller-runtime/pkg/log" mygroupv1alpha1 "example.com/my-operator/api/v1alpha1" ) // MyResourceReconciler reconciles a MyResource object type MyResourceReconciler struct { client.Client Scheme *runtime.Scheme } // +kubebuilder:rbac:groups=mygroup.example.com,resources=myresources,verbs=get;list;watch;create;update;patch;delete // +kubebuilder:rbac:groups=mygroup.example.com,resources=myresources/status,verbs=get;update;patch // +kubebuilder:rbac:groups=mygroup.example.com,resources=myresources/finalizers,verbs=update // +kubebuilder:rbac:groups=apps,resources=deployments,verbs=get;list;watch;create;update;patch;delete // +kubebuilder:rbac:groups=core,resources=services,verbs=get;list;watch;create;update;patch;delete func (r *MyResourceReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { log := log.FromContext(ctx) var myResource mygroupv1alpha1.MyResource if err := r.Get(ctx, req.NamespacedName, &myResource); err != nil { log.Error(err, "unable to fetch MyResource") return ctrl.Result{}, client.IgnoreNotFound(err) } // Define the desired Deployment deployment := &appsv1.Deployment{ ObjectMeta: metav1.ObjectMeta{ Name: myResource.Name + "-deployment", Namespace: myResource.Namespace, }, Spec: appsv1.DeploymentSpec{ Replicas: myResource.Spec.Replicas, Selector: &metav1.LabelSelector{ MatchLabels: map[string]string{"app": myResource.Name}, }, Template: corev1.PodTemplateSpec{ ObjectMeta: metav1.ObjectMeta{ Labels: map[string]string{"app": myResource.Name}, }, Spec: corev1.PodSpec{ Containers: []corev1.Container{ { Name: "my-container", Image: myResource.Spec.Image, }, }, }, }, }, } if err := ctrl.SetControllerReference(&myResource, deployment, r.Scheme); err != nil { return ctrl.Result{}, err } // Check if the Deployment already exists, if not create a new one existingDeployment := &appsv1.Deployment{} err := r.Get(ctx, client.ObjectKey{Name: deployment.Name, Namespace: deployment.Namespace}, existingDeployment) if err != nil && client.IgnoreNotFound(err) != nil { return ctrl.Result{}, err } else if client.IgnoreNotFound(err) != nil { log.Info("Creating a new Deployment", "Deployment.Namespace", deployment.Namespace, "Deployment.Name", deployment.Name) if err = r.Create(ctx, deployment); err != nil { return ctrl.Result{}, err } //Deployment created successfuly - return and requeue return ctrl.Result{Requeue: true}, nil } else { // Update the existing Deployment if needed // (e.g., update replicas) log.Info("Updating existing Deployment", "Deployment.Namespace", deployment.Namespace, "Deployment.Name", deployment.Name) existingDeployment.Spec.Replicas = myResource.Spec.Replicas // Example update logic err = r.Update(ctx, existingDeployment) if err != nil { return ctrl.Result{}, err } return ctrl.Result{Requeue: true}, nil } return ctrl.Result{}, nil } // SetupWithManager sets up the controller with the Manager. func (r *MyResourceReconciler) SetupWithManager(mgr ctrl.Manager) error { return ctrl.NewControllerManagedBy(mgr). For(&mygroupv1alpha1.MyResource{}). Owns(&appsv1.Deployment{}). Complete(r) } """ **Common Anti-Pattern:** Writing controllers that directly manipulate pods instead of deployments or other higher-level abstractions. This tightly couples the control plane to the data plane and makes it difficult to reason about the system's state. ### 1.2. Declarative Configuration **Standard:** Embrace declarative configuration using Kubernetes resources (e.g., Deployments, Services, ConfigMaps). Controllers should reconcile the actual state towards the desired state defined in the resources. **Do This:** Define the desired state in YAML or JSON manifests and use "kubectl apply" or similar tools to submit them to the API server. Use controllers to watch for changes in these resources and automatically adjust the system. Leverage operators extending Kubernetes to manage custom resources **Don't Do This:** Avoid imperative commands that directly modify pod configuration, circumventing the API server and the control loop. Avoid manual configuration changes that are not reflected in the declarative state stored in Kubernetes. **Why This Matters:** Declarative configuration provides idempotency, auditability, and version control. It allows for easy rollback to previous states and ensures consistency across the cluster. **Code Example:** """yaml # Deployment manifest apiVersion: apps/v1 kind: Deployment metadata: name: my-app spec: replicas: 3 selector: matchLabels: app: my-app template: metadata: labels: app: my-app spec: containers: - name: my-container image: nginx:latest ports: - containerPort: 80 """ **Common Anti-Pattern:** Using "kubectl exec" to modify configuration files within a running pod. This is an imperative approach that bypasses the declarative configuration management system. ### 1.3. Reconciliation Loops **Standard:** Implement controllers using reconciliation loops that continuously monitor the desired state and the actual state of the system. **Do This:** Use a framework like controller-runtime to build your controllers. This framework provides abstractions for event handling, state management, and error handling. Implement idempotent reconciliation logic. The loop should converge towards your objective even if the state is changed outside of the loop. **Don't Do This:** Write one-off scripts or tools that only run once to configure the system. Avoid controllers that trigger actions only on events without a background reconciliation process. **Why This Matters:** Reconciliation loops ensure that the system converges towards the desired state, even in the face of failures or unexpected events. **Code Example:** See Controller example in 1.1. **Common Anti-Pattern:** Implementing controllers that only react to events and don't have a reconciliation loop. This can lead to inconsistencies if events are missed or processed out of order. ### 1.4. Event-Driven Architecture within Kubernetes **Standard**: Utilize Kubernetes events for non-critical notifications and insights. Events should supplement, not replace, the core reconciliation loop. **Do This**: Emit events for significant state transitions within your controllers or operators. Subscribe to relevant events to gain insights into cluster behavior, but rely primarily on your reconciliation loop for managing the system's state. Consider using tools like KEDA (Kubernetes Event-driven Autoscaling) to automatically scale deployments based on events. **Don't Do This**: Build core logic that depends solely on receiving specific events. This creates fragile dependencies and makes your system more difficult to debug. Overload the event system with unnecessary debug output or high-frequency notifications. **Why This Matters**: Kubernetes events provide a valuable telemetry stream that can be used for monitoring, alerting, and troubleshooting and autoscaling. This architecture improves auditability by establishing a detailed log of activity. **Code Example:** """go // Emit an event when a MyResource is successfully processed. func (r *MyResourceReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { // ... your reconciliation logic ... if err == nil { r.Recorder.Event(&myResource, corev1.EventTypeNormal, "MyResourceProcessed", "Successfully processed MyResource") } } """ **Common Anti-Pattern**: Building interdependent chains of operations that rely solely on Kubernetes events to trigger the next step. Instead, reconcile status changes of resources. ## 2. Project Structure and Organization A well-organized project structure promotes maintainability and collaboration. ### 2.1. Standard Project Layout **Standard:** Adhere to the standard Go project layout. Utilize "go modules" for dependency management. **Do This:** Structure your project with the following directories: - "api/": API definitions (CRDs). - "controllers/": Controller implementations. - "cmd/": Command-line tools. - "config/": Kubernetes manifests for deploying the application. - "internal/": internal use packages **Don't Do This:** Avoid a flat project structure with all files in a single directory. Do not commit dependencies into the repository; always rely on go modules. **Why This Matters:** A standardized layout makes it easier for new contributors to understand the project and find relevant code. "Go modules" ensures reproducible builds. **Code Example:** """ my-project/ ├── api/ │ └── v1alpha1/ │ ├── myresource_types.go │ └── groupversion_info.go ├── controllers/ │ └── myresource_controller.go ├── cmd/ │ └── manager/ │ └── main.go ├── config/ │ ├── crd/ │ │ └── kustomization.yaml │ ├── default/ │ │ ├── kustomization.yaml │ │ ├── kustomizeconfig.yaml │ │ └── manager_auth_proxy_patch.yaml │ ├── manager/ │ │ ├── kustomization.yaml │ │ └── manager.yaml │ ├── prometheus/ │ │ └── kustomization.yaml │ └── rbac/ │ ├── auth_proxy_client_role.yaml │ ├── auth_proxy_patch.yaml │ ├── kustomization.yaml │ ├── role_binding.yaml │ └── role.yaml ├── go.mod ├── go.sum └── main.go """ **Common Anti-Pattern:** Mixing API definitions, controller logic, and command-line tools in the same directory. ### 2.2. Package Naming **Standard:** Use descriptive and consistent package names. Package names should be short, lowercase, and avoid underscores or dashes. **Do This:** Name packages according to their purpose. For example, "pkg/controller" for controller implementations, "pkg/apis/v1alpha1" for API definitions. **Don't Do This:** Use generic package names like "utils" or "helpers" without clear context. Use arbitrarily abbreviated names that are not self-explanatory. **Why This Matters:** Clear package names improve code discoverability and reduce ambiguity. **Code Example:** """go package controller // Good: Clearly indicates this contains controller logic package myapp // Good: Specifies functionality related to "myapp" package v1 // Also good, if this is api version v1 """ **Common Anti-Pattern:** Using the same package name for multiple unrelated modules or using cryptic abbreviations. ### 2.3. API Versioning **Standard:** Properly version your APIs using the Kubernetes API versioning scheme (e.g., "v1", "v1alpha1", "v1beta1"). **Do This:** Use semantic versioning principles when incrementing API versions. Breaking changes should be introduced in a new API version. Use CRDs for extending Kubernetes with custom resources. **Don't Do This:** Introduce breaking changes without incrementing the API version. Modify existing API objects in-place without providing a migration path. **Why This Matters:** API versioning ensures backward compatibility and allows users to upgrade their applications without breaking existing functionality. **Code Example:** """go // api/v1alpha1/myresource_types.go package v1alpha1 import ( metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" ) // MyResourceSpec defines the desired state of MyResource type MyResourceSpec struct { // ... } // MyResourceStatus defines the observed state of MyResource type MyResourceStatus struct { // ... } // +kubebuilder:object:root=true // +kubebuilder:subresource:status // MyResource is the Schema for the myresources API type MyResource struct { metav1.TypeMeta "json:",inline"" metav1.ObjectMeta "json:"metadata,omitempty"" Spec MyResourceSpec "json:"spec,omitempty"" Status MyResourceStatus "json:"status,omitempty"" } // +kubebuilder:object:root=true // MyResourceList contains a list of MyResource type MyResourceList struct { metav1.TypeMeta "json:",inline"" metav1.ListMeta "json:"metadata,omitempty"" Items []MyResource "json:"items"" } func init() { SchemeBuilder.Register(&MyResource{}, &MyResourceList{}) } """ **Common Anti-Pattern:** Making incompatible changes to an existing API without incrementing the version number. ## 3. Concurrency and Error Handling Kubernetes is a highly concurrent environment. Proper concurrency and error handling are crucial for building reliable components. ### 3.1. Goroutines and Synchronization **Standard:** Use goroutines for concurrent operations, but manage them carefully to avoid resource leaks. Utilize synchronization primitives (e.g., mutexes, channels) to protect shared data. **Do This:** Use "errgroup.Group" from the "golang.org/x/sync/errgroup" package to manage a group of goroutines and handle errors collectively. Implement graceful shutdown mechanisms to wait for goroutines to complete before exiting. **Don't Do This:** Launch goroutines without a mechanism for waiting for them to complete or handling their errors. Access shared data without proper synchronization. **Why This Matters:** Unmanaged goroutines can lead to memory leaks and performance degradation. Race conditions can occur if shared data is not properly protected. **Code Example:** """go import ( "context" "fmt" "sync" "golang.org/x/sync/errgroup" ) func processData(ctx context.Context, data []string, workerCount int) error { var mu sync.Mutex results := make([]string, 0, len(data)) g, ctx := errgroup.WithContext(ctx) dataCh := make(chan string, len(data)) for _, d := range(data) { dataCh <- d } close(dataCh) for i := 0; i < workerCount; i++ { g.Go(func() error { for d := range dataCh { select { case <-ctx.Done(): return ctx.Err() default: processed, err := processSingleItem(ctx, d) if err != nil { return fmt.Errorf("processing item %s failed: %w", d, err) } mu.Lock() results = append(results, processed) mu.Unlock() fmt.Printf("Worker %d processed %s \n", i, d) } } return nil }) } if err := g.Wait(); err != nil { return fmt.Errorf("error during processing: %w", err) } fmt.Printf("Results: %v \n", results) return nil } func processSingleItem(ctx context.Context, item string) (string, error) { // Simulate some work //time.Sleep(time.Millisecond * 100) return fmt.Sprintf("Processed: %s", item), nil } """ **Common Anti-Pattern:** Ignoring potential race conditions when accessing shared data from multiple goroutines. ### 3.2. Error Handling **Standard:** Handle errors explicitly and gracefully. Provide informative error messages that aid in debugging. **Do This:** Use the "errors" package to wrap errors with context information. Implement retry logic for transient errors. Emit Kubernetes events to signal error conditions. Propagate errors up the call stack. **Don't Do This:** Ignore errors or use generic error messages that provide no context. Retry operations indefinitely without a backoff strategy. **Why This Matters:** Proper error handling prevents cascading failures and makes it easier to diagnose problems. **Code Example:** """go import ( "fmt" "errors" ) func doSomething() error { if err := somethingElse(); err != nil { return fmt.Errorf("failed to do something: %w", err) } return nil } func somethingElse() error { return errors.New("something went wrong") } """ **Common Anti-Pattern:** Using "panic" for recoverable errors. "panic" should only be used for unrecoverable errors that indicate a bug in the code. Recoverable errors should be handled using the "error" interface. ### 3.3. Context Propagation **Standard:** Propagate "context.Context" throughout the call stack to pass deadlines, cancellation signals, and request-scoped values. **Do This:** Accept "context.Context" as the first argument to all functions that perform I/O operations or long-running operations. Check the context for cancellation signals and return early if the context is canceled. Use "context.WithValue" to pass request-scoped values. **Don't Do This:** Ignore the "context.Context" or create a new context without propagating the existing one. Store the context in a global variable. **Why This Matters:** Context propagation allows for graceful shutdown and cancellation of operations in response to user requests or system events. **Code Example:** """go func doSomething(ctx context.Context) error { select { case <-ctx.Done(): return ctx.Err() // Return early if the context is canceled default: // Perform the operation } return nil } """ **Common Anti-Pattern:** Ignoring cancellation signals from the context, leading to long-running operations that continue even after the user has canceled the request. ## 4. Performance Optimization Kubernetes components should be designed for optimal performance and scalability. ### 4.1. Caching **Standard:** Use caching to reduce latency and improve throughput. **Do This:** Cache frequently accessed data in memory. Use a distributed cache (e.g., Memcached, Redis) for data that needs to be shared across multiple instances. Invalidate the cache when data changes. Use client-side caching where appropriate. **Don't Do This:** Cache data without a proper invalidation strategy, leading to stale data. Cache excessive amounts of data, leading to memory pressure. **Why This Matters:** Caching reduces the load on backend services and improves the responsiveness of the system. **Code Example:** """go import ( "sync" "time" ) type Cache struct { mu sync.RWMutex items map[string]interface{} } func NewCache() *Cache { return &Cache{ items: make(map[string]interface{}), } } func (c *Cache) Get(key string) (interface{}, bool) { c.mu.RLock() defer c.mu.RUnlock() val, ok := c.items[key] return val, ok } func (c *Cache) Set(key string, value interface{}) { c.mu.Lock() defer c.mu.Unlock() c.items[key] = value } func (c *Cache) Delete(key string) { c.mu.Lock() defer c.mu.Unlock() delete(c.items, key) } // Example usage with expiration: func (c *Cache) SetWithExpiration(key string, value interface{}, ttl time.Duration) { c.Set(key, value) time.AfterFunc(ttl, func() { c.Delete(key) }) } """ **Common Anti-Pattern:** Caching sensitive data without proper encryption or access control. ### 4.2. Efficient Data Structures and Algorithms **Standard:** Choose appropriate data structures and algorithms for the task at hand. **Do This:** Use efficient data structures like hash maps and sets for fast lookups. Avoid inefficient algorithms like nested loops for large datasets. Profile your code to identify performance bottlenecks. **Don't Do This:** Use inefficient data structures or algorithms without considering their performance implications. Optimize code prematurely without profiling. **Why This Matters:** Efficient data structures and algorithms can significantly improve the performance of the system. **Code Example:** """go // Using a hash map for fast lookups myMap := make(map[string]string) myMap["key1"] = "value1" value := myMap["key1"] // Fast lookup """ **Common Anti-Pattern:** Using linear search on a large dataset when a hash map would be more efficient. ### 4.3. Resource Limits and Quotas **Standard:** Define resource limits and quotas for all Kubernetes resources. **Do This:** Set CPU and memory limits for containers. Use resource quotas to limit the total amount of resources that can be consumed by a namespace. Monitor resource usage and adjust limits and quotas as needed. **Don't Do This:** Leave resource limits and quotas undefined, leading to uncontrolled resource consumption. **Why This Matters:** Resource limits and quotas prevent resource exhaustion and ensure fair resource allocation across the cluster. **Code Example:** """yaml # Pod manifest with resource limits apiVersion: v1 kind: Pod metadata: name: my-pod spec: containers: - name: my-container image: nginx:latest resources: limits: cpu: "1" memory: "1Gi" requests: cpu: "0.5" memory: "512Mi" """ **Common Anti-Pattern:** Setting unrealistically high resource limits, effectively disabling resource control. ## 5. Security Best Practices Security is paramount in Kubernetes. All components should be designed with security in mind. ### 5.1. RBAC (Role-Based Access Control) **Standard:** Use RBAC to control access to Kubernetes resources. **Do This:** Define roles and role bindings to grant specific permissions to users, groups, and service accounts. Follow the principle of least privilege, granting only the necessary permissions. Utilize Kubernetes built-in roles where possible. **Don't Do This:** Grant cluster-admin privileges to all users. Use overly permissive roles that grant unnecessary access. **Why This Matters:** RBAC limits the potential impact of security breaches by restricting access to sensitive resources. **Code Example:** """yaml # Role definition apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: pod-reader rules: - apiGroups: [""] resources: ["pods"] verbs: ["get", "list", "watch"] """ **Common Anti-Pattern:** Using the "cluster-admin" role for service accounts that only need access to a limited set of resources. ### 5.2. Secure Communication **Standard:** Use TLS encryption for all communication between Kubernetes components. **Do This:** Enable TLS for the API server, kubelet, and other control plane components. Use secure communication channels (e.g., HTTPS) for accessing external services. Implement mutual TLS (mTLS) authentication for enhanced security (Service Mesh). **Don't Do This:** Disable TLS or use insecure communication channels. Store TLS certificates in plaintext. **Why This Matters:** TLS encryption protects sensitive data from eavesdropping and tampering. **Code Example:** (Configuration of TLS is highly dependent on the specific component and deployment environment.) ### 5.3. Container Security **Standard:** Secure your container images and runtime environment. **Do This:** Use minimal base images. Scan container images for vulnerabilities. Run containers as non-root users. Use Pod Security Standards or Admission Controllers to enforce security policies. **Don't Do This:** Use images from untrusted sources. Run containers as root users. Expose sensitive information in container images. **Why This Matters:** Container security reduces the attack surface and mitigates the impact of vulnerabilities. **Code Example:** """dockerfile # Minimal Dockerfile example FROM alpine:latest RUN adduser -D myuser USER myuser COPY myapp /app/ CMD ["/app/myapp"] """ **Common Anti-Pattern:** Running containers as root users, which can allow attackers to escape the container and gain access to the host system. ### 5.4. Data Protection **Standard**: Protect sensitive data at rest and in transit. **Do This**: Encrypt sensitive data stored in etcd using Kubernetes' encryption providers. Use Secrets to store sensitive information, and consider using external secret management solutions like Vault. Ensure that data transmitted between services is encrypted using TLS. **Don't Do This**: Store sensitive data in ConfigMaps or environment variables without encryption. Hardcode credentials directly in code or configuration files. **Why This Matters**: Protecting sensitive data reduces the risk of data breaches and ensures compliance with regulatory requirements. **Code Example**: """yaml # Example Secret that references a Vault secret: apiVersion: v1 kind: Secret metadata: name: my-secret type: Opaque stringData: username: ENC[vault:secret/data/myapp/db:username] password: ENC[vault:secret/data/myapp/db:password] """ **Common Anti-Pattern**: Storing unencrypted API keys or passwords in environment variables or configuration files, particularly within public repositories. ## 6. Monitoring and Logging Monitoring and logging are essential for understanding the behavior of Kubernetes components and troubleshooting issues. ### 6.1. Metrics **Standard:** Expose metrics for all key performance indicators (KPIs). **Do This:** Use the Prometheus format for exposing metrics. Use labels to provide context for metrics. Implement alerting rules to detect anomalies and potential problems. Utilize the metrics collection/aggregation pipeline. **Don't Do This:** Expose too few metrics, making it difficult to diagnose problems. Expose too many metrics, overwhelming the monitoring system. **Why This Matters:** Metrics provide visibility into the health and performance of the system. **Code Example:** """go import ( "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promauto" ) var ( myCounter = promauto.NewCounter(prometheus.CounterOpts{ Name: "my_app_requests_total", Help: "Total number of requests.", }) ) func handleRequest() { myCounter.Inc() // ... } """ ### 6.2. Logging **Standard:** Log all significant events and errors. **Do This:** Use structured logging (e.g., JSON) for easier parsing and analysis. Include context information in log messages. Send logs to a central logging system (e.g., Elasticsearch, Loki). Adhere to K8s Logging Architecture. **Don't Do This:** Log excessive amounts of data, overwhelming the logging system. Log sensitive information in plaintext. **Why This Matters:** Logs provide a historical record of system events and are invaluable for debugging and auditing. **Code Example:** """go import ( "go.uber.org/zap" ) func doSomething() { logger, _ := zap.NewProduction() defer logger.Sync() logger.Info("Doing something", zap.String("component", "my-component"), zap.Int("attempt", 3), ) } """ **Common Anti-Pattern:** Writing log messages that are difficult to understand or parse. ### 6.3. Tracing **Standard:** Implement distributed tracing to track requests as they propagate through the system. **Do This:** Use a tracing library (e.g., OpenTelemetry) to instrument your code. Propagate tracing context across service boundaries. Send traces to a tracing backend (e.g., Jaeger, Zipkin). **Don't Do This:** Ignore tracing, making it difficult to diagnose performance problems or understand the flow of requests. Increase the latency of the requests because of the overhead. **Why This Matters:** Tracing helps to identify performance bottlenecks and understand the interactions between different components. This provides valuable context for debugging cross-service issues. **Code Example:** (Implementation of tracing is dependent on the specific tracing library and the application architecture.) ## 7. Conclusion Following these coding standards will help ensure the creation of robust, maintainable, performant, and secure Kubernetes components. This document should be considered a living document, and contributions are welcome to keep it up-to-date with the latest best practices and advancements in the Kubernetes ecosystem.
# Code Style and Conventions Standards for Kubernetes This document outlines the code style and conventions standards for contributing to the Kubernetes project. Adhering to these standards ensures consistency, readability, maintainability, and security across the codebase. These guidelines apply to all languages used in Kubernetes, with a primary focus on Go and YAML, and take into account the most recent versions of Kubernetes. ## 1. General Principles * **Consistency:** Maintain a consistent style across all files and packages. Use linters and formatters to enforce style rules automatically. * **Readability:** Write code that is easy to understand. Use meaningful names, keep functions short, and add comments when necessary. * **Maintainability:** Design code that is easy to modify and extend. Follow SOLID principles and avoid code duplication. * **Testability:** Ensure all code is easily testable. Write unit tests, integration tests, and end-to-end tests. * **Security:** Write secure code. Follow security best practices, such as input validation, output encoding, and least privilege. Use static analysis tools to identify potential vulnerabilities. * **Error Handling:** Implement robust error handling. Return errors, log errors, and handle errors gracefully. Never ignore errors. ## 2. Go Coding Standards ### 2.1. Formatting * **Use "go fmt":** All Go code *must* be formatted with "go fmt". This tool automatically formats Go code according to the standard style. **Do This:** """bash go fmt ./... """ **Don't Do This:** """go // non-standard formatting func main () { println("Hello, World!") } //messy """ * **Line Length:** Keep lines reasonably short (ideally under 120 characters). This improves readability in most editors and IDEs. **Do This:** """go // Properly wrapped line for readability err := client.Create(context.Background(), &corev1.Pod{ ObjectMeta: metav1.ObjectMeta{ Name: "my-pod", Namespace: "default", }, Spec: corev1.PodSpec{ Containers: []corev1.Container{{ Name: "nginx", Image: "nginx:latest", }}, }, }) if err != nil { klog.Error(err, "Failed to create pod") return err } """ **Don't Do This:** """go err := client.Create(context.Background(), &corev1.Pod{ObjectMeta: metav1.ObjectMeta{Name: "my-pod", Namespace: "default",}, Spec: corev1.PodSpec{ Containers: []corev1.Container{{ Name: "nginx", Image: "nginx:latest",}},},}) // Very long line. Hard to read if err != nil { klog.Error(err, "Failed to create pod"); return err; } // hard to read single line """ * **Imports:** Use grouped imports with standard library imports first, followed by external imports, and then internal imports (separated by blank lines). **Do This:** """go import ( "context" "fmt" "time" "github.com/go-logr/logr" corev1 "k8s.io/api/core/v1" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" "sigs.k8s.io/controller-runtime/pkg/client" ) """ **Don't Do This:** """go import ( "fmt" "sigs.k8s.io/controller-runtime/pkg/client" "context" corev1 "k8s.io/api/core/v1" "time" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" "github.com/go-logr/logr" ) """ **Why:** Grouped imports and ordering make dependencies clearer and easier to track. ### 2.2. Naming Conventions * **Packages:** Use short, descriptive, and lowercase package names. Avoid abbreviations and underscores. The package name should reflect the purpose of the code within the package. **Do This:** "pkg/controller", "pkg/webhook" **Don't Do This:** "pkg/K8s_Controller", "pkg/kube-controller" **Why:** Clarity and avoid confusion. Package names are used in import statements, so keeping them readable helps. * **Variables and Functions:** Use camelCase for variable and function names. Shorter names are preferred for local variables, while longer, more descriptive names are suitable for global variables and exported functions/methods. **Do This:** "podName", "createPod", "ReconcileResource" **Don't Do This:** "Pod_Name", "CreatePOD", "reconcile_resource" * **Constants:** Use PascalCase (CamelCase starting with an uppercase letter) for named constants. **Do This:** "DefaultRequeueTime", "MaxRetries" **Don't Do This:** "defaultRequeueTime", "max_retries" * **Interfaces:** Name interfaces using PascalCase, typically ending with "er" or "Interface". Avoid redundancy. For example, "storage.Interface" is preferred over "storage.StorageInterface". **Do This:** "ResourceHandler", "ClientInterface" **Don't Do This:** "IResourceHandler", "ClientIntf" * **Types:** Use PascalCase for type names. **Do This:** "PodSpec", "DeploymentStatus" **Don't Do This:** "podSpec", "deployment_status" ### 2.3. Error Handling * **Explicit Error Checks:** Always check errors explicitly. Don't use the blank identifier ("_") to discard errors without handling. **Do This:** """go pod, err := client.Get(context.TODO(), client.ObjectKey{Namespace: "default", Name: "my-pod"}, &corev1.Pod{}) if err != nil { if errors.IsNotFound(err) { klog.Info("Pod not found") return nil } klog.Error(err, "Failed to get pod") return err } """ **Don't Do This:** """go pod, _ := client.Get(context.TODO(), client.ObjectKey{Namespace: "default", Name: "my-pod"}, &corev1.Pod{}) // ignoring the error """ * **Error Wrapping:** Use "%w" to wrap errors to preserve the original error context. This allows for easier debugging and error analysis. **Do This:** """go err := someFunction() if err != nil { return fmt.Errorf("failed in someFunction: %w", err) } """ **Don't Do This:** """go err := someFunction() if err != nil { return fmt.Errorf("failed in someFunction: %s", err) // Losing error context } """ * **Error Types:** Use the "errors" package for creating and checking specific error types. **Do This:** """go var ErrInvalidInput = errors.New("invalid input") func validateInput(input string) error { if input == "" { return ErrInvalidInput } return nil } func main() { err := validateInput("") if errors.Is(err, ErrInvalidInput) { fmt.Println("Input is invalid") } } """ **Don't Do This:** """go func validateInput(input string) error { if input == "" { return fmt.Errorf("invalid input") // String comparison is brittle } return nil } func main() { err := validateInput("") if err.Error() == "invalid input" { // Brittle string comparison fmt.Println("Input is invalid") } } """ **Why:** Provides more robust error checking using "errors.Is()" and "errors.As()". ### 2.4. Concurrency * **Context:** Always pass a "context.Context" as the first argument to functions that perform I/O operations or may block. Use context for cancellation and deadlines. **Do This:** """go func createResource(ctx context.Context, client client.Client, obj runtime.Object) error { return client.Create(ctx, obj) } """ **Don't Do This:** """go func createResource(client client.Client, obj runtime.Object) error { // Missing context return client.Create(context.Background(), obj) //avoid creating new context in API methods } """ * **Goroutine Management:** Use "sync.WaitGroup" or channels to manage goroutines and prevent leaks. **Do This:** """go var wg sync.WaitGroup for i := 0; i < 5; i++ { wg.Add(1) go func(i int) { defer wg.Done() fmt.Println("Worker", i) }(i) } wg.Wait() """ **Don't Do This:** """go for i := 0; i < 5; i++ { go func(i int) { // Potential goroutine leak fmt.Println("Worker", i) }(i) } """ * **Mutexes:** Use mutexes ("sync.Mutex") to protect shared resources from concurrent access. **Do This:** """go var mu sync.Mutex var counter int func incrementCounter() { mu.Lock() defer mu.Unlock() counter++ } """ **Don't Do This:** """go var counter int func incrementCounter() { // Possible data race counter++ } """ ### 2.5. Logging * **Structured Logging:** Use structured logging with "klog" for log messages. **Do This:** """go import ( "k8s.io/klog/v2" ) klog.InfoS("Pod created", "namespace", pod.Namespace, "name", pod.Name) klog.ErrorS(err, "Failed to create pod", "namespace", pod.Namespace, "name", pod.Name) """ **Don't Do This:** """go fmt.Printf("Pod created in namespace %s with name %s\n", pod.Namespace, pod.Name) // unstructured logging """ * **Log Levels:** Use appropriate log levels (e.g., Info, Warning, Error) based on the severity of the message. **Do This:** """go klog.V(2).InfoS("Detailed information for debugging") // Verbose logging klog.WarningS("Something unexpected happened, but the program can continue") klog.ErrorS(err, "A critical error occurred") """ **Don't Do This:** """go klog.InfoS("Everything is fine") // Using Info for debug messages """ **Why:** Structured logging enables better filtering, analysis, and integration with monitoring tools. * **Contextual Logging:** Include relevant context in log messages, such as resource names, namespaces, and operation IDs. ### 2.6. Comments * **Godoc Comments:** Write godoc comments for all exported types, functions, and methods. These comments should explain what the code does and how to use it. **Do This:** """go // ReconcileResource reconciles a resource. // It fetches the resource, checks its status, and updates it if necessary. func (r *Reconciler) ReconcileResource(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { // ... } """ **Don't Do This:** """go func (r *Reconciler) ReconcileResource(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { // reconciles // ... } """ * **Internal Comments:** Use comments to explain complex or non-obvious logic within functions. Focus on *why* the code does what it does, not just *what* the code does. ### 2.7. Kubernetes Specific Considerations * **API Machinery Types:** When working with Kubernetes API types (e.g., Pod, Deployment), use the types defined in the "k8s.io/api" and "k8s.io/apimachinery" packages. *Avoid* creating custom types that duplicate or shadow these. Ensure API groups and versions are correct. **Do This:** """go import ( corev1 "k8s.io/api/core/v1" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" ) pod := &corev1.Pod{ ObjectMeta: metav1.ObjectMeta{ Name: "my-pod", Namespace: "default", }, Spec: corev1.PodSpec{ Containers: []corev1.Container{{ Name: "nginx", Image: "nginx:latest", }}, }, } """ **Don't Do This:** """go type MyPod struct { // Avoid custom type for Pod Name string Image string } """ * **Clients and Informers:** Use "client-go" to interact with the Kubernetes API. Use informers for caching and event handling. Use the controller-runtime library for building controllers. Ensure clients are properly configured and authenticated. When using a "dynamic" client, ensure the GVR (GroupVersionResource) is properly specified. **Do This:** """go import ( "sigs.k8s.io/controller-runtime/pkg/client" ) // Assume client is a properly configured controller-runtime client.Client err := client.Get(context.TODO(), client.ObjectKey{Namespace: "default", Name: "my-pod"}, &corev1.Pod{}) if err != nil { // ... } """ **Don't Do This:** """go // Avoid direct HTTP calls to the Kubernetes API server """ * **Controllers:** Follow the controller pattern for managing Kubernetes resources. Use the "controller-runtime" library to simplify controller development. Implement reconciliation logic that is idempotent and handles errors gracefully. Use finalizers to ensure proper cleanup of resources. * **Webhooks**: Implement webhooks following the Kubernetes API guidelines for admission controllers. Ensure webhooks are properly secured with TLS and RBAC. ### 2.7.1. Controller Pattern Enhancements (Latest Kubernetes) * **Using Builder Pattern:** Use the "Builder" pattern from "controller-runtime" for declarative controller setup. This provides clear resource ownership, filtering events, and more refined control over reconciliation triggers. **Do This:** """go func (r *MyReconciler) SetupWithManager(mgr ctrl.Manager) error { return ctrl.NewControllerManagedBy(mgr). For(&myappv1.MyResource{}). Owns(&corev1.Pod{}). // Reconcile when owned Pods change WithEventFilter(predicate.GenerationChangedPredicate{}). // Only reconcile on spec changes Complete(r) } """ **Why**: Improves readability and maintainability of controller setup. * **Managing Informers:** Employ informer factories for optimal shared informer use across multiple controllers. This reduces pressure on the API server and reduces memory consumption. * **Predicate Filtering:** Use Predicates to filter events to reduce unnecessary reconciliations. Use resource version predicates to reduce reconcile frequency and optimize performance. ### 2.8. Testing * **Unit Tests:** Write unit tests for all functions and methods. Use table-driven tests for parameterized testing. **Do This:** """go func TestMyFunction(t *testing.T) { testCases := []struct { name string input int expected int }{ { name: "Positive input", input: 5, expected: 10, }, { name: "Negative input", input: -5, expected: -10, }, } for _, tc := range testCases { t.Run(tc.name, func(t *testing.T) { actual := myFunction(tc.input) if actual != tc.expected { t.Errorf("Expected %d, but got %d", tc.expected, actual) } }) } } """ **Don't Do This:** """go func TestMyFunction(t *testing.T) { result := myFunction(5) if result != 10 { t.Errorf("Expected 10, but got %d", result) } } """ * **Integration Tests:** Write integration tests to verify the interaction between different components. * **End-to-End Tests:** Write end-to-end tests to verify the overall system behavior. Use tools like Ginkgo and Gomega for writing BDD-style tests. Follow the guidance provided in the "test-infra" repository for writing and running E2E tests in Kubernetes. ## 3. YAML Coding Standards ### 3.1. Formatting * **Indentation:** Use 2 spaces for indentation. *Never* use tabs. **Do This:** """yaml apiVersion: v1 kind: Pod metadata: name: my-pod spec: containers: - name: nginx image: nginx:latest """ **Don't Do This:** """yaml apiVersion: v1 kind: Pod metadata: name: my-pod # Incorrect indentation spec: containers: # Incorrect indentation - name: nginx # Incorrect indentation image: nginx:latest # Incorrect indentation """ * **Line Length:** Keep lines reasonably short (ideally under 80 characters). * **Spacing:** Use a single space after colons and commas. **Do This:** """yaml name: my-pod ports: [80, 443] """ **Don't Do This:** """yaml name:my-pod ports:[80,443] """ ### 3.2. Structure and Content * **API Version and Kind:** Always specify the "apiVersion" and "kind" fields at the beginning of each YAML file. Ensure the version is correct for the target Kubernetes cluster. Use the latest stable API versions. **Do This:** """yaml apiVersion: apps/v1 kind: Deployment """ **Don't Do This:** """yaml apiVersion: apps/v1beta1 # Deprecated API version kind: Deployment """ * **Metadata:** Include meaningful metadata, such as "name", "namespace", and "labels". Use labels consistently for selecting and managing resources. Use annotations to store non-identifying metadata. **Do This:** """yaml metadata: name: my-deployment namespace: production labels: app: my-app tier: backend """ * **Comments:** Use comments to explain the purpose of specific configurations and settings. **Do This:** """yaml # This deployment manages the backend API servers apiVersion: apps/v1 kind: Deployment metadata: name: my-deployment """ ### 3.3. Naming Conventions * **Resource Names:** Use lowercase letters, numbers, and hyphens for resource names. Start with a letter and end with a letter or number. Keep names short and descriptive. **Do This:** "my-pod", "backend-service" **Don't Do This:** "MyPod", "Backend_Service", "my_super_long_and_unnecessary_pod_name" * **Label Keys:** Use a DNS subdomain prefix for custom labels to avoid collisions with other applications or systems. **Do This:** "example.com/my-label", "app.kubernetes.io/name" (for standard Kubernetes labels) **Don't Do This:** "my-label" (without a domain prefix) ### 3.4. Best Practices * **Immutability:** Treat YAML files as immutable configuration. Use Git or other version control systems to track changes. * **Separation of Concerns:** Separate YAML files for different environments (e.g., development, staging, production). Use templating tools like Helm or Kustomize to manage environment-specific configurations. * **Security Contexts:** Always define security contexts for Pods and Containers to enforce security policies. Use "runAsUser", "runAsGroup", "capabilities", and other security-related fields. * **Resource Requests and Limits:** Define resource requests and limits for Containers to ensure proper resource allocation and prevent resource starvation. ### 3.5. Using Kustomize * **Base and Overlays:** Employ Kustomize's base and overlay system. Define a base configuration with common settings and create overlays for environment-specific customizations. **Base (base/kustomization.yaml):** """yaml resources: - deployment.yaml - service.yaml commonLabels: app: my-app """ **Overlay (overlays/production/kustomization.yaml):** """yaml bases: - ../../base patches: - path: deployment-patch.yaml target: kind: Deployment name: my-app """ **Deployment Patch (overlays/production/deployment-patch.yaml):** """yaml apiVersion: apps/v1 kind: Deployment metadata: name: my-app spec: replicas: 3 """ **Why:** Provides a structured way to manage configuration variations without duplicating entire files. ## 4. General Tips and Anti-Patterns * **Avoid Hardcoding Values:** Use environment variables, config maps, or secrets to externalize configuration values. Never hardcode sensitive information in code or YAML files. * **Don't Repeat Yourself (DRY):** Abstract common functionality into reusable functions or modules. Avoid duplicating code across multiple files or packages. * **Single Responsibility Principle (SRP):** Each function, method, or class should have a single, well-defined purpose. * **Keep Functions Short:** Keep functions short and focused (ideally under 50 lines of code). Break down complex logic into smaller, more manageable functions. * **Avoid Global Variables:** Minimize the use of global variables. If you need to use a global variable, protect it with a mutex to prevent concurrent access. Consider dependency injection to pass dependencies explicitly. * **Use Linters and Static Analysis Tools:** Use linters (e.g., "golangci-lint", "yamale") and static analysis tools (e.g., "staticcheck", "kube-linter") to identify potential issues in your code and YAML files. Configure these tools in your CI/CD pipeline for automated code review. * **Keep Up-to-Date:** Stay up-to-date with the latest Kubernetes features, best practices, and security advisories. Regularly review and update your code to take advantage of new features and address potential vulnerabilities. * **Address Code Smells Early:** Pay attention to code smells (e.g., long methods, duplicate code, feature envy) and address them early in the development process. Refactoring code regularly can prevent technical debt from accumulating. By adhering to these standards and guidelines, developers can contribute to a more consistent, readable, maintainable, and secure Kubernetes codebase. These standards should be considered a living document, subject to change and refinement as the Kubernetes project evolves. Regular review and updates are encouraged to keep these standards aligned with the latest best practices and technologies.
# Component Design Standards for Kubernetes This document outlines the coding standards for component design within the Kubernetes ecosystem. It aims to provide clear, actionable guidance for developers to create reusable, maintainable, performant, and secure components. These standards apply to new development and should guide refactoring efforts. ## 1. Architectural Principles ### 1.1. Modularity and Abstraction **Standard:** Design components with well-defined interfaces and minimal dependencies to enhance reusability and reduce coupling. * **Do This:** * Define clear API boundaries using interfaces in Go. * Use dependency injection to decouple components. * Favor composition over inheritance. * **Don't Do This:** * Create monolithic components with intertwined functionalities. * Introduce circular dependencies. * Expose internal implementation details through interfaces. **Why:** Modularity isolates changes, simplifies testing, and improves code reuse. Abstraction reduces complexity by hiding implementation details, allowing for easier maintenance and evolution. **Example:** """go // Good: Defined interface type PodLister interface { ListPods(namespace string) ([]*v1.Pod, error) } type cachedPodLister struct { informer cache.SharedIndexInformer } func (c *cachedPodLister) ListPods(namespace string) ([]*v1.Pod, error) { // Implementation using cached data return nil, nil //Placeholder removal breaks code } // Bad: Direct dependency on concrete type type PodReconciler struct { podLister *cachedPodLister // Concrete type, limits testability } //Good: Dependency on interface type BetterPodReconciler struct { podLister PodLister } func NewBetterPodReconciler(lister PodLister) *BetterPodReconciler { return &BetterPodReconciler{podLister: lister} } """ ### 1.2. Separation of Concerns **Standard:** Divide responsibilities into distinct components with single, well-defined purposes. * **Do This:** * Separate data access logic from business logic. * Isolate API handling from core functionality. * Design controllers to manage specific resources or aspects of the system. * **Don't Do This:** * Combine unrelated functionalities into a single component. * Write controllers that manage multiple, unrelated resource types. * Mix presentation logic with business logic. **Why:** Separation of concerns simplifies debugging, reduces the impact of changes, and enables parallel development. **Example:** """go //Bad: Combining API handling and business logic func HandlePodCreation(w http.ResponseWriter, r *http.Request) { //Parse request //Validate Request //Create Pod in etcd //Write Response } //Good: Separate API handling and business logic func HandlePodCreation(w http.ResponseWriter, r *http.Request, podCreator podCreator) { // Parse request // Validate Request err := podCreator.CreatePod(r.Context(), r.Body) if err != nil { //Handle error return } // Write Response } type podCreator interface { CreatePod(ctx context.Context, body io.ReadCloser) error } """ ### 1.3. Loose Coupling **Standard:** Minimize dependencies between components to improve independence and flexibility. * **Do This:** * Use asynchronous communication patterns (e.g., events, queues) to decouple components. * Define clear interfaces for service interaction. * Use versioned APIs to allow independent evolution of components. * **Don't Do This:** * Create tightly coupled dependencies that require simultaneous updates. * Expose internal data structures between components. * Rely on shared global state. **Why:** Loose coupling simplifies testing, allows for independent scaling, and reduces the risk of cascading failures. **Example:** """go //Bad: Direct function call resulting in tight coupling func ComponentA() { result := ComponentB() // Direct call // use result } //Good: Decoupled using a message queue // Component A publishes a message func ComponentA(queue MessageQueue) { message := Message{Type: "ComponentBRequest", Data: ...} queue.Publish("componentb.requests", message) } // Component B subscribes to the queue and processes messages func ComponentB(queue MessageQueue, processor BProcessor) { queue.Subscribe("componentb.requests", func(message Message) { processor.process(message) }) } """ ### 1.4. Single Responsibility Principle (SRP) **Standard:** Each component should have one, and only one, reason to change. * **Do This:** * Decompose complex components into smaller, focused units. * Ensure each component has a clear and specific purpose. * Refactor components that violate SRP. * **Don't Do This:** * Create "god" classes that handle multiple, unrelated responsibilities. * Add new responsibilities to existing components without careful consideration. **Why:** SRP improves maintainability, reduces complexity, and enhances testability. """go // Bad: A single struct handles both fetching and processing data. type DataHandler struct { Source string } func (dh *DataHandler) FetchData() ([]byte, error) { // Fetches data from dh.Source } func (dh *DataHandler) ProcessData(data []byte) (string, error) { // Processes the fetched data } // Good: Separate structs for fetching and processing data type DataFetcher struct { Source string } func (df *DataFetcher) FetchData() ([]byte, error) { // Fetches data from df.Source } type DataProcessor struct {} func (dp *DataProcessor) ProcessData(data []byte) (string, error) { // Processes the fetched data } """ ## 2. Kubernetes-Specific Considerations ### 2.1. Controller Design **Standard:** Kubernetes controllers should follow the "informer" pattern for efficient resource monitoring and reconciliation. * **Do This:** * Use "SharedIndexInformer" for caching and event handling of Kubernetes resources. * Implement the "Reconcile" function to manage the desired state. * Use work queues to manage events asynchronously. * Leverage client-go library features for interacting with the Kubernetes API. * **Don't Do This:** * Poll the Kubernetes API directly for changes. * Block the reconciliation loop with long-running operations. * Ignore error conditions during reconciliation. **Why:** The informer pattern provides efficient change notifications, reduces API load, and ensures eventual consistency. **Example:** """go // Basic Kubernetes controller structure import ( "context" "fmt" batchv1 "k8s.io/api/batch/v1" corev1 "k8s.io/api/core/v1" "k8s.io/apimachinery/pkg/api/errors" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" utilruntime "k8s.io/apimachinery/pkg/util/runtime" "k8s.io/apimachinery/pkg/util/wait" batchinformers "k8s.io/client-go/informers/batch/v1" "k8s.io/client-go/kubernetes" "k8s.io/client-go/kubernetes/scheme" batchlisters "k8s.io/client-go/listers/batch/v1" "k8s.io/client-go/tools/cache" "k8s.io/client-go/tools/record" "k8s.io/client-go/util/workqueue" // Import workqueue "k8s.io/klog/v2" "time" ) const controllerAgentName = "sample-controller" const ( // SuccessSynced is used as part of the Event 'reason' when a Foo is synced SuccessSynced = "Synced" // ErrResourceExists is used as part of the Event 'reason' when a Foo fails // to sync due to a Job of the same name already existing. ErrResourceExists = "ErrResourceExists" // MessageResourceExists is the message used for Events when a resource // fails to sync due to a Job already existing MessageResourceExists = "Resource %q already exists and is not managed by Foo" // MessageResourceSynced is the message used for an Event fired when a Foo // is synced successfully MessageResourceSynced = "Foo %q synced successfully" ) // Controller is the controller implementation for Foo resources type Controller struct { // kubeclientset is a standard kubernetes clientset kubeclientset kubernetes.Interface // jobLister is lister for Jobs jobLister batchlisters.JobLister // jobsSynced is a sync for jobs jobsSynced cache.InformerSynced // workqueue is a rate limited work queue. This is used to queue work to be // processed instead of performing it as soon as a change happens. This // means we can ensure we only process a fixed amount of resources at a // time, and makes it easy to ensure we are never overwhelming the system. workqueue workqueue.RateLimitingInterface // recorder is an event recorder for recording Event resources to the // Kubernetes API. recorder record.EventRecorder } // NewController returns a new sample controller func NewController( kubeclientset kubernetes.Interface, jobInformer batchinformers.JobInformer, recorder record.EventRecorder) *Controller { controller := &Controller{ kubeclientset: kubeclientset, jobLister: jobInformer.Lister(), jobsSynced: jobInformer.Informer().HasSynced, workqueue: workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "Jobs"), recorder: recorder, } klog.V(4).Info("Setting up event handlers") // Set up event handlers for when Job resources change jobInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{ AddFunc: controller.enqueueJob, UpdateFunc: func(old, new interface{}) { controller.enqueueJob(new) }, DeleteFunc: controller.enqueueJob, }) return controller } // Run will set up the event handlers for types we are interested in, as well // as syncing informer caches and starting workers. It will block until stopCh // is closed, at which point it will shutdown the workqueue and wait for // workers to finish processing their current work items. func (c *Controller) Run(ctx context.Context, workers int) error { defer utilruntime.HandleCrash() defer c.workqueue.ShutDown() // Start the informer factories to begin populating the informer caches klog.Info("Starting Job controller") // Wait for the caches to be synced before starting workers klog.Info("Waiting for informer caches to sync") if ok := cache.WaitForCacheSync(ctx.Done(), c.jobsSynced); !ok { return fmt.Errorf("failed to wait for caches to sync") } klog.Info("Starting workers") // Launch two workers to process Job resources for i := 0; i < workers; i++ { go wait.Until(c.runWorker, time.Second, ctx.Done()) } klog.Info("Started workers") <-ctx.Done() klog.Info("Shutting down workers") return nil } // runWorker is a long-running function that will continually pull new items // off the workqueue and process them. func (c *Controller) runWorker() { for c.processNextWorkItem() { } } // processNextWorkItem will read a single work item off the workqueue and // attempt to process it. func (c *Controller) processNextWorkItem() bool { // Fetch the next item off the workqueue. obj, shutdown := c.workqueue.Get() if shutdown { return false } // We wrap this block in a func so we can defer c.workqueue.Done. err := func(obj interface{}) error { // We call Done here so the workqueue knows we have finished // processing this item. We also must remember to call Forget if we // do not want this work item being re-queued. For example, we do // not call Forget if a transient error occurs, instead the item is // put back on the workqueue and attempted again after a back-off // period. defer c.workqueue.Done(obj) var key string var ok bool // We expect strings to come off the workqueue. These are of the // form namespace/name. if key, ok = obj.(string); !ok { // As the item in the workqueue is actually invalid, we call // Forget here else we'd go into a loop of attempting to // process a work item that is invalid. c.workqueue.Forget(obj) utilruntime.HandleError(fmt.Errorf("expected string in workqueue but got %#v", obj)) return nil } // Run the syncHandler, passing it the namespace/name string of the // Foo resource to be synced. if err := c.syncHandler(key); err != nil { // Put the item back on the workqueue to handle any transient errors. c.workqueue.AddRateLimited(key) return fmt.Errorf("error syncing '%s': %s, requeuing", key, err.Error()) } // Finally, if no error occurs we Forget this item so it does not // get queued again until another change happens. c.workqueue.Forget(obj) klog.Infof("Successfully synced '%s'", key) return nil }(obj) if err != nil { utilruntime.HandleError(err) return true } return true } // syncHandler compares the actual state with the desired, and attempts to // converge the two. It then updates the Status block of the Foo resource // with the current status of the resource. func (c *Controller) syncHandler(key string) error { // Convert the namespace/name string into a distinct namespace and name namespace, name, err := cache.SplitMetaNamespaceKey(key) if err != nil { utilruntime.HandleError(fmt.Errorf("invalid resource key: %s", key)) return nil } // Get the Job with this namespace/name job, err := c.jobLister.Jobs(namespace).Get(name) if err != nil { // The Job resource may no longer exist, in which case we stop // processing. if errors.IsNotFound(err) { utilruntime.HandleError(fmt.Errorf("job '%s' in work queue no longer exists", key)) return nil } return err } //TODO: Replace with actual controller code klog.Infof("Found job %s/%s", job.Namespace, job.Name) c.recorder.Event(job, corev1.EventTypeNormal, SuccessSynced, MessageResourceSynced) return nil } // enqueueJob takes any resource, converts it into a key formatted string, and // adds the string to the workqueue. func (c *Controller) enqueueJob(obj interface{}) { key, err := cache.MetaNamespaceKeyFunc(obj) if err != nil { utilruntime.HandleError(err) return } c.workqueue.Add(key) } // TODO: Consider alternatives to rate limiting. Discuss implications of exponential backoff in high-volume scenarios (potential for long delays) //AddRateLimited(item interface{}) //Forget(item interface{}) //NumRequeues(item interface{}) int """ ### 2.2. CRD (Custom Resource Definition) Design **Standard:** Design CRDs with clear and well-defined schemas, validation rules, and versioning strategies. * **Do This:** * Use OpenAPI v3 schema for validation of resources. * Implement webhook-based validation for complex business logic that cannot be expressed via OpenAPI. * Implement webhook-based mutation to enforce immutability or set defaults. * Define default values for optional fields. * Use semantic versioning for API changes. * Provide conversion webhooks for migrating resources between API versions. * **Don't Do This:** * Define overly complex or deeply nested schemas. * Introduce breaking changes without a smooth migration path. * Omit validation rules, leading to inconsistent or invalid data. **Why:** Well-designed CRDs provide a consistent and reliable extension mechanism for Kubernetes, improving user experience and reducing operational risks. **Example:** """yaml # Example CRD definition with validation apiVersion: apiextensions.k8s.io/v1 kind: CustomResourceDefinition metadata: name: myresources.example.com spec: group: example.com versions: - name: v1alpha1 schema: openAPIV3Schema: type: object properties: spec: type: object properties: size: type: integer minimum: 1 maximum: 10 image: type: string pattern: "^[a-z0-9-.]*/[a-z0-9-.]*:[a-z0-9-.]*$" served: true storage: true scope: Namespaced names: plural: myresources singular: myresource kind: MyResource shortNames: - mr """ """go // Example Validation Webhook import ( "encoding/json" "net/http" "k8s.io/api/admission/v1" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" "k8s.io/klog/v2" ) func (wh *WebhookServer) validateMyResource(ar *v1.AdmissionReview) *v1.AdmissionResponse { klog.V(4).Info("Validating MyResource") req := ar.Request var myResource MyResource if err := json.Unmarshal(req.Object.Raw, &myResource); err != nil { klog.Errorf("Could not unmarshal raw object: %v", err) return &v1.AdmissionResponse{ Result: &metav1.Status{ Message: err.Error(), Code: http.StatusBadRequest, }, } } if myResource.Spec.Size > 100 { return &v1.AdmissionResponse{ Allowed: false, Result: &metav1.Status{ Message: "Size cannot be greater than 100", Code: http.StatusForbidden, }, } } return &v1.AdmissionResponse{ Allowed: true, } } """ ### 2.3. Operator Pattern **Standard:** Use the operator pattern to automate complex operational tasks, such as deployments, upgrades, backups, and scaling. * **Do This:** * Encapsulate operational knowledge within the operator. * Use CRDs to define and manage application-specific resources. * Implement reconciliation loops to ensure desired state. * Leverage existing operator frameworks (e.g., Operator SDK, Kubebuilder) to simplify development. * **Don't Do This:** * Create overly complex operators that manage too many unrelated aspects. * Omit comprehensive error handling and logging. * Rely on manual intervention for common operational tasks. **Why:** Operators automate complex tasks, reduce human error, and improve overall system reliability. **Example:** Operator framework such as Kubebuilder and the Operator SDK provides code generation tools to minimize boilerplate. Example uses and generated code are too lengthy to fit here. Please refer to: [https://sdk.operatorframework.io/](https://sdk.operatorframework.io/) [https://kubebuilder.io/](https://kubebuilder.io/) ### 2.4. API Versioning and Compatibility **Standard:** Follow semantic versioning principles for API changes and provide compatibility layers for older versions. * **Do This:** * Increment the major version number for breaking changes. * Increment the minor version number for new features. * Increment the patch version number for bug fixes. * Provide conversion webhooks for seamless migration between API versions. * Deprecate old APIs gradually with proper warnings and migration guides. * **Don't Do This:** * Introduce breaking changes without incrementing the major version number. * Remove deprecated APIs without proper notice. * Omit compatibility layers for older API consumers. **Why:** Proper API versioning ensures smooth upgrades, reduces compatibility issues, and minimizes disruption for existing users. ## 3. Implementation Details ### 3.1. Error Handling **Standard:** Implement robust error handling with clear and informative error messages. * **Do This:** * Use Go's error interface and the "errors" package for custom error types. * Wrap errors with context using "%w" verb in "fmt.Errorf". * Log errors with sufficient context for debugging. * Handle errors gracefully and avoid panics. * **Don't Do This:** * Ignore errors silently. * Use generic error messages without context. * Propagate errors without handling or logging them. **Why:** Proper error handling simplifies debugging, improves system resilience, and provides better operational insights. **Example:** """go // Good: Error wrapping with context func doSomething(ctx context.Context) error { err := someFunc() if err != nil { return fmt.Errorf("failed to do something: %w", err) } return nil } // Bad: Ignoring error func doSomething(ctx context.Context) { someFunc() // potentially ignoring an error condition } """ ### 3.2. Logging **Standard:** Implement structured logging with appropriate levels and contextual information. * **Do This:** * Use a structured logging library (e.g., "klog/v2") for consistent formatting and metadata. * Choose appropriate log levels (e.g., "Info", "Warning", "Error", "Debug") based on severity. * Include relevant context in log messages (e.g., request ID, user ID, resource name). * Avoid logging sensitive information. * **Don't Do This:** * Use unstructured logging with inconsistent formatting. * Log excessive amounts of debug information in production. * Omit contextual information from log messages. **Why:** Structured logging simplifies analysis, improves operational visibility, and facilitates troubleshooting. **Example:** """go // Good: Structured logging with klog import ( "k8s.io/klog/v2" ) func doSomething(name string) error { klog.InfoS("doing something", "name", name) err := someFunc() if err != nil { klog.ErrorS(err, "failed to do something", "name", name) return fmt.Errorf("failed to do something: %w", err) } return nil } // Bad: Unstructured logging func doSomething(name string) error { fmt.Printf("Doing Somethign %s", name) err := someFunc() if err != nil { fmt.Printf("Error: %v", err) return fmt.Errorf("failed to do something: %w", err) } return nil } """ ### 3.3. Concurrency **Standard:** Use Go's concurrency primitives (goroutines, channels, mutexes) to manage concurrent operations safely and efficiently. * **Do This:** * Use goroutines for parallel execution. * Use channels for communication and synchronization between goroutines. * Use mutexes or other synchronization primitives to protect shared resources. * Use the "context" package for cancellation and timeouts. * **Don't Do This:** * Introduce race conditions by accessing shared resources without protection. * Create goroutine leaks by failing to wait for goroutines to complete. * Use global variables for shared state. **Why:** Proper concurrency management improves performance, prevents data corruption, and ensures program stability. **Example:** """go // Good: Using goroutines and channels func processData(data []string) <-chan string { results := make(chan string) go func() { defer close(results) for _, item := range data { result := doSomeWork(item) results <- result } }() return results } // Bad: Race condition on shared variable (DO NOT USE) var counter int func incrementCounter() { for i := 0; i < 1000; i++ { counter++ // Race condition! } } """ ### 3.4. Resource Management **Standard:** Manage system resources (CPU, memory, file descriptors) efficiently to prevent resource exhaustion and improve performance. * **Do This:** * Limit resource usage with appropriate quotas and limits. * Release resources promptly when no longer needed. * Use profiling tools to identify resource bottlenecks. * Avoid memory leaks. * **Don't Do This:** * Allocate excessive amounts of memory without justification. * Leak file descriptors or other system resources. * Ignore resource constraints defined by the Kubernetes environment. **Why:** Efficient resource management reduces costs, improves performance, and enhances system stability. ## 4. Security Best Practices ### 4.1. Input Validation **Standard:** Validate all user inputs to prevent injection attacks and other vulnerabilities. * **Do This:** * Use appropriate validation libraries and frameworks. * Sanitize inputs to remove potentially harmful characters or sequences. * Enforce strict input length and format constraints. * Use parameterized queries to prevent SQL injection. * **Don't Do This:** * Trust user inputs without validation. * Allow arbitrary code execution based on user inputs. * Hardcode sensitive data into the code. **Why:** Input validation prevents exploits that could compromise the system's integrity and security. ### 4.2. Authentication and Authorization **Standard:** Implement robust authentication and authorization mechanisms to control access to sensitive resources. * **Do This:** * Use Kubernetes RBAC (Role-Based Access Control) to define access permissions. * Authenticate users using secure protocols (e.g., OAuth 2.0, OpenID Connect). * Enforce the principle of least privilege. * Regularly review and update access control policies. * **Don't Do This:** * Grant unnecessary permissions to users or services. * Store credentials in plaintext. * Bypass authentication or authorization checks. **Why:** Authentication and authorization protect sensitive resources from unauthorized access. ### 4.3. Secrets Management **Standard:** Store and manage sensitive data (e.g., passwords, API keys, certificates) securely. * **Do This:** * Use Kubernetes Secrets to store sensitive data. * Encrypt secrets at rest using KMS (Key Management Service). * Rotate secrets regularly. * Avoid committing secrets to source control. * **Don't Do This:** * Store secrets in environment variables or configuration files. * Share secrets unnecessarily. * Hardcode secrets into the code or container images. **Why:** Secure secrets management prevents sensitive data from being exposed, reducing the risk of unauthorized access. ### 4.4. Container Security **Standard:** Secure container images and runtime environments to prevent vulnerabilities and exploits. * **Do This:** * Use minimal base images with only necessary dependencies. * Scan container images for vulnerabilities using tools like Trivy or Clair. * Run containers with non-root users. * Apply security contexts to restrict container capabilities. * Regularly update container images and dependencies. * **Don't Do This:** * Use outdated or vulnerable base images. * Run containers as root without justification. * Expose sensitive ports or services unnecessarily. **Why:** Container security prevents exploits that could compromise the container or the underlying host. ## 5. Testing ### 5.1. Unit Tests **Standard:** Write comprehensive unit tests for all components to ensure correctness and prevent regressions. * **Do This:** * Use Go's testing package for unit tests. * Aim for high code coverage. * Mock dependencies to isolate units under test. * Write clear and concise test cases. * **Don't Do This:** * Omit unit tests for critical components. * Write tests that are too complex or brittle. * Ignore failing tests. **Why:** Unit tests provide confidence in the correctness of individual components and prevent regressions during development. ### 5.2. Integration Tests **Standard:** Write integration tests to verify the interaction between different components. * **Do This:** * Use Ginkgo and Gomega for BDD-style testing. * Test the integration of components with the Kubernetes API. * Verify the end-to-end behavior of the system. * **Don't Do This:** * Rely solely on unit tests. * Skip testing the integration of critical components. * Write integration tests that are too slow or unreliable. **Why:** Integration tests ensure that different components work together correctly and that the system as a whole behaves as expected. ### 5.3. End-to-End (E2E) Tests **Standard:** Write end-to-end tests to validate the system's overall functionality in a realistic environment. * **Do This:** * Deploy the system to a test Kubernetes cluster. * Simulate real-world user scenarios. * Verify the system's behavior under load. * Monitor performance and resource usage. * **Don't Do This:** * Skip E2E tests. * Test only basic functionality. * Ignore performance or scalability issues revealed by E2E tests. **Why:** End-to-end tests provide confidence that the system functions correctly in a production-like environment and meets performance and scalability requirements. ## 6. Continuous Integration and Continuous Delivery (CI/CD) ### 6.1. Automated Builds **Standard:** Automate the build process to ensure consistent and reliable builds. * **Do This:** * Use a CI/CD system (e.g., Jenkins, GitLab CI, GitHub Actions) to automate builds. * Trigger builds on code commits and pull requests. * Run unit tests and integration tests as part of the build process. * Generate build artifacts (e.g., container images, binaries) automatically. * **Don't Do This:** * Rely on manual builds. * Skip automated testing in the build process. * Fail to track build provenance and dependencies. ### 6.2. Automated Deployments **Standard:** Automate the deployment process to ensure consistent and reliable deployments. * **Do This:** * Use a CI/CD system to automate deployments. * Use declarative deployment configurations (e.g., Kubernetes manifests, Helm charts). * Implement blue-green deployments or canary releases for minimal downtime. * Monitor deployments for errors and roll back automatically if necessary. * **Don't Do This:** * Rely on manual deployments. * Deploy directly to production without testing in a staging environment. * Fail to monitor deployments for errors. ## 7. Documentation ### 7.1. Code Comments **Standard:** Write clear and concise comments to explain the purpose and functionality of the code. * **Do This:** * Comment complex or non-obvious code. * Explain the purpose of functions, methods, and classes. * Document API interfaces and data structures. * Use Go's commenting conventions. If reviewers ask questions about why the code is the way it is, that’s a sign that comments might be helpful. * **Don't Do This:** * Write redundant or obvious comments. * Comment every line of code. * Let comments become outdated. ### 7.2. API Documentation **Standard:** Generate API documentation automatically from code comments. * **Do This:** * Follow the conventions of documentation generators (e.g., GoDoc). * Document all API endpoints, data structures, and parameters. * Provide examples of API usage. * **Don't Do This:** * Omit API documentation. * Write API documentation manually. * Let API documentation become outdated. # Modern Approaches, Patterns, and Technologies for Component Design in Kubernetes: * **Controller-Runtime:** Leveraging controller-runtime, part of Kubebuilder, for building controllers simplifies many aspects like leader election, metrics, and health probes. It’s favored over vanilla client-go informers for new controller development. * **Server-Side Apply:** Use server-side apply to manage resources more effectively, merging changes from multiple actors without conflict. * **Composition Functions (kustomize):** With tools such as kustomize, configure Kubernetes resources via composition rather than duplication. This allows for patching and overriding standard configurations in a non-destructive manner. * **eBPF for Networking and Security:** Adopt eBPF (extended Berkeley Packet Filter) for advanced networking, observability, and security policies (e.g., Cilium CNI). * **Policy Engines (OPA, Kyverno):** Use policy engines like OPA (Open Policy Agent) or Kyverno to implement fine-grained policies to manage resource creation and configuration. * **Gatekeeper:** Use Gatekeeper to enforce CRD-based policies in your cluster. * **Service Mesh (Istio, Linkerd):** Employ service meshes for features like mTLS, traffic management, and advanced observability with minimal code changes. By incorporating these best practices, you can build Kubernetes components that are robust, scalable, secure, and easy to maintain. This document serves as a living guide that should be updated regularly to reflect the latest developments in the Kubernetes ecosystem.
# State Management Standards for Kubernetes This document outlines coding standards for managing application state within Kubernetes. It provides guidelines for developers to ensure consistency, maintainability, performance, and security when building stateful applications on Kubernetes. These guidelines specifically address the unique challenges of state management in a distributed, containerized environment and leverages modern Kubernetes features and best practices. ## 1. Introduction to State Management in Kubernetes State management in Kubernetes revolves around persisting, accessing, and managing data across pod lifecycles. This is especially critical for stateful applications like databases, message queues, and caching systems. Unlike stateless applications, stateful apps need to retain data even when pods are rescheduled or updated. This document focuses on Kubernetes-native approaches and avoids relying on external solutions where possible to enhance portability and integration. ### 1.1. Key Considerations for State Management * **Persistence:** How data is stored and retrieved reliably. * **Data Access:** Efficient and secure methods for applications to interact with persistent data. * **Consistency:** Ensuring data remains consistent across different nodes and pods. * **High Availability:** Maintaining data availability even during failures. * **Scalability:** Adapting storage capacity to meet changing application demands. * **Security:** Protecting sensitive data at rest and in transit. ## 2. Persistent Volumes and Claims Kubernetes Persistent Volumes (PVs) and Persistent Volume Claims (PVCs) are fundamental for managing persistent storage. Follow best practices for defining and using PVs and PVCs effectively. ### 2.1. Defining Persistent Volumes * **Do This:** Use StorageClasses for dynamic provisioning of PVs. * **Don't Do This:** Manually create PVs unless absolutely necessary. Static provisioning reduces portability and increases administrative overhead. **Why:** StorageClasses allow admins to define different types of storage (e.g., SSD, HDD, cloud-provider specific) and allow users to dynamically request storage without needing to know the underlying infrastructure details. **Code Example (StorageClass):** """yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: standard provisioner: kubernetes.io/aws-ebs # Or your provider's provisioner parameters: type: gp2 reclaimPolicy: Retain # Or Delete, depending on your requirements """ * **Do This:** Set appropriate "reclaimPolicy" ("Retain" or "Delete") based on your needs. "Retain" keeps the volume after the PVC is deleted, useful for debugging or data recovery. "Delete" removes the volume when the PVC is deleted. * **Don't Do This:** Leave "reclaimPolicy" unset. This might lead to unexpected data loss or orphaned volumes. **Why:** The "reclaimPolicy" dictates what happens to the underlying storage volume when a PVC is deleted. Choosing the correct setting is important for data lifecycle management. ### 2.2. Defining Persistent Volume Claims * **Do This:** Define "accessModes" that match your application's needs (ReadWriteOnce, ReadOnlyMany, ReadWriteMany). * **Don't Do This:** Request more storage than your application needs. This wastes resources. **Why:** "accessModes" control how the volume can be accessed by multiple pods. * "ReadWriteOnce": The volume can be mounted as read-write by a single node. * "ReadOnlyMany": The volume can be mounted as read-only by many nodes. * "ReadWriteMany": The volume can be mounted as read-write by many nodes. **Code Example (PVC):** """yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: my-pvc spec: storageClassName: standard accessModes: - ReadWriteOnce resources: requests: storage: 10Gi """ * **Do This:** Use "resources.requests.storage" to specify the amount of storage needed. **Why:** This ensures your application gets the required storage and helps Kubernetes scheduler find a suitable Persistent Volume. ### 2.3. Anti-Patterns: PV/PVC * **Anti-Pattern:** Hardcoding specific PV names in pod definitions. * **Better:** Rely on PVCs and StorageClasses for dynamic volume provisioning, promoting environment portability. * **Anti-Pattern:** Ignoring "reclaimPolicy" leading to data loss after PVC deletion (or orphaned volumes after app deletion). * **Better:** Carefully consider and setting "reclaimPolicy" based on data lifecycle. ## 3. StatefulSets StatefulSets are the recommended way to manage stateful applications in Kubernetes. ### 3.1. Defining StatefulSets * **Do This:** Use "serviceName" to define a headless service for your StatefulSet pods. * **Don't Do This:** Use a regular Service with a selector that matches all pods in the StatefulSet. **Why:** Headless Services provide stable network identities for each pod in the StatefulSet, crucial for stateful applications that require peer-to-peer communication or predictable addressing. **Code Example (StatefulSet):** """yaml apiVersion: apps/v1 kind: StatefulSet metadata: name: my-statefulset spec: serviceName: my-headless-service replicas: 3 selector: matchLabels: app: my-app template: metadata: labels: app: my-app spec: containers: - name: my-container image: my-image volumeMounts: - name: data mountPath: /data volumeClaimTemplates: - metadata: name: data spec: accessModes: [ "ReadWriteOnce" ] storageClassName: "standard" resources: requests: storage: 10Gi """ * **Do This:** Use "volumeClaimTemplates" to automatically create PVCs for each pod. * **Don't Do This:** Manually create PVCs for each pod in the StatefulSet. **Why:** "volumeClaimTemplates" simplify managing persistent storage for StatefulSets. Kubernetes will automatically create a PVC for each pod, named "<volumeClaimTemplateName>-<statefulset-name>-<pod-name>". * **Do This:** Understand the ordering guarantees provided by StatefulSets for pod creation, update, and deletion. * Pods are created sequentially, in order "0, 1, 2, ...". * Pods are updated in reverse ordinal order. * Pods are terminated in reverse ordinal order ("2, 1, 0"). * **Don't Do This:** Assume pods within a StatefulSet are identical and interchangeable. **Why:** StatefulSets are designed for applications where order and identity matter(e.g., clustered databases). ### 3.2. Pod Management Policy * **Do This:** Use the "OrderedReady" pod management policy unless you have a specific reason to use "Parallel". * **Don't Do This:** Use the "Parallel" pod management policy without understanding its implications on stateful application behavior. **Why:** "OrderedReady" ensures that each pod is fully ready before the next pod is created or updated. This is crucial for maintaining data consistency and availability in stateful applications. "Parallel" starts all pods at once which could lead to issues if your different instances need to coordinate. Example (StatefulSet): """yaml apiVersion: apps/v1 kind: StatefulSet metadata: name: web spec: serviceName: "nginx" replicas: 2 podManagementPolicy: OrderedReady selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - name: nginx image: k8s.gcr.io/nginx-slim:0.8 ports: - containerPort: 80 name: web volumeMounts: - name: www mountPath: /usr/share/nginx/html updateStrategy: type: RollingUpdate volumeClaimTemplates: - metadata: name: www spec: accessModes: [ "ReadWriteOnce" ] storageClassName: "standard" resources: requests: storage: 1Gi """ ### 3.3. Anti-Patterns: StatefulSets * **Anti-Pattern:** Ignoring the ordered nature of StatefulSet deployments. This can lead to data corruption or inconsistent state in distributed systems. * **Better:** Design your application to handle ordered deployments and updates, and to leverage the pod ordinal index. * **Anti-Pattern:** Scaling down a StatefulSet without considering the impact on data distribution and consistency. * **Better:** Implement graceful shutdown procedures that redistribute data before terminating pods. * **Anti-Pattern:** Mounting the same persistent volume multiple times into the same pod. While Kubernetes will block ReadWriteOnce mounts, ReadOnlyMany and ReadWriteMany volumes have different considerations. * **Better:** Design container layouts and mounts with a 1-to-1 mapping from PV to directories intended for a single container. ## 4. Configuration Management Managing configuration is critical for stateful apps. Use Kubernetes ConfigMaps and Secrets. ### 4.1. ConfigMaps * **Do This:** Use ConfigMaps to store non-sensitive configuration data. * **Don't Do This:** Store sensitive information in ConfigMaps. **Why:** ConfigMaps are not encrypted and should not contain secrets like passwords or API keys. **Code Example (ConfigMap):** """yaml apiVersion: v1 kind: ConfigMap metadata: name: my-config data: my_config_key: "my_config_value" """ Then, access the ConfigMap from a container: """yaml apiVersion: apps/v1 kind: Deployment metadata: name: my-app spec: replicas: 1 selector: matchLabels: app: my-app template: metadata: labels: app: my-app spec: containers: - name: my-container image: my-image env: - name: MY_CONFIG_VAR valueFrom: configMapKeyRef: name: my-config key: my_config_key """ ### 4.2. Secrets * **Do This:** Use Secrets to store sensitive information such as passwords, API keys, and certificates. * **Don't Do This:** Hardcode secrets in your application code or configuration files. * **Consider:** Use SealedSecrets, HashiCorp Vault, or other secrets management solutions for enhanced security, especially in production. **Why:** Secrets are stored as base64 encoded strings and can be mounted as volumes or environment variables. **Code Example (Secret):** """yaml apiVersion: v1 kind: Secret metadata: name: my-secret type: Opaque # Optional, define the type of secret data: my_secret_key: $(echo -n 'my_secret_value' | base64) """ **Note:** Encode the secret value using base64. Don't store plain-text passwords. Access the Secret using environment variables: """yaml apiVersion: apps/v1 kind: Deployment metadata: name: my-app spec: replicas: 1 selector: matchLabels: app: my-app template: metadata: labels: app: my-app spec: containers: - name: my-container image: my-image env: - name: MY_SECRET_VAR valueFrom: secretKeyRef: name: my-secret key: my_secret_key """ ### 4.3. Projected Volumes * **Do This:** Use projected volumes to inject multiple ConfigMaps and Secrets into a pod as a single volume. * **Don't Do This:** Mount ConfigMaps and Secrets as individual volumes unless absolutely necessary. **Why:** Projected volumes simplify configuration management by providing a single point of access for multiple configuration sources. **Code Example (Projected Volume):** """yaml apiVersion: apps/v1 kind: Pod metadata: name: my-pod spec: containers: - name: my-container image: my-image volumeMounts: - name: config-volume mountPath: /etc/config readOnly: true volumes: - name: config-volume projected: sources: - configMap: name: my-config - secret: name: my-secret """ ### 4.4. Anti-Patterns: Configuration * **Anti-Pattern:** Storing secrets directly in ConfigMaps. * **Better:** Use Secrets for sensitive data, and consider a secrets management solution. * **Anti-Pattern:** Hardcoding configuration values into container images. * **Better:** Externalize configuration using ConfigMaps and Secrets. ## 5. Data Backup and Recovery Implementing robust backup and recovery strategies is crucial for stateful applications. ### 5.1. Backup Strategies * **Do This:** Regularly back up your persistent volumes using Kubernetes-aware backup solutions like Velero, Kopia, or cloud provider-specific tools. * **Don't Do This:** Rely solely on manual backups. Automate the backup process to minimize data loss. **Why:** Regular backups protect against data loss due to hardware failures, accidental deletions, or other disasters. ### 5.2. Recovery Procedures * **Do This:** Document your recovery procedures and test them regularly. * **Don't Do This:** Wait until a disaster occurs to figure out how to restore your data. **Why:** Practiced recovery procedures minimize downtime and ensure you can restore your application to a known good state. ### 5.3. Volume Snapshots * **Do This:** Utilize Volume Snapshots, if supported by your CSI driver and storage provider. * **Don't Do This:** Ignore snapshots if they are provided by your storage backend, as they improve backup and restore times significantly. **Why:** Volume snapshots can allow you to create a point-in-time copy of a volume, which can then be restored from quickly and easily. Backup solutions can be configured to use snapshots for even faster backup. **Code Example (VolumeSnapshotClass):** """yaml apiVersion: snapshot.storage.k8s.io/v1 kind: VolumeSnapshotClass metadata: name: csi-aws-vsc driver: ebs.csi.aws.com # Or your provider's CSI driver deletionPolicy: Delete # Or Retain, based on your disaster recovery plan parameters: csi.storage.k8s.io/snapshotter-secret-name: aws-secret csi.storage.k8s.io/snapshotter-secret-namespace: default """ **Code Example (VolumeSnapshot):** """yaml apiVersion: snapshot.storage.k8s.io/v1 kind: VolumeSnapshot metadata: name: my-snapshot spec: volumeSnapshotClassName: csi-aws-vsc source: persistentVolumeClaimName: my-pvc """ ### 5.4. Anti-Patterns: Backup and Recovery * **Anti-Pattern:** Infrequent backups. * **Better:** Schedule backups based on your application's RPO (Recovery Point Objective). * **Anti-Pattern:** Lack of tested recovery procedures. * **Better:** Regularly test your recovery procedures to ensure they work in a real disaster scenario. * **Anti-Pattern:** Storing backups in the same location as the primary data. * **Better:** Use offsite backups to protect against regional outages. ## 6. Advanced State Management Patterns Kubernetes offers several advanced patterns for state management, particularly for complex stateful applications. ### 6.1. Operators * **Do This:** Consider using Kubernetes Operators for managing complex stateful applications. * **Don't Do This:** Manually manage the lifecycle of complex applications in Kubernetes. **Why:** Operators encapsulate the operational knowledge of managing an application, automating tasks like provisioning, scaling, upgrades, and backups. * Operators use Custom Resource Definitions (CRDs) to extend the Kubernetes API. ### 6.2. Local Persistent Volumes * **Do This:** Use Local Persistent Volumes (Local PVs) for applications that require low-latency access to storage, such as distributed databases. * **Don't Do This:** Use Local PVs for applications that require high availability or data replication across multiple nodes. **Why:** Local PVs provide direct access to locally attached storage devices, improving performance but sacrificing some of the availability and portability of traditional PVs. ### 6.3. Data Locality Optimization * **Do This:** Strive to schedule pods needing data access to nodes where the data already resides, using node affinity or topology spread constraints. * **Don't Do This:** Ignore data locality as failing to optimize this can lead to significant performance degradation and increased network traffic. **Why:** Scheduling pods near their data can dramatically reduce latency and improve overall application performance. ### 6.4. Anti-Patterns: Advanced Patterns * **Anti-Pattern:** Overusing operators for simple applications. * **Better:** Operators are best suited for managing complex, stateful workloads. * **Anti-Pattern:** Ignoring the limitations of Local PVs. * **Better:** Understand the availability and portability trade-offs before using Local PVs. ## 7. Conclusion These standards provide a foundation for building robust and maintainable stateful applications in Kubernetes. By adhering to these guidelines, developers can ensure that applications manage state effectively, are highly available, and can be easily scaled and maintained. Remember to always consult the [official Kubernetes documentation](https://kubernetes.io/docs/home/) for the most up-to-date information. As Kubernetes evolves, so too will these best practices; continuous learning and adaptation are key to successful Kubernetes development.
# Performance Optimization Standards for Kubernetes This document outlines coding standards specifically focused on performance optimization for Kubernetes applications. Adhering to these standards will help ensure applications are fast, responsive, and efficient in their resource usage within a Kubernetes environment. These guidelines emphasize modern techniques using the latest Kubernetes features. ## 1. Container Image Optimization Optimized container images are crucial for fast deployment and reduced resource consumption. ### 1.1 Base Image Selection **Do This:** * Use minimal base images like "distroless" or "alpine". These images contain only the runtime dependencies required by your application, reducing the image size significantly. * If using Java, consider Eclipse Temurin JRE Slim images. * Regularly update base images to patch security vulnerabilities and leverage performance improvements. **Don't Do This:** * Don't use heavyweight base images like full OS distributions (e.g., Ubuntu) unless absolutely necessary. They add unnecessary layers and increase image size. * Don't neglect to update base images, leading to security risks and missed performance enhancements. **Why:** Smaller images download faster, consume less disk space, and reduce the attack surface. Regular updates ensure security and performance benefits provided by the base image. **Example:** """dockerfile # Correct: Using distroless base image for a Go application FROM golang:1.21-alpine AS builder WORKDIR /app COPY go.mod go.sum ./ RUN go mod download COPY . . RUN go build -o main . FROM gcr.io/distroless/static:latest WORKDIR /app COPY --from=builder /app/main /app/main ENTRYPOINT ["/app/main"] """ """dockerfile # Incorrect: Using a full Ubuntu image when distroless would suffice FROM ubuntu:latest RUN apt-get update && apt-get install -y --no-install-recommends some-dependency COPY . /app WORKDIR /app CMD ["python", "app.py"] """ ### 1.2 Multi-Stage Builds **Do This:** * Use multi-stage Docker builds to separate the build environment from the runtime environment. The build stage contains all the tools needed for compilation, while the final stage only contains the application and its runtime dependencies. * Leverage Docker's caching mechanism effectively by ordering the Dockerfile commands logically. Copy dependency files (e.g., "pom.xml", "requirements.txt") before the application code to maximize cache reuse when the code changes. **Don't Do This:** * Don't include build tools or intermediate files in the final image. * Don't invalidate the Docker cache unnecessarily by changing files that aren't dependencies. **Why:** Multi-stage builds result in smaller, more secure images, as build tools are not included in the final artifact. Efficient caching speeds up build times. **Example:** """dockerfile # Correct: Multi-stage build for a Java application FROM maven:3.9.6-eclipse-temurin-21-alpine AS builder WORKDIR /app COPY pom.xml . RUN mvn dependency:go-offline COPY src ./src RUN mvn clean install -DskipTests FROM eclipse-temurin:21-jre-alpine WORKDIR /app COPY --from=builder /app/target/*.jar app.jar EXPOSE 8080 ENTRYPOINT ["java", "-jar", "app.jar"] """ ### 1.3 Image Layer Optimization **Do This:** * Minimize the number of layers in the image by combining multiple commands into a single "RUN" instruction using "&&". * Sort multi-line arguments alphabetically to ensure consistency. **Don't Do This:** * Don't create a new layer for each command. * Don't randomly order multi-line arguments, as this can lead to cache misses and performance issues. **Why:** Fewer layers improve image build speed and reduce the final image size. Consistent command formatting assists with readability. **Example:** """dockerfile # Correct: Combining commands to reduce layers FROM alpine:latest RUN apk update && \ apk add --no-cache --virtual .build-deps \ git \ gcc \ musl-dev && \ git clone https://github.com/some/repo.git && \ cd repo && \ make && \ make install && \ apk del .build-deps """ ## 2. Resource Management Efficient resource management is critical for optimal performance in Kubernetes. ### 2.1 Resource Requests and Limits **Do This:** * Define resource requests and limits for all containers in your deployments. Requests guarantee that the Pod will be scheduled on a node with sufficient resources. Limits prevent containers from consuming excessive resources and impacting other Pods. * Tune requests and limits based on performance testing and monitoring. Initially, set requests based on observed consumption during normal load, and limits slightly higher to allow for occasional spikes. * Apply resource quotas at the namespace level to prevent resource exhaustion. **Don't Do This:** * Don't leave resource requests and limits undefined. * Don't set limits equal to requests (effectively disabling Quality of Service (QoS) guarantees). * Don't over-provision resources excessively, as this can lead to resource waste and scheduling issues. * Don't under-provision resources, resulting in application throttling or out-of-memory errors. **Why:** Resource requests ensure proper scheduling. Limits prevent resource hogging, enhancing overall cluster stability. Proper tuning optimizes resource utilization. **Example:** """yaml # Correct: Defining resource requests and limits apiVersion: apps/v1 kind: Deployment metadata: name: my-app spec: replicas: 3 selector: matchLabels: app: my-app template: metadata: labels: app: my-app spec: containers: - name: my-container image: my-image:latest resources: requests: cpu: "100m" memory: "256Mi" limits: cpu: "500m" memory: "512Mi" """ ### 2.2 Horizontal Pod Autoscaling (HPA) **Do This:** * Utilize Horizontal Pod Autoscaling (HPA) to automatically adjust the number of Pods based on CPU utilization, memory consumption, or custom metrics. * Configure HPA with appropriate target utilization values based on application performance requirements and resource characteristics. * Test HPA configurations thoroughly under different load conditions to ensure proper scaling behavior. * Consider using the Kubernetes Event-driven Autoscaling (KEDA) project for autoscaling based on external metrics (e.g., queue length). **Don't Do This:** * Don't rely solely on manual scaling, which is inefficient and prone to human error. * Don't set HPA target utilization values too high or too low, resulting in unnecessary scaling or performance bottlenecks. * Don't ignore the impact of scaling events on dependent systems, such as databases. **Why:** HPA automatically scales applications based on demand, optimizing resource utilization and ensuring responsiveness. **Example:** """yaml # Correct: Configuring HPA based on CPU utilization apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: my-app-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: my-app minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 """ ## 3. Network Optimization Network performance is critical for distributed applications. ### 3.1 Service Mesh **Do This:** * Consider using a service mesh like Istio or Linkerd to manage and optimize inter-service communication. Service meshes provide features like traffic management, load balancing, observability, and security. * Leverage service mesh features like traffic shifting for safe deployments and canary releases. **Don't Do This:** * Don't implement complex networking logic within the application code. * Don't overlook the overhead introduced by the service mesh itself when evaluating its performance impact. Measure before and after integration. **Why:** Service meshes abstract away networking complexities, enabling better observability, resilience, and security. **Example (Istio):** """yaml # Correct: Using Istio's traffic management to route traffic based on headers apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: my-app-vs spec: hosts: - "my-app.example.com" gateways: - my-gateway http: - match: - headers: version: exact: "v2" route: - destination: host: my-app subset: v2 - route: - destination: host: my-app subset: v1 """ ### 3.2 gRPC and Protocol Buffers **Do This:** * Use gRPC with Protocol Buffers for efficient communication between microservices when appropriate. gRPC is a high-performance, open-source RPC framework that offers features like binary serialization, HTTP/2 transport, and code generation. * Define Protocol Buffers schemas carefully to minimize payload size and transmission overhead. * Utilize gRPC's streaming capabilities for long-lived connections and high-throughput data transfer. **Don't Do This:** * Don't use REST APIs with JSON for internal communication if performance is critical. * Don't define overly complex Protocol Buffers schemas that increase serialization and deserialization costs. **Why:** gRPC offers significant performance advantages over REST/JSON, especially for inter-service communication by reducing payload sizes and leveraging HTTP/2. **Example (Protocol Buffer):** """protobuf // Correct: Defining a simple Protocol Buffers message syntax = "proto3"; package myapp; message User { int32 id = 1; string name = 2; string email = 3; } """ ### 3.3 Connection Pooling **Do This:** * Implement connection pooling for database connections and other external resources to reduce the overhead of establishing new connections. * Configure connection pool settings (e.g., minimum connections, maximum connections, connection timeout) based on application requirements and database capacity. * Monitor connection pool usage to identify potential bottlenecks or resource leaks. **Don't Do This:** * Don't create a new connection for each request. * Don't set the maximum connection pool size too low, resulting in connection exhaustion. * Don't set the connection timeout too high, resulting in long wait times. **Why:** Connection pooling reduces latency. Reusing existing connections improves performance, especially for frequently accessed databases. ## 4. Application Optimization Application-level optimization complements the infrastructure optimizations. ### 4.1 Caching **Do This:** * Implement caching at various levels (e.g., in-memory, distributed cache) to reduce latency and improve application responsiveness. * Use a distributed cache like Redis or Memcached for shared data across multiple Pods, especially read-heavy applications. * Implement appropriate cache invalidation strategies (e.g., time-based, event-based) to ensure data consistency. * Leverage Kubernetes ConfigMaps or Secrets for managing cache configuration parameters. **Don't Do This:** * Don't cache data excessively, resulting in stale information. * Don't use a single, large cache for all data, which can lead to performance bottlenecks. * Don't store sensitive information in the cache without proper encryption. **Why:** Caching significantly improves performance. Reducing database calls and external API requests lowers latency and improves responsiveness. **Example (Redis):** This assumes you have a Redis instance running in your Kubernetes cluster. """python # Correct: Using Redis for caching in a Python application import redis import os redis_host = os.environ.get('REDIS_HOST', 'redis-master.default.svc.cluster.local') redis_port = int(os.environ.get('REDIS_PORT', 6379)) redis_client = redis.Redis(host=redis_host, port=redis_port, db=0) def get_data(key): cached_data = redis_client.get(key) if cached_data: return cached_data.decode('utf-8') else: # Fetch data from database or external API data = fetch_data_from_source(key) # Replace this with your actual db/api call redis_client.set(key, data, ex=3600) # Cache for 1 hour return data def fetch_data_from_source(key): #Simulate a database lookup if key == "user_id_123": return "User Data from DB for ID 123" return "Default User Data" # Example usage user_data = get_data("user_id_123") print(user_data) """ ### 4.2 Asynchronous Processing **Do This:** * Use asynchronous processing for long-running or non-critical tasks to avoid blocking the main application thread. * Implement message queues like RabbitMQ or Kafka to decouple application components and handle asynchronous tasks. * Utilize worker queues and background processing frameworks like Celery (Python) or Spring Batch (Java) to manage asynchronous jobs. **Don't Do This:** * Don't perform time-consuming tasks synchronously in the request-response cycle. * Don't overlook the complexity of managing distributed asynchronous systems. **Why:** Asynchronous processing improves responsiveness. Decoupling components and handling background jobs enhance overall application resilience. **Example (RabbitMQ with Python using Celery):** This example assumes you have RabbitMQ and Celery installed and configured. """python # tasks.py from celery import Celery import os # Celery Configuration CELERY_BROKER_URL = os.environ.get('CELERY_BROKER_URL', 'redis://localhost:6379/0') CELERY_RESULT_BACKEND = os.environ.get('CELERY_RESULT_BACKEND', 'redis://localhost:6379/0') celery = Celery('tasks', broker=CELERY_BROKER_URL, backend=CELERY_RESULT_BACKEND) @celery.task def add(x, y): # Simulate a long-running operation import time time.sleep(5) return x + y """ """python # app.py from flask import Flask, request from tasks import add app = Flask(__name__) @app.route('/add') def calculate_sum(): x = int(request.args.get('x', 0)) y = int(request.args.get('y', 0)) task = add.delay(x, y) # Send the task to Celery asynchronously return f"Addition task submitted with task ID: {task.id}" if __name__ == '__main__': app.run(debug=True, host='0.0.0.0', port=5000) """ This offloads the sum calculation to a background worker handled by Celery. ### 4.3 Efficient Data Structures and Algorithms **Do This:** * Choose appropriate data structures and algorithms based on application requirements to minimize computational complexity and memory usage. * Use efficient data serialization formats like Protocol Buffers or MessagePack for data exchange. * Optimize database queries by using indexes, prepared statements, and query caching. **Don't Do This:** * Don't use inefficient algorithms or data structures that lead to performance bottlenecks. * Don't perform unnecessary data serialization or deserialization. * Don't execute poorly optimized database queries. **Why:** Efficient algorithms dramatically boost performance. Minimizing resource usage improves throughput and responsiveness. ## 5. Monitoring and Profiling Continuous monitoring and profiling are essential for identifying and resolving performance issues. ### 5.1 Metrics Collection **Do This:** * Collect application metrics (e.g., response time, error rate, resource utilization) using tools like Prometheus and Grafana. * Expose metrics in the Prometheus format. * Use meaningful metric names and labels to facilitate analysis and troubleshooting. * Set up alerts to notify operators of performance anomalies. **Don't Do This:** * Don't neglect to collect metrics, making it difficult to identify performance problems. * Don't collect excessive metrics, leading to storage and processing overhead. **Why:** Metrics provide visibility into application behavior. Identifying bottlenecks allows for targeted optimization efforts. **Example (Prometheus):** """python # Correct example exposing metrics in a Python application using the prometheus_client library from prometheus_client import start_http_server, Summary import random import time # Create a metric to track time spent and requests made. REQUEST_TIME = Summary('request_processing_seconds', 'Time spent processing request') # Decorate function with metric. @REQUEST_TIME.time() def process_request(t): """A dummy function that takes some time.""" time.sleep(t) if __name__ == '__main__': # Start up the server to expose the metrics. start_http_server(8000) # Generate some requests. while True: process_request(random.random()) """ Then, configure Prometheus to scrape this endpoint. ### 5.2 Profiling **Do This:** * Use profiling tools like "pprof" (Go), "cProfile" (Python), or Java profilers to identify performance hotspots in the code. * Profile applications under realistic load conditions to simulate production behavior. * Analyze profiling results to pinpoint the root cause of performance issues. * Regularly run performance benchmarks to track optimizations. **Don't Do This:** * Don't rely solely on intuition when optimizing code. * Don't profile in isolation, but consider real-world production environment. **Why:** Profiling identifies code-level bottlenecks. Focusing optimization efforts on the most impactful areas maximizes performance gains. ## 6. Kubernetes-Specific Optimizations These optimizations are relevant to Kubernetes environments. ### 6.1 Liveness and Readiness Probes **Do This:** * Configure liveness and readiness probes for all containers to ensure proper health checking and service discovery. * Liveness probes detect when a container is unhealthy and needs to be restarted. * Readiness probes determine when a container is ready to accept traffic. * Tune probe parameters (e.g., initial delay, period, timeout) based on application characteristics. * Avoid making probes overly complex or resource-intensive. **Don't Do This:** * Don't omit liveness and readiness probes. * Don't make liveness probes too sensitive, resulting in unnecessary restarts. * Don't make readiness probes too lenient, leading to traffic being routed to unhealthy containers. **Why:** Probes ensure healthy container lifecycle. Restarts and traffic routing are handled efficiently. **Example:** """yaml # Correct: Defining liveness and readiness probes apiVersion: apps/v1 kind: Deployment metadata: name: my-app spec: template: spec: containers: - name: my-container image: my-image:latest livenessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 3 periodSeconds: 5 readinessProbe: httpGet: path: /readyz port: 8080 initialDelaySeconds: 5 periodSeconds: 10 """ ### 6.2 Init Containers **Do This:** * Use init containers for initialization tasks that must be completed before the main application container starts (e.g., database migrations, configuration loading). * Keep init containers as lightweight as possible to minimize startup time. * Use Kubernetes Jobs for tasks that need to run and complete but are not directly part of the long-running application (e.g., periodic data processing). **Don't Do This:** * Don't perform unnecessary tasks in init containers. * Don't use init containers for long-running processes. **Why:** Init containers separate initialization logic. Decoupling the setup from the main application container optimizes startup time. **Example:** """yaml # Correct: Using init containers for database migrations apiVersion: apps/v1 kind: Deployment metadata: name: my-app spec: template: spec: initContainers: - name: migrate-db image: migrate-image:latest command: ["migrate", "-path", "/migrations", "-database", "mysql://...", "up"] containers: - name: my-container image: my-image:latest """ By adhering to these coding standards, developers can build Kubernetes applications that are performant, scalable, and resilient. Continuous monitoring and improvements are critical for long-term success.