devops interviews

AWS GCP DevOps Engineer Senior Position Interview Q&ABackground

Experience: 6 years as a DevOps Engineer
Project: AI-powered mobile application for centralized smart home accessory control
Tech Stack:
- DevOps Tools: Docker, Kubernetes, Jenkins, Git, Ansible, Terraform, Helm, Istio, GitHub Actions, GitLab, ELK, Prometheus, Grafana, Spinnaker, ArgoCD, Apache2, Nginx, Tomcat 8, GitLab, Datadog, Rio, Nomad
- Programming Languages: Python, Groovy, Shell Scripting, YAML
- Operating Systems: Ubuntu, RedHat
- Cloud Services: AWS, GCP

Project and Responsibilities

Q: Explain your project and roles and responsibilities?
A: I worked on an AI-powered mobile application that serves as a centralized controller for smart home accessories and devices. The platform uses natural language processing and machine learning to interpret user commands and control various home systems.As a Senior DevOps Engineer, I was responsible for:

Building and maintaining the CI/CD pipeline using Jenkins and later GitHub Actions, reducing deployment time by 70%.
Managing Kubernetes clusters on AWS EKS and GCP GKE, handling the infrastructure for 300+ microservices that powered the AI backend.
Implementing infrastructure as code using Terraform to provision and manage all cloud resources across AWS and GCP environments.
Setting up monitoring and observability with Prometheus, Grafana, and ELK stack to ensure platform reliability with 99.9% uptime.
Automating deployment workflows using ArgoCD for GitOps practices, ensuring infrastructure and application consistency.
Designing and implementing auto-scaling solutions for handling variable workloads, particularly during peak usage times.
Creating and maintaining Docker images for microservices, optimizing them for faster startup and smaller footprint.
Implementing security best practices, including vulnerability scanning in CI/CD pipeline and secret management with HashiCorp Vault.
Establishing disaster recovery procedures that reduced our RTO from 4 hours to 30 minutes.
Mentoring junior DevOps engineers and collaborating with development teams to improve deployment practices.

Infrastructure and Performance Issues

Q: In server memory utilization is 100% consumed how can you resolve this issue?
A: When server memory utilization reaches 100%, I'd take these steps to resolve the issue:

Immediate triage:
- Check running processes with top or htop to identify memory-consuming processes
- Use ps aux --sort=-%mem to sort processes by memory usage
- Run free -m to verify swap usage and available memory
Emergency mitigation:
- Restart memory-hogging services if non-critical
- Clear application caches if applicable
- Adjust JVM heap settings for Java applications
- In extreme cases, use the Linux OOM killer selectively with echo 1000 > /proc/[PID]/oom_adj
Root cause analysis:
- Check application logs for memory leaks or unusual behavior
- Use tools like pmap to examine memory allocation per process
- Review recent deployments or configuration changes
- Analyze memory usage patterns with Prometheus/Grafana metrics
Long-term solutions:
- Implement vertical scaling (increase server memory)
- Configure proper resource limits in Kubernetes/Docker
- Set up horizontal scaling for distributed load
- Implement memory usage alerts at 80% threshold
- Optimize application code and database queries
- Add caching layers where appropriate

For our AI platform, we faced similar issues with NLP processing nodes. I implemented autoscaling policies based on memory thresholds and optimized our container resource allocation, reducing memory-related incidents by 85%.

Q: Application facing a latency issue how do find the cause?
A: To diagnose and resolve application latency issues, I'd follow this systematic approach:

Define the baseline:
- Check historical metrics to understand normal performance
- Quantify the current latency issue (how much slower than usual)
End-to-end monitoring:
- Analyze request tracing with tools like Jaeger or Zipkin to identify slow components
- Check Prometheus/Datadog metrics for unusual patterns
- Review all microservices in the request path
Infrastructure inspection:
- Verify CPU/memory usage on affected services
- Check network performance between services
- Monitor database query execution time and load
- Inspect cloud provider status for regional issues
Application profiling:
- Use APM tools to identify slow code paths
- Check for N+1 query problems
- Review recent deployments that might have introduced regressions
Specific checks:
- Database: Review slow query logs, check for missing indexes
- Caching: Verify cache hit rates, potential evictions
- External APIs: Test direct calls to determine if third-party services are slow
- Load balancers: Check for proper distribution of traffic

For our AI application, I encountered similar issues where latency spiked during peak hours. By implementing distributed tracing, I identified that our ML inference service was bottlenecked. We implemented a queue-based architecture with horizontal scaling that improved p95 latency by 60%.

Q: How do you resolve this type of latency issue in application?
A: Once I've identified the root cause of the latency, I implement targeted solutions:

Code-level optimizations:
- Refactor inefficient algorithms
- Implement caching for expensive operations
- Optimize database queries with proper indexing
- Add connection pooling for external services
Infrastructure improvements:
- Scale horizontally by adding more instances of bottlenecked services
- Implement auto-scaling based on CPU/memory thresholds
- Upgrade resources for vertically scaling (larger instances)
- Move to faster storage solutions if I/O is the bottleneck
Architecture changes:
- Implement asynchronous processing for non-critical operations
- Add message queues to handle traffic spikes
- Use circuit breakers to prevent cascading failures
- Implement CDN for static content delivery
Database optimizations:
- Add read replicas to distribute query load
- Implement database sharding for write-heavy workloads
- Use connection pooling to reduce connection overhead
- Optimize slow queries identified in analysis
Caching strategy:
- Implement Redis/Memcached for frequently accessed data
- Use browser caching for frontend assets
- Add API response caching with appropriate TTL

Implementing a Redis cache layer for frequent AI model requests, reducing database load by 40%
Moving CPU-intensive ML inference to dedicated, optimized instances
Implementing a priority queue system for processing requests
Setting up Istio service mesh to better control traffic routing and retries

After these changes, we reduced average response time from 800ms to 120ms and improved throughput by 3x.

Q: AWS instance OS is corrupted but that server contains very critical information how do you trouble shoot server and data get back? And server login also not working?
A: When an AWS instance has a corrupted OS with critical data and the server login isn't working, I'd follow this recovery process:

Stop the instance (don't terminate) to preserve the EBS volumes.
Create EBS snapshots of all attached volumes immediately for backup.
Detach the root volume from the corrupted instance.
Launch a temporary recovery EC2 instance in the same Availability Zone.

Attach the corrupted volume to the recovery instance as a secondary volume:


aws ec2 attach-volume --volume-id vol-xxxxx --instance-id i-xxxxx --device /dev/sdf

Mount the volume on the recovery instance to access data:


sudo mkdir /mnt/recovery sudo mount /dev/xvdf1 /mnt/recovery

If mounting fails due to filesystem corruption, try filesystem repair:
sudo fsck -y /dev/xvdf1

Extract critical data from the mounted volume to a secure location:


sudo cp -r /mnt/recovery/path/to/critical/data /home/ec2-user/recovered-data

Transfer the data to S3 or EBS snapshot for safekeeping:


aws s3 cp /home/ec2-user/recovered-data s3://my-backup-bucket/ --recursive

For system recovery, create a new instance from a fresh AMI and restore the data.
If needed, use AWS Backup or create an automated snapshot solution to prevent future data loss.

I successfully recovered critical AI training data using this method when our model training instance became unbootable after a failed kernel update, preserving months of work without data loss.

Q: Website is getting slow how do you optimize to better performance?
A: To optimize a slow website for better performance, I'd follow this systematic approach:

Performance assessment:
- Run performance tests using Lighthouse, WebPageTest, or GTmetrix
- Set up real user monitoring (RUM) with Datadog or New Relic
- Analyze server response times using application logs
Frontend optimizations:
- Compress and minify CSS/JS/HTML
- Implement lazy loading for images and non-critical resources
- Use proper image formats (WebP, AVIF) and responsive sizing
- Implement browser caching with appropriate cache headers
- Enable Gzip/Brotli compression
Backend improvements:
- Identify and optimize slow API endpoints
- Implement caching for frequently accessed data with Redis
- Optimize database queries with proper indexing
- Use connection pooling for database connections
Infrastructure enhancements:
- Implement CDN for static assets (CloudFront, Cloudflare)
- Scale application servers horizontally under load
- Configure load balancer for proper traffic distribution
- Upgrade server resources if CPU/memory is the bottleneck
Advanced techniques:
- Implement HTTP/2 or HTTP/3 for multiplexed connections
- Use server-side rendering or static site generation where appropriate
- Consider edge computing for global audiences
- Implement service worker for offline capabilities

For our AI mobile app web interface, I reduced page load time from 4.2s to 1.3s by implementing CloudFront CDN with edge caching, optimizing API responses with Redis caching, and implementing proper image optimization. This improved user engagement by 35% and reduced server costs.CI/CD and Jenkins\

Q: What is the difference between declarative pipeline and scripted pipelines in Jenkins?
A: The main differences between declarative and scripted pipelines in Jenkins are:Declarative Pipeline:

Uses a more structured, predefined syntax with strict hierarchical structure
Begins with pipeline block and requires specific sections like agent, stages, steps
Easier for beginners with less Jenkins/Groovy experience
Self-documenting with clear syntax
Built-in input validation that catches errors before runtime
Provides a simplified model with limited flexibility
Better integration with Blue Ocean visualization

Example:

groovy

pipeline {     agent any     stages {         stage('Build') {             steps {                 sh 'mvn clean package'             }         }     } }

Scripted Pipeline:

Based on Groovy scripting language with fewer syntax restrictions
Begins with node block
Offers more flexibility and control with full Groovy programming capabilities
Better for complex logic and conditions
Steeper learning curve requiring Groovy knowledge
Provides programmatic flow control (if/else, try/catch)
Runtime error checking (versus compile-time checking)

Example:

groovy

node {     stage('Build') {         try {             sh 'mvn clean package'         } catch (Exception e) {             currentBuild.result = 'FAILURE'         }     } }

In our AI application pipeline, I used declarative for most standard deployment workflows but scripted pipelines for complex model training pipelines that required custom logic and dynamic resource allocation.

Q: How do you secure Jenkins jobs?
A: To secure Jenkins jobs, I implement these key measures:

Access Control:
- Use role-based authentication with the Role-based Authorization Strategy plugin
- Implement project-based matrix authorization for granular permissions
- Restrict job modification rights to senior DevOps engineers only
- Enforce strict credential access controls
Credential Management:
- Store all secrets in Jenkins Credentials Plugin, never in job configs
- Use credential binding for secure injection into pipelines
- Rotate credentials regularly with automated processes
- Implement AWS IAM roles for EC2 instances instead of static credentials
Pipeline Security:
- Enable script security for Groovy sandboxing
- Implement strict approval process for custom scripts
- Use Jenkinsfile from version-controlled repositories only
- Validate all external inputs and parameters
Infrastructure Security:
- Run Jenkins master on private subnets with restricted access
- Use Jenkins agents for execution with principle of least privilege
- Implement network segmentation with security groups
- Keep Jenkins and plugins regularly updated
Audit and Compliance:
- Enable comprehensive audit logging
- Implement build history retention policies
- Use the Audit Trail plugin to track all user actions
- Regular security scans of Jenkins environment

For our AI project, I implemented Jenkins in a dedicated VPC with private agents, LDAP integration for authentication, and automated scanning of all build artifacts, which prevented several potential security incidents.

Q: How do you provide access to user on specific job in Jenkins?
A: To provide a user access to specific jobs in Jenkins, I follow this process:

Install and configure Project-based Matrix Authorization Strategy plugin if not already available
Navigate to Jenkins dashboard > Manage Jenkins > Configure Global Security
Under Authorization, select "Project-based Matrix Authorization Strategy"
Set up global permissions (typically restrictive by default)
For the specific job access:
- Go to the specific job configuration
- Find the "Enable project-based security" checkbox and enable it
- Add the specific user with only the required permissions:
  - Job/Read: To view the job
  - Job/Build: To trigger builds
  - Job/Workspace: To access workspace files
  - Job/Cancel: To cancel running builds if needed
Verify access by having the user log in and confirm they can only see and interact with the intended job
For managing multiple jobs with similar permissions:
- Create a folder structure with the Jenkins Folders plugin
- Apply permissions at the folder level
- Place related jobs within the folder
Document the access granted in our access control register for audit purposes

For our AI application CI/CD pipeline, I implemented this approach to give data scientists access to only their model training jobs while restricting access to production deployment pipelines.

Q: Can you explain what the shared libraries are and how those are used in your projects?
A: In my projects, I've extensively used Jenkins Shared Libraries to create reusable pipeline code across our AI application ecosystem. Here's how I implemented and utilized them:Shared Libraries Implementation:

I created a dedicated Git repository for our shared libraries with this structure:
- vars/: Contains global variables/functions used in pipelines
- src/: Houses Java/Groovy classes for complex logic
- resources/: Stores non-Groovy files like JSON templates
Configured the shared library in Jenkins:
- Set up as a "Global Shared Library" in Jenkins configuration
- Used semantic versioning with git tags for library versioning
- Implemented approval process for library changes

Practical Usage Examples:

Standardized CI/CD Stages:

Created reusable functions for common pipeline stages:

groovy

// vars/standardBuild.groovy def call(Map config) {     sh "docker build -t ${config.imageName} ."     sh "docker tag ${config.imageName} ${config.registry}/${config.imageName}:${config.version}"     sh "docker push ${config.registry}/${config.imageName}:${config.version}" }

Security Scanning Integration:

Built shared functions for security tools integration:

groovy

// vars/securityScan.groovy def call(String imageName) {     sh "trivy image ${imageName} --severity HIGH,CRITICAL" }

Deployment Functions:

Created deployment helpers for different environments:

groovy

// vars/deployToK8s.groovy def call(String environment, String appName, String version) {     sh "helm upgrade --install ${appName} ./charts/${appName} --set image.tag=${version} -n ${environment}" }

Notification System:

Implemented standardized Slack/Teams notifications:

groovy

// vars/notifyBuildStatus.groovy def call(String status) {     // Logic to send appropriate notifications }

These shared libraries significantly reduced code duplication across our 30+ microservices, ensured consistent security practices, and allowed us to update deployment processes centrally. When we needed to modify our Kubernetes deployment strategy, I only had to update the shared library once rather than changing dozens of Jenkinsfiles.Kubernetes Management

Q: In Kubernetes one pod is restarting multiple times how do you trouble shoot and rectify the issue?
A: To troubleshoot and rectify a Kubernetes pod that's restarting multiple times, I follow this process:

Check pod status and restart information:
```
kubectl get pod <pod-name> -n <namespace> kubectl describe pod <pod-name> -n <namespace>
```
- Look for restart count, last state, and termination reason

Examine pod logs:

# Current logs kubectl logs <pod-name> -n <namespace>  # Previous container logs if it's crashed kubectl logs <pod-name> -n <namespace> --previous

Check for resource constraints:
- Look for OOMKilled errors indicating memory issues
- Review resource requests/limits in pod spec
- Check node resource availability:
```
kubectl describe node <node-name>
```
Verify health probes:
- Check if liveness or readiness probes are failing
- Ensure probe endpoints are responding correctly
- Temporarily adjust probe timeouts/thresholds if needed
Inspect ConfigMaps and Secrets:
- Verify the pod can access required configuration
- Check for typos or misconfiguration
Check container dependencies:
- Ensure the application can connect to databases, caches, APIs
- Check network policies aren't blocking required traffic
Common solutions based on findings:
- Increase resource limits if OOMKilled
- Fix application bugs if logs show application errors
- Adjust health probe configuration if too strict
- Update environment variables if misconfigured
- Fix image version if incompatible with environment

For our AI application, I encountered restarts due to memory pressure during model inference. I resolved it by adjusting resource limits, implementing memory optimization in the app, and setting up horizontal pod autoscaling based on memory utilization.

Q: What are the probes in Kubernetes? How do you use them in your projects?
A: In Kubernetes, probes are health-checking mechanisms that determine the state and availability of pods. I've used three types of probes extensively in my projects:

Liveness Probes:
- Detect if a container is running but deadlocked or in a broken state
- Trigger container restart when they fail
- In our AI application, I implemented HTTP liveness probes for web services:
```
yaml

livenessProbe:   httpGet:     path: /health     port: 8080   initialDelaySeconds: 30   periodSeconds: 10   timeoutSeconds: 5   failureThreshold: 3
```
Readiness Probes:
- Determine if a pod is ready to receive traffic
- Failed readiness probes remove pod from service endpoints
- For our database-dependent services, I used:
```
yaml

readinessProbe:   httpGet:     path: /ready     port: 8080   initialDelaySeconds: 15   periodSeconds: 5
```
Startup Probes:
- Specifically for slow-starting containers
- Disable liveness/readiness until the application is fully initialized
- Critical for our ML inference services that load large models:
```
startupProbe:   httpGet:     path: /startup     port: 8080   failureThreshold: 30   periodSeconds: 10
```

Implementation strategy I used:

For web services: HTTP probes checking custom health endpoints
For database services: TCP socket probes verifying port accessibility
For cache services: Exec probes running internal health checks

These probes significantly improved our platform stability by:

Preventing traffic to pods that weren't fully initialized
Automatically restarting deadlocked services
Gracefully handling temporary dependencies failures

For our AI model serving pods, which had long startup times, I combined startup probes with proper readiness checks to ensure pods weren't killed during model loading while still maintaining proper service health.

Q: What is cluster size in your project? How many nodes and pods are running in production?
A: In our production environment for the AI mobile app, our Kubernetes cluster was sized to handle our workload requirements while maintaining reliability and cost efficiency.Cluster Size:

3 separate clusters across different regions for high availability
Primary production cluster: 15 worker nodes
Secondary clusters: 8 worker nodes each
Each node: c5.2xlarge instances (8 vCPU, 16GB RAM)
Autoscaling configured to scale between 10-20 nodes based on demand

Pod Distribution:

Total running pods in production: ~350 pods
Core services: 120 pods (API gateways, authentication, data services)
AI/ML components: 80 pods (inference engines, NLP processors)
Supporting services: 150 pods (monitoring, logging, message queues)

Resource Management:

Implemented node affinity to separate CPU-intensive and memory-intensive workloads
Used pod disruption budgets to ensure service availability during upgrades
Configured horizontal pod autoscaling based on CPU/memory usage
Reserved capacity for system components (10% of cluster resources)

High Availability Configuration:

Spread critical services across multiple AZs
Implemented pod anti-affinity for critical services
Maintained minimum 3 replicas for stateless services

This sizing allowed us to handle 5M+ daily user requests with 99.95% uptime while maintaining enough headroom for traffic spikes during product launches or marketing campaigns.

Q: In each node how many pods are running?
A: In our production clusters, we maintained a balanced pod distribution across nodes to ensure optimal resource utilization while avoiding overloading any single node. The exact distribution was:

Average pods per node: 23-25 pods
Maximum pods per node: 30 (resource limit enforced via kubelet)
Minimum pods per node: 15 (system pods and critical services)

Node capacity was determined by both resource allocation and kube-reserved settings:

CPU allocation: Maximum 80% of available CPU (leaving headroom)
Memory allocation: Maximum 75% of available memory
Network capacity: Considered for data-intensive services

We used node labels and taints to ensure specialized workloads (like our ML inference engines which required GPU access) were scheduled on appropriate nodes, with typically fewer pods (8-10) on these specialized nodes due to their higher resource requirements.For system stability, we configured pod disruption budgets and used PodAntiAffinity to prevent critical service pods from clustering on the same nodes, ensuring even distribution and fault tolerance.

Q: How do you manage application pods in EKS cluster?
A: To manage application pods in an EKS cluster, I followed these key practices:

Deployment Strategy:
- Used Kubernetes Deployments and StatefulSets for workload management
- Implemented GitOps with ArgoCD for application deployment
- Maintained declarative manifests in Git repositories
- Applied blue/green deployment patterns for zero-downtime updates
Resource Management:
- Set appropriate resource requests and limits for all pods
- Implemented Horizontal Pod Autoscaling (HPA) based on CPU/memory metrics
- Used Pod Disruption Budgets to ensure availability during updates
- Applied node selectors and affinity rules to optimize pod placement
Configuration Management:
- Maintained environment-specific configs via ConfigMaps and Secrets
- Used AWS Secrets Manager with External Secrets Operator for sensitive data
- Implemented Helm charts for templating and parameterization
- Stored application configs in version control
Monitoring and Observability:
- Set up Prometheus and Grafana dashboards to monitor pod health
- Implemented custom metrics for autoscaling ML workloads
- Used Datadog for application performance monitoring
- Set up ELK stack for centralized logging and log analysis
Network Management:
- Implemented appropriate NetworkPolicies for pod-to-pod communication
- Used AWS ALB Ingress Controller for external access
- Set up service mesh with Istio for advanced traffic management
- Configured proper security groups at the cluster level

For our AI application, I used Helm charts to templatize deployments across dev, staging, and production environments, with ArgoCD ensuring state consistency between Git and the cluster.

Q: How one pod in namespace A can communicate to the namespace B pod?
A: In Kubernetes, a pod in namespace A can communicate with a pod in namespace B using the following approaches:

Service DNS resolution:
- The most common and recommended method
- Use fully qualified domain name (FQDN) format:
```
<service-name>.<namespace>.svc.cluster.local
```
- For example, if namespace B has a service called "api-service", a pod in namespace A would connect to:
```
api-service.namespaceB.svc.cluster.local:8080
```
NetworkPolicy configuration:
- By default, all pods can communicate across namespaces
- If NetworkPolicies are in place, they must explicitly allow cross-namespace traffic:
- kind: NetworkPolicy apiVersion: networking.k8s.io/v1 metadata: name: allow-from-namespace-a namespace: namespaceB spec: ingress: - from: - namespaceSelector: matchLabels: name: namespaceA podSelector: {}
Service Export/Import:
- For more complex setups, use a service mesh like Istio
- Create a ServiceEntry in namespace A to explicitly import services from namespace B

For our AI application, we used the DNS approach with proper NetworkPolicies to allow our frontend services in the "web" namespace to communicate with backend AI services in the "ml-services" namespace while maintaining security boundaries.Q: Can you explain how network flow in your EKS cluster from internet?
A: In our EKS cluster, the network flow from internet to our application pods follows this path:

Internet Gateway:
- External traffic enters through AWS Internet Gateway
- Route tables direct traffic to appropriate VPC subnets
Load Balancer Layer:
- AWS Application Load Balancer (ALB) receives traffic
- ALB performs TLS termination and initial request routing
- AWS WAF integrated with ALB provides layer 7 protection against common attacks
AWS VPC/Subnet Layer:
- Traffic flows through VPC to properly configured security groups
- EKS nodes placed in private subnets with NAT gateways for outbound traffic
- Security groups limit traffic to EKS node ports

Kubernetes Ingress Layer:

AWS ALB Ingress Controller translates Ingress resources to ALB configuration
Ingress resources define routing rules based on hostnames and paths

Example configuration:

yaml

apiVersion: networking.k8s.io/v1 kind: Ingress metadata:   annotations:     kubernetes.io/ingress.class: alb spec:   rules:   - host: api.aiapp.example.com     http:       paths:       - path: /         pathType: Prefix         backend:           service:             name: api-gateway             port:               number: 80

Kubernetes Service Layer:
- Services distribute traffic to pods using kube-proxy
- ClusterIP services for internal communication
- NodePort services for ALB integration
Pod Network:
- AWS VPC CNI plugin provides pod networking
- Each pod has its own IP within the VPC CIDR
- NetworkPolicies enforce pod-to-pod communication rules

For our AI application, we implemented a multi-tier architecture with separate ingress paths for web frontend, API services, and admin interfaces, all secured with proper authentication at the ingress layer.

Q: In your EKS cluster secondary IP ranges are exhausted how can you resolve this issue?
A: When secondary IP ranges are exhausted in an EKS cluster, I'd implement these solutions:

Modify VPC CNI configuration:
- Switch to custom networking mode with AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG=true
- Configure prefix assignment mode to reduce IP usage:
```
kubectl set env daemonset aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true
```
- This allows assigning a /28 prefix per ENI instead of individual IPs, increasing IP capacity ~16x
VPC and subnet expansion:
- Create new VPC with larger CIDR block (e.g., /16 instead of /20)
- Migrate to new VPC using cluster migration techniques
- For temporary relief, add new subnets with unused CIDR ranges
Optimize pod density:
- Increase maxPods per node to maximize IP utilization
- Use larger instance types with more ENIs for higher pod capacity
- Implement pod consolidation to reduce total IP usage
Instance type adjustment:
- Switch to Nitro-based instances with higher ENI/IP limits
- For critical sections, use placement groups to improve network performance
Alternative CNI options:
- Consider alternate CNI plugins like Calico in IPIP mode
- This overlay approach removes dependency on VPC secondary IP ranges

In our AI application cluster, we encountered this issue during rapid scaling. I implemented prefix delegation and strategically consolidated non-critical services, which resolved the IP exhaustion without requiring migration to a new VPC.Monitoring and Observability

Q: How do you monitor these workloads?
A: To monitor our Kubernetes workloads in the EKS cluster, I implemented a comprehensive monitoring strategy:

Prometheus and Grafana Stack:
- Deployed Prometheus Operator via Helm for metrics collection
- Set up custom Grafana dashboards for different service categories
- Created specific dashboards for pod resources, API latency, and error rates
- Implemented recording rules for frequently queried metrics
Node and Infrastructure Monitoring:
- Used node-exporter for hardware-level metrics
- Integrated CloudWatch metrics for EKS control plane monitoring
- Set up kube-state-metrics for cluster-level insights
- Monitored EBS volumes and network interfaces
Application Performance Monitoring:
- Deployed Datadog agents with APM for transaction tracing
- Instrumented critical services with OpenTelemetry
- Created service maps to visualize dependencies
- Set up custom metrics for AI model performance monitoring
Logging Architecture:
- Implemented Fluent Bit daemonset for log collection
- Centralized logs in Elasticsearch (ELK stack)
- Created Kibana dashboards for log analysis
- Set up log-based alerts for critical errors
Alert Management:
- Configured Alertmanager with different severity levels
- Set up PagerDuty integration for critical alerts
- Used Slack for lower priority notifications
- Implemented alert aggregation to prevent alert storms
Custom Monitoring for AI Workloads:
- Created specialized metrics for ML model inference times
- Monitored model accuracy drift
- Set up custom dashboards for NLP processor performance
SLO/SLI Tracking:
- Defined Service Level Objectives for critical paths
- Implemented error budgets for key services
- Created SLO dashboards to track performance against targets

This monitoring setup allowed us to maintain 99.95% uptime while quickly identifying and resolving performance bottlenecks across our AI application stack.

Q: How Prometheus will get metrics from nodes? Can you explain in detail?
A: Prometheus collects metrics from Kubernetes nodes through a multi-layered architecture. Here's a detailed explanation of how this works:

Node Exporter Deployment:

Deployed as a DaemonSet to ensure it runs on every node
The Node Exporter is a Prometheus exporter specifically designed to expose hardware and OS metrics

Configuration example:

yaml

apiVersion: apps/v1 kind: DaemonSet metadata:   name: node-exporter   namespace: monitoring spec:   selector:     matchLabels:       app: node-exporter   template:     metadata:       labels:         app: node-exporter     spec:       hostNetwork: true       containers:       - name: node-exporter         image: prom/node-exporter:v1.3.1         ports:         - containerPort: 9100           name: metrics         volumeMounts:         - name: proc           mountPath: /host/proc           readOnly: true         - name: sys           mountPath: /host/sys           readOnly: true       volumes:       - name: proc         hostPath:           path: /proc       - name: sys         hostPath:           path: /sys

Service Discovery:

Prometheus uses Kubernetes API for service discovery
It identifies targets through role-based configurations

The Prometheus server configuration includes job definitions:

yaml

scrape_configs:   - job_name: 'kubernetes-nodes'     kubernetes_sd_configs:     - role: node     relabel_configs:     - source_labels: [__address__]       regex: '(.*):10250'       replacement: '${1}:9100'       target_label: __address__       action: replace

Metrics Collection Process:
- Prometheus server polls each Node Exporter endpoint at a defined interval (typically 15-30s)
- Node Exporter exposes an HTTP endpoint (usually :9100/metrics)
- When scraped, Node Exporter collects current metrics from the OS
- Metrics are returned in Prometheus text-based format
Collected Node Metrics:
- CPU usage and load
- Memory utilization
- Disk I/O and space usage
- Network traffic and errors
- System uptime and process counts
- File descriptor usage
Storage and Processing:
- Metrics are stored in Prometheus' time-series database
- Data is compressed and optimized for time-series queries
- Retention period configured based on storage capacity (7-30 days in our setup)
Additional Node-Level Exporters:
- kube-state-metrics: Provides Kubernetes object metrics
- cAdvisor: Collects container metrics (built into kubelet)
- Custom exporters for specific applications
Handling Node Changes:
- Prometheus dynamically updates targets when nodes are added/removed
- Relabeling rules standardize metrics across heterogeneous nodes
- Service discovery refreshes at configurable intervals

For our AI platform, we extended this with custom metrics for GPU utilization and memory usage, which were critical for monitoring our ML inference nodes. We also implemented recording rules to pre-compute frequently used aggregations, reducing query load during dashboard rendering.

Q: How do you check if application is not working? What are the steps you take?
A: When I detect that an application is not working, I follow these steps to diagnose and resolve the issue:

Verify the outage scope:
- Check if the issue affects all users or specific segments
- Determine if it's isolated to one service or impacting multiple components
- Verify if it's environment-specific (prod vs. staging)
Check monitoring dashboards first:
- Review Grafana dashboards for spikes in error rates
- Check service health metrics in Prometheus
- Examine Datadog APM for transaction traces and errors
- Look for correlated events in other services
Investigate logs:
- Query ELK stack for error messages:
```
kubernetes.namespace: "app-namespace" AND log: "error" AND kubernetes.pod.name: "app-*"
```
- Check for specific HTTP error codes in access logs
- Look for application exception stacktraces
Kubernetes-specific checks:
- Verify pod status and restart counts:
- kubectl get pods -n <namespace> | grep <app-name> kubectl describe pod <pod-name> -n <namespace>
- Check pod logs:
```
kubectl logs <pod-name> -n <namespace> --tail=100
```
- Verify service endpoints are correctly registered:
```
kubectl get endpoints <service-name> -n <namespace>
```
Infrastructure validation:
- Check node resources (CPU/memory pressure)
- Verify network connectivity between services
- Check if AWS/GCP services are experiencing outages
- Validate if backing services (databases, caches) are responsive
Debugging and resolution:
- If needed, access a problematic pod directly:
```
kubectl exec -it <pod-name> -n <namespace> -- /bin/bash
```
- Test connectivity to dependent services
- Check application configuration
- Validate recent changes through deployment history
Resolution approaches:
- Rollback to previous working version if recent deployment caused issue
- Scale up resources if under load pressure
- Restart affected services if temporary state issues are suspected
- Apply configuration fixes if misconfiguration identified

For our AI application, I created standardized runbooks for common failure scenarios, which significantly reduced our mean time to resolution (MTTR) from 45 minutes to under 15 minutes.

Bhavani prasad
A California-based travel writer, lover of food, oceans, and nature.

You may also be interested in