Devops inerview questions and answers

Senior DevOps Engineer Interview Q&A

Introduction and Background

Q: Tell me about yourself, and which projects you worked on, and what are the roles and responsibilities?

A: I'm a senior DevOps engineer with 7+ years of experience across AWS and GCP environments. I've specialized in building and optimizing cloud infrastructure, CI/CD pipelines, and observability solutions. Most recently, I led the cloud migration for FinTech Corp, moving their legacy application to a microservices architecture on AWS EKS. I designed the Terraform modules for network infrastructure, implemented GitOps with ArgoCD, and built observability using Prometheus and Grafana. Prior to that, I worked at TechSolutions Inc. where I managed multi-region Kubernetes clusters on GCP, implemented infrastructure as code using Terraform, and created automated deployment pipelines with GitHub Actions. My core responsibilities have included:

Architecting and managing cloud infrastructure on AWS/GCP
Implementing IaC with Terraform for reproducible environments
Building robust CI/CD pipelines with GitHub Actions and ArgoCD
Setting up monitoring and alerting with Prometheus, Grafana, and Datadog
Automating security scanning with SonarQube and OWASP tools
Troubleshooting and resolving production incidents

Technical Questions

Q: How do you migrate a legacy monolith application to microservices? What is the process you followed?

A: For migrating a legacy monolith to microservices, I follow a structured approach:

Assessment and Planning:

Analyze the monolith's codebase and dependencies
Identify bounded contexts that can become independent services
Create a migration roadmap with clear milestones

Implement Strangler Pattern:

Place API gateway in front of the monolith
Gradually redirect traffic to new microservices
Keep the monolith running until fully replaced

Database Decoupling:

Identify data ownership boundaries
Implement CDC (Change Data Capture) for transition period
Create service-specific databases or schemas

Infrastructure Preparation:

Set up Kubernetes clusters with proper resource allocation
Implement service mesh for communication (Istio/Linkerd)
Configure CI/CD pipelines for each microservice

Incremental Migration:

Start with least critical, least connected components
Implement feature flags for risk management
Run parallel testing between old and new implementations

Observability Integration:

Implement distributed tracing across services
Set up centralized logging and monitoring
Create service-specific dashboards and alerts

Post-Migration Optimization:

Remove deprecated monolith components
Optimize resource allocation and scaling policies
Document architecture and operational procedures

Q: If a production server went down, what actions would you take?

A: When a production server goes down, my immediate actions would be:

Acknowledge the incident:

Check monitoring alerts to understand the scope
Notify stakeholders via established communication channels

Initial assessment:

Verify if it's an isolated server issue or broader system failure
Check infrastructure dashboards (AWS/GCP console, Grafana)
Review recent deployments or changes

Implement immediate mitigation:

If AWS/GCP instance: Check for termination/stop events or health checks
For Kubernetes pods: Check logs, events, and resource constraints
Attempt restart or failover to replicas if available

Deeper diagnosis:

Review logs in ELK/CloudWatch/StackDriver
Check for resource exhaustion (CPU/memory/disk)
Verify network connectivity and security group settings

Resolution:

Apply fix based on root cause (scaling, configuration update, rollback)
Verify service restoration via monitoring and health checks
Update load balancers or DNS if needed

Post-incident actions:

Document incident details
Schedule a post-mortem meeting
Implement preventative measures (improved monitoring, automated recovery)

Q: In one project, a production server got attacked from an outsider. How would you handle the situation? What primary actions would you take?

A: When a production server is attacked, here's how I'd handle it:

Immediate Containment:

Isolate the compromised server by restricting network access
Update security groups/firewall rules to block malicious IPs
If necessary, temporarily take the server offline

Evidence Collection:

Capture forensic snapshots of affected instances
Preserve logs before any cleanup (CloudTrail, VPC Flow Logs, application logs)
Document the timeline of events

Assess Impact and Scope:

Determine what systems were accessed and potential data exposure
Check for lateral movement to other servers
Review authentication logs for unauthorized access

Incident Response:

Follow company incident response plan
Notify security team and management
Engage AWS/GCP support if needed

Recovery:

Deploy clean server instances from known good AMIs/images
Restore from pre-attack backups if necessary
Apply all security patches and updates

Post-Attack Security Hardening:

Implement additional WAF rules
Review IAM permissions and access controls
Set up enhanced monitoring and alerting
Conduct vulnerability scans with tools like OWASP ZAP

Documentation and Prevention:

Document attack vectors and mitigation steps
Update runbooks and security protocols
Schedule security training for team members

Q: One of your backend applications is getting high traffic spikes and services are not responding. How do you troubleshoot?

A: When troubleshooting a backend application experiencing high traffic spikes and non-responsiveness, I'd follow this approach:

Immediate Assessment:

Check monitoring dashboards (Grafana/Datadog) for resource utilization
Verify traffic patterns in load balancer metrics (ALB/CloudFront)
Check error rates and response times in application logs

Resource Analysis:

Check CPU, memory, and disk I/O on affected services
Verify database connection pool status and query performance
Look for network bottlenecks or throttling

Scale Resources:

Trigger manual horizontal scaling if auto-scaling hasn't responded
Increase node count in Kubernetes clusters if pod scheduling is delayed
Adjust database read replicas or instance sizes if DB is the bottleneck

Implement Traffic Management:

Enable rate limiting at API Gateway/Nginx level
Implement circuit breakers for failing downstream dependencies
Consider temporary caching strategies for read-heavy operations

Short-term Mitigations:

Activate degraded mode for non-critical features
Redirect traffic to backup systems if available
Temporarily increase timeouts for dependent services

Root Cause Investigation:

Analyze distributed traces to identify slowest components
Review recent deployments that might have affected performance
Check for inefficient queries or N+1 problems

Post-Recovery Actions:

Implement proper auto-scaling policies
Optimize database queries and indexes
Add load testing to CI/CD pipeline

Q: How do you manage sensitive information in CI/CD pipelines?

A: For managing sensitive information in CI/CD pipelines, I use several secure approaches:

Secrets Management:

HashiCorp Vault for dynamic secrets with short TTLs
AWS Secrets Manager/GCP Secret Manager for cloud-specific credentials
Integration with CI/CD platforms via appropriate plugins

Environment Variables:

Store sensitive values as protected variables in CI/CD platforms
Mask secrets in logs and console output
Use environment-specific variable groups

Infrastructure as Code:

Terraform remote state encryption
SOPS or git-crypt for encrypting values in repositories
Use of variable files that aren't committed to source control

Access Control:

Implement least privilege for pipeline service accounts
Rotate credentials regularly through automation
Restrict access to production deployment pipelines

Runtime Security:

Scan code and containers for leaked secrets using tools like TruffleHog
Implement approval gates for sensitive environment deployments
Use temporary credentials that expire after pipeline completion

Audit and Compliance:

Log all secret access attempts
Regular review of who has access to sensitive information
Integrate secret scanning into the pipeline itself

Q: How do you ensure running applications in Kubernetes are secure?

A: To ensure security for applications running in Kubernetes, I implement multiple layers of protection:

Image Security:

Scan container images using Trivy or Clair
Use minimal base images (distroless/Alpine)
Enforce signed images with admission controllers

Pod Security:

Implement Pod Security Standards (PSS)
Run containers as non-root users
Use read-only filesystems where possible
Set resource limits to prevent DoS attacks

Network Security:

Implement network policies for pod-to-pod communication
Use service meshes for mTLS (Istio/Linkerd)
Limit egress traffic to required endpoints
Configure proper ingress security with WAF

Access Control:

Implement RBAC with least privilege
Use namespaces for separation and permissions boundaries
Regular review and audit of service accounts
Implement OpenID Connect for user authentication

Runtime Security:

Deploy Falco for runtime threat detection
Implement OPA Gatekeeper for policy enforcement
Use seccomp and AppArmor profiles

Secret Management:

Use Kubernetes Secrets with proper encryption
Consider external secret stores (Vault) with CSI drivers
Rotate secrets regularly

Monitoring and Compliance:

Implement audit logging
Use Prometheus alerts for security-related events
Regular compliance scans with kube-bench

Q: What are the most difficult challenges you faced in your last and previous projects? How did you resolve them?

A: In my most recent project, the most difficult challenge was migrating a critical payment processing system with zero downtime requirements while handling 5000+ TPS at peak. The system had complex stateful components and tight coupling between services. I resolved this by implementing a dual-write pattern where the new microservices architecture ran in parallel with the legacy system. We built a synchronization layer that ensured data consistency between both systems. I created a custom traffic shifting mechanism using weighted routing in AWS ALB that allowed us to gradually transfer traffic in 5% increments while monitoring error rates and performance. We set up enhanced observability with distributed tracing across both systems to quickly identify and resolve integration issues. In my previous role, the biggest challenge was securing a multi-tenant Kubernetes platform that hosted applications with strict compliance requirements. I addressed this by implementing a combination of network policies, admission controllers, and a custom operator that enforced tenant isolation. We used OPA Gatekeeper to create policy-as-code that automatically validated all deployments against security benchmarks. The trickiest part was balancing security with developer productivity, which we solved by creating self-service security tooling and pre-approved templates.

Q: You said you worked on AWS EKS, right? How many services are running in Kubernetes?

A: In our AWS EKS environment, we were running approximately 45-50 microservices in Kubernetes. This included:

12 core business logic services handling the main application workflow
8 data processing services for ETL and analytics pipelines
5 authentication and authorization services
7 integration services connecting to external APIs and partners
4 notification services (email, SMS, push)
Several utility services for logging, monitoring, and administration
Infrastructure components like service mesh proxies, metrics collectors, and custom operators

The cluster was configured with node groups optimized for different workload types - compute-optimized for processing services, memory-optimized for caching and database services, and general-purpose for most API services. We used pod affinity/anti-affinity rules to ensure proper distribution across nodes and zones.

Q: What is observability and how do you implement it?

A: Observability is the ability to understand a system's internal state based on its external outputs. It goes beyond monitoring by providing context and insights into why a system behaves a certain way. I implement observability through three key pillars:

Metrics:

Deploy Prometheus for collecting time-series data
Set up Grafana dashboards for visualization
Implement custom metrics for business KPIs
Configure proper alerting thresholds

Logs:

Centralize logs with ELK stack or CloudWatch
Structure logs in consistent JSON format
Include correlation IDs for request tracing
Implement log rotation and retention policies

Distributed Tracing:

Implement OpenTelemetry instrumentation
Use Jaeger or X-Ray for trace visualization
Capture timing data across service boundaries
Tag traces with critical business context

Implementation Steps:

Add observability as infrastructure code
Instrument applications consistently
Create standardized dashboards for services
Train teams on using observability tools
Set up anomaly detection for proactive alerts

Q: How can one AWS account EKS cluster microservice communicate with another AWS account EKS cluster microservice?

A: For cross-account communication between microservices in different AWS account EKS clusters, I implement these approaches:

Network Connectivity:

Set up Transit Gateway or VPC Peering between accounts
Configure proper route tables and security groups
Ensure DNS resolution works across VPCs with Route 53 Resolver

Service Discovery:

Implement AWS Cloud Map for cross-account service discovery
Use external-dns controller to manage Route 53 entries
Set up proper namespace isolation for multi-account services

Authentication & Authorization:

Configure IAM roles for cross-account access
Use AWS STS for temporary credentials
Implement mTLS with service mesh for secure service-to-service communication

API Gateway Pattern:

Expose specific services through API Gateway
Implement proper authorization (IAM, JWT)
Use private API endpoints accessible through VPC endpoints

Event-Driven Pattern:

Use SNS/SQS for asynchronous communication
Configure cross-account topic policies
Implement event schemas for consistency

The most secure and scalable approach is typically a combination of proper network connectivity with a service mesh for traffic management and security.

Q: A client approaches you to design infrastructure and implement for their application. What questions and information would you ask the client?

A: When a client approaches me to design and implement infrastructure for their application, I'd ask these key questions to gather critical information:

Application Architecture:

Is this a monolithic or microservices application?
What programming languages and frameworks are used?
What are the stateful components (databases, caches)?
Are there any legacy systems that need integration?

Performance Requirements:

What's your expected user load and growth projections?
Are there specific throughput or latency requirements?
Do you have seasonal traffic patterns or predictable spikes?
What's your RTO/RPO for disaster recovery?

Compliance and Security:

Are there industry regulations you must comply with (HIPAA, PCI, GDPR)?
What's your data classification policy?
Do you have specific security requirements or threat models?
Any geographical data residency requirements?

Operational Requirements:

What's your deployment frequency and release strategy?
Do you have existing DevOps practices or CI/CD pipelines?
What's your monitoring and alerting strategy?
Who will maintain the infrastructure long-term?

Budget and Timeline:

What's your infrastructure budget (both initial and ongoing)?
When do you need this infrastructure operational?
Are there any phased implementation requirements?
What's your tolerance for cloud vendor lock-in?

Existing Environment:

Do you have existing cloud accounts or infrastructure?
Are there any technology constraints or preferences?
Do you have existing infrastructure as code templates?
What's your current deployment and operations workflow?

This information helps me design infrastructure that aligns with both technical requirements and business objectives while avoiding costly redesigns later.

Q: If it is a containerized application, how will you implement it from scratch? What measures would you take for security, high availability, scalability, performance efficiency, and cost optimization?

A: For implementing a containerized application from scratch, I'd focus on building a comprehensive solution addressing all key areas: Security Measures

Container Security:

Use minimal base images and distroless containers
Implement vulnerability scanning in CI/CD (Trivy, Clair)
Enforce non-root users and read-only filesystems
Apply strict resource limits and seccomp profiles

Infrastructure Security:

Implement defense-in-depth with multiple security layers
Use private container registries with image signing
Apply network policies for east-west traffic control
Implement secrets management (HashiCorp Vault, AWS Secrets Manager)
Set up WAF for protecting ingress points

High Availability

Multi-AZ Deployment:

Deploy Kubernetes across at least 3 availability zones
Use pod anti-affinity rules to distribute workloads
Implement proper liveness/readiness probes

Stateful Components:

Use managed database services with multi-AZ replication
Implement proper PVC and storage class configuration
Set up regular backup and recovery processes

Reliability Engineering:

Design for graceful degradation with circuit breakers
Implement retries with exponential backoff
Set up proper failover for critical components

Scalability

Horizontal Scaling:

Configure Horizontal Pod Autoscaler based on CPU/memory/custom metrics
Implement Cluster Autoscaler for node management
Design stateless services where possible

Traffic Management:

Deploy ingress controller with proper traffic shaping
Implement service mesh for advanced traffic control
Set up proper connection pooling and backpressure mechanisms

Performance Efficiency

Resource Optimization:

Right-size containers based on actual requirements
Implement CPU/memory limits and requests
Use node affinity for specialized workloads (GPU, high memory)

Application Performance:

Configure proper caching layers (Redis, CDN)
Optimize container startup with init containers
Implement efficient health check mechanisms

Cost Optimization

Resource Management:

Use Spot instances for non-critical workloads
Implement node termination handlers
Schedule batch jobs during off-peak hours

Infrastructure Efficiency:

Implement auto-scaling based on actual demand
Use Kubernetes cost allocation tools (Kubecost)
Implement resource quotas per namespace

Operational Efficiency:

Automate everything with infrastructure as code (Terraform)
Implement GitOps workflow with ArgoCD
Set up proper monitoring and alerting (Prometheus, Grafana)

Bhavani prasad
Cloud & Devops Engineer

Senior DevOps Engineer Interview Q&A

You may also be interested in