Senior DevOps Engineer Interview Q&A


Introduction and Background

Q: Tell me about yourself, and which projects you worked on, and what are the roles and responsibilities?

A: I'm a senior DevOps engineer with 7+ years of experience across AWS and GCP environments. I've specialized in building and optimizing cloud infrastructure, CI/CD pipelines, and observability solutions. Most recently, I led the cloud migration for FinTech Corp, moving their legacy application to a microservices architecture on AWS EKS. I designed the Terraform modules for network infrastructure, implemented GitOps with ArgoCD, and built observability using Prometheus and Grafana. Prior to that, I worked at TechSolutions Inc. where I managed multi-region Kubernetes clusters on GCP, implemented infrastructure as code using Terraform, and created automated deployment pipelines with GitHub Actions. My core responsibilities have included:

  • Architecting and managing cloud infrastructure on AWS/GCP
  • Implementing IaC with Terraform for reproducible environments
  • Building robust CI/CD pipelines with GitHub Actions and ArgoCD
  • Setting up monitoring and alerting with Prometheus, Grafana, and Datadog
  • Automating security scanning with SonarQube and OWASP tools
  • Troubleshooting and resolving production incidents
Technical Questions 

  Q: How do you migrate a legacy monolith application to microservices? What is the process you followed? 

 A: For migrating a legacy monolith to microservices, I follow a structured approach:

  1. Assessment and Planning:
    • Analyze the monolith's codebase and dependencies
    • Identify bounded contexts that can become independent services
    • Create a migration roadmap with clear milestones
  2. Implement Strangler Pattern:
    • Place API gateway in front of the monolith
    • Gradually redirect traffic to new microservices
    • Keep the monolith running until fully replaced
  3. Database Decoupling:
    • Identify data ownership boundaries
    • Implement CDC (Change Data Capture) for transition period
    • Create service-specific databases or schemas
  4. Infrastructure Preparation:
    • Set up Kubernetes clusters with proper resource allocation
    • Implement service mesh for communication (Istio/Linkerd)
    • Configure CI/CD pipelines for each microservice
  5. Incremental Migration:
    • Start with least critical, least connected components
    • Implement feature flags for risk management
    • Run parallel testing between old and new implementations
  6. Observability Integration:
    • Implement distributed tracing across services
    • Set up centralized logging and monitoring
    • Create service-specific dashboards and alerts
  7. Post-Migration Optimization:
    • Remove deprecated monolith components
    • Optimize resource allocation and scaling policies
    • Document architecture and operational procedures
Q: If a production server went down, what actions would you take? 

 A: When a production server goes down, my immediate actions would be:

  1. Acknowledge the incident:
    • Check monitoring alerts to understand the scope
    • Notify stakeholders via established communication channels
  2. Initial assessment:
    • Verify if it's an isolated server issue or broader system failure
    • Check infrastructure dashboards (AWS/GCP console, Grafana)
    • Review recent deployments or changes
  3. Implement immediate mitigation:
    • If AWS/GCP instance: Check for termination/stop events or health checks
    • For Kubernetes pods: Check logs, events, and resource constraints
    • Attempt restart or failover to replicas if available
  4. Deeper diagnosis:
    • Review logs in ELK/CloudWatch/StackDriver
    • Check for resource exhaustion (CPU/memory/disk)
    • Verify network connectivity and security group settings
  5. Resolution:
    • Apply fix based on root cause (scaling, configuration update, rollback)
    • Verify service restoration via monitoring and health checks
    • Update load balancers or DNS if needed
  6. Post-incident actions:
    • Document incident details
    • Schedule a post-mortem meeting
    • Implement preventative measures (improved monitoring, automated recovery)
Q: In one project, a production server got attacked from an outsider. How would you handle the situation? What primary actions would you take? 

 A: When a production server is attacked, here's how I'd handle it:

  1. Immediate Containment:
    • Isolate the compromised server by restricting network access
    • Update security groups/firewall rules to block malicious IPs
    • If necessary, temporarily take the server offline
  2. Evidence Collection:
    • Capture forensic snapshots of affected instances
    • Preserve logs before any cleanup (CloudTrail, VPC Flow Logs, application logs)
    • Document the timeline of events
  3. Assess Impact and Scope:
    • Determine what systems were accessed and potential data exposure
    • Check for lateral movement to other servers
    • Review authentication logs for unauthorized access
  4. Incident Response:
    • Follow company incident response plan
    • Notify security team and management
    • Engage AWS/GCP support if needed
  5. Recovery:
    • Deploy clean server instances from known good AMIs/images
    • Restore from pre-attack backups if necessary
    • Apply all security patches and updates
  6. Post-Attack Security Hardening:
    • Implement additional WAF rules
    • Review IAM permissions and access controls
    • Set up enhanced monitoring and alerting
    • Conduct vulnerability scans with tools like OWASP ZAP
  7. Documentation and Prevention:
    • Document attack vectors and mitigation steps
    • Update runbooks and security protocols
    • Schedule security training for team members
Q: One of your backend applications is getting high traffic spikes and services are not responding. How do you troubleshoot? 

 A: When troubleshooting a backend application experiencing high traffic spikes and non-responsiveness, I'd follow this approach:

  1. Immediate Assessment:
    • Check monitoring dashboards (Grafana/Datadog) for resource utilization
    • Verify traffic patterns in load balancer metrics (ALB/CloudFront)
    • Check error rates and response times in application logs
  2. Resource Analysis:
    • Check CPU, memory, and disk I/O on affected services
    • Verify database connection pool status and query performance
    • Look for network bottlenecks or throttling
  3. Scale Resources:
    • Trigger manual horizontal scaling if auto-scaling hasn't responded
    • Increase node count in Kubernetes clusters if pod scheduling is delayed
    • Adjust database read replicas or instance sizes if DB is the bottleneck
  4. Implement Traffic Management:
    • Enable rate limiting at API Gateway/Nginx level
    • Implement circuit breakers for failing downstream dependencies
    • Consider temporary caching strategies for read-heavy operations
  5. Short-term Mitigations:
    • Activate degraded mode for non-critical features
    • Redirect traffic to backup systems if available
    • Temporarily increase timeouts for dependent services
  6. Root Cause Investigation:
    • Analyze distributed traces to identify slowest components
    • Review recent deployments that might have affected performance
    • Check for inefficient queries or N+1 problems
  7. Post-Recovery Actions:
    • Implement proper auto-scaling policies
    • Optimize database queries and indexes
    • Add load testing to CI/CD pipeline
Q: How do you manage sensitive information in CI/CD pipelines? 

 A: For managing sensitive information in CI/CD pipelines, I use several secure approaches:

  1. Secrets Management:
    • HashiCorp Vault for dynamic secrets with short TTLs
    • AWS Secrets Manager/GCP Secret Manager for cloud-specific credentials
    • Integration with CI/CD platforms via appropriate plugins
  2. Environment Variables:
    • Store sensitive values as protected variables in CI/CD platforms
    • Mask secrets in logs and console output
    • Use environment-specific variable groups
  3. Infrastructure as Code:
    • Terraform remote state encryption
    • SOPS or git-crypt for encrypting values in repositories
    • Use of variable files that aren't committed to source control
  4. Access Control:
    • Implement least privilege for pipeline service accounts
    • Rotate credentials regularly through automation
    • Restrict access to production deployment pipelines
  5. Runtime Security:
    • Scan code and containers for leaked secrets using tools like TruffleHog
    • Implement approval gates for sensitive environment deployments
    • Use temporary credentials that expire after pipeline completion
  6. Audit and Compliance:
    • Log all secret access attempts
    • Regular review of who has access to sensitive information
    • Integrate secret scanning into the pipeline itself
Q: How do you ensure running applications in Kubernetes are secure? 

 A: To ensure security for applications running in Kubernetes, I implement multiple layers of protection:

  1. Image Security:
    • Scan container images using Trivy or Clair
    • Use minimal base images (distroless/Alpine)
    • Enforce signed images with admission controllers
  2. Pod Security:
    • Implement Pod Security Standards (PSS)
    • Run containers as non-root users
    • Use read-only filesystems where possible
    • Set resource limits to prevent DoS attacks
  3. Network Security:
    • Implement network policies for pod-to-pod communication
    • Use service meshes for mTLS (Istio/Linkerd)
    • Limit egress traffic to required endpoints
    • Configure proper ingress security with WAF
  4. Access Control:
    • Implement RBAC with least privilege
    • Use namespaces for separation and permissions boundaries
    • Regular review and audit of service accounts
    • Implement OpenID Connect for user authentication
  5. Runtime Security:
    • Deploy Falco for runtime threat detection
    • Implement OPA Gatekeeper for policy enforcement
    • Use seccomp and AppArmor profiles
  6. Secret Management:
    • Use Kubernetes Secrets with proper encryption
    • Consider external secret stores (Vault) with CSI drivers
    • Rotate secrets regularly
  7. Monitoring and Compliance:
    • Implement audit logging
    • Use Prometheus alerts for security-related events
    • Regular compliance scans with kube-bench
Q: What are the most difficult challenges you faced in your last and previous projects? How did you resolve them? 

 A: In my most recent project, the most difficult challenge was migrating a critical payment processing system with zero downtime requirements while handling 5000+ TPS at peak. The system had complex stateful components and tight coupling between services. I resolved this by implementing a dual-write pattern where the new microservices architecture ran in parallel with the legacy system. We built a synchronization layer that ensured data consistency between both systems. I created a custom traffic shifting mechanism using weighted routing in AWS ALB that allowed us to gradually transfer traffic in 5% increments while monitoring error rates and performance. We set up enhanced observability with distributed tracing across both systems to quickly identify and resolve integration issues. In my previous role, the biggest challenge was securing a multi-tenant Kubernetes platform that hosted applications with strict compliance requirements. I addressed this by implementing a combination of network policies, admission controllers, and a custom operator that enforced tenant isolation. We used OPA Gatekeeper to create policy-as-code that automatically validated all deployments against security benchmarks. The trickiest part was balancing security with developer productivity, which we solved by creating self-service security tooling and pre-approved templates.   

Q: You said you worked on AWS EKS, right? How many services are running in Kubernetes? 

 A: In our AWS EKS environment, we were running approximately 45-50 microservices in Kubernetes. This included:

  • 12 core business logic services handling the main application workflow
  • 8 data processing services for ETL and analytics pipelines
  • 5 authentication and authorization services
  • 7 integration services connecting to external APIs and partners
  • 4 notification services (email, SMS, push)
  • Several utility services for logging, monitoring, and administration
  • Infrastructure components like service mesh proxies, metrics collectors, and custom operators
The cluster was configured with node groups optimized for different workload types - compute-optimized for processing services, memory-optimized for caching and database services, and general-purpose for most API services. We used pod affinity/anti-affinity rules to ensure proper distribution across nodes and zones. 

  Q: What is observability and how do you implement it? 

 A: Observability is the ability to understand a system's internal state based on its external outputs. It goes beyond monitoring by providing context and insights into why a system behaves a certain way. I implement observability through three key pillars:

  1. Metrics:
    • Deploy Prometheus for collecting time-series data
    • Set up Grafana dashboards for visualization
    • Implement custom metrics for business KPIs
    • Configure proper alerting thresholds
  2. Logs:
    • Centralize logs with ELK stack or CloudWatch
    • Structure logs in consistent JSON format
    • Include correlation IDs for request tracing
    • Implement log rotation and retention policies
  3. Distributed Tracing:
    • Implement OpenTelemetry instrumentation
    • Use Jaeger or X-Ray for trace visualization
    • Capture timing data across service boundaries
    • Tag traces with critical business context
  4. Implementation Steps:
    • Add observability as infrastructure code
    • Instrument applications consistently
    • Create standardized dashboards for services
    • Train teams on using observability tools
    • Set up anomaly detection for proactive alerts
Q: How can one AWS account EKS cluster microservice communicate with another AWS account EKS cluster microservice?

A: For cross-account communication between microservices in different AWS account EKS clusters, I implement these approaches:

  1. Network Connectivity:
    • Set up Transit Gateway or VPC Peering between accounts
    • Configure proper route tables and security groups
    • Ensure DNS resolution works across VPCs with Route 53 Resolver
  2. Service Discovery:
    • Implement AWS Cloud Map for cross-account service discovery
    • Use external-dns controller to manage Route 53 entries
    • Set up proper namespace isolation for multi-account services
  3. Authentication & Authorization:
    • Configure IAM roles for cross-account access
    • Use AWS STS for temporary credentials
    • Implement mTLS with service mesh for secure service-to-service communication
  4. API Gateway Pattern:
    • Expose specific services through API Gateway
    • Implement proper authorization (IAM, JWT)
    • Use private API endpoints accessible through VPC endpoints
  5. Event-Driven Pattern:
    • Use SNS/SQS for asynchronous communication
    • Configure cross-account topic policies
    • Implement event schemas for consistency
The most secure and scalable approach is typically a combination of proper network connectivity with a service mesh for traffic management and security.

Q: A client approaches you to design infrastructure and implement for their application. What questions and information would you ask the client? 

 A: When a client approaches me to design and implement infrastructure for their application, I'd ask these key questions to gather critical information:

  1. Application Architecture:
    • Is this a monolithic or microservices application?
    • What programming languages and frameworks are used?
    • What are the stateful components (databases, caches)?
    • Are there any legacy systems that need integration?
  2. Performance Requirements:
    • What's your expected user load and growth projections?
    • Are there specific throughput or latency requirements?
    • Do you have seasonal traffic patterns or predictable spikes?
    • What's your RTO/RPO for disaster recovery?
  3. Compliance and Security:
    • Are there industry regulations you must comply with (HIPAA, PCI, GDPR)?
    • What's your data classification policy?
    • Do you have specific security requirements or threat models?
    • Any geographical data residency requirements?
  4. Operational Requirements:
    • What's your deployment frequency and release strategy?
    • Do you have existing DevOps practices or CI/CD pipelines?
    • What's your monitoring and alerting strategy?
    • Who will maintain the infrastructure long-term?
  5. Budget and Timeline:
    • What's your infrastructure budget (both initial and ongoing)?
    • When do you need this infrastructure operational?
    • Are there any phased implementation requirements?
    • What's your tolerance for cloud vendor lock-in?
  6. Existing Environment:
    • Do you have existing cloud accounts or infrastructure?
    • Are there any technology constraints or preferences?
    • Do you have existing infrastructure as code templates?
    • What's your current deployment and operations workflow?
This information helps me design infrastructure that aligns with both technical requirements and business objectives while avoiding costly redesigns later. 


  Q: If it is a containerized application, how will you implement it from scratch? What measures would you take for security, high availability, scalability, performance efficiency, and cost optimization? 

 A: For implementing a containerized application from scratch, I'd focus on building a comprehensive solution addressing all key areas: Security Measures

  1. Container Security:
    • Use minimal base images and distroless containers
    • Implement vulnerability scanning in CI/CD (Trivy, Clair)
    • Enforce non-root users and read-only filesystems
    • Apply strict resource limits and seccomp profiles
  2. Infrastructure Security:
    • Implement defense-in-depth with multiple security layers
    • Use private container registries with image signing
    • Apply network policies for east-west traffic control
    • Implement secrets management (HashiCorp Vault, AWS Secrets Manager)
    • Set up WAF for protecting ingress points
High Availability
  1. Multi-AZ Deployment:
    • Deploy Kubernetes across at least 3 availability zones
    • Use pod anti-affinity rules to distribute workloads
    • Implement proper liveness/readiness probes
  2. Stateful Components:
    • Use managed database services with multi-AZ replication
    • Implement proper PVC and storage class configuration
    • Set up regular backup and recovery processes
  3. Reliability Engineering:
    • Design for graceful degradation with circuit breakers
    • Implement retries with exponential backoff
    • Set up proper failover for critical components
Scalability
  1. Horizontal Scaling:
    • Configure Horizontal Pod Autoscaler based on CPU/memory/custom metrics
    • Implement Cluster Autoscaler for node management
    • Design stateless services where possible
  2. Traffic Management:
    • Deploy ingress controller with proper traffic shaping
    • Implement service mesh for advanced traffic control
    • Set up proper connection pooling and backpressure mechanisms
Performance Efficiency
  1. Resource Optimization:
    • Right-size containers based on actual requirements
    • Implement CPU/memory limits and requests
    • Use node affinity for specialized workloads (GPU, high memory)
  2. Application Performance:
    • Configure proper caching layers (Redis, CDN)
    • Optimize container startup with init containers
    • Implement efficient health check mechanisms
Cost Optimization
  1. Resource Management:
    • Use Spot instances for non-critical workloads
    • Implement node termination handlers
    • Schedule batch jobs during off-peak hours
  2. Infrastructure Efficiency:
    • Implement auto-scaling based on actual demand
    • Use Kubernetes cost allocation tools (Kubecost)
    • Implement resource quotas per namespace
  3. Operational Efficiency:
    • Automate everything with infrastructure as code (Terraform)
    • Implement GitOps workflow with ArgoCD
    • Set up proper monitoring and alerting (Prometheus, Grafana)

Bhavani prasad
Cloud & Devops Engineer