Q: Tell me about yourself ?

 A: Hi there! I'm a Senior Site Reliability Engineer at TataCliq, where I've been working for the past three years. At TataCliq, I focus on maintaining and improving the reliability of our e-commerce platform that serves millions of customers across India. My day-to-day responsibilities include managing our Kubernetes clusters that run our microservices architecture, maintaining our CI/CD pipelines with Jenkins and GitLab CI, and ensuring our infrastructure scales properly during high-traffic sales events like festive season promotions. One of my major contributions was leading the migration from a more traditional VM-based deployment to containerization with Docker and Kubernetes orchestration, which improved our deployment frequency and stability. I also implemented a comprehensive monitoring solution using Prometheus and Grafana that gives us better visibility into system performance and customer experience metrics. I work closely with both the development teams to ensure their code is deployed reliably and with the infrastructure team to maintain our AWS cloud environment where most of our services are hosted. We use Terraform for infrastructure as code to maintain consistency across our environments. Currently, I'm focused on improving our automated scaling solutions to handle flash sales more efficiently and reducing our mean time to recovery (MTTR) when incidents occur. I'm also exploring service mesh with Istio to improve network resilience and security between our microservices. 

  Kubernetes Cluster Management 

  Q: What is your production cluster size and how are the clusters created? are   created manually or using any IAC tools? 

 A: At TataCliq, we operate a substantial production Kubernetes infrastructure to support our e-commerce platform. Here's an overview of our cluster setup: Our production cluster consists of approximately 150 nodes distributed across three AWS availability zones for high availability. Each cluster is sized to handle our normal traffic with enough headroom to scale during peak shopping events like Diwali sales and other promotions. We maintain separate clusters for production, staging, and development environments. For infrastructure provisioning, we've fully embraced Infrastructure as Code (IaC) principles using Terraform. This approach has been critical for maintaining consistency and enabling disaster recovery capabilities. Here's our approach:

  1. We use Terraform to provision all AWS resources including VPCs, subnets, security groups, IAM roles, and the underlying EC2 instances for our Kubernetes clusters.
  2. For Kubernetes itself, we leverage EKS (Amazon Elastic Kubernetes Service) which we also provision through Terraform using the AWS provider and the Terraform EKS module.
  3. Our node groups are defined as Auto Scaling Groups, which allows us to scale horizontally based on demand metrics from Prometheus.
  4. All infrastructure changes go through a GitLab CI pipeline that includes:
    • Terraform plan review
    • Automated testing
    • Approval gates for production changes
    • Terraform apply with state stored in an S3 backend
This IaC approach has been transformative for us. Prior to implementing this, some clusters were created manually, which led to configuration drift and occasional issues during updates. The move to Terraform has allowed us to treat our infrastructure truly as code - versioned, tested, and repeatable. For worker node provisioning and configuration, we complement Terraform with Ansible to handle post-provisioning tasks and ensure consistent configuration across all nodes. 

  Q: How do you upgrade Kubernetes Clusters? 

 A: For Kubernetes cluster upgrades at TataCliq, we follow this process:

  1. Planning Phase:
    • Thoroughly review release notes for breaking changes
    • Schedule maintenance window during low-traffic periods
    • Verify compatibility with all critical workloads
  2. Upgrade Process:
    • Use EKS managed upgrades for control plane (via Terraform)
    • Follow blue/green deployment for worker nodes:
      • Create new node groups with updated version
      • Drain and cordon old nodes
      • Test workloads on new nodes
      • Remove old node groups when successful
  3. Rollback Plan:
    • Maintain previous Terraform state backup
    • Keep old node groups available for 24 hours post-upgrade
    • Document rollback procedures for on-call team
  4. Validation:
    • Run automated smoke tests after upgrade
    • Verify monitoring systems and alerts
    • Monitor application performance metrics
This controlled approach minimizes downtime and provides safety mechanisms if issues arise. 

  Q: How do you manage authorization in EKS clusters in AWS? 

 A: For authorization in our EKS clusters at TataCliq, we implement a multi-layered approach:

  1. AWS IAM Integration:
    • Use AWS IAM roles for initial cluster authentication
    • Map IAM roles to Kubernetes RBAC with aws-auth ConfigMap
    • Implement least privilege principle for all service accounts
  2. Kubernetes RBAC:
    • Define granular Roles and ClusterRoles for different teams
    • Use RoleBindings to assign permissions based on team function
    • Create separate namespaces for different applications with specific access controls
  3. Service Accounts:
    • IRSA (IAM Roles for Service Accounts) for pod-level permissions
    • Each microservice has dedicated service account with appropriate permissions
    • Use Terraform to manage service account creation and IAM role attachment
  4. CI/CD Access:
    • Dedicated service accounts for deployment pipelines
    • Time-limited credentials for automated deployments
    • Pipeline-specific permissions scoped to target namespaces
  5. Auditing:
    • Enable EKS audit logs forwarded to CloudWatch
    • Regular permission reviews using IAM Access Analyzer
    • Alert on unexpected privilege escalation attempts
This approach balances security with operational needs while maintaining complete visibility into who can access what within our clusters. 

  Deployment Strategies

Q: Which deployment strategy are you using in your projects and why? 

 A: For our deployments at TataCliq, we primarily use the following strategies:

  1. Blue/Green Deployments for critical customer-facing services:
    • Maintain two identical environments (blue = current, green = new)
    • Fully test new version in green environment
    • Switch traffic completely once verified
    • Allows immediate rollback by routing back to blue
    • Eliminates downtime for customers during deployment
  2. Canary Deployments for high-risk changes:
    • Deploy new version to small percentage of users (5-10%)
    • Monitor metrics and error rates closely
    • Gradually increase traffic if successful
    • Reduces blast radius of potential issues
    • Used for major feature releases and significant backend changes
  3. Rolling Updates for internal services and stateless applications:
    • Kubernetes default deployment strategy
    • Replace instances incrementally
    • Balance between resource efficiency and safety
    • Configured with proper readiness/liveness probes
Our choice typically depends on:
  • Service criticality and customer impact
  • Confidence level in the change
  • Resource constraints
  • Stateful vs. stateless considerations
We've standardized these patterns using Helm charts and ArgoCD for declarative deployments, ensuring consistent implementation across teams while maintaining flexibility where needed. This multi-strategy approach helps us maintain our 99.95% service availability targets while still enabling frequent deployments across our platform.

  Challenge Resolution 

  Q: What is the biggest challenge you faced in your project and how did you resolve it? A: One of the biggest challenges I faced at TataCliq was during our annual Diwali sale event when our platform experienced unexpected performance degradation despite extensive preparation. The Challenge: Our traffic surged 8x beyond even our aggressive forecasts when a flash sale went viral on social media. This led to database connection saturation, API timeouts, and eventually shopping cart failures. Our autoscaling couldn't keep pace, and customers encountered checkout errors right when attempting to complete purchases. Resolution Approach:

  1. Immediate Triage: Implemented emergency circuit breakers and request throttling at the API gateway level to prioritize checkout flows over browsing functionality.
  2. Real-time Scaling: Manually increased database connection pools and deployed read replicas for product catalog queries while routing write operations to primary instances.
  3. Traffic Management: Used Istio service mesh to implement backpressure mechanisms and graceful degradation, showing "high demand" pages instead of error messages.
  4. Long-term Fix: Post-incident, we:
    • Redesigned our database architecture to implement proper sharding
    • Created a dedicated microservice for cart operations with its own data store
    • Implemented Redis-based inventory reservation system with TTL
    • Developed better load testing that simulated viral traffic patterns
    • Created "panic mode" configurations that could be instantly deployed
This experience transformed our approach to scaling. We now run chaos engineering exercises monthly and model our infrastructure for 15x normal capacity during sale events. What was initially a crisis became a valuable opportunity to significantly improve our platform's resilience. 

  AWS Cost Optimization

Q: How do you optimize costs in AWS? 

 A: At TataCliq, we've implemented several cost optimization strategies for our AWS infrastructure:

  1. Resource Right-sizing:
    • Regular EC2 instance type reviews based on CloudWatch metrics
    • Kubernetes node binpacking optimization using cluster-autoscaler
    • Spot instances for non-critical and stateless workloads (40% of compute)
  2. Autoscaling Refinement:
    • Time-based scaling for predictable traffic patterns
    • Custom metrics-based scaling using Prometheus data
    • Scale-to-zero for dev/test environments during non-business hours
  3. Storage Optimization:
    • EBS volume right-sizing with automated snapshots
    • S3 lifecycle policies moving older data to Glacier
    • Database storage compression and regular cleanup jobs
  4. Reserved Instances & Savings Plans:
    • 1-year commitments for baseline infrastructure (60% coverage)
    • Compute Savings Plans for flexible workloads
    • Regular RI utilization reviews and exchanges when needed
  5. Cost Allocation:
    • Comprehensive tagging strategy for all resources
    • Team-specific cost dashboards
    • Monthly cost reviews with each product team
  6. Networking Optimization:
    • NAT Gateway consolidation
    • CloudFront for content delivery with proper cache settings
    • VPC endpoint usage to reduce data transfer costs
  7. Tooling:
    • AWS Cost Explorer for trend analysis
    • Custom Grafana dashboards for real-time spending
    • Weekly automated cost anomaly reports
This approach has helped us reduce our AWS bill by approximately 35% while supporting increased traffic and capabilities. 

  Production Environment 

  Q: How many pods are running for applications in production and what are those? 

 A: In our TataCliq production environment, we're running approximately 450-500 pods across our application stack. The main application components include: Customer-Facing Services:

  • Frontend microservices (~30 pods)
  • Product catalog service (25 pods with autoscaling)
  • Search service backed by Elasticsearch (15 pods)
  • User authentication and profile services (20 pods)
  • Shopping cart and checkout services (30 pods, critical path)
  • Payment gateway integrations (15 pods)
  • Order management (25 pods)
  • Recommendation engine (20 pods)
Backend Services:
  • Inventory management (15 pods)
  • Pricing and promotions engine (20 pods)
  • Seller portal services (15 pods)
  • Logistics and fulfillment (20 pods)
  • Notification services (15 pods)
  • Analytics collectors (25 pods)
Platform Services:
  • API gateways (20 pods)
  • Cache layers (Redis clusters, 30 pods)
  • Message brokers (Kafka, 15 pods)
  • Batch processing jobs (20 pods)
  • Data pipelines (30 pods)
We maintain high availability with pod anti-affinity rules to distribute workloads across nodes and zones. Critical services are configured with horizontal pod autoscaling based on CPU/memory metrics and custom metrics like request queue depth for dynamic scaling during traffic spikes. 

  Security Implementation 

  Q: How do you manage security within the cluster? 

 A: At TataCliq, we implement a comprehensive security approach for our Kubernetes clusters:

  1. Pod Security:
    • Enforce Pod Security Standards (Baseline profile minimum)
    • Run containers as non-root users with read-only filesystems
    • Implement resource limits for all pods to prevent DoS scenarios
    • Use seccomp and AppArmor profiles for critical workloads
  2. Network Security:
    • Network Policies to enforce zero-trust pod-to-pod communication
    • Service Mesh (Istio) for mTLS between services
    • Egress controls limiting outbound connections
    • Private endpoints for all AWS services
  3. Secret Management:
    • AWS Secrets Manager integration for credentials
    • Sealed Secrets for Kubernetes manifests
    • Regular secret rotation automated via CI/CD
  4. Image Security:
    • Container image scanning in CI/CD pipeline (Trivy)
    • Private ECR registry with immutable tags
    • Image signing and verification
    • Base image standardization and patching
  5. Compliance & Auditing:
    • EKS audit logging to CloudWatch
    • Regular CIS benchmark scans
    • Automated compliance checks using Polaris
    • Kyverno policies to enforce security standards
  6. Access Control:
    • Just-in-time access for production environments
    • Least privilege RBAC configurations
    • Regular access reviews and rotation
  7. Runtime Protection:
    • Falco for runtime threat detection
    • Behavioral analysis and alerting
    • Automated response to suspicious activities
We maintain this security posture through regular assessments, penetration testing, and a dedicated security working group that reviews all changes to our security practices. 

  Q: How do you manage sensitive information in Kubernetes? 

 A: At TataCliq, we've implemented a multi-layered approach to manage sensitive information in our Kubernetes environments:

  1. AWS Secrets Manager Integration:
    • Store credentials, API keys, and tokens in AWS Secrets Manager
    • Use IRSA (IAM Roles for Service Accounts) to provide pods with specific access
    • Implement External Secrets Operator to sync AWS secrets to Kubernetes securely
  2. Sealed Secrets:
    • Encrypt secrets directly in Git repositories using Bitnami Sealed Secrets
    • Create SealedSecret CRDs that can only be decrypted within the cluster
    • Enable GitOps workflow while maintaining security
  3. Environment-Specific Management:
    • Separate secrets by environment (dev/stage/prod)
    • Use Kubernetes namespaces with strict RBAC controls
    • Implement network policies to restrict secret access
  4. Secret Rotation:
    • Automated rotation schedules for database credentials
    • Automated certificate renewal with cert-manager
    • Version control and audit trails for all secret changes
  5. Access Controls:
    • Limit secret access to specific service accounts
    • Implement Just-in-Time access for human operators
    • Audit all secret access events
  6. Secret Injection Methods:
    • Mount as environment variables for simple configs
    • Use volume mounts for larger secrets or certificates
    • Implement HashiCorp Vault sidecar for highly sensitive data
  7. CI/CD Security:
    • Separate deployment pipelines for secrets
    • Approval gates for production secret changes
    • Pipeline-specific service accounts with least privilege
This approach has eliminated hardcoded credentials throughout our stack while maintaining operational efficiency.

  Q: Can you explain how exactly applications consume secrets as they are running in pods within the EKS cluster? 

 A: In our TataCliq environment, we've implemented several methods for pods to consume secrets within the EKS cluster:

  1. External Secrets Operator (ESO) Workflow:
    • We define ExternalSecret CRDs that reference AWS Secrets Manager secrets
    • ESO controller fetches the secrets and creates corresponding Kubernetes Secret objects
    • Application pods then mount these standard Kubernetes Secrets
  2. IAM Roles for Service Accounts (IRSA):
    • Each application has a dedicated Kubernetes ServiceAccount
    • ServiceAccounts are annotated with IAM role ARNs
    • EKS pod identity webhook injects AWS credentials as environment variables
    • Applications use these credentials to directly query AWS Secrets Manager API
  3. Secret Mounting Methods:
    • Environment Variables:
o env: o - name: DB_PASSWORD o valueFrom: o secretKeyRef: o name: app-secrets o key: db-password
    • Volume Mounts:
o volumes: o - name: secrets-volume o secret: o secretName: app-secrets o volumeMounts: o - name: secrets-volume o mountPath: /etc/secrets o readOnly: true
  1. Application Integration:
    • Our Java/Python applications use a standardized secrets client library
    • On startup, they load secrets from environment variables or mounted files
    • For dynamic secrets, they query AWS Secrets Manager directly using IRSA credentials
    • Periodic in-memory refresh of secrets for long-running services
  2. Secret Access Pattern:
    • Application bootstrap process fetches secrets before handling requests
    • Secrets stored in application memory (never written to disk)
    • Background goroutine/thread refreshes dynamic secrets on configurable interval
    • Circuit breaker pattern implemented for secret fetch failures
This approach provides defense in depth while maintaining application performance and reliability - even if a secret provider temporarily fails, applications continue running with cached values until the provider recovers. 

  Q: Where did you use HashiCorp Vault and why did you use it? 

 A: At TataCliq, we implemented HashiCorp Vault alongside AWS Secrets Manager for specific use cases that required additional security features and cross-platform capabilities:

  1. Dynamic Database Credentials:
    • Used Vault's database secrets engine to generate short-lived credentials
    • Implemented automatic credential rotation every 12 hours
    • Reduced risk window if credentials were compromised
    • Provided detailed audit trails for database access
  2. PKI and Certificate Management:
    • Vault serves as our internal Certificate Authority
    • Issues short-lived TLS certificates for service-to-service communication
    • Automates certificate rotation without application downtime
    • Integrates with our service mesh for mTLS enforcement
  3. Non-AWS Environment Integration:
    • Provides consistent secrets access across AWS and on-premises environments
    • Enables hybrid deployment models during our cloud migration
    • Unified secrets management API for legacy applications
  4. Encryption as a Service:
    • Uses Vault's transit engine to encrypt/decrypt sensitive data
    • Provides hardware-backed encryption keys (using AWS KMS as seal)
    • Allows encryption operations without exposing keys to applications
  5. Governance Requirements:
    • Implements multi-party approval workflows for critical secrets
    • Provides comprehensive audit logging for compliance requirements
    • Supports access control with fine-grained policies
We deployed Vault in HA mode within our EKS cluster using the official Helm chart, backed by a dedicated DynamoDB table for storage. For authentication, we integrated it with our Kubernetes service accounts using the Kubernetes auth method, making it seamless for applications to obtain secrets. 

  Q: Did you configure the Vault server? If yes, how did you set it up? (Mentioning it's running in AWS security account) 

 A: Yes, I configured our HashiCorp Vault deployment at TataCliq. Since our security infrastructure runs in a dedicated AWS security account, we set up Vault with the following configuration:

  1. Deployment Architecture:
    • Deployed in HA mode with 3 Vault server pods in our security EKS cluster
    • Used AWS KMS for auto-unseal (avoiding manual intervention during restarts)
    • DynamoDB backend for storage with point-in-time recovery enabled
    • Dedicated VPC with private subnets only
  2. Network Configuration:
    • Internal NLB to expose Vault service within AWS network
    • VPC peering connections to application accounts with restrictive security groups
    • AWS PrivateLink endpoints for secure cross-account access
    • No direct internet access to Vault servers
  3. Authentication Methods:
    • Kubernetes auth for in-cluster services
    • AWS IAM auth for cross-account access
    • LDAP integration for human operator access
    • JWT auth for CI/CD pipelines
  4. High Availability Setup:
    • Configured auto-join using AWS EC2 instance discovery
    • Implemented leader election with DynamoDB
    • Liveness and readiness probes to ensure proper pod health monitoring
    • Anti-affinity rules to distribute Vault pods across availability zones
  5. Initialization and Unsealing:
    • Implemented Shamir's Secret Sharing for master key sharding (5 key shares, 3 required)
    • Distributed key shares to separate security administrators
    • AWS KMS auto-unseal for regular operations
  6. Monitoring and Backup:
    • Prometheus metrics exported for operational monitoring
    • Regular storage backend snapshots
    • Audit logs shipped to centralized logging system
    • Daily verification of backup restoration procedures
This architecture provides isolation of the security infrastructure while enabling secure cross-account access from our application environments. 


  AWS Multi-Account Management 

  Q: You said you manage multiple AWS accounts. How do you manage multiple AWS accounts? 

 A: At TataCliq, we manage multiple AWS accounts using a structured approach for security, cost control, and operational efficiency:

  1. Account Structure:
    • Organization hierarchy with AWS Organizations
    • Dedicated accounts for: security, shared services, each environment (dev/stage/prod), and separate accounts for critical applications
    • Master payer account for consolidated billing
  2. Identity Management:
    • AWS Single Sign-On (SSO) integration with our corporate identity provider
    • Centralized identity policies with defined permission sets
    • Cross-account IAM roles for service-to-service communication
    • Just-in-time access for elevated permissions
  3. Infrastructure Automation:
    • Multi-account Terraform structure with remote state management
    • Account Factory for standardized account provisioning
    • Shared module library to ensure consistency
    • CI/CD pipelines with account-specific deployment stages
  4. Security Controls:
    • Centralized CloudTrail and Config aggregation in security account
    • Service control policies (SCPs) enforcing account guardrails
    • Security Hub for cross-account compliance monitoring
    • Automated remediation for common security findings
  5. Networking:
    • Transit Gateway for inter-account communication
    • Centralized ingress/egress through security account
    • VPC peering for critical direct connections
    • Private endpoints for AWS service access
  6. Cost Management:
    • Tag enforcement policies
    • Account-level budgets and alerting
    • Reserved Instance sharing across accounts
    • Regular cost anomaly detection
  7. Operational Tooling:
    • Cross-account CloudWatch dashboards
    • Centralized logging with aggregated account data
    • Systems Manager for multi-account administration
    • Custom console for quick account switching
This architecture allows us to maintain security boundaries while enabling efficient operations across our AWS environment. 

  Terraform Challenges 

  Q: What is the most difficult issue you experienced with Terraform while provisioning infrastructure? How did you resolve it? 

 A: The most difficult Terraform issue I faced at TataCliq was managing state drift and concurrent modifications during our rapid scaling phase. During a major sales event preparation, multiple teams were simultaneously modifying infrastructure using Terraform. We had grown quickly and our Terraform workflows hadn't matured. This led to:

  1. State File Conflicts: Multiple engineers were running Terraform against the same environments, causing state lock timeouts and occasional state corruption.
  2. Unmanaged Resource Modifications: Emergency changes were made directly in the AWS console, causing state drift and subsequent Terraform runs to attempt destroying "unknown" resources.
  3. Dependency Management: Complex cross-module dependencies caused ordering problems and partial failures during large updates.
To resolve these issues:
  1. Implemented CI/CD-Only Approach:
    • Restricted all Terraform execution to GitLab CI pipelines
    • Required all changes to go through pull requests
    • Set up state locking with DynamoDB and longer timeouts
  2. Modularized Architecture:
    • Restructured our Terraform code into clear bounded-context modules
    • Created separation between network, security, and application resources
    • Implemented proper variable passing and explicit dependencies
  3. State Management:
    • Wrote custom scripts to detect drift and alert before pipeline execution
    • Created targeted state migration workflows for resources that required manual intervention
    • Implemented state file backups before each apply
  4. Operational Improvements:
    • Created read-only Terraform workspaces for team exploration
    • Developed visualization tools for resource relationships
    • Established change windows for major infrastructure updates
These changes significantly improved our infrastructure stability and reduced failed deployments by nearly 90%, while still supporting our fast-paced development environment. 

  Incident Response and Security 

  Q: In an example scenario in your project, one pod got compromised. How do you find and resolve this type of issue? 

 A: In our TataCliq environment, when faced with a pod compromise, I followed this incident response process:

  1. Detection:
    • Alert triggered from Falco detecting unusual syscalls and file access patterns
    • Anomalous network traffic identified by our Istio service mesh showing outbound connections to unknown endpoints
    • Logs showed unexpected privilege escalation attempts within the container
  2. Immediate Containment:
    • Isolated the affected pod by applying emergency network policies to block all egress traffic
    • Captured forensic snapshot of the running container for analysis
    • Scaled up healthy replacement pods while preventing scheduler from placing new instances on the affected node
  3. Investigation:
    • Extracted container logs and performed memory dump analysis
    • Identified compromised application dependency with known CVE
    • Discovered crypto mining process running with hijacked container credentials
    • Located and preserved PCAP data of suspicious network traffic
  4. Remediation:
    • Terminated affected pods and the underlying node
    • Updated application dependencies and rebuilt container images with security patches
    • Added the malicious endpoints to our egress blocklists
    • Rotated all potentially exposed secrets and credentials
    • Reviewed and tightened IAM roles attached to the affected service account
  5. Long-term Fixes:
    • Implemented stricter seccomp profiles for the affected workloads
    • Added additional runtime security monitoring rules
    • Created automated vulnerability scanning for all dependencies in CI/CD
    • Updated our incident response playbook based on lessons learned
    • Conducted team security training focused on container escape techniques
This procedure helped us contain the compromise within 30 minutes of detection and prevent any data exfiltration or lateral movement within our cluster. 

  Cloud Migration

Q: Have you done any migrations from on-prem to cloud? 

 A: Yes, I led a significant migration from our on-premises data center to AWS at TataCliq. This was a critical initiative to improve scalability, reduce operational overhead, and enhance our disaster recovery capabilities. Migration Approach:

  1. Assessment & Planning:
    • Conducted comprehensive inventory of all on-prem applications and dependencies
    • Identified application interdependencies and created migration waves
    • Developed TCO analysis and business case for cloud adoption
    • Created detailed migration runbooks for each application component
  2. Technical Implementation:
    • Used the replatform (lift and reshape) approach for most services
    • Containerized legacy applications where possible for Kubernetes deployment
    • Implemented a hybrid connectivity model with AWS Direct Connect during transition
    • Used AWS Database Migration Service for database migrations to minimize downtime
  3. Key Challenges Overcome:
    • Legacy Application Compatibility: Modified several monolithic applications to work in containerized environments
    • Data Migration: Developed custom data sync solutions for large product catalogs with zero downtime
    • Security Compliance: Redesigned security controls to maintain regulatory compliance in cloud
    • Knowledge Transition: Upskilled operations team on cloud technologies and new monitoring approaches
  4. Results:
    • Completed migration of 85+ services over 8 months
    • Reduced infrastructure costs by approximately 30%
    • Improved application performance by 40% on average
    • Enhanced disaster recovery capabilities with multi-AZ deployments
    • Reduced time-to-market for new features from weeks to days
The migration allowed us to adopt modern DevOps practices including infrastructure as code with Terraform and automated CI/CD pipelines, which significantly improved our deployment frequency and reliability. 

  Monitoring and Early Detection 

  Q: What type of issues are occurring in applications running in Kubernetes? How do you find them earlier than customers? 

 A: In our TataCliq Kubernetes environment, we encounter several common application issues and have implemented proactive detection methods to catch them before they impact customers: Common Application Issues:

  1. Resource Constraints:
    • Memory leaks causing OOMKilled events
    • CPU throttling causing increased latency
    • Disk space filling up on container volumes
  2. Service Dependencies:
    • Database connection pool exhaustion
    • External API timeouts or failures
    • Cache service degradation
  3. Kubernetes-Specific Issues:
    • Pod scheduling failures due to resource requests
    • Liveness/readiness probe failures
    • Configuration issues with ConfigMaps or Secrets
Early Detection Methods:
  1. Proactive Monitoring:
    • Golden signals monitoring (latency, traffic, errors, saturation)
    • Custom Prometheus metrics for application-specific health indicators
    • Synthetic transactions that simulate critical user journeys every minute
  2. Anomaly Detection:
    • ML-based anomaly detection for request patterns and error rates
    • Baseline deviation alerts for key performance metrics
    • Real-time log pattern analysis for emerging error types
  3. Progressive Deployment:
    • Canary deployments with automated metric comparison
    • Feature flags tied to monitoring systems
    • Automatic rollbacks when error thresholds are exceeded
  4. Operational Dashboards:
    • Service-level objective (SLO) dashboards showing error budgets
    • Cross-service dependency maps with real-time health indicators
    • Consolidated alerts with context for faster triage
By combining these approaches, we typically detect issues 10-15 minutes before they would impact customers at scale. This proactive stance has significantly improved our mean time to detection (MTTD) and allowed us to maintain our availability targets even as we've increased deployment frequency. 

  Infrastructure Security in AWS 

  Q: How do you manage infrastructure security in AWS? 

 A: At TataCliq, we implement a comprehensive infrastructure security approach for our AWS environment:

  1. Account-Level Controls:
    • Strict Service Control Policies (SCPs) enforcing security guardrails
    • Centralized CloudTrail logs in dedicated security account
    • AWS Organizations with segregation of duties across accounts
    • GuardDuty enabled across all accounts with automated remediation
  2. Network Security:
    • Transit Gateway with centralized inspection
    • VPC Flow Logs analyzed in real-time for anomalies
    • Security groups managed through Terraform with approval workflows
    • WAF for edge protection with custom rule sets
  3. Identity & Access Management:
    • Least privilege IAM policies with regular access reviews
    • AWS SSO integration with JIT access for administrative functions
    • MFA enforcement for all human accounts
    • Temporary credentials for all programmatic access
  4. Data Protection:
    • Default encryption for all storage (S3, EBS, RDS)
    • KMS for key management with automatic rotation
    • S3 bucket policies preventing public access
    • DLP scans for PII/sensitive data
  5. Continuous Compliance:
    • AWS Config rules with conformance packs for PCI-DSS and GDPR
    • Security Hub for unified compliance view
    • Automated remediation for common compliance issues
    • Daily compliance reports and drift detection
  6. Vulnerability Management:
    • ECR image scanning in CI/CD pipelines
    • Inspector for host vulnerability assessment
    • Regular penetration testing with third parties
    • Automated patching processes for all resources
  7. Monitoring & Response:
    • Centralized logging with real-time threat detection
    • Playbooks for common security incidents
    • Automated containment for compromised resources
    • Regular security incident response exercises
This layered security approach allows us to maintain a strong security posture while still enabling developer velocity and infrastructure scalability.

Bhavani prasad
Cloud & Devops Engineer