Role Summary
Own our secure, multi-account AWS foundation and the MLOps/GenAI platform that powers
clinician matching, document processing, and safety tooling. You blend SRE discipline with ML
platform pragmatism to deliver compliant, observable, and cost-efficient infrastructure.
Key Responsibilities
• Build and operate a secure AWS landing zone (Organizations, Control Tower), VPC
architecture, private networking, and multi-account guardrails.
• Design CI/CD and IaC at scale (GitHub Actions/CodeBuild/CodePipeline, Terraform and/or
AWS CDK); policy-as-code (Open Policy Agent, AWS SCPs).
• Run compute fabrics for services and data: Amazon EKS (preferred) and ECS Fargate;
autoscaling, HPA/Karpenter, cluster security (IRSA, PodSecurity).
• Observability platform: AWS Distro for OpenTelemetry, CloudWatch, Prometheus/Grafana,
X-Ray; golden signals, SLOs, incident response and on-call.
• Security-by-default: IAM least-privilege, KMS envelope encryption, Secrets
Manager/Parameter Store, AWS WAF/Shield, artifact signing, SBOM/SLSA.
• Resiliency engineering: multi-AZ baselines, chaos testing, backup/DR (AWS Backup), game
days; cost management with CUR/Budgets/rightsizing.
• MLOps: SageMaker projects/pipelines, model registry, feature store, inference endpoints;
safe deployment patterns (shadow/canary/AB) and data drift monitoring.
• GenAI: Amazon Bedrock integration (guardrails, content filters, PII redaction), retrieval with
vector indexes (pgvector on Aurora or OpenSearch k-NN).
• Data platform enablement with S3/Lake Formation/Glue/Athena/EMR; secure data paths for
training/serving; governance and auditability.
• Champion DevSecOps: threat modeling, SBOM scanning, container/image hardening, and
secure software supply chain.
