Location:
New York, NY (located near Penn Stations South) or Dallas, TX
Summary
This role combines Cloud DevOps engineering and architectural responsibilities, focusing on designing, automating, and managing secure, scalable, and cost-optimized cloud environments. The position requires expertise in AWS services, Infrastructure as Code (IaC), container orchestration, OS management, AI/ML integration, and modern application architectures.
Responsibilities
- Architect and design cloud infrastructure solutions leveraging AWS IaaS (EC2, VPC, IAM) and PaaS (Lambda, RDS, ECS).
- Define high-level architecture diagrams, reference architectures, and best practices for multi-cloud deployments.
- Ensure scalability, high availability, and disaster recovery in all designs.
- Automate provisioning and configuration using Terraform or OpenTofu.
- Deploy and orchestrate containerized workloads using Amazon EKS and Kubernetes.
- Build and maintain CI/CD pipelines for application delivery and infrastructure updates.
- Administer Linux and Windows servers, including patching, hardening, and performance tuning.
- Implement automated patch management using tools like AWS Systems Manager, WSUS, Ansible, or SCCM.
- Monitor system health and performance using CloudWatch, Prometheus, Grafana, and native OS tools.
- Ensure compliance with security policies and best practices across cloud and OS layers.
- Perform deep troubleshooting across all layers:
o Network (VPC, NACLs, Security Groups)
o IAM permissions and policy conflicts
o Kubernetes cluster failures, Helm misconfigurations
o CI/CD pipeline errors and rollback strategies
o OS-level performance bottlenecks and kernel issues
o Root cause analysis and permanent fixes for outages
Infrastructure as Code (IaC)
- Design and implement IaC using Terraform and OpenTofu across multi-cloud environments.
- Develop reusable modules and manage state files with remote backends and workspaces.
- Automate workflows and CI/CD pipelines using Python and tools like Jenkins, GitHub Actions, or GitLab CI.
- Integrate policy-as-code frameworks such as Open Policy Agent (OPA) or Terraform Sentinel for governance.
- Collaborate with security and compliance teams to enforce resource policies and automate audits.
- Optimize cloud resources through tagging, lifecycle policies, and cost management strategies.
- Document infrastructure designs, scripts, and operational procedures.
Required skills
- 7+ years of experience in Cloud Architecture and DevOps, designing and managing secure, scalable AWS environments.
- 7+ years of hands-on expertise with AWS IaaS (EC2, VPC, IAM) and PaaS (Lambda, RDS, ECS) services.
- 5+ years of experience implementing Infrastructure as Code (IaC) using Terraform/OpenTofu, including reusable modules and remote state management.
- 5+ years of experience deploying and orchestrating containerized workloads using Amazon EKS and Kubernetes.
- Strong proficiency in CI/CD pipeline design, automation scripting (Python), and integration with tools like Jenkins, GitHub Actions, or GitLab CI.
- Experience in AI/ML integration, including Amazon SageMaker, Bedrock, and designing AI Landing Zones for predictive and generative AI workloads.
- Expertise in modern application architectures, including event-driven, serverless (AWS Lambda), and microservices design.
- Proven ability to lead offshore teams, in case of absence of offshore lead to manage client relationships, and delivering results aligned with SOW.
Desired Core Technologies:
AWS Services
- EC2, VPC, IAM, S3, EBS, ELB, Auto Scaling
- Lambda, RDS, DynamoDB, CloudFormation, Systems Manager Infrastructure as Code
- Terraform / OpenTofu: modules, remote state, workspaces
- YAML/JSON for IaC templates and configurations
Containers & Orchestration
- Docker: image creation, registries, networking
- Kubernetes: architecture, RBAC, Helm
- Amazon EKS: provisioning, scaling, upgrades
- DevOps & CI/CD
- Git workflows, automated testing, deployment strategies
- Proficiency in Python for scripting and automation
- Familiarity with CI/CD tools and version control systems (e.g., Git), AWS CodePipeline
- Knowledge of infrastructure governance, monitoring, and logging tools (e.g., Prometheus, Grafana)
- Understanding of security best practices in cloud environments
OS Administration & Patching
- Linux: Ubuntu, CentOS, Amazon Linux
Shell scripting, cron jobs, systemd, log rotation
Patch management via yum, apt, Ansible, AWS Systems Manager
- Windows Server: AD, DNS, IIS, PowerShell Patch management via WSUS, SCCM,
AWS Systems Manager
Group Policy, scheduled tasks, event logs
Security & Monitoring
- IAM policies, security groups, NACLs
- CloudWatch, Prometheus, Grafana, ELK stack
- Secrets management: AWS Secrets Manager, HashiCorp Vault
AI Ops & Integration
- AI Landing Zone design and implementation.
- AI/Apps integration using: Amazon Bedrock, Amazon SageMaker or ML frameworks for predictive and generative AI.
- Expertise in ML and Gen AI for cloud-native applications.
Application Architecture
- Event-driven architecture for scalable systems.
- Serverless architecture leveraging AWS Lambda and managed services.
- Microservices design and deployment.
AI-based applications using SageMaker and Bedrock