Cloud Architecture & Operations
- Build and operate HPC environments on cloud platforms such as:
- Amazon Web Services (AWS)
- Microsoft Azure
- Google Cloud Platform
Design hybrid-cloud and multi-cloud architectures for HPC workloads.
- Implement cloud-native storage, networking, security, and disaster recovery solutions.
Infrastructure Automation & DevOps
- Develop Infrastructure as Code (IaC) using:
- Terraform
- CloudFormation
Ansible
Python code
- Build CI/CD pipelines for infrastructure and platform deployments.
- Automate cluster provisioning, configuration management, monitoring, and patch management.
- Develop self-service provisioning frameworks for research and engineering teams.
AI & Data Engineering
- Design and implement scalable AI/ML data pipelines.
- Build data ingestion, transformation, and orchestration frameworks.
- Support distributed AI training and inference workloads.
- Optimize GPU utilization for deep learning applications.
- Collaborate with Data Scientists and ML Engineers to deploy production AI solutions.
Platform Monitoring & Reliability
- Implement observability solutions using: Prometheus, Grafana, ELK Stack, OpenTelemetry
- Monitor system performance, capacity planning, and SLA compliance.
- Troubleshoot performance bottlenecks across compute, storage, network, and AI frameworks.
HPC Infrastructure Engineering
- Design, deploy, and manage large-scale HPC clusters across on-premises and cloud environments.
- Administer compute, storage, networking, and GPU resources for AI/ML and data-intensive workloads.
Optimize cluster performance, scheduling, and resource utilization using workload managers such as: Slurm, LSF, PBS Pro, Kubernetes
Security & Governance
- Implement security best practices for HPC and cloud environments.
- Manage IAM, secrets management, encryption, and compliance controls.
- Support regulatory requirements and enterprise governance standards.