We are looking for Infrastructure Engineer responsible for building, operating, and improving reliable, scalable, and high-performing production systems. This role combines infrastructure engineering expertise with SRE practices, focusing on system availability, automation, incident management, and continuous improvement.
Key Responsibilities:
• Manage high severity incidents and high customer impact incidents focusing on fast recovery
• Champions production resilience and availability, focusing on superior client experience, by working with the operation team and technology development teams
• Drive the implementation of Site Reliability Engineer (SRE) and Chaos Engineering design for all strategic systems
• Drive effective communication between business and technology with regards to production service reliability and performance
• Drive continuous improvements in processes or systems leveraging Site Reliability Engineering methods
• Respond to, evaluate and analyse production incidents to minimise their impact as well as devise innovative solutions to prevent them in the future
• Improve the reliability and availability of systems by gathering hard data, designing systems for increased service reliability and performance
• Provide expert advice and training to our engineers as to which technology solutions and advanced reliability techniques to use on each situation
Requirements:
• Bachelor's degree in Computer Science or related field
• 10+ years of relevant experience
• Experience driving major production incidents and organise incident retrospective meetings
• Experience with Core Java 8, Cloud Foundry and non-relational databases, and Linux, Unix systems
• Experience with high availability, high-scale, and performant systems
• Experience with python and Unix scripting
EA License No.: 96C4864
Reg No.: R25128798 HUANG QIMENG