Site Reliability Engineer
Senior Site Reliability Engineer
YOUR MISSION
- System Reliability & Performance: Ensure the reliability, availability, and performance of AI-powered features and services across our products. Proactively monitor and address system issues to prevent downtime and improve performance.
- Infrastructure Automation: Develop and maintain automation tools and scripts to manage infrastructure, deployments, and operations, using technologies such as Terraform, Ansible, or similar.
- Monitoring & Incident Management: Implement and maintain comprehensive monitoring and alerting systems. Lead incident response efforts, including root cause analysis and post-mortem documentation.
- Colaboration: Work closely with development and operations teams to design, build, and maintain scalable and resilient systems that support AI features and integrations.
- Continuous Improvement: Identify and implement improvements to existing systems, processes, and practices to enhance reliability, scalability, and performance.
- Security & Compliance: Ensure that all systems and processes comply with security best practices and regulatory requirements, particularly in the context of AI and cloud-hosted services.
- CI/CD Pipeline Management: Maintain and optimize continuous integration and continuous deployment (CI/CD) pipelines to ensure smooth and efficient deployment of AI features.
- Capacity Planning & Optimization: Conduct capacity planning and optimize resource utilization to ensure that our systems can scale effectively as demand grows.
MUST HAVE
-
- Education: Bachelor’s degree in Computer Science, Software Engineering, or a related field.
- Experience: 3-5 years of experience in site reliability engineering, DevOps, or a similar role, with a strong focus on cloud-hosted environments, preferably on Microsoft Azure.
- Automation Skills: Extensive experience with infrastructure as code (IaC) tools such as Terraform, Ansible, or equivalent, and a strong understanding of automation principles.
- Cloud Expertise: Deep knowledge of Microsoft Azure, including experience with cloud services, networking, storage, and security.
- Monitoring & Incident Management: Proven experience in setting up and managing monitoring, logging, and alerting tools, as well as leading incident response efforts.
- Collaboration & Communication: Strong collaboration and communication skills, with the ability to work effectively in cross-functional teams and influence stakeholders.
- Problem-Solving Skills: Strong analytical and problem-solving abilities, with a proactive approach to identifying and addressing potential issues before they impact the user experience.
NICE TO HAVE - AI Systems Experience: Familiarity with the unique challenges of deploying and maintaining AI systems, including model deployment, monitoring, and scalability.
- Agile Methodology: Experience working in an Agile/Scrum environment, with a focus on continuous improvement and iterative development.
- Security Best Practices: Knowledge of security best practices in cloud environments, particularly in relation to AI and sensitive data.
FOR YOU
- Up to 6 additional days off for personal or professional development
Wellbeing: Fitness subscriptions, Medical and dentistry subscriptions, Digital health, and wellness solutions (mental health, nutrition, coaching, parenting)
And many more... ask us about it!
JOIN US
We ask for your understanding that MATRIX42 can only accept applications online via the applicant portal in connection with our applicant management system due to the currently valid EU data protection regulations.