Senior Site Reliability Engineer - Rokt - Job details

Senior Site Reliability Engineer at Rokt

Engineering - Foundation, Sydney, Australia sydney engineering

Description

Posted 5 years ago

The Role:

We are looking for talented engineers who are passionate about designing and building high levels of availability, scalability and reliability into systems to join our Site Reliability Engineering Team. At Rokt we believe every team is responsible for running and operating the software they build. We are tasked with driving forward the reliability of the platform by working towards standardisation and supporting our teams, assessing their architecture, helping them to design reliable services, and cultivating excellent operational practices. We approach problems with the mindset on how to ensure they don’t happen again and identify pain-points in reliability and reduce the impact by changing the product, improving the processes, and coaching the developers. We respond to alerts & on-call pages, and work with development to define, measure, and exceed SLOs on the features our customers value the most.

You will have the unique opportunity to help build the practice of Site Reliability Engineering at Rokt. You will become intimate with the architecture of our systems and be responsible for diving deep into code, lead architecture and Root Cause Analysis workshops working directly with feature teams.

The mission of this role is to improve reliability, resilience and velocity for service teams at Rokt.

Responsibilities:

Our day-to- day is driven by helping our product teams create robust software faster.
Introduce best practices into the teams around observability, SLOs and reliability.
Work in close collaboration with partner teams to shape the future roadmap to improve reliability and establish strong operational readiness across teams.
Participate in system design consulting, and capacity planning.
Identify areas for improvement across the organisation and drive Engineering-wide technical change in the field of Site Reliability.
Share your knowledge by giving brown bags, tech talks, and evangelizing appropriate tech and engineering best practices.
Partner with the broader Rokt organization to build a culture of rigorously learning from incidents.
Contribute to Root Cause Analysis (RCA) investigations and follow up each incident to ensure the appropriate action items are in place and prioritized.
Designing tools to help our entire engineering organization be as productive as possible.
Lead development and roll out of new tools, technologies and processes that have high business impact and are used by multiple teams that improve reliability and velocity.
Contribute to documentation and uplifting of partner teams

Requirements:

Hands-on experience in Site Reliability and Observability Engineering, debugging, diagnosing and correcting errors and resolving high severity incidents
Think about systems - edge cases, failure modes, behaviors, specific implementations.
Experience building solutions in distributed systems for high volume transaction and/or developing support focused tooling.
Strong software development experience with at least one of the following languages: Python, Java, Go, C# or similar.
Experience working on various monitoring, and alerting tools
You have done hands-on development with cloud infrastructure (AWS, GCE, Azure, Kubernetes, Docker).
Hands-on experience with cloud infrastructure such as AWS, Google Compute or Azure.
Experience in Defensive programming, Circuit breakers, Resilience frameworks, Fault tolerance, and self-healing mechanisms of services.
An ability and desire to mentor and coach engineers.
Strong organisational and interpersonal skills, with experience developing and instilling a culture of operational maturity.
You have handled multiple on call shifts, and have navigated more than one incident through to the retrospective process.
Systematic problem-solving approach, coupled with effective communication skills and a sense of ownership and drive.
At Rokt we encourage autonomy; teams have complete ownership of their systems including building, running and monitoring. As such, you may be required to be on-call and respond to systems alerts should they arise.