We are looking for talented engineers who are passionate about designing and building high levels of availability, scalability and reliability into systems to join our Site Reliability Engineering Team. At Rokt we believe every team is responsible for running and operating the software they build. We are tasked with driving forward the reliability of the platform by working towards standardisation and supporting our teams, assessing their architecture, helping them to design reliable services, and cultivating excellent operational practices. We approach problems with the mindset on how to ensure they don’t happen again and identify pain-points in reliability and reduce the impact by changing the product, improving the processes, and coaching the developers. We respond to alerts & on-call pages, and work with development to define, measure, and exceed SLOs on the features our customers value the most.
You will have the unique opportunity to help build the practice of Site Reliability Engineering at Rokt. You will become intimate with the architecture of our systems and be responsible for diving deep into code, lead architecture and Root Cause Analysis workshops working directly with feature teams.
The mission of this role is to improve reliability, resilience and velocity for service teams at Rokt.
- Our day-to- day is driven by helping our product teams create robust software faster.
- Introduce best practices into the teams around observability, SLOs and reliability.
- Work in close collaboration with partner teams to shape the future roadmap to improve reliability and establish strong operational readiness across teams.
- Participate in system design consulting, and capacity planning.
- Identify areas for improvement across the organisation and drive Engineering-wide technical change in the field of Site Reliability.
- Share your knowledge by giving brown bags, tech talks, and evangelizing appropriate tech and engineering best practices.
- Partner with the broader Rokt organization to build a culture of rigorously learning from incidents.
- Contribute to Root Cause Analysis (RCA) investigations and follow up each incident to ensure the appropriate action items are in place and prioritized.
- Designing tools to help our entire engineering organization be as productive as possible.
- Lead development and roll out of new tools, technologies and processes that have high business impact and are used by multiple teams that improve reliability and velocity.
- Contribute to documentation and uplifting of partner teams
- Hands-on experience in Site Reliability and Observability Engineering, debugging, diagnosing and correcting errors and resolving high severity incidents
- Think about systems - edge cases, failure modes, behaviors, specific implementations.
- Experience building solutions in distributed systems for high volume transaction and/or developing support focused tooling.
- Strong software development experience with at least one of the following languages: Python, Java, Go, C# or similar.
- Experience working on various monitoring, and alerting tools
- You have done hands-on development with cloud infrastructure (AWS, GCE, Azure, Kubernetes, Docker).
- Hands-on experience with cloud infrastructure such as AWS, Google Compute or Azure.
- Experience in Defensive programming, Circuit breakers, Resilience frameworks, Fault tolerance, and self-healing mechanisms of services.
- An ability and desire to mentor and coach engineers.
- Strong organisational and interpersonal skills, with experience developing and instilling a culture of operational maturity.
- You have handled multiple on call shifts, and have navigated more than one incident through to the retrospective process.
- Systematic problem-solving approach, coupled with effective communication skills and a sense of ownership and drive.
- At Rokt we encourage autonomy; teams have complete ownership of their systems including building, running and monitoring. As such, you may be required to be on-call and respond to systems alerts should they arise.
- Work with the greatest talent in town. Our recruiting process is tough. We hold a high bar because we have a high performing culture - we only want the brightest and the best.
- Join a community. We believe the best things happen when we come together to solve complex problems and make meaningful connections with each other through interest groups, sports clubs, and social events.
- Accelerate your career. Develop through our global training events, ‘Level Up’ investment, online training courses and our fantastic people leaders. Take your career to Rokt’speed - the average time between promotions is 12 months.
- Take a break. When you work hard, we know you also need to rest. We offer generous time off and parental leave policies. We also offer a paid Rokt’star Sabbatical for employees who have been with us 3 years or more.
- Stay happy and healthy. Enjoy catered lunch 3 times a week and healthy snacks in the office. Plus, join the gym on us! Access generous retirement plans like a 4% dollar-for-dollar 401K matching plan in the US. In the US, get fully funded premium health insurance for your whole family.
- Become a shareholder. All Rokt’stars have stock options. If we succeed, everyone gets to enjoy the upside.
- See the world! Along with our global all-staff events in amazing locations (Phuket, Thailand in January 2020), we also offer generous relocation packages for those interested in moving to another Rokt office. We have cool offices in great cities - Tokyo, New York, Singapore, Boston, Sydney.
- We believe in equality. Rokt is an Equal Opportunity Employer and recognizes that a diverse workforce is crucial to our success as a business. We would love to hear from you - irrespective of socio-economic status or background, age, gender identity, race, religion, sexual orientation, color, pregnancy, carer/family responsibilities, national and social origin, political opinion, marital, veteran, or disability status.