Description
We are looking for a Staff Applied AI Scientist to join the team behind AI Coach, Culture Amp’s contextually-aware AI coaching system that turns survey insights, performance data and interpersonal dynamics into personalised assistance at scale. Shipping an AI product is only the beginning. The harder problem, and one few teams have solved, is measuring on an ongoing basis whether that product is working well in production, why its quality changes, and how to make it better. In this role you will scale the effective measurement and improvement of our AI products in production, which means establishing online evaluation of live AI features, and then to make this sustainable by enabling the rest of our engineering org to do the same.
As part of this team of amazing humans,
You will
- Design and run sampling, LLM-as-a-judge, and labelling systems over de-identified production traces (for example, with Langfuse) to build longitudinal evaluation monitoring and alerting.
- Build LLM-powered analysis that works out why performance moved and recommends prompt or system changes to improve the product.
- Own the full feedback loop: prompt engineering, evaluation at scale, data labelling and continuous improvement.
- Enable others through reusable frameworks, tooling and documentation so product and engineering teams run their own evaluations. Lead from the front, then hand over.
- Partner closely with Coach, product, data science and people science so measured quality maps to real customer value.
- Stay current with the latest evaluation, observability and LLMOps research and provider offerings.
You have
- Proven experience analysing the performance of AI or data products in production and turning it into changes that maintained and improved the product.
- Hands-on LLM evaluation in production: LLM-as-judge, eval datasets, human-in-the-loop labelling, scoring against thresholds.
- Observability for LLM and agentic systems (traces, sampling, prompt management, production monitoring such as Langfuse or comparable).
- Longitudinal measurement: metrics and baselines, regression detection, quality tracking over time.
- Proven commercial experience taking ML or AI systems to production, and strong software engineering fundamentals (we work primarily in Python and TypeScript).
- AI-native daily practice, comfortable using agentic coding tools (Claude Code, Cursor, Codex or similar) on multi-step tasks, with clear judgment on when to direct an agent versus write code yourself.
- Strong technical writing and communication, and a track record of building capability into systems and teaching others to own it.
- Strong signals: built or scaled an eval and observability practice across multiple teams; evolved existing enterprise codebases with AI; production agentic systems (orchestration, RAG); a postgraduate degree in ML, CS, Applied Maths or related; public writing, talks or open-source work in eval, observability or LLMOps.
You are
- Motivated by breaking new ground in an emerging field, with the humility to learn in public and the resilience to be a self-starter.
- Motivated by enablement. Your biggest wins come from teaching others and building this into our systems, which can mean you do not own what you build forever.
The way we build at Culture Amp
At Culture Amp, our engineers are increasingly orchestrating agents that write code, rather than just writing it directly themselves. We guide, plan, build, and review loops where AI takes the initiative on routine work, allowing you to steer architecture, trade-offs, and quality. We're investing in a shared "harness" of tooling and standards so agents can do real product work safely, and we all embrace these capabilities as a core part of how we ship.