Reinforcement Learning from Human Feedback (RLHF) is widely used to make large language models follow instructions more reliably, stay helpful, and reduce unsafe or low-quality outputs. A core component of RLHF is the reward model: a separate model trained...