Standard distillation discards incorrect teacher responses. We propose Reinforcement Distillation, utilizing negative reasoning traces as signals to improve student model performance on reasoning tasks.
Final-year CS undergrad at NUS. I explore two frontiers: improving the efficiency of singular models and designing agentic systems that are efficient and general.
Standard distillation discards incorrect teacher responses. We propose Reinforcement Distillation, utilizing negative reasoning traces as signals to improve student model performance on reasoning tasks.
We discovered that introducing randomness into selection strategies can outperform complex deterministic metrics.
Parallel test-time scaling systems are usually dictated by human designs, which are not always optimal. We explore how LLM-powered agents can autonomously decide when and how to scale compute.