Ongoing projects & their respective implementation notes, architecture decisions, and empirical findings from training and deployment.
Explicit Gradient Regularization within GRPO
This research explores explicit gradient regularization applied during GRPO training for large language models, aimed at improving convergence and reducing catastrophic forgetting.
Replicating RL's policy sharpening at inference
Inference time sampling technique using a power distribution and MCMC that tries to match the effects of RL on the model's policy.
Performing SDFT on domain specific knowledge
Investigating the efficacy of SDFT on OOD tasks - specifically on corporate specific information - and comparing them against Experiential RL.
Learning to write context into memory at inference time
An alternate form of memory compressino that involves test-time Gradient Descent to continually 'bake in' knowledge into the model.