-
Notifications
You must be signed in to change notification settings - Fork 37
Expand file tree
/
Copy pathpapers_database.json
More file actions
8866 lines (8866 loc) · 805 KB
/
papers_database.json
File metadata and controls
8866 lines (8866 loc) · 805 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
{
"rlhf_code": [
{
"title": "Execution-based Code Generation using Deep Reinforcement Learning",
"authors": [
"Parshin Shojaee",
"Aneesh Jain",
"Sindhu Tipirneni",
"Chandan K. Reddy"
],
"summary": "The utilization of programming language (PL) models, pre-trained on large-scale code corpora, as a means of automating software engineering processes has demonstrated considerable potential in streamlining various code generation tasks such as code completion, code translation, and program synthesis. However, current approaches mainly rely on supervised fine-tuning objectives borrowed from text generation, neglecting unique sequence-level characteristics of code, including but not limited to compilability as well as syntactic and functional correctness. To address this limitation, we propose PPOCoder, a new framework for code generation that synergistically combines pre-trained PL models with Proximal Policy Optimization (PPO) which is a widely used deep reinforcement learning technique. By utilizing non-differentiable feedback from code execution and structure alignment, PPOCoder seamlessly integrates external code-specific knowledge into the model optimization process. It's important to note that PPOCoder is a task-agnostic and model-agnostic framework that can be used across different code generation tasks and PLs. Extensive experiments on three code generation tasks demonstrate the effectiveness of our proposed approach compared to SOTA methods, achieving significant improvements in compilation success rates and functional correctness across different PLs.",
"arxiv_id": "2301.13816v4",
"published": "2023-01-31",
"categories": [
"cs.LG",
"cs.AI",
"cs.CL",
"cs.PL"
],
"url": "https://arxiv.org/abs/2301.13816v4",
"pdf_url": "https://arxiv.org/pdf/2301.13816v4.pdf"
},
{
"title": "Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint",
"authors": [
"Wei Xiong",
"Hanze Dong",
"Chenlu Ye",
"Ziqi Wang",
"Han Zhong",
"Heng Ji",
"Nan Jiang",
"Tong Zhang"
],
"summary": "This paper studies the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF). We first identify the primary challenges of existing popular methods like offline PPO and offline DPO as lacking in strategical exploration of the environment. Then, to understand the mathematical principle of RLHF, we consider a standard mathematical formulation, the reverse-KL regularized contextual bandit for RLHF. Despite its widespread practical application, a rigorous theoretical analysis of this formulation remains open. We investigate its behavior in three distinct settings -- offline, online, and hybrid -- and propose efficient algorithms with finite-sample theoretical guarantees. Moving towards practical applications, our framework, with a robust approximation of the information-theoretical policy improvement oracle, naturally gives rise to several novel RLHF algorithms. This includes an iterative version of the Direct Preference Optimization (DPO) algorithm for online settings, and a multi-step rejection sampling strategy for offline scenarios. Our empirical evaluations on real-world alignment experiment of large language model demonstrate that these proposed methods significantly surpass existing strong baselines, such as DPO and Rejection Sampling Optimization (RSO), showcasing the connections between solid theoretical foundations and their potent practical implementations.",
"arxiv_id": "2312.11456v4",
"published": "2023-12-18",
"categories": [
"cs.LG",
"cs.AI",
"stat.ML"
],
"url": "https://arxiv.org/abs/2312.11456v4",
"pdf_url": "https://arxiv.org/pdf/2312.11456v4.pdf"
},
{
"title": "Online Iterative Reinforcement Learning from Human Feedback with General Preference Model",
"authors": [
"Chenlu Ye",
"Wei Xiong",
"Yuheng Zhang",
"Hanze Dong",
"Nan Jiang",
"Tong Zhang"
],
"summary": "We investigate Reinforcement Learning from Human Feedback (RLHF) in the context of a general preference oracle. In particular, we do not assume the existence of a reward function and an oracle preference signal drawn from the Bradley-Terry model as most of the prior works do. We consider a standard mathematical formulation, the reverse-KL regularized minimax game between two LLMs for RLHF under general preference oracle. The learning objective of this formulation is to find a policy so that it is consistently preferred by the KL-regularized preference oracle over any competing LLMs. We show that this framework is strictly more general than the reward-based one, and propose sample-efficient algorithms for both the offline learning from a pre-collected preference dataset and online learning where we can query the preference oracle along the way of training. Empirical studies verify the effectiveness of the proposed framework.",
"arxiv_id": "2402.07314v3",
"published": "2024-02-11",
"categories": [
"cs.LG",
"stat.ML"
],
"url": "https://arxiv.org/abs/2402.07314v3",
"pdf_url": "https://arxiv.org/pdf/2402.07314v3.pdf"
},
{
"title": "Impoola: The Power of Average Pooling for Image-Based Deep Reinforcement Learning",
"authors": [
"Raphael Trumpp",
"Ansgar Sch\u00e4fftlein",
"Mirco Theile",
"Marco Caccamo"
],
"summary": "As image-based deep reinforcement learning tackles more challenging tasks, increasing model size has become an important factor in improving performance. Recent studies achieved this by focusing on the parameter efficiency of scaled networks, typically using Impala-CNN, a 15-layer ResNet-inspired network, as the image encoder. However, while Impala-CNN evidently outperforms older CNN architectures, potential advancements in network design for deep reinforcement learning-specific image encoders remain largely unexplored. We find that replacing the flattening of output feature maps in Impala-CNN with global average pooling leads to a notable performance improvement. This approach outperforms larger and more complex models in the Procgen Benchmark, particularly in terms of generalization. We call our proposed encoder model Impoola-CNN. A decrease in the network's translation sensitivity may be central to this improvement, as we observe the most significant gains in games without agent-centered observations. Our results demonstrate that network scaling is not just about increasing model size - efficient network design is also an essential factor. We make our code available at https://github.com/raphajaner/impoola.",
"arxiv_id": "2503.05546v2",
"published": "2025-03-07",
"categories": [
"cs.LG",
"cs.AI"
],
"url": "https://arxiv.org/abs/2503.05546v2",
"pdf_url": "https://arxiv.org/pdf/2503.05546v2.pdf"
},
{
"title": "DeepCodeSeek: Real-Time API Retrieval for Context-Aware Code Generation",
"authors": [
"Esakkivel Esakkiraja",
"Denis Akhiyarov",
"Aditya Shanmugham",
"Chitra Ganapathy"
],
"summary": "Current search techniques are limited to standard RAG query-document applications. In this paper, we propose a novel technique to expand the code and index for predicting the required APIs, directly enabling high-quality, end-to-end code generation for auto-completion and agentic AI applications. We address the problem of API leaks in current code-to-code benchmark datasets by introducing a new dataset built from real-world ServiceNow Script Includes that capture the challenge of unclear API usage intent in the code. Our evaluation metrics show that this method achieves 87.86% top-40 retrieval accuracy, allowing the critical context with APIs needed for successful downstream code generation. To enable real-time predictions, we develop a comprehensive post-training pipeline that optimizes a compact 0.6B reranker through synthetic dataset generation, supervised fine-tuning, and reinforcement learning. This approach enables our compact reranker to outperform a much larger 8B model while maintaining 2.5x reduced latency, effectively addressing the nuances of enterprise-specific code without the computational overhead of larger models.",
"arxiv_id": "2509.25716v1",
"published": "2025-09-30",
"categories": [
"cs.SE",
"cs.AI",
"cs.IR"
],
"url": "https://arxiv.org/abs/2509.25716v1",
"pdf_url": "https://arxiv.org/pdf/2509.25716v1.pdf"
},
{
"title": "ACE-RLHF: Automated Code Evaluation and Socratic Feedback Generation Tool using Large Language Models and Reinforcement Learning with Human Feedback",
"authors": [
"Tasnia Rahman",
"Sathish A. P. Kumar",
"Sumit Jha",
"Arvind Ramanathan"
],
"summary": "Automated Program Repair tools are developed for generating feedback and suggesting a repair method for erroneous code. State of the art (SOTA) code repair methods rely on data-driven approaches and often fail to deliver solution for complicated programming questions. To interpret the natural language of unprecedented programming problems, using Large Language Models (LLMs) for code-feedback generation is crucial. LLMs generate more comprehensible feedback than compiler-generated error messages, and Reinforcement Learning with Human Feedback (RLHF) further enhances quality by integrating human-in-the-loop which helps novice students to lean programming from scratch interactively. We are applying RLHF fine-tuning technique for an expected Socratic response such as a question with hint to solve the programming issue. We are proposing code feedback generation tool by fine-tuning LLM with RLHF, Automated Code Evaluation with RLHF (ACE-RLHF), combining two open-source LLM models with two different SOTA optimization techniques. The quality of feedback is evaluated on two benchmark datasets containing basic and competition-level programming questions where the later is proposed by us. We achieved 2-5% higher accuracy than RL-free SOTA techniques using Llama-3-7B-Proximal-policy optimization in automated evaluation and similar or slightly higher accuracy compared to reward model-free RL with AI Feedback (RLAIF). We achieved almost 40% higher accuracy with GPT-3.5 Best-of-n optimization while performing manual evaluation.",
"arxiv_id": "2504.04657v1",
"published": "2025-04-07",
"categories": [
"cs.LG"
],
"url": "https://arxiv.org/abs/2504.04657v1",
"pdf_url": "https://arxiv.org/pdf/2504.04657v1.pdf"
},
{
"title": "Deep Reinforcement Learning with Gradient Eligibility Traces",
"authors": [
"Esraa Elelimy",
"Brett Daley",
"Andrew Patterson",
"Marlos C. Machado",
"Adam White",
"Martha White"
],
"summary": "Achieving fast and stable off-policy learning in deep reinforcement learning (RL) is challenging. Most existing methods rely on semi-gradient temporal-difference (TD) methods for their simplicity and efficiency, but are consequently susceptible to divergence. While more principled approaches like Gradient TD (GTD) methods have strong convergence guarantees, they have rarely been used in deep RL. Recent work introduced the generalized Projected Bellman Error ($\\overline{\\text{PBE}}$), enabling GTD methods to work efficiently with nonlinear function approximation. However, this work is limited to one-step methods, which are slow at credit assignment and require a large number of samples. In this paper, we extend the generalized $\\overline{\\text{PBE}}$ objective to support multistep credit assignment based on the $\u03bb$-return and derive three gradient-based methods that optimize this new objective. We provide both a forward-view formulation compatible with experience replay and a backward-view formulation compatible with streaming algorithms. Finally, we evaluate the proposed algorithms and show that they outperform both PPO and StreamQ in MuJoCo and MinAtar environments, respectively. Code available at https://github.com/esraaelelimy/gtd\\_algos",
"arxiv_id": "2507.09087v2",
"published": "2025-07-12",
"categories": [
"cs.LG",
"cs.AI",
"stat.ML"
],
"url": "https://arxiv.org/abs/2507.09087v2",
"pdf_url": "https://arxiv.org/pdf/2507.09087v2.pdf"
},
{
"title": "Reinforcement Learning from Human Feedback with High-Confidence Safety Constraints",
"authors": [
"Yaswanth Chittepu",
"Blossom Metevier",
"Will Schwarzer",
"Austin Hoag",
"Scott Niekum",
"Philip S. Thomas"
],
"summary": "Existing approaches to language model alignment often treat safety as a tradeoff against helpfulness, which can lead to unacceptable responses in sensitive domains. To ensure reliable performance in such settings, we propose High-Confidence Safe Reinforcement Learning from Human Feedback (HC-RLHF), a method that provides high-confidence safety guarantees while maximizing helpfulness. Similar to previous methods, HC-RLHF explicitly decouples human preferences into helpfulness and harmlessness (safety), which are learned by training a reward model and a cost model, respectively. It then employs a two-step process to find safe solutions. In the first step, it optimizes the reward function under an intentionally pessimistic version of the cost constraint. In the second step, the trained model undergoes a safety test to verify whether its performance stays within an upper-confidence bound of the actual cost constraint. We provide a theoretical analysis of HC-RLHF, including proof that it will not return an unsafe solution with a probability greater than a user-specified threshold. For our empirical analysis, we apply HC-RLHF to align three different language models (Qwen2-1.5B, Qwen2.5-3B, and LLaMa3.2-3B) with human preferences. Our results demonstrate that HC-RLHF produces safe models with high probability and can improve harmlessness and helpfulness compared to previous methods.",
"arxiv_id": "2506.08266v1",
"published": "2025-06-09",
"categories": [
"cs.LG",
"cs.AI",
"cs.CL",
"stat.AP"
],
"url": "https://arxiv.org/abs/2506.08266v1",
"pdf_url": "https://arxiv.org/pdf/2506.08266v1.pdf"
},
{
"title": "RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs",
"authors": [
"Shreyas Chaudhari",
"Pranjal Aggarwal",
"Vishvak Murahari",
"Tanmay Rajpurohit",
"Ashwin Kalyan",
"Karthik Narasimhan",
"Ameet Deshpande",
"Bruno Castro da Silva"
],
"summary": "State-of-the-art large language models (LLMs) have become indispensable tools for various tasks. However, training LLMs to serve as effective assistants for humans requires careful consideration. A promising approach is reinforcement learning from human feedback (RLHF), which leverages human feedback to update the model in accordance with human preferences and mitigate issues like toxicity and hallucinations. Yet, an understanding of RLHF for LLMs is largely entangled with initial design choices that popularized the method and current research focuses on augmenting those choices rather than fundamentally improving the framework. In this paper, we analyze RLHF through the lens of reinforcement learning principles to develop an understanding of its fundamentals, dedicating substantial focus to the core component of RLHF -- the reward model. Our study investigates modeling choices, caveats of function approximation, and their implications on RLHF training algorithms, highlighting the underlying assumptions made about the expressivity of reward. Our analysis improves the understanding of the role of reward models and methods for their training, concurrently revealing limitations of the current methodology. We characterize these limitations, including incorrect generalization, model misspecification, and the sparsity of feedback, along with their impact on the performance of a language model. The discussion and analysis are substantiated by a categorical review of current literature, serving as a reference for researchers and practitioners to understand the challenges of RLHF and build upon existing efforts.",
"arxiv_id": "2404.08555v2",
"published": "2024-04-12",
"categories": [
"cs.LG",
"cs.AI",
"cs.CL"
],
"url": "https://arxiv.org/abs/2404.08555v2",
"pdf_url": "https://arxiv.org/pdf/2404.08555v2.pdf"
},
{
"title": "RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback",
"authors": [
"Harrison Lee",
"Samrat Phatale",
"Hassan Mansoor",
"Thomas Mesnard",
"Johan Ferret",
"Kellie Lu",
"Colton Bishop",
"Ethan Hall",
"Victor Carbune",
"Abhinav Rastogi",
"Sushant Prakash"
],
"summary": "Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but gathering high-quality preference labels is expensive. RL from AI Feedback (RLAIF), introduced in Bai et al., offers a promising alternative that trains the reward model (RM) on preferences generated by an off-the-shelf LLM. Across the tasks of summarization, helpful dialogue generation, and harmless dialogue generation, we show that RLAIF achieves comparable performance to RLHF. Furthermore, we take a step towards \"self-improvement\" by demonstrating that RLAIF can outperform a supervised fine-tuned baseline even when the AI labeler is the same size as the policy, or even the exact same checkpoint as the initial policy. Finally, we introduce direct-RLAIF (d-RLAIF) - a technique that circumvents RM training by obtaining rewards directly from an off-the-shelf LLM during RL, which achieves superior performance to canonical RLAIF. Our results suggest that RLAIF can achieve performance on-par with using human feedback, offering a potential solution to the scalability limitations of RLHF.",
"arxiv_id": "2309.00267v3",
"published": "2023-09-01",
"categories": [
"cs.CL",
"cs.AI",
"cs.LG"
],
"url": "https://arxiv.org/abs/2309.00267v3",
"pdf_url": "https://arxiv.org/pdf/2309.00267v3.pdf"
},
{
"title": "Automatic Unit Test Data Generation and Actor-Critic Reinforcement Learning for Code Synthesis",
"authors": [
"Philip John Gorinski",
"Matthieu Zimmer",
"Gerasimos Lampouras",
"Derrick Goh Xin Deik",
"Ignacio Iacobacci"
],
"summary": "The advent of large pre-trained language models in the domain of Code Synthesis has shown remarkable performance on various benchmarks, treating the problem of Code Generation in a fashion similar to Natural Language Generation, trained with a Language Modelling (LM) objective. In addition, the property of programming language code being precisely evaluable with respect to its semantics -- through the use of Unit Tests to check its functional correctness -- lends itself to using Reinforcement Learning (RL) as a further training paradigm. Previous work has shown that RL can be applied as such to improve models' coding capabilities; however, such RL-based methods rely on a reward signal based on defined Unit Tests, which are much harder to obtain compared to the huge crawled code datasets used in LM objectives. In this work, we present a novel approach to automatically obtain data consisting of function signatures and associated Unit Tests, suitable for RL training of Code Synthesis models. We also introduce a straightforward, simple yet effective Actor-Critic RL training scheme and show that it, in conjunction with automatically generated training data, leads to improvement of a pre-trained code language model's performance by up to 9.9% improvement over the original underlying code synthesis LM, and up to 4.3% over RL-based models trained with standard PPO or CodeRL.",
"arxiv_id": "2310.13669v1",
"published": "2023-10-20",
"categories": [
"cs.LG",
"cs.AI",
"cs.CL",
"cs.PL"
],
"url": "https://arxiv.org/abs/2310.13669v1",
"pdf_url": "https://arxiv.org/pdf/2310.13669v1.pdf"
},
{
"title": "A Recipe for Unbounded Data Augmentation in Visual Reinforcement Learning",
"authors": [
"Abdulaziz Almuzairee",
"Nicklas Hansen",
"Henrik I. Christensen"
],
"summary": "Q-learning algorithms are appealing for real-world applications due to their data-efficiency, but they are very prone to overfitting and training instabilities when trained from visual observations. Prior work, namely SVEA, finds that selective application of data augmentation can improve the visual generalization of RL agents without destabilizing training. We revisit its recipe for data augmentation, and find an assumption that limits its effectiveness to augmentations of a photometric nature. Addressing these limitations, we propose a generalized recipe, SADA, that works with wider varieties of augmentations. We benchmark its effectiveness on DMC-GB2 - our proposed extension of the popular DMControl Generalization Benchmark - as well as tasks from Meta-World and the Distracting Control Suite, and find that our method, SADA, greatly improves training stability and generalization of RL agents across a diverse set of augmentations. For visualizations, code and benchmark: see https://aalmuzairee.github.io/SADA/",
"arxiv_id": "2405.17416v2",
"published": "2024-05-27",
"categories": [
"cs.LG",
"cs.CV",
"cs.RO"
],
"url": "https://arxiv.org/abs/2405.17416v2",
"pdf_url": "https://arxiv.org/pdf/2405.17416v2.pdf"
},
{
"title": "Policy-labeled Preference Learning: Is Preference Enough for RLHF?",
"authors": [
"Taehyun Cho",
"Seokhun Ju",
"Seungyub Han",
"Dohyeong Kim",
"Kyungjae Lee",
"Jungwoo Lee"
],
"summary": "To design rewards that align with human goals, Reinforcement Learning from Human Feedback (RLHF) has emerged as a prominent technique for learning reward functions from human preferences and optimizing policies via reinforcement learning algorithms. However, existing RLHF methods often misinterpret trajectories as being generated by an optimal policy, causing inaccurate likelihood estimation and suboptimal learning. Inspired by Direct Preference Optimization framework which directly learns optimal policy without explicit reward, we propose policy-labeled preference learning (PPL), to resolve likelihood mismatch issues by modeling human preferences with regret, which reflects behavior policy information. We also provide a contrastive KL regularization, derived from regret-based principles, to enhance RLHF in sequential decision making. Experiments in high-dimensional continuous control tasks demonstrate PPL's significant improvements in offline RLHF performance and its effectiveness in online settings.",
"arxiv_id": "2505.06273v2",
"published": "2025-05-06",
"categories": [
"cs.LG",
"cs.AI"
],
"url": "https://arxiv.org/abs/2505.06273v2",
"pdf_url": "https://arxiv.org/pdf/2505.06273v2.pdf"
},
{
"title": "Dreaming of Many Worlds: Learning Contextual World Models Aids Zero-Shot Generalization",
"authors": [
"Sai Prasanna",
"Karim Farid",
"Raghu Rajan",
"Andr\u00e9 Biedenkapp"
],
"summary": "Zero-shot generalization (ZSG) to unseen dynamics is a major challenge for creating generally capable embodied agents. To address the broader challenge, we start with the simpler setting of contextual reinforcement learning (cRL), assuming observability of the context values that parameterize the variation in the system's dynamics, such as the mass or dimensions of a robot, without making further simplifying assumptions about the observability of the Markovian state. Toward the goal of ZSG to unseen variation in context, we propose the contextual recurrent state-space model (cRSSM), which introduces changes to the world model of Dreamer (v3) (Hafner et al., 2023). This allows the world model to incorporate context for inferring latent Markovian states from the observations and modeling the latent dynamics. Our approach is evaluated on two tasks from the CARL benchmark suite, which is tailored to study contextual RL. Our experiments show that such systematic incorporation of the context improves the ZSG of the policies trained on the \"dreams\" of the world model. We further find qualitatively that our approach allows Dreamer to disentangle the latent state from context, allowing it to extrapolate its dreams to the many worlds of unseen contexts. The code for all our experiments is available at https://github.com/sai-prasanna/dreaming_of_many_worlds.",
"arxiv_id": "2403.10967v2",
"published": "2024-03-16",
"categories": [
"cs.LG",
"cs.AI"
],
"url": "https://arxiv.org/abs/2403.10967v2",
"pdf_url": "https://arxiv.org/pdf/2403.10967v2.pdf"
},
{
"title": "Uni-RLHF: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback",
"authors": [
"Yifu Yuan",
"Jianye Hao",
"Yi Ma",
"Zibin Dong",
"Hebin Liang",
"Jinyi Liu",
"Zhixin Feng",
"Kai Zhao",
"Yan Zheng"
],
"summary": "Reinforcement Learning with Human Feedback (RLHF) has received significant attention for performing tasks without the need for costly manual reward design by aligning human preferences. It is crucial to consider diverse human feedback types and various learning methods in different environments. However, quantifying progress in RLHF with diverse feedback is challenging due to the lack of standardized annotation platforms and widely used unified benchmarks. To bridge this gap, we introduce Uni-RLHF, a comprehensive system implementation tailored for RLHF. It aims to provide a complete workflow from real human feedback, fostering progress in the development of practical problems. Uni-RLHF contains three packages: 1) a universal multi-feedback annotation platform, 2) large-scale crowdsourced feedback datasets, and 3) modular offline RLHF baseline implementations. Uni-RLHF develops a user-friendly annotation interface tailored to various feedback types, compatible with a wide range of mainstream RL environments. We then establish a systematic pipeline of crowdsourced annotations, resulting in large-scale annotated datasets comprising more than 15 million steps across 30+ popular tasks. Through extensive experiments, the results in the collected datasets demonstrate competitive performance compared to those from well-designed manual rewards. We evaluate various design choices and offer insights into their strengths and potential areas of improvement. We wish to build valuable open-source platforms, datasets, and baselines to facilitate the development of more robust and reliable RLHF solutions based on realistic human feedback. The website is available at https://uni-rlhf.github.io/.",
"arxiv_id": "2402.02423v2",
"published": "2024-02-04",
"categories": [
"cs.LG",
"cs.AI",
"cs.HC",
"cs.RO"
],
"url": "https://arxiv.org/abs/2402.02423v2",
"pdf_url": "https://arxiv.org/pdf/2402.02423v2.pdf"
},
{
"title": "On a Connection Between Imitation Learning and RLHF",
"authors": [
"Teng Xiao",
"Yige Yuan",
"Mingxiao Li",
"Zhengyu Chen",
"Vasant G Honavar"
],
"summary": "This work studies the alignment of large language models with preference data from an imitation learning perspective. We establish a close theoretical connection between reinforcement learning from human feedback RLHF and imitation learning (IL), revealing that RLHF implicitly performs imitation learning on the preference data distribution. Building on this connection, we propose DIL, a principled framework that directly optimizes the imitation learning objective. DIL provides a unified imitation learning perspective on alignment, encompassing existing alignment algorithms as special cases while naturally introducing new variants. By bridging IL and RLHF, DIL offers new insights into alignment with RLHF. Extensive experiments demonstrate that DIL outperforms existing methods on various challenging benchmarks.",
"arxiv_id": "2503.05079v1",
"published": "2025-03-07",
"categories": [
"cs.LG"
],
"url": "https://arxiv.org/abs/2503.05079v1",
"pdf_url": "https://arxiv.org/pdf/2503.05079v1.pdf"
},
{
"title": "CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning",
"authors": [
"Hung Le",
"Yue Wang",
"Akhilesh Deepak Gotmare",
"Silvio Savarese",
"Steven C. H. Hoi"
],
"summary": "Program synthesis or code generation aims to generate a program that satisfies a problem specification. Recent approaches using large-scale pretrained language models (LMs) have shown promising results, yet they have some critical limitations. In particular, they often follow a standard supervised fine-tuning procedure to train a code generation model only from the pairs of natural-language problem descriptions and ground-truth programs. Such paradigm largely ignores some important but potentially useful signals in the problem specification such as unit tests, which thus often results in poor performance when solving complex unseen coding tasks. To address the limitations, we propose \"CodeRL\", a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning (RL). Specifically, during training, we treat the code-generating LM as an actor network, and introduce a critic network that is trained to predict the functional correctness of generated programs and provide dense feedback signals to the actor. During inference, we introduce a new generation procedure with a critical sampling strategy that allows a model to automatically regenerate programs based on feedback from example unit tests and critic scores. For the model backbones, we extended the encoder-decoder architecture of CodeT5 with enhanced learning objectives, larger model sizes, and better pretraining data. Our method not only achieves new SOTA results on the challenging APPS benchmark, but also shows strong zero-shot transfer capability with new SOTA results on the simpler MBPP benchmark.",
"arxiv_id": "2207.01780v3",
"published": "2022-07-05",
"categories": [
"cs.LG",
"cs.CL",
"cs.PL"
],
"url": "https://arxiv.org/abs/2207.01780v3",
"pdf_url": "https://arxiv.org/pdf/2207.01780v3.pdf"
},
{
"title": "RLHF Workflow: From Reward Modeling to Online RLHF",
"authors": [
"Hanze Dong",
"Wei Xiong",
"Bo Pang",
"Haoxiang Wang",
"Han Zhao",
"Yingbo Zhou",
"Nan Jiang",
"Doyen Sahoo",
"Caiming Xiong",
"Tong Zhang"
],
"summary": "We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill in this gap and provide a detailed recipe that is easy to reproduce for online iterative RLHF. In particular, since online human feedback is usually infeasible for open-source communities with limited resources, we start by constructing preference models using a diverse set of open-source datasets and use the constructed proxy preference model to approximate human feedback. Then, we discuss the theoretical insights and algorithmic principles behind online iterative RLHF, followed by a detailed practical implementation. Our trained LLM achieves impressive performance on LLM chatbot benchmarks, including AlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks such as HumanEval and TruthfulQA. We have shown that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets. Further, we have made our models, curated datasets, and comprehensive step-by-step code guidebooks publicly available. Please refer to https://github.com/RLHFlow/RLHF-Reward-Modeling and https://github.com/RLHFlow/Online-RLHF for more detailed information.",
"arxiv_id": "2405.07863v3",
"published": "2024-05-13",
"categories": [
"cs.LG",
"cs.AI",
"cs.CL",
"stat.ML"
],
"url": "https://arxiv.org/abs/2405.07863v3",
"pdf_url": "https://arxiv.org/pdf/2405.07863v3.pdf"
},
{
"title": "Principled Penalty-based Methods for Bilevel Reinforcement Learning and RLHF",
"authors": [
"Han Shen",
"Zhuoran Yang",
"Tianyi Chen"
],
"summary": "Bilevel optimization has been recently applied to many machine learning tasks. However, their applications have been restricted to the supervised learning setting, where static objective functions with benign structures are considered. But bilevel problems such as incentive design, inverse reinforcement learning (RL), and RL from human feedback (RLHF) are often modeled as dynamic objective functions that go beyond the simple static objective structures, which pose significant challenges of using existing bilevel solutions. To tackle this new class of bilevel problems, we introduce the first principled algorithmic framework for solving bilevel RL problems through the lens of penalty formulation. We provide theoretical studies of the problem landscape and its penalty-based (policy) gradient algorithms. We demonstrate the effectiveness of our algorithms via simulations in the Stackelberg Markov game, RL from human feedback and incentive design.",
"arxiv_id": "2402.06886v3",
"published": "2024-02-10",
"categories": [
"cs.LG",
"math.OC",
"stat.ML"
],
"url": "https://arxiv.org/abs/2402.06886v3",
"pdf_url": "https://arxiv.org/pdf/2402.06886v3.pdf"
},
{
"title": "Reinforcement Learning in the Era of LLMs: What is Essential? What is needed? An RL Perspective on RLHF, Prompting, and Beyond",
"authors": [
"Hao Sun"
],
"summary": "Recent advancements in Large Language Models (LLMs) have garnered wide attention and led to successful products such as ChatGPT and GPT-4. Their proficiency in adhering to instructions and delivering harmless, helpful, and honest (3H) responses can largely be attributed to the technique of Reinforcement Learning from Human Feedback (RLHF). In this paper, we aim to link the research in conventional RL to RL techniques used in LLM research. Demystify this technique by discussing why, when, and how RL excels. Furthermore, we explore potential future avenues that could either benefit from or contribute to RLHF research. Highlighted Takeaways: 1. RLHF is Online Inverse RL with Offline Demonstration Data. 2. RLHF $>$ SFT because Imitation Learning (and Inverse RL) $>$ Behavior Cloning (BC) by alleviating the problem of compounding error. 3. The RM step in RLHF generates a proxy of the expensive human feedback, such an insight can be generalized to other LLM tasks such as prompting evaluation and optimization where feedback is also expensive. 4. The policy learning in RLHF is more challenging than conventional problems studied in IRL due to their high action dimensionality and feedback sparsity. 5. The main superiority of PPO over off-policy value-based methods is its stability gained from (almost) on-policy data and conservative policy updates.",
"arxiv_id": "2310.06147v1",
"published": "2023-10-09",
"categories": [
"cs.LG",
"cs.AI"
],
"url": "https://arxiv.org/abs/2310.06147v1",
"pdf_url": "https://arxiv.org/pdf/2310.06147v1.pdf"
},
{
"title": "Safe RLHF: Safe Reinforcement Learning from Human Feedback",
"authors": [
"Josef Dai",
"Xuehai Pan",
"Ruiyang Sun",
"Jiaming Ji",
"Xinbo Xu",
"Mickel Liu",
"Yizhou Wang",
"Yaodong Yang"
],
"summary": "With the development of large language models (LLMs), striking a balance between the performance and safety of AI systems has never been more critical. However, the inherent tension between the objectives of helpfulness and harmlessness presents a significant challenge during LLM training. To address this issue, we propose Safe Reinforcement Learning from Human Feedback (Safe RLHF), a novel algorithm for human value alignment. Safe RLHF explicitly decouples human preferences regarding helpfulness and harmlessness, effectively avoiding the crowdworkers' confusion about the tension and allowing us to train separate reward and cost models. We formalize the safety concern of LLMs as an optimization task of maximizing the reward function while satisfying specified cost constraints. Leveraging the Lagrangian method to solve this constrained problem, Safe RLHF dynamically adjusts the balance between the two objectives during fine-tuning. Through a three-round fine-tuning using Safe RLHF, we demonstrate a superior ability to mitigate harmful responses while enhancing model performance compared to existing value-aligned algorithms. Experimentally, we fine-tuned the Alpaca-7B using Safe RLHF and aligned it with collected human preferences, significantly improving its helpfulness and harmlessness according to human evaluations.",
"arxiv_id": "2310.12773v1",
"published": "2023-10-19",
"categories": [
"cs.AI",
"cs.LG"
],
"url": "https://arxiv.org/abs/2310.12773v1",
"pdf_url": "https://arxiv.org/pdf/2310.12773v1.pdf"
},
{
"title": "Fly, Fail, Fix: Iterative Game Repair with Reinforcement Learning and Large Multimodal Models",
"authors": [
"Alex Zook",
"Josef Spjut",
"Jonathan Tremblay"
],
"summary": "Game design hinges on understanding how static rules and content translate into dynamic player behavior - something modern generative systems that inspect only a game's code or assets struggle to capture. We present an automated design iteration framework that closes this gap by pairing a reinforcement learning (RL) agent, which playtests the game, with a large multimodal model (LMM), which revises the game based on what the agent does. In each loop the RL player completes several episodes, producing (i) numerical play metrics and/or (ii) a compact image strip summarising recent video frames. The LMM designer receives a gameplay goal and the current game configuration, analyses the play traces, and edits the configuration to steer future behaviour toward the goal. We demonstrate results that LMMs can reason over behavioral traces supplied by RL agents to iteratively refine game mechanics, pointing toward practical, scalable tools for AI-assisted game design.",
"arxiv_id": "2507.12666v1",
"published": "2025-07-16",
"categories": [
"cs.AI",
"cs.LG"
],
"url": "https://arxiv.org/abs/2507.12666v1",
"pdf_url": "https://arxiv.org/pdf/2507.12666v1.pdf"
},
{
"title": "DPO Meets PPO: Reinforced Token Optimization for RLHF",
"authors": [
"Han Zhong",
"Zikang Shan",
"Guhao Feng",
"Wei Xiong",
"Xinle Cheng",
"Li Zhao",
"Di He",
"Jiang Bian",
"Liwei Wang"
],
"summary": "In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards -- a challenging scenario in traditional deep reinforcement learning. Despite the great successes of PPO in the alignment of large language models, its open-source implementation is still largely sub-optimal. To address these issues, we introduce a framework that models RLHF problems as a Markov decision process (MDP), enabling the capture of fine-grained token-wise information. Under this framework, we introduce an algorithm Reinforced Token Optimization (\\texttt{RTO}), which learns the token-wise reward function from preference data and performs policy optimization based on this learned token-wise reward signal. Theoretically, \\texttt{RTO} is proven to have the capability of finding the near-optimal policy sample-efficiently. For its practical implementation, \\texttt{RTO} innovatively integrates Direct Preference Optimization (DPO) and PPO. DPO, originally derived from sparse sentence rewards, surprisingly provides us with a token-wise characterization of response quality, which is seamlessly incorporated into our subsequent PPO training stage. Extensive experiments demonstrate that \\texttt{RTO} performs better than PPO and other direct preference learning algorithms. In particular, RTO outperforms PPO by 7.5 points on the AlpacaEval 2 benchmark and by 4.1 points on Arena-Hard. Our code and models are available at \\href{https://github.com/zkshan2002/RTO}{https://github.com/zkshan2002/RTO}.",
"arxiv_id": "2404.18922v4",
"published": "2024-04-29",
"categories": [
"cs.LG",
"cs.AI",
"cs.CL",
"stat.ML"
],
"url": "https://arxiv.org/abs/2404.18922v4",
"pdf_url": "https://arxiv.org/pdf/2404.18922v4.pdf"
},
{
"title": "The Hidden Link Between RLHF and Contrastive Learning",
"authors": [
"Xufei Lv",
"Kehai Chen",
"Haoyuan Sun",
"Xuefeng Bai",
"Min Zhang",
"Houde Liu",
"Kehai Chen"
],
"summary": "Alignment of large language models (LLMs) with human values has recently garnered significant attention, with prominent examples including the canonical yet costly Reinforcement Learning from Human Feedback (RLHF) and the simple Direct Preference Optimization (DPO). In this work, we demonstrate that both RLHF and DPO can be interpreted from the perspective of mutual information (MI) maximization, uncovering a profound connection to contrastive learning. Within this framework, both RLHF and DPO can be interpreted as methods that performing contrastive learning based on the positive and negative samples derived from base model, leveraging the Donsker-Varadhan (DV) lower bound on MI (equivalently, the MINE estimator). Such paradigm further illuminates why RLHF may not intrinsically incentivize reasoning capacities in LLMs beyond what is already present in the base model. Building on the perspective, we replace the DV/MINE bound with the Jensen-Shannon (JS) MI estimator and propose the Mutual Information Optimization (MIO). Comprehensive theoretical analysis and extensive empirical evaluations demonstrate that MIO mitigates the late-stage decline in chosen-likelihood observed in DPO, achieving competitive or superior performance across various challenging reasoning and mathematical benchmarks.",
"arxiv_id": "2506.22578v2",
"published": "2025-06-27",
"categories": [
"cs.LG",
"cs.AI",
"stat.ML"
],
"url": "https://arxiv.org/abs/2506.22578v2",
"pdf_url": "https://arxiv.org/pdf/2506.22578v2.pdf"
},
{
"title": "Process-Supervised Reinforcement Learning for Code Generation",
"authors": [
"Yufan Ye",
"Ting Zhang",
"Wenbin Jiang",
"Hua Huang"
],
"summary": "Existing reinforcement learning strategies based on outcome supervision have proven effective in enhancing the performance of large language models(LLMs) for code generation. While reinforcement learning based on process supervision has shown great promise in handling multi-step reasoning tasks, its effectiveness in code generation remains largely underexplored and underjustified. The primary obstacle stems from the resource-intensive nature of constructing high-quality process-supervised data, which demands substantial human expertise and computational resources. In response to this challenge, we propose a \"statement mutation/refactoring-compile and execution verification\" strategy: mutating and refactoring code line-by-line through a teacher model, and utilizing compiler execution results to automatically label each line, resulting in line-by-line process-supervised data, which is pivotal for training a process-supervised reward model. The trained reward model is then integrated into the PRLCoder framework, followed by experimental validation on several benchmarks. Experimental results demonstrate that process-supervised reinforcement learning significantly surpasses methods relying solely on outcome supervision. Notably, in tackling complex code generation tasks, process-supervised reinforcement learning shows a clear advantage, ensuring both the integrity of the code generation process and the correctness of the generation results.",
"arxiv_id": "2502.01715v1",
"published": "2025-02-03",
"categories": [
"cs.SE",
"cs.AI"
],
"url": "https://arxiv.org/abs/2502.01715v1",
"pdf_url": "https://arxiv.org/pdf/2502.01715v1.pdf"
},
{
"title": "Reinforcement Learning under State and Outcome Uncertainty: A Foundational Distributional Perspective",
"authors": [
"Larry Preuett",
"Qiuyi Zhang",
"Muhammad Aurangzeb Ahmad"
],
"summary": "In many real-world planning tasks, agents must tackle uncertainty about the environment's state and variability in the outcomes of any chosen policy. We address both forms of uncertainty as a first step toward safer algorithms in partially observable settings. Specifically, we extend Distributional Reinforcement Learning (DistRL)-which models the entire return distribution for fully observable domains-to Partially Observable Markov Decision Processes (POMDPs), allowing an agent to learn the distribution of returns for each conditional plan. Concretely, we introduce new distributional Bellman operators for partial observability and prove their convergence under the supremum p-Wasserstein metric. We also propose a finite representation of these return distributions via psi-vectors, generalizing the classical alpha-vectors in POMDP solvers. Building on this, we develop Distributional Point-Based Value Iteration (DPBVI), which integrates psi-vectors into a standard point-based backup procedure-bridging DistRL and POMDP planning. By tracking return distributions, DPBVI naturally enables risk-sensitive control in domains where rare, high-impact events must be carefully managed. We provide source code to foster further research in robust decision-making under partial observability.",
"arxiv_id": "2505.06518v2",
"published": "2025-05-10",
"categories": [
"cs.AI"
],
"url": "https://arxiv.org/abs/2505.06518v2",
"pdf_url": "https://arxiv.org/pdf/2505.06518v2.pdf"
},
{
"title": "An Investigation of Offline Reinforcement Learning in Factorisable Action Spaces",
"authors": [
"Alex Beeson",
"David Ireland",
"Giovanni Montana"
],
"summary": "Expanding reinforcement learning (RL) to offline domains generates promising prospects, particularly in sectors where data collection poses substantial challenges or risks. Pivotal to the success of transferring RL offline is mitigating overestimation bias in value estimates for state-action pairs absent from data. Whilst numerous approaches have been proposed in recent years, these tend to focus primarily on continuous or small-scale discrete action spaces. Factorised discrete action spaces, on the other hand, have received relatively little attention, despite many real-world problems naturally having factorisable actions. In this work, we undertake a formative investigation into offline reinforcement learning in factorisable action spaces. Using value-decomposition as formulated in DecQN as a foundation, we present the case for a factorised approach and conduct an extensive empirical evaluation of several offline techniques adapted to the factorised setting. In the absence of established benchmarks, we introduce a suite of our own comprising datasets of varying quality and task complexity. Advocating for reproducible research and innovation, we make all datasets available for public use alongside our code base.",
"arxiv_id": "2411.11088v1",
"published": "2024-11-17",
"categories": [
"stat.ML",
"cs.LG"
],
"url": "https://arxiv.org/abs/2411.11088v1",
"pdf_url": "https://arxiv.org/pdf/2411.11088v1.pdf"
},
{
"title": "On Generalization Across Environments In Multi-Objective Reinforcement Learning",
"authors": [
"Jayden Teoh",
"Pradeep Varakantham",
"Peter Vamplew"
],
"summary": "Real-world sequential decision-making tasks often require balancing trade-offs between multiple conflicting objectives, making Multi-Objective Reinforcement Learning (MORL) an increasingly prominent field of research. Despite recent advances, existing MORL literature has narrowly focused on performance within static environments, neglecting the importance of generalizing across diverse settings. Conversely, existing research on generalization in RL has always assumed scalar rewards, overlooking the inherent multi-objectivity of real-world problems. Generalization in the multi-objective context is fundamentally more challenging, as it requires learning a Pareto set of policies addressing varying preferences across multiple objectives. In this paper, we formalize the concept of generalization in MORL and how it can be evaluated. We then contribute a novel benchmark featuring diverse multi-objective domains with parameterized environment configurations to facilitate future studies in this area. Our baseline evaluations of state-of-the-art MORL algorithms on this benchmark reveals limited generalization capabilities, suggesting significant room for improvement. Our empirical findings also expose limitations in the expressivity of scalar rewards, emphasizing the need for multi-objective specifications to achieve effective generalization. We further analyzed the algorithmic complexities within current MORL approaches that could impede the transfer in performance from the single- to multiple-environment settings. This work fills a critical gap and lays the groundwork for future research that brings together two key areas in reinforcement learning: solving multi-objective decision-making problems and generalizing across diverse environments. We make our code available at https://github.com/JaydenTeoh/MORL-Generalization.",
"arxiv_id": "2503.00799v2",
"published": "2025-03-02",
"categories": [
"cs.LG"
],
"url": "https://arxiv.org/abs/2503.00799v2",
"pdf_url": "https://arxiv.org/pdf/2503.00799v2.pdf"
},
{
"title": "Learning to Generate Unit Test via Adversarial Reinforcement Learning",
"authors": [
"Dongjun Lee",
"Changho Hwang",
"Kimin Lee"
],
"summary": "Unit testing is a core practice in programming, enabling systematic evaluation of programs produced by human developers or large language models (LLMs). Given the challenges in writing comprehensive unit tests, LLMs have been employed to automate test generation, yet methods for training LLMs to produce high-quality tests remain underexplored. In this work, we propose UTRL, a novel reinforcement learning framework that trains an LLM to generate high-quality unit tests given a programming instruction. Our key idea is to iteratively train two LLMs, the unit test generator and the code generator, in an adversarial manner via reinforcement learning. The unit test generator is trained to maximize a discrimination reward, which reflects its ability to produce tests that expose faults in the code generator's solutions, and the code generator is trained to maximize a code reward, which reflects its ability to produce solutions that pass the unit tests generated by the test generator. In our experiments, we demonstrate that unit tests generated by Qwen3-4B trained via UTRL show higher quality compared to unit tests generated by the same model trained via supervised fine-tuning on human-written ground-truth unit tests, yielding code evaluations that more closely align with those induced by the ground-truth tests. Moreover, Qwen3-4B trained with UTRL outperforms frontier models such as GPT-4.1 in generating high-quality unit tests, highlighting the effectiveness of UTRL in training LLMs for this task.",
"arxiv_id": "2508.21107v2",
"published": "2025-08-28",
"categories": [
"cs.SE",
"cs.AI"
],
"url": "https://arxiv.org/abs/2508.21107v2",
"pdf_url": "https://arxiv.org/pdf/2508.21107v2.pdf"
},
{
"title": "RLHF-Blender: A Configurable Interactive Interface for Learning from Diverse Human Feedback",
"authors": [
"Yannick Metz",
"David Lindner",
"Rapha\u00ebl Baur",
"Daniel Keim",
"Mennatallah El-Assady"
],
"summary": "To use reinforcement learning from human feedback (RLHF) in practical applications, it is crucial to learn reward models from diverse sources of human feedback and to consider human factors involved in providing feedback of different types. However, the systematic study of learning from diverse types of feedback is held back by limited standardized tooling available to researchers. To bridge this gap, we propose RLHF-Blender, a configurable, interactive interface for learning from human feedback. RLHF-Blender provides a modular experimentation framework and implementation that enables researchers to systematically investigate the properties and qualities of human feedback for reward learning. The system facilitates the exploration of various feedback types, including demonstrations, rankings, comparisons, and natural language instructions, as well as studies considering the impact of human factors on their effectiveness. We discuss a set of concrete research opportunities enabled by RLHF-Blender. More information is available at https://rlhfblender.info/.",
"arxiv_id": "2308.04332v1",
"published": "2023-08-08",
"categories": [
"cs.LG",
"cs.HC"
],
"url": "https://arxiv.org/abs/2308.04332v1",
"pdf_url": "https://arxiv.org/pdf/2308.04332v1.pdf"
},
{
"title": "BLIP-FusePPO: A Vision-Language Deep Reinforcement Learning Framework for Lane Keeping in Autonomous Vehicles",
"authors": [
"Seyed Ahmad Hosseini Miangoleh",
"Amin Jalal Aghdasian",
"Farzaneh Abdollahi"
],
"summary": "In this paper, we propose Bootstrapped Language-Image Pretraining-driven Fused State Representation in Proximal Policy Optimization (BLIP-FusePPO), a novel multimodal reinforcement learning (RL) framework for autonomous lane-keeping (LK), in which semantic embeddings generated by a vision-language model (VLM) are directly fused with geometric states, LiDAR observations, and Proportional-Integral-Derivative-based (PID) control feedback within the agent observation space. The proposed method lets the agent learn driving rules that are aware of their surroundings and easy to understand by combining high-level scene understanding from the VLM with low-level control and spatial signals. Our architecture brings together semantic, geometric, and control-aware representations to make policy learning more robust. A hybrid reward function that includes semantic alignment, LK accuracy, obstacle avoidance, and speed regulation helps learning to be more efficient and generalizable. Our method is different from the approaches that only use semantic models to shape rewards. Instead, it directly embeds semantic features into the state representation. This cuts down on expensive runtime inference and makes sure that semantic guidance is always available. The simulation results show that the proposed model is better at LK stability and adaptability than the best vision-based and multimodal RL baselines in a wide range of difficult driving situations. We make our code publicly available.",
"arxiv_id": "2510.22370v1",
"published": "2025-10-25",
"categories": [
"cs.RO",
"cs.AI",
"cs.CV",
"cs.LG",
"cs.SE"
],
"url": "https://arxiv.org/abs/2510.22370v1",
"pdf_url": "https://arxiv.org/pdf/2510.22370v1.pdf"
},
{
"title": "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback",
"authors": [
"Yuntao Bai",
"Andy Jones",
"Kamal Ndousse",
"Amanda Askell",
"Anna Chen",
"Nova DasSarma",
"Dawn Drain",
"Stanislav Fort",
"Deep Ganguli",
"Tom Henighan",
"Nicholas Joseph",
"Saurav Kadavath",
"Jackson Kernion",
"Tom Conerly",
"Sheer El-Showk",
"Nelson Elhage",
"Zac Hatfield-Dodds",
"Danny Hernandez",
"Tristan Hume",
"Scott Johnston",
"Shauna Kravec",
"Liane Lovitt",
"Neel Nanda",
"Catherine Olsson",
"Dario Amodei",
"Tom Brown",
"Jack Clark",
"Sam McCandlish",
"Chris Olah",
"Ben Mann",
"Jared Kaplan"
],
"summary": "We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and identify a roughly linear relation between the RL reward and the square root of the KL divergence between the policy and its initialization. Alongside our main results, we perform peripheral analyses on calibration, competing objectives, and the use of OOD detection, compare our models with human writers, and provide samples from our models using prompts appearing in recent related work.",
"arxiv_id": "2204.05862v1",
"published": "2022-04-12",
"categories": [
"cs.CL",
"cs.LG"
],
"url": "https://arxiv.org/abs/2204.05862v1",
"pdf_url": "https://arxiv.org/pdf/2204.05862v1.pdf"
},
{
"title": "Reinforcement Learning and Data-Generation for Syntax-Guided Synthesis",
"authors": [
"Julian Parsert",
"Elizabeth Polgreen"
],
"summary": "Program synthesis is the task of automatically generating code based on a specification. In Syntax-Guided Synthesis (SyGuS) this specification is a combination of a syntactic template and a logical formula, and the result is guaranteed to satisfy both. We present a reinforcement-learning guided algorithm for SyGuS which uses Monte-Carlo Tree Search (MCTS) to search the space of candidate solutions. Our algorithm learns policy and value functions which, combined with the upper confidence bound for trees, allow it to balance exploration and exploitation. A common challenge in applying machine learning approaches to syntax-guided synthesis is the scarcity of training data. To address this, we present a method for automatically generating training data for SyGuS based on anti-unification of existing first-order satisfiability problems, which we use to train our MCTS policy. We implement and evaluate this setup and demonstrate that learned policy and value improve the synthesis performance over a baseline by over 26 percentage points in the training and testing sets. Our tool outperforms state-of-the-art tool cvc5 on the training set and performs comparably in terms of the total number of problems solved on the testing set (solving 23% of the benchmarks on which cvc5 fails). We make our data set publicly available, to enable further application of machine learning methods to the SyGuS problem.",
"arxiv_id": "2307.09564v2",
"published": "2023-07-13",
"categories": [
"cs.AI"
],
"url": "https://arxiv.org/abs/2307.09564v2",
"pdf_url": "https://arxiv.org/pdf/2307.09564v2.pdf"
},
{
"title": "CosmoCore Affective Dream-Replay Reinforcement Learning for Code Generation",
"authors": [
"Santhosh Kumar Ravindran"
],
"summary": "We introduce CosmoCore, a neuroscience-inspired reinforcement learning (RL) architecture that integrates affective signals to enhance code generation in large language models (LLMs). Motivated by human and animal learning where embarrassment from mistakes drives rapid correction, as observed in training a puppy to avoid repeating errors after a single scolding CosmoCore tags code generation trajectories with valence and surprise using a lightweight multi-layer perceptron (MLP). High-negative valence (cringe) episodes, such as buggy code outputs, are prioritized in a Dream Queue for five-fold replay during off-policy updates, while low-surprise successes are pruned to prevent overconfidence and buffer bloat. Evaluated on code generation benchmarks like HumanEval and BigCodeBench, alongside simulations with a custom data pipeline environment, CosmoCore reduces hallucinated code (e.g., syntax errors or logical bugs) by 48\\% and accelerates self-correction by 45\\%. Local experiments using Hugging Face models in a PySpark environment validate these gains, with code snippets provided for replication. Ablations confirm valence tagging boosts curiosity in exploration, and pruning mitigates inefficiency. This framework extends RL from human feedback (RLHF) for more emotionally aware code assistants, with applications in IDEs and data pipelines. Code and the custom mini-world simulation are released.",
"arxiv_id": "2510.18895v1",
"published": "2025-10-20",
"categories": [
"cs.SE",
"cs.AI",
"cs.HC"
],
"url": "https://arxiv.org/abs/2510.18895v1",
"pdf_url": "https://arxiv.org/pdf/2510.18895v1.pdf"
},
{
"title": "Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF",
"authors": [
"Anand Siththaranjan",
"Cassidy Laidlaw",
"Dylan Hadfield-Menell"
],
"summary": "In practice, preference learning from human feedback depends on incomplete data with hidden context. Hidden context refers to data that affects the feedback received, but which is not represented in the data used to train a preference model. This captures common issues of data collection, such as having human annotators with varied preferences, cognitive processes that result in seemingly irrational behavior, and combining data labeled according to different criteria. We prove that standard applications of preference learning, including reinforcement learning from human feedback (RLHF), implicitly aggregate over hidden contexts according to a well-known voting rule called Borda count. We show this can produce counter-intuitive results that are very different from other methods which implicitly aggregate via expected utility. Furthermore, our analysis formalizes the way that preference learning from users with diverse values tacitly implements a social choice function. A key implication of this result is that annotators have an incentive to misreport their preferences in order to influence the learned model, leading to vulnerabilities in the deployment of RLHF. As a step towards mitigating these problems, we introduce a class of methods called distributional preference learning (DPL). DPL methods estimate a distribution of possible score values for each alternative in order to better account for hidden context. Experimental results indicate that applying DPL to RLHF for LLM chatbots identifies hidden context in the data and significantly reduces subsequent jailbreak vulnerability. Our code and data are available at https://github.com/cassidylaidlaw/hidden-context",
"arxiv_id": "2312.08358v2",
"published": "2023-12-13",
"categories": [
"cs.LG",
"cs.AI",
"stat.ML"
],
"url": "https://arxiv.org/abs/2312.08358v2",
"pdf_url": "https://arxiv.org/pdf/2312.08358v2.pdf"
},
{
"title": "Learning a Pessimistic Reward Model in RLHF",
"authors": [
"Yinglun Xu",
"Hangoo Kang",
"Tarun Suresh",
"Yuxuan Wan",
"Gagandeep Singh"
],
"summary": "This work proposes `PET', a novel pessimistic reward fine-tuning method, to learn a pessimistic reward model robust against reward hacking in offline reinforcement learning from human feedback (RLHF). Traditional reward modeling techniques in RLHF train an imperfect reward model, on which a KL regularization plays a pivotal role in mitigating reward hacking when optimizing a policy. Such an intuition-based method still suffers from reward hacking, and the policies with large KL divergence from the dataset distribution are excluded during learning. In contrast, we show that when optimizing a policy on a pessimistic reward model fine-tuned through PET, reward hacking can be prevented without relying on any regularization. We test our methods on the standard TL;DR summarization dataset. We find that one can learn a high-quality policy on our pessimistic reward without using any regularization. Such a policy has a high KL divergence from the dataset distribution while having high performance in practice. In summary, our work shows the feasibility of learning a pessimistic reward model against reward hacking. The agent can greedily search for the policy with a high pessimistic reward without suffering from reward hacking.",
"arxiv_id": "2505.20556v1",
"published": "2025-05-26",
"categories": [
"cs.LG"
],
"url": "https://arxiv.org/abs/2505.20556v1",
"pdf_url": "https://arxiv.org/pdf/2505.20556v1.pdf"
},
{
"title": "Can Differentiable Decision Trees Enable Interpretable Reward Learning from Human Feedback?",
"authors": [
"Akansha Kalra",
"Daniel S. Brown"
],
"summary": "Reinforcement Learning from Human Feedback (RLHF) has emerged as a popular paradigm for capturing human intent to alleviate the challenges of hand-crafting the reward values. Despite the increasing interest in RLHF, most works learn black box reward functions that while expressive are difficult to interpret and often require running the whole costly process of RL before we can even decipher if these frameworks are actually aligned with human preferences. We propose and evaluate a novel approach for learning expressive and interpretable reward functions from preferences using Differentiable Decision Trees (DDTs). Our experiments across several domains, including CartPole, Visual Gridworld environments and Atari games, provide evidence that the tree structure of our learned reward function is useful in determining the extent to which the reward function is aligned with human preferences. We also provide experimental evidence that not only shows that reward DDTs can often achieve competitive RL performance when compared with larger capacity deep neural network reward functions but also demonstrates the diagnostic utility of our framework in checking alignment of learned reward functions. We also observe that the choice between soft and hard (argmax) output of reward DDT reveals a tension between wanting highly shaped rewards to ensure good RL performance, while also wanting simpler, more interpretable rewards. Videos and code, are available at: https://sites.google.com/view/ddt-rlhf",
"arxiv_id": "2306.13004v6",
"published": "2023-06-22",
"categories": [
"cs.LG",
"cs.AI"
],
"url": "https://arxiv.org/abs/2306.13004v6",
"pdf_url": "https://arxiv.org/pdf/2306.13004v6.pdf"
},
{
"title": "Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO",
"authors": [
"Ruizhe Shi",
"Minhak Song",
"Runlong Zhou",
"Zihan Zhang",
"Maryam Fazel",
"Simon S. Du"
],
"summary": "We present a fine-grained theoretical analysis of the performance gap between reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under a representation gap. Our study decomposes this gap into two sources: an explicit representation gap under exact optimization and an implicit representation gap under finite samples. In the exact optimization setting, we characterize how the relative capacities of the reward and policy model classes influence the final policy qualities. We show that RLHF, DPO, or online DPO can outperform one another depending on type of model mis-specifications. Notably, online DPO can outperform both RLHF and standard DPO when the reward and policy model classes are isomorphic and both mis-specified. In the approximate optimization setting, we provide a concrete construction where the ground-truth reward is implicitly sparse and show that RLHF requires significantly fewer samples than DPO to recover an effective reward model -- highlighting a statistical advantage of two-stage learning. Together, these results provide a comprehensive understanding of the performance gap between RLHF and DPO under various settings, and offer practical insights into when each method is preferred.",
"arxiv_id": "2505.19770v2",
"published": "2025-05-26",
"categories": [
"cs.LG",
"cs.CL"
],
"url": "https://arxiv.org/abs/2505.19770v2",
"pdf_url": "https://arxiv.org/pdf/2505.19770v2.pdf"
},
{
"title": "Klear-CodeTest: Scalable Test Case Generation for Code Reinforcement Learning",
"authors": [
"Jia Fu",
"Xinyu Yang",
"Hongzhi Zhang",
"Yahui Liu",
"Jingyuan Zhang",
"Qi Wang",
"Fuzheng Zhang",
"Guorui Zhou"
],
"summary": "Precise, correct feedback is crucial for effectively training large language models (LLMs) in code reinforcement learning. However, synthesizing high-quality test cases remains a profoundly challenging and unsolved problem. In this work, we present Klear-CodeTest, a comprehensive test case synthesis framework featuring rigorous verification to ensure quality and reliability of test cases. Our approach achieves broad coverage of programming problems via a novel Generator-Validation (G-V) framework, ensuring correctness through a consistency validation mechanism that verifies outputs against gold solutions. The proposed G-V framework generates comprehensive test cases including both regular and corner cases, enhancing test coverage and discriminative power for solution correctness assessment in code reinforcement learning. In addition, we design a multi-layered security sandbox system optimized for online verification platforms, guaranteeing safe and reliable code execution. Through comprehensive experiments, we demonstrate the effectiveness of our curated dataset, showing significant improvements in model performance and training stability. The source codes, curated dataset and sandbox system are available at: https://github.com/Kwai-Klear/CodeTest.",
"arxiv_id": "2508.05710v2",
"published": "2025-08-07",
"categories": [
"cs.SE",
"cs.AI"
],
"url": "https://arxiv.org/abs/2508.05710v2",
"pdf_url": "https://arxiv.org/pdf/2508.05710v2.pdf"
},
{
"title": "Generative RLHF-V: Learning Principles from Multi-modal Human Preference",
"authors": [
"Jiayi Zhou",
"Jiaming Ji",
"Boyuan Chen",
"Jiapeng Sun",
"Wenqi Chen",
"Donghai Hong",
"Sirui Han",
"Yike Guo",
"Yaodong Yang"
],
"summary": "Training multi-modal large language models (MLLMs) that align with human intentions is a long-term challenge. Traditional score-only reward models for alignment suffer from low accuracy, weak generalization, and poor interpretability, blocking the progress of alignment methods, e.g., reinforcement learning from human feedback (RLHF). Generative reward models (GRMs) leverage MLLMs' intrinsic reasoning capabilities to discriminate pair-wise responses, but their pair-wise paradigm makes it hard to generalize to learnable rewards. We introduce Generative RLHF-V, a novel alignment framework that integrates GRMs with multi-modal RLHF. We propose a two-stage pipeline: $\\textbf{multi-modal generative reward modeling from RL}$, where RL guides GRMs to actively capture human intention, then predict the correct pair-wise scores; and $\\textbf{RL optimization from grouped comparison}$, which enhances multi-modal RL scoring precision by grouped responses comparison. Experimental results demonstrate that, besides out-of-distribution generalization of RM discrimination, our framework improves 4 MLLMs' performance across 7 benchmarks by $18.1\\%$, while the baseline RLHF is only $5.3\\%$. We further validate that Generative RLHF-V achieves a near-linear improvement with an increasing number of candidate responses. Our code and models can be found at https://generative-rlhf-v.github.io.",
"arxiv_id": "2505.18531v1",
"published": "2025-05-24",
"categories": [
"cs.AI",
"cs.CV"
],
"url": "https://arxiv.org/abs/2505.18531v1",
"pdf_url": "https://arxiv.org/pdf/2505.18531v1.pdf"
},
{
"title": "GALOIS: Boosting Deep Reinforcement Learning via Generalizable Logic Synthesis",
"authors": [
"Yushi Cao",
"Zhiming Li",
"Tianpei Yang",
"Hao Zhang",
"Yan Zheng",
"Yi Li",
"Jianye Hao",
"Yang Liu"
],
"summary": "Despite achieving superior performance in human-level control problems, unlike humans, deep reinforcement learning (DRL) lacks high-order intelligence (e.g., logic deduction and reuse), thus it behaves ineffectively than humans regarding learning and generalization in complex problems. Previous works attempt to directly synthesize a white-box logic program as the DRL policy, manifesting logic-driven behaviors. However, most synthesis methods are built on imperative or declarative programming, and each has a distinct limitation, respectively. The former ignores the cause-effect logic during synthesis, resulting in low generalizability across tasks. The latter is strictly proof-based, thus failing to synthesize programs with complex hierarchical logic. In this paper, we combine the above two paradigms together and propose a novel Generalizable Logic Synthesis (GALOIS) framework to synthesize hierarchical and strict cause-effect logic programs. GALOIS leverages the program sketch and defines a new sketch-based hybrid program language for guiding the synthesis. Based on that, GALOIS proposes a sketch-based program synthesis method to automatically generate white-box programs with generalizable and interpretable cause-effect logic. Extensive evaluations on various decision-making tasks with complex logic demonstrate the superiority of GALOIS over mainstream baselines regarding the asymptotic performance, generalizability, and great knowledge reusability across different environments.",
"arxiv_id": "2205.13728v1",
"published": "2022-05-27",
"categories": [
"cs.AI"
],
"url": "https://arxiv.org/abs/2205.13728v1",
"pdf_url": "https://arxiv.org/pdf/2205.13728v1.pdf"
},
{
"title": "Measuring memorization in RLHF for code completion",
"authors": [
"Aneesh Pappu",
"Billy Porter",
"Ilia Shumailov",
"Jamie Hayes"
],
"summary": "Reinforcement learning with human feedback (RLHF) has become the dominant method to align large models to user preferences. Unlike fine-tuning, for which there are many studies regarding training data memorization, it is not clear how memorization is affected by or introduced in the RLHF alignment process. Understanding this relationship is important as real user data may be collected and used to align large models; if user data is memorized during RLHF and later regurgitated, this could raise privacy concerns. In addition to RLHF, other methods such as Direct Preference Optimization (DPO) and $\u03a8$PO have gained popularity for learning directly from human preferences, removing the need for optimizing intermediary reward models with reinforcement learning. In this work, we analyze how training data memorization can surface and propagate through each phase of RLHF and direct preference learning. We focus our study on code completion models, as code completion is one of the most popular use cases for large language models. We find that RLHF significantly decreases the chance that data used for reward modeling and reinforcement learning is memorized in comparison to directly fine-tuning on this data, but that examples already memorized during the fine-tuning stage of RLHF, will, in the majority of cases, remain memorized after RLHF. In contrast, we find that aligning by learning directly from human preference data via a special case of $\u03a8$PO, Identity Preference Optimization (IPO), increases the likelihood that training data is regurgitated compared to RLHF. Our work suggests that RLHF, as opposed to direct preference learning, is a safer way to mitigate the risk of regurgitating sensitive preference data when aligning large language models. We find our conclusions are robust across multiple code completion datasets, tasks, and model scales.",
"arxiv_id": "2406.11715v2",
"published": "2024-06-17",
"categories": [
"cs.LG",
"cs.CL",
"cs.SE"
],
"url": "https://arxiv.org/abs/2406.11715v2",
"pdf_url": "https://arxiv.org/pdf/2406.11715v2.pdf"
},
{
"title": "ReCode: Updating Code API Knowledge with Reinforcement Learning",
"authors": [
"Haoze Wu",
"Yunzhi Yao",
"Wenhao Yu",
"Ningyu Zhang"
],
"summary": "Large Language Models (LLMs) exhibit remarkable code generation capabilities but falter when adapting to frequent updates in external library APIs. This critical limitation, stemming from reliance on outdated API knowledge from their training data, even with access to current documentation, impedes reliable code generation in dynamic environments. To tackle this issue, we propose ReCode (rule-based Reinforcement learning for Code Update), a novel framework that mimics human programmer adaptation to API changes. Specifically, we construct a dataset of approximately 2,000 data entries to train the LLMs to perform version migration based on updated information. Then, we introduce a modified string similarity metric for code evaluation as the reward for reinforcement learning. Our experiments demonstrate that ReCode substantially boosts LLMs' code generation performance in dynamic API scenarios, especially on the unseen CodeUpdateArena task. Crucially, compared to supervised fine-tuning, ReCode has less impact on LLMs' general code generation abilities. We apply ReCode on various LLMs and reinforcement learning algorithms (GRPO and DAPO), all achieving consistent improvements. Notably, after training, Qwen2.5-Coder-7B outperforms that of the 32B parameter code instruction-tuned model and the reasoning model with the same architecture. Code is available at https://github.com/zjunlp/ReCode.",
"arxiv_id": "2506.20495v4",
"published": "2025-06-25",
"categories": [
"cs.CL",
"cs.AI",
"cs.IR",
"cs.LG",
"cs.SE"
],
"url": "https://arxiv.org/abs/2506.20495v4",
"pdf_url": "https://arxiv.org/pdf/2506.20495v4.pdf"
},
{
"title": "Improved Tree Search for Automatic Program Synthesis",
"authors": [
"Aran Carmon",
"Lior Wolf"
],
"summary": "In the task of automatic program synthesis, one obtains pairs of matching inputs and outputs and generates a computer program, in a particular domain-specific language (DSL), which given each sample input returns the matching output. A key element is being able to perform an efficient search in the space of valid programs. Here, we suggest a variant of MCTS that leads to state of the art results on two vastly different DSLs. The exploration method we propose includes multiple contributions: a modified visit count, a preprocessing procedure for the training dataset, and encoding the part of the program that was already executed.",
"arxiv_id": "2303.07166v1",
"published": "2023-03-13",
"categories": [
"cs.LG",
"cs.PL",
"cs.SE"
],
"url": "https://arxiv.org/abs/2303.07166v1",
"pdf_url": "https://arxiv.org/pdf/2303.07166v1.pdf"
},
{
"title": "Enhancing Code LLMs with Reinforcement Learning in Code Generation: A Survey",
"authors": [
"Junqiao Wang",
"Zeng Zhang",
"Yangfan He",
"Zihao Zhang",
"Xinyuan Song",
"Yuyang Song",
"Tianyu Shi",
"Yuchen Li",
"Hengyuan Xu",
"Kunyu Wu",
"Xin Yi",
"Zhongwei Wan",
"Xinhang Yuan",
"Zijun Wang",
"Kuan Lu",
"Menghao Huo",
"Tang Jingqun",
"Guangwu Qian",
"Keqin Li",
"Qiuwu Chen",
"Lewei He"
],
"summary": "With the rapid evolution of large language models (LLM), reinforcement learning (RL) has emerged as a pivotal technique for code generation and optimization in various domains. This paper presents a systematic survey of the application of RL in code optimization and generation, highlighting its role in enhancing compiler optimization, resource allocation, and the development of frameworks and tools. Subsequent sections first delve into the intricate processes of compiler optimization, where RL algorithms are leveraged to improve efficiency and resource utilization. The discussion then progresses to the function of RL in resource allocation, emphasizing register allocation and system optimization. We also explore the burgeoning role of frameworks and tools in code generation, examining how RL can be integrated to bolster their capabilities. This survey aims to serve as a comprehensive resource for researchers and practitioners interested in harnessing the power of RL to advance code generation and optimization techniques.",
"arxiv_id": "2412.20367v5",
"published": "2024-12-29",
"categories": [
"cs.SE",
"cs.CL"
],
"url": "https://arxiv.org/abs/2412.20367v5",
"pdf_url": "https://arxiv.org/pdf/2412.20367v5.pdf"
},
{
"title": "Proof2Silicon: Prompt Repair for Verified Code and Hardware Generation via Reinforcement Learning",
"authors": [
"Manvi Jha",
"Jiaxin Wan",
"Deming Chen"
],
"summary": "Large Language Models (LLMs) have demonstrated impressive capabilities in automated code generation but frequently produce code that fails formal verification, an essential requirement for hardware and safety-critical domains. To overcome this fundamental limitation, we previously proposed PREFACE, a model-agnostic framework based on reinforcement learning (RL) that iteratively repairs the prompts provided to frozen LLMs, systematically steering them toward generating formally verifiable Dafny code without costly fine-tuning. This work presents Proof2Silicon, a novel end-to-end synthesis framework that embeds the previously proposed PREFACE flow to enable the generation of correctness-by-construction hardware directly from natural language specifications. Proof2Silicon operates by: (1) leveraging PREFACE's verifier-driven RL agent to optimize prompt generation iteratively, ensuring Dafny code correctness; (2) automatically translating verified Dafny programs into synthesizable high-level C using Dafny's Python backend and PyLog; and (3) employing Vivado HLS to produce RTL implementations. Evaluated rigorously on a challenging 100-task benchmark, PREFACE's RL-guided prompt optimization consistently improved Dafny verification success rates across diverse LLMs by up to 21%. Crucially, Proof2Silicon achieved an end-to-end hardware synthesis success rate of up to 72%, generating RTL designs through Vivado HLS synthesis flows. These results demonstrate a robust, scalable, and automated pipeline for LLM-driven, formally verified hardware synthesis, bridging natural-language specification and silicon realization.",
"arxiv_id": "2509.06239v2",
"published": "2025-09-07",
"categories": [
"cs.AI"
],
"url": "https://arxiv.org/abs/2509.06239v2",
"pdf_url": "https://arxiv.org/pdf/2509.06239v2.pdf"
},
{
"title": "SharedRep-RLHF: A Shared Representation Approach to RLHF with Diverse Preferences",
"authors": [
"Arpan Mukherjee",
"Marcello Bullo",
"Deniz G\u00fcnd\u00fcz"
],
"summary": "Uniform-reward reinforcement learning from human feedback (RLHF), which trains a single reward model to represent the preferences of all annotators, fails to capture the diversity of opinions across sub-populations, inadvertently favoring dominant groups. The state-of-the-art, MaxMin-RLHF, addresses this by learning group-specific reward models, and by optimizing for the group receiving the minimum reward, thereby promoting fairness. However, we identify that a key limitation of MaxMin-RLHF is its poor performance when the minimum-reward group is a minority. To mitigate this drawback, we introduce a novel framework, termed {\\em SharedRep-RLHF}. At its core, SharedRep-RLHF learns and leverages {\\em shared traits} in annotations among various groups, in contrast to learning separate reward models across groups. We first show that MaxMin-RLHF is provably suboptimal in learning shared traits, and then quantify the sample complexity of SharedRep-RLHF. Experiments across diverse natural language tasks showcase the effectiveness of SharedRep-RLHF compared to MaxMin-RLHF with a gain of up to 20% in win rate.",
"arxiv_id": "2509.03672v1",
"published": "2025-09-03",
"categories": [
"cs.LG",
"stat.ML"
],
"url": "https://arxiv.org/abs/2509.03672v1",
"pdf_url": "https://arxiv.org/pdf/2509.03672v1.pdf"
},
{
"title": "Gradual Transition from Bellman Optimality Operator to Bellman Operator in Online Reinforcement Learning",
"authors": [
"Motoki Omura",
"Kazuki Ota",
"Takayuki Osa",
"Yusuke Mukuta",
"Tatsuya Harada"
],
"summary": "For continuous action spaces, actor-critic methods are widely used in online reinforcement learning (RL). However, unlike RL algorithms for discrete actions, which generally model the optimal value function using the Bellman optimality operator, RL algorithms for continuous actions typically model Q-values for the current policy using the Bellman operator. These algorithms for continuous actions rely exclusively on policy updates for improvement, which often results in low sample efficiency. This study examines the effectiveness of incorporating the Bellman optimality operator into actor-critic frameworks. Experiments in a simple environment show that modeling optimal values accelerates learning but leads to overestimation bias. To address this, we propose an annealing approach that gradually transitions from the Bellman optimality operator to the Bellman operator, thereby accelerating learning while mitigating bias. Our method, combined with TD3 and SAC, significantly outperforms existing approaches across various locomotion and manipulation tasks, demonstrating improved performance and robustness to hyperparameters related to optimality. The code for this study is available at https://github.com/motokiomura/annealed-q-learning.",
"arxiv_id": "2506.05968v2",
"published": "2025-06-06",
"categories": [
"cs.LG",
"cs.AI",
"cs.RO"
],
"url": "https://arxiv.org/abs/2506.05968v2",
"pdf_url": "https://arxiv.org/pdf/2506.05968v2.pdf"
},
{
"title": "Large Language Model for Verilog Generation with Code-Structure-Guided Reinforcement Learning",
"authors": [
"Ning Wang",
"Bingkun Yao",
"Jie Zhou",
"Xi Wang",
"Zhe Jiang",
"Nan Guan"
],
"summary": "Recent advancements in large language models (LLMs) have sparked significant interest in the automatic generation of Register Transfer Level (RTL) designs, particularly using Verilog. Current research on this topic primarily focuses on pre-training and instruction tuning, but the effectiveness of these methods is constrained by the limited availability of training data, as public Verilog code is far less abundant than software code. In particular, these methods struggle to effectively capture Verilog parallel code structures, which fundamentally differ from the imperative, sequential control flow typical in most software programming languages. This paper introduces VeriSeek, an LLM enhanced by reinforcement learning using a limited amount of high-quality training data to achieve high Verilog code generation performance. Our reinforcement learning approach employs code structure information as feedback signals to refine the pre-trained model, enabling it to effectively learn important patterns from Verilog code with parallel structures. Experiments show that VeriSeek outperforms state-of-the-art methods across multiple benchmarks.",
"arxiv_id": "2407.18271v4",
"published": "2024-07-21",
"categories": [
"cs.AR",
"cs.AI"
],
"url": "https://arxiv.org/abs/2407.18271v4",
"pdf_url": "https://arxiv.org/pdf/2407.18271v4.pdf"
},
{
"title": "HybridFlow: A Flexible and Efficient RLHF Framework",