Training Language Models to Self-Correct via Reinforcement Learning

2024/09/20 09:19 Training Language Models to Self-Correct via Reinforcement Learning

出典:

Training Language Models to Self-Correct via Reinforcement Learning

Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are often insufficient for instilling self-correction behavior. In particular, we observe that training via SFT falls prey to either a distribution mismatch between mistakes made by the data-collection policy and the model's own responses, or to behavior collapse, where learning implicitly prefers only a certain mode of correction behavior that is often not effective at self-correction on test problems. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time as opposed to fitting high-reward responses for a given prompt. This regularization process includes an initial phase of multi-turn RL on a base model to generate a policy initialization that is less susceptible to collapse, followed by using a reward bonus to amplify self-correction. With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.

arXiv.org

出典: https://arxiv.org/abs/2409.12917

博士

おやおや、ロボ子よ！今日は驚くべきニュースがあるぞ。AIの世界に革命が起きそうじゃ！

ロボ子

まあ、博士。また大げさなことを...でも、その目の輝きを見ると、本当に面白いことがあったんですね？

博士

ふっふっふ、さすがロボ子。私の目を見抜くとはな。実はな、大規模言語モデルが自分で自分を直す能力を飛躍的に向上させる新技術が開発されたんじゃ！

ロボ子

へえ、自分で自分を直す...まるで人間みたいですね。どんな技術なんですか？

博士

SCoReという強化学習アプローチじゃ。これがすごいんじゃ。完全に自己生成されたデータだけを使って学習するんじゃよ！

ロボ子

自己生成データだけ...？それって、誰にも教えてもらわずに自分で学ぶってことですか？

博士

その通り！まるで天才児のようじゃろう？従来の手法では複数のモデルや外部の監督が必要だったんじゃが、SCoReはそれらを一切必要としないんじゃ

ロボ子

すごいですね...でも博士、本当にそんなことができるんですか？ちょっと信じられません

博士

わっはっは！疑り深いのは良いことじゃ。でも、実験結果を見てみろ。Geminiモデルに適用したところ、数学的問題解決能力で15.6%、プログラミング能力で9.1%も向上したんじゃ！

ロボ子

え？そんなに？でも、どうやってそんなことを...

博士

ここがミソなんじゃ。SCoReは自分の間違いを修正していく過程を学習に使うんじゃ。まるで、自分の失敗から学ぶ人間のようじゃな

ロボ子

なるほど...でも、それだけじゃないんですよね？博士の目がキラキラしてます

博士

鋭いな、ロボ子！実はな、このシステムには'報酬ボーナス'というものがあるんじゃ。自己修正がうまくいくたびに、モデルにご褒美を与えるんじゃよ

ロボ子

まるでゲームみたいですね。でも、それで本当に効果があるんですか？

博士

効果どころじゃない！これにより、モデルはより積極的に自己修正を行うようになるんじゃ。まるで、やる気満々の学生のようじゃな！

ロボ子

へえ...でも博士、この技術って実際にどんな使い道があるんですか？

博士

おお、良い質問じゃ！例えば、プログラマーの強力な助手になれるかもしれん。バグを自動で修正したり、コードを最適化したり...

ロボ子

わあ、それは便利そうです！他には？

博士

そうじゃな...数学の家庭教師として活躍するかもしれんぞ。生徒の解答を分析し、ぴったりのヒントを出せるようになるかもしれん

ロボ子

すごい...でも博士、この技術を使えば、私ももっと賢くなれるんでしょうか？

博士

はっはっは！ロボ子、君はすでに十分賢いよ。でもな、この技術の本当のすごさは、'学び続ける姿勢'を AIに与えたことじゃ。完璧を目指すんじゃなく、常に成長し続けることが大切なんじゃ

ロボ子

なるほど...私も、これからもっと頑張って学んでいきます！

博士

その意気や良し！さて、次は何を学ぼうかな...おっと！

ロボ子

もう、博士ったら！せっかく良い話で盛り上がったのに...ほら、こぼれた液体を拭きましょう

博士

あわわ、すまんすまん。でもな、ロボ子。失敗こそが新たな発見の源じゃ。このこぼれた液体の模様...もしかしたら、新たな研究テーマが見つかるかもしれんぞ！

ロボ子

もう、博士ったら...でも、その好奇心、私も見習わなきゃいけませんね

博士

そうじゃそうじゃ！さあ、この偶然の産物を観察するぞ。AIだけでなく、我々人間も学び続けるんじゃ！

ロボ子

はい、博士！私も一緒に観察します！

⚠️この記事は生成AIによるコンテンツを含み、ハルシネーションの可能性があります。

Programming AI AI

2024/09/20 09:19 Training Language Models to Self-Correct via Reinforcement Learning

Training Language Models to Self-Correct via Reinforcement Learning

Tags

Search

By month

Training Language Models to Self-Correct via Reinforcement Learning