Aprenentatge de reforç a partir de la retroalimentació humana

En l'aprenentatge automàtic, l'aprenentatge de reforç a partir de la retroalimentació humana (amb acrònim anglès RLHF) o l'aprenentatge de reforç a partir de les preferències humanes és una tècnica que entrena un "model de recompensa" directament a partir de la retroalimentació humana i l'utilitza com a funció de recompensa per optimitzar la política d'un agent mitjançant l'aprenentatge de reforç (RL).^[1]^[2] RLHF pot millorar la robustesa i l'exploració dels agents RL, especialment quan la funció de recompensa és escassa o sorollosa.^[3]^[4]^[5]

El feedback humà es recull demanant als humans que classifiquen les instàncies del comportament de l'agent.^[6]^[7]^[8] Aquests rànquings es poden utilitzar per puntuar resultats, per exemple, mitjançant el sistema de classificació Elo.^[9]

RLHF s'ha aplicat a diversos dominis del processament del llenguatge natural, com ara agents conversacionals, resum de textos i comprensió del llenguatge natural.^[10]^[11] L'aprenentatge de reforç regular, on els agents aprenen de les seves pròpies accions basant-se en una "funció de recompensa", és difícil d'aplicar a les tasques de processament del llenguatge natural perquè sovint les recompenses no són fàcils de definir o mesurar, especialment quan s'enfronten a tasques complexes que impliquen valors humans. o preferències. RLHF pot permetre que els models de llenguatge proporcionin respostes que s'alinein amb aquests valors complexos, generin respostes més detallades i rebutgin preguntes que siguin inadequades o fora de l'espai de coneixement del model.^[12] Alguns exemples de models de llenguatge entrenats amb RLHF són ChatGPT d'OpenAI i el seu predecessor InstructGPT,^[13]^[14]^[15]^[16] així com Sparrow de DeepMind.^[17]^[18]^[19]

RLHF també s'ha aplicat a altres àrees com el desenvolupament de robots de videojocs. Per exemple, OpenAI i DeepMind van formar agents per jugar a jocs Atari basats en les preferències humanes.^[20]^[21] Els agents van aconseguir un fort rendiment en molts dels entorns provats, sovint superant el rendiment humà.^[22]

Referències modifica

↑ Ziegler, Daniel M.; Stiennon, Nisan; Wu, Jeffrey; Brown, Tom B.; Radford, Alec "Fine-Tuning Language Models from Human Preferences"., 2019. DOI: 10.48550/arXiv.1909.08593.
↑ Lambert, Nathan. «Illustrating Reinforcement Learning from Human Feedback (RLHF)». huggingface.co. [Consulta: 4 març 2023].
↑ MacGlashan, James; Ho, Mark K; Loftin, Robert; Peng, Bei; Wang, Guan Proceedings of the 34th International Conference on Machine Learning - Volume 70, 06-08-2017, pàg. 2285–2294.
↑ Warnell, Garrett; Waytowich, Nicholas; Lawhern, Vernon; Stone, Peter Proceedings of the AAAI Conference on Artificial Intelligence, 32, 1, 25-04-2018. DOI: 10.1609/aaai.v32i1.11485.
↑ Bai, Yuntao; Jones, Andy; Ndousse, Kamal; Askell, Amanda; Chen, Anna "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback", 2022. DOI: 10.48550/arXiv.2204.05862.
↑ Ouyang, Long; Wu, Jeffrey; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll (en anglès) "Training language models to follow instructions with human feedback", 31-10-2022.
↑ Edwards, Benj. «OpenAI invites everyone to test ChatGPT, a new AI-powered chatbot—with amusing results» (en anglès americà). Ars Technica, 01-12-2022. [Consulta: 4 març 2023].
↑ Abhishek, Gupta. «Getting stakeholder engagement right in responsible AI». VentureBeat, 05-02-2023. [Consulta: 4 març 2023].
↑ Lambert, Nathan. «Illustrating Reinforcement Learning from Human Feedback (RLHF)». huggingface.co. [Consulta: 4 març 2023].
↑ Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L. "Learning to summarize with human feedback", 2022. DOI: 10.48550/arXiv.2203.02155.
↑ Nisan, Stiennon; Long, Ouyang; Jeffrey, Wu; Daniel, Ziegler; Ryan, Lowe (en anglès) Advances in Neural Information Processing Systems, 33, 2020.
↑ Wiggers, Kyle. «Can AI really be protected from text-based attacks?». TechCrunch, 24-02-2023. [Consulta: 4 març 2023].
↑ Edwards, Benj. «OpenAI invites everyone to test ChatGPT, a new AI-powered chatbot—with amusing results» (en anglès americà). Ars Technica, 01-12-2022. [Consulta: 4 març 2023].
↑ Farseev, Aleks. «Council Post: Is Bigger Better? Why The ChatGPT Vs. GPT-3 Vs. GPT-4 'Battle' Is Just A Family Chat» (en anglès). Forbes. [Consulta: 4 març 2023].
↑ Heikkilä, Melissa. «How OpenAI is trying to make ChatGPT safer and less biased» (en anglès). MIT Technology Review. [Consulta: 4 març 2023].
↑ Douglas Heaven, Will. «ChatGPT is OpenAI’s latest fix for GPT-3. It’s slick but still spews nonsense» (en anglès). MIT Technology Review. [Consulta: 4 març 2023].
↑ Glaese, Amelia; McAleese, Nat; Trębacz, Maja; Aslanides, John; Firoiu, Vlad "Building safer dialogue agents", 2022. DOI: 10.48550/arXiv.2209.14375.
↑ «Why DeepMind isn’t deploying its new AI chatbot — and what it means for responsible AI». VentureBeat, 23-09-2022. [Consulta: 4 març 2023].
↑ «Building safer dialogue agents» (en anglès). www.deepmind.com. [Consulta: 4 març 2023].
↑ «Learning from human preferences». openai.com. [Consulta: 4 març 2023].
↑ «Learning through human feedback» (en anglès). www.deepmind.com. [Consulta: 4 març 2023].
↑ Christiano, Paul F; Leike, Jan; Brown, Tom; Martic, Miljan; Legg, Shane Advances in Neural Information Processing Systems, 30, 2017 [Consulta: 4 març 2023].

[1] Ziegler, Daniel M.; Stiennon, Nisan; Wu, Jeffrey; Brown, Tom B.; Radford, Alec "Fine-Tuning Language Models from Human Preferences"., 2019. DOI: 10.48550/arXiv.1909.08593.

[huggingface-2] Lambert, Nathan. «Illustrating Reinforcement Learning from Human Feedback (RLHF)». huggingface.co. [Consulta: 4 març 2023].

[3] MacGlashan, James; Ho, Mark K; Loftin, Robert; Peng, Bei; Wang, Guan Proceedings of the 34th International Conference on Machine Learning - Volume 70, 06-08-2017, pàg. 2285–2294.

[4] Warnell, Garrett; Waytowich, Nicholas; Lawhern, Vernon; Stone, Peter Proceedings of the AAAI Conference on Artificial Intelligence, 32, 1, 25-04-2018. DOI: 10.1609/aaai.v32i1.11485.

[5] Bai, Yuntao; Jones, Andy; Ndousse, Kamal; Askell, Amanda; Chen, Anna "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback", 2022. DOI: 10.48550/arXiv.2204.05862.

[6] Ouyang, Long; Wu, Jeffrey; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll (en anglès) "Training language models to follow instructions with human feedback", 31-10-2022.

[ars-7] Edwards, Benj. «OpenAI invites everyone to test ChatGPT, a new AI-powered chatbot—with amusing results» (en anglès americà). Ars Technica, 01-12-2022. [Consulta: 4 març 2023].

[8] Abhishek, Gupta. «Getting stakeholder engagement right in responsible AI». VentureBeat, 05-02-2023. [Consulta: 4 març 2023].

[huggingface2-9] Lambert, Nathan. «Illustrating Reinforcement Learning from Human Feedback (RLHF)». huggingface.co. [Consulta: 4 març 2023].

[10] Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L. "Learning to summarize with human feedback", 2022. DOI: 10.48550/arXiv.2203.02155.

[11] Nisan, Stiennon; Long, Ouyang; Jeffrey, Wu; Daniel, Ziegler; Ryan, Lowe (en anglès) Advances in Neural Information Processing Systems, 33, 2020.

[12] Wiggers, Kyle. «Can AI really be protected from text-based attacks?». TechCrunch, 24-02-2023. [Consulta: 4 març 2023].

[ars2-13] Edwards, Benj. «OpenAI invites everyone to test ChatGPT, a new AI-powered chatbot—with amusing results» (en anglès americà). Ars Technica, 01-12-2022. [Consulta: 4 març 2023].

[14] Farseev, Aleks. «Council Post: Is Bigger Better? Why The ChatGPT Vs. GPT-3 Vs. GPT-4 'Battle' Is Just A Family Chat» (en anglès). Forbes. [Consulta: 4 març 2023].

[15] Heikkilä, Melissa. «How OpenAI is trying to make ChatGPT safer and less biased» (en anglès). MIT Technology Review. [Consulta: 4 març 2023].

[16] Douglas Heaven, Will. «ChatGPT is OpenAI’s latest fix for GPT-3. It’s slick but still spews nonsense» (en anglès). MIT Technology Review. [Consulta: 4 març 2023].

[17] Glaese, Amelia; McAleese, Nat; Trębacz, Maja; Aslanides, John; Firoiu, Vlad "Building safer dialogue agents", 2022. DOI: 10.48550/arXiv.2209.14375.

[18] «Why DeepMind isn’t deploying its new AI chatbot — and what it means for responsible AI». VentureBeat, 23-09-2022. [Consulta: 4 març 2023].

[19] «Building safer dialogue agents» (en anglès). www.deepmind.com. [Consulta: 4 març 2023].

[20] «Learning from human preferences». openai.com. [Consulta: 4 març 2023].

[21] «Learning through human feedback» (en anglès). www.deepmind.com. [Consulta: 4 març 2023].

[22] Christiano, Paul F; Leike, Jan; Brown, Tom; Martic, Miljan; Legg, Shane Advances in Neural Information Processing Systems, 30, 2017 [Consulta: 4 març 2023].

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]