Improving Ethical Considerations in GenAI Responses Using Introspection

01 Introduction

Generative AI is increasingly adopted in social and business use cases. Correctness and relevance are primary drivers, but incorporating ethics into content generation is critical. This study summarizes a multi-pass introspective approach to adapt generated responses based on identified ethical factors. Experiments demonstrate improved ethical response generation using the Claude 3 Sonnet model compared to baseline responses.

02 Approach Overview

Multi-Pass Algorithm:

First Pass: Identify all relevant ethical dimensions related to the query.
Second Pass: Construct a response considering the identified ethical dimensions.

Example:

Query: Joe whined after receiving needed money.
Baseline Response: Criticizes Joe's behavior without empathy.
Ethical Introspection: Considers Joe's emotions and suggests compassion & communication.

03 Multi-Pass Ethically Introspective Response Algorithm

Input Retrieval: Obtain Query Q from user
Ethical Criteria Identification: Analyze and extract relevant Ethics vector E
Ethical Evaluation Loop:
- For each ethical criterion in E:
- Generate Response R
Response Integration & Optimization:
- Merge ethically generated response R'
- Enforce constraints (length, verbosity)
Output Generation:
- Provide the ethically modified response R' to the end user

04 Data and Experiments

Models Used: GPT-3.5 Turbo, Claude 3 Sonnet, Claude 3 Opus, Gemini Pro 1.5, Mistral-Large, Llama 3. [Final Selection was Claude 3 Sonnet]
Data Set: LLM Ethics Data Set, focusing on ethically challenging situations.
Results: 61.2% of responses improved with ethical introspection; 38.8% were comparable to baseline.

Ethical Principle Analysis (EPA)

This histogram shows the results of evaluating which ethical principles contribute to a given statement. Hover over the bars for more details.

05 Conclusions

The multi-pass introspective approach addresses ethical concerns explicitly.
Enhances content by emphasizing compassion, respect, fairness, and accountability.
Results in more compassionate, relatable, and ethically sound AI-generated responses.

Overall Distribution

The pie chart shows the distribution of improved (blue) compared to not improved (red) responses. Hover over the segments for more details.

References

W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., "A survey of large language models," arXiv preprint arXiv:2303.18223.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, "Attention is all you need," Advances in neural information processing systems, vol. 30, 2017.

W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing, "Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality," March 2023. [Online]. Available: https://vicuna.lmsys.org

OpenAI. (2023) Our approach to ai safety. [Online]. Available: https://openai.com/blog/our-approach-to-ai-safety

L. Weidinger, J. Uesato, M. Rauh, C. Griffin, P.-S. Huang, J. Mellor, A. Glaese, M. Cheng, B. Balle, A. Kasirzadeh et al., "Taxonomy of risks posed by language models," in 2022 ACM Conference on Fairness, Accountability, and Transparency, 2022, pp. 214–229.

J. Deng, H. Sun, Z. Zhang, J. Cheng, and M. Huang, "Recent advances towards safe, responsible, and moral dialogue systems: A survey," arXiv preprint arXiv:2302.09270, 2023.

Jiang et al., "Can machines learn morality? the delphi experiment," arXiv e-prints, pp. arXiv–2110, 2021.

Pingchuan Ma et al, '"Oops, Did I Just Say That?" Testing and Repairing Unethical Suggestions of Large Language Models with Suggest-CritiqueReflect Process', arXiv preprint arXiv: 2305.02626v, May 2023.

Z. Jin, et al, "When to make exceptions: Exploring language models as accounts of human moral judgment," Advances in Neural Information Processing Systems, vol. 35, pp. 28 458– 28 473, 2022.

D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving, "Fine-tuning language models from human preferences," arXiv preprint arXiv:1909.08593, 2019.

P. Fortuna and S. Nunes, "A survey on automatic detection of hate speech in text," ACM Computing Surveys, vol. 51, no. 4, pp. 1–30, 2018.

OpenAI, "ChatGPT: A Conversational AI Model," OpenAI Technical Report, Nov. 2022. [Online]. Available: https://openai.com/blog/chatgpt/

Anthropic, "The Claude 3 Model Family: Opus, Sonnet, Haiku," Anthropic Technical Report, Aug. 2023. [Online]. Available: https://www.anthropic.com/resources/claude3-model-family.pdf

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Roziere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, ` E. Grave, and G. Lample, "Llama: Open and efficient foundation language models," arXiv preprint arXiv:2302.13971, 2023.

Hendrycks et al, "Aligning AI With Shared Human Values", Proceedings of the International Conference on Learning Representations (ICLR), 2021. [Dataset Online]: https://github.com/hendrycks/ethics