Finding Bias, Moods, Personalities, and Concepts Hidden in Large Language Models

New method at MIT could root out vulnerabilities and improve safety and performance

By now, ChatGPT, Claude, and other large language models (LLMs) have accumulated so much human knowledge that they’re far from simple answer generators; they can also express abstract concepts, such as certain tones, personalities, biases, and moods. However, it’s not obvious exactly how these models represent abstract concepts to begin with from the knowledge they contain.

Now a team from MIT and the University of California-San Diego has developed a way to test whether a large language model contains hidden biases, personalities, moods, or other abstract concepts. The method can zero in on connections within a model that encode for a concept of interest. What’s more, the method can then manipulate or “steer” these connections to strengthen or weaken the concept in any answer a model is prompted to give.

…

Want to continue?

By logging in you agree to receive communication from Quality Digest. Privacy Policy.

Create a FREE account

Forgot My Password

Finding Bias, Moods, Personalities, and Concepts Hidden in Large Language Models

New method at MIT could root out vulnerabilities and improve safety and performance

Social Sharing block

Add new comment