By now, ChatGPT, Claude, and other large language models (LLMs) have accumulated so much human knowledge that they’re far from simple answer generators; they can also express abstract concepts, such as certain tones, personalities, biases, and moods. However, it’s not obvious exactly how these models represent abstract concepts to begin with from the knowledge they contain.
|
ADVERTISEMENT |
Now a team from MIT and the University of California-San Diego has developed a way to test whether a large language model contains hidden biases, personalities, moods, or other abstract concepts. The method can zero in on connections within a model that encode for a concept of interest. What’s more, the method can then manipulate or “steer” these connections to strengthen or weaken the concept in any answer a model is prompted to give.
…

Add new comment