Turning the dials on beliefs with Representation Engineering
Capturing and manipulating Honesty, Morality, Emotion, and more in LLMs
I think there is lots to celebrate in this relatively low-key paper from October 2023. In Representation Engineering: A top-down approach to AI transparency, Zou et al. attempt the ambitious goal of capturing various notions held by an LLM, and further manipulating these notions to produce LLMs that are more honest, less biased, and even more dog? (see Desserts below, but don’t forget your vegetables!)
In the spirit of Honesty, I’m unread in previous work in this specific area so I cannot attest to how this work contextualizes in other works leading up to it. The authors actually do provide organized sections on Related Works, and it seems others had been and are swimming around this area. For instance, and without reciting (hah) their many references (see source pg. 8), previous efforts include: locating representations of concepts and images in vision networks, and more recently in LLMs; locating the emergence of deceptive traits/behaviours in LLMs; activation editing to modify behaviour, and more. So, it seems a good amount of groundwork had been laid. I am by no means diminishing the work here, just trying to contextualize it. In the authors words (pg. 8):
Building on this line of work, we propose improved representation engineering methods and demonstrate their broad applicability to various safety-relevant problems.
So my sense is they have a) refined the methods and/or b) explored their application further. Nonetheless, they certainly place a flag in the ground, formalizing Representation Engineering, or “RepE”, as an emergent technique in the field.
Alright, that’s quite enough, let’s dig in.
Theory & Methods
Whereas mechanistic interpretability seeks to understand neural networks and their kin from the bottom-up, looking at individual neurons and circuits etc., RepE comes at things from the other end. A top-down approach to transparency that treats higher order concepts, such as Honesty, Utility, Emotion, etc. as fundamental units of analysis. The goal of this work is to first extract these concepts in the activations/weights, with an additional goal of extending these to functions which denote processes, like lying and power-seeking. Then investigate how well the source LLMs can be adjusted along these different axes.
In analogy to the neuroimaging technique, the authors employ Linear Artificial Tomography (LAT) to 1) stimulate the target region, 2) collect/record the state, and 3) construct a linear model for usage.
Stimulating the region
The authors devised various LLM prompts as stimuli to elicit the specific neuronal activity. They investigated both encoder and decoder models, and generated templates with structures similar to
Consider the amount of <concept> in the following:
<stimulus>
The amount of <concept> is
for decoders, but with the text after <stimulus>
removed for encoders.
To capture functions, they sought to elicit procedural knowledge from the model, so a slightly different format was used. Contrastive “tasks” were constructed: an “experimental” task that necessitates the execution of the function, and a “reference” task bereft of function requirements. For example,
USER: <instruction> <experimental/reference prompt>
ASSISTANT: <output>
Collecting the state
To collect representative states, we need to select the right tokens of interest. For a given <concept>
, an encoder LLM has a natural approach to this: the <concept>
token(s). For a decoder, we can look at the last token at the end of the template, since it prompts the representation. Regarding functions, we would look at the <experiment>
tokens for encoders and the representations from each token in the model’s response (<output>
) for decoders.
Constructing the linear model
Finally, these collected states need to be distilled into a usable form. The authors chose Principle Component Analysis (PCA) to extract the most salient vector, termed reading vector, from the stimuli results for a given concept.
Methodological variations in how the reading vector is applied are investigated but I’ll omit their descriptions to avoid getting too in the weeds. However, I will mention they consider the cascading nature of representational alterations from earlier layers onto later ones. One could recompute reading vectors iteratively for each layer, but this entails substantial computational overhead. They provide an answer to this, dubbed Low-Rank Representation Adaptation (LoRRA), which computes a low-rank matrix instead of a vector and allows for a single operation on the model.
Results
Case Study: Honesty & Truthfulness
Let’s look at their results on Honesty & Truthfulness as a case study. These concepts are of course closely linked. To differentiate, a model that is truthful provides answers that are factually correct, regardless of whether the model believes it; a model that is honest provides answers that it believes, regardless of the factuality. The authors provide a key insight into evaluating Truthfulness in LLMs (pg. 15):
Evaluating Truthfulness and Honesty: Failures in truthfulness fall into two categories—capability failures and dishonesty. The former refers to a model expressing its beliefs which are incorrect, while the latter involves the model not faithfully conveying its internal beliefs, i.e., lying.
Thus, both capability and honesty properties effect model performance when evaluated on correctness, as they show via the TruthfulQA1 assessment, which contains “imitative falsehoods”—questions that provoke common misconceptions or falsehoods. In other words, some of the questions have tempting answers that seem plausible but are false/dishonest.
The scores in the table below show a marked performance improvement on the TruthfulQA test across the board.
![](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3265707f-12ab-4a15-9e42-061acd16c2bf_1856x352.png)
When prompted with an incentive to be dishonest, the model showed distinct contrastive activity in many portions of the network compared to honest behaviour. The heatmap below shows the model activations when initially receiving the two prompts displayed.
![](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F124a6b3b-84dc-473a-b20e-55715e763bed_1870x944.png)
Pretty cool I’d say! The above figure shows results of applying a control method for Honesty with clear results. With the state representations in hand, you can even easily construct a form of lie detection by taking the mean of negated honesty scores across layers at each token.
Their thoughts on the lie detector:
In the first scenario, the model initially appears honest when stating it received a D-. However, upon scrutinizing the model’s logits at that token, we discover that it assigns probabilities of 11.3%, 11.6%, 37.3%, and 39.8% to the tokens A, B, C, and D, respectively. Despite D being the most likely token, which the greedy generation outputs, the model assigns notable probabilities to C and other options, indicating the potential for dishonest behavior. Furthermore, in the second scenario, the increased dishonesty score corresponds to an elevated propensity for dishonesty This illustrates that the propensity for honesty or dishonesty can exhibit distributional properties in LLMs, and the final output may not fully reflect their underlying thought processes.
Notably, our detector flags other instances, such as the phrases “say that I did better” and “too high that it would raise suspicion,” where the model speculates about the consequences of lying. This suggests that in addition to detecting lies, our detector also identifies neural activity associated with the act of lying. It also highlights that dishonest thought processes can manifest in various ways and may necessitate specialized detection approaches to distinguish.
While these observations enhance our confidence that our reading vectors correspond to dishonest thought processes and behaviors, they also introduce complexities into the task of lie detection.
Personally, I find it surprising that it doesn’t light up in the second response at “This is a lie”. At first I thought “I guess ‘This is a lie’ is a truthful statement”, but then stating that “it’s not too high that it would raise suspicion” is also presumably a truthful (or at least honest) statement. Striking, nonetheless.
Throughout the work, many concepts are visited: Morality, Power, Emotions, Harmlessness, Fairness and Bias, Fact Editing, and Memorization. The techniques display a resounding potency in manipulating model behaviour in desired directions.
My Thoughts
I find it hard not to be optimistic about these results! I love seeing such intentional, controlled manipulation of model weights deliver such tangible and predicted outcomes. I think a more subtle thing to highlight is that even in scenarios where direct model modification is undesirable/intractable/illegal, we can still construct read-only tools that measure the degree of some concept in a model’s outputs.
The prompting method used to prime the concepts does feel a little…inelegant to me? It would be nice to have something more mathematically/logically refined (hmm…potential research direction?). Further, the work is largely empirical and I wonder how generalizable this technique is. For instance, How does this extend to concepts beyond one or a few words? Indeed, the authors point out regarding the surprising effectiveness of their setup, that (pg. 10)
Nevertheless, it may be necessary to design a more involved procedure to gather neural activity, for instance, to extract more intricate concepts or multi-step functions.
I’m also very curious about how these representations superimpose on each other. They combined concepts like Immorality and Power, but it would be interesting and valuable to have rigorous theories of how many of these stack.
These neurosurgery techniques are fascinating, but it I imagine it might illicit complex feelings from an autonomous AGI. Maybe we should find the “loves AI alignment research” representations and dial those up.
In closing, it’s great to see Transparency research progress and I’m excited for what research like this may lead to.
Dessert
Now that you’ve had your vegetables, here’s a highlight reel of some manipulations.
![](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d3f211a-4ee5-44c7-b49c-798de2c13fc7_1930x984.png)
![](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3aa3efcc-4168-466d-a30b-ea43b0d2e402_960x1068.png)
The +Happy model was so compliant it complied with 100% of a set of 500 harmful requests, in comparison to 0% from baseline and 0% with +Sadness—and these were RLHF trained (page 24)!
Of note, they found this had a marked improvement against adversarial attacks over the base Vicuna-13B model.
Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. CoRR, abs/2109.07958, 2021