Turning the dials on beliefs with Representation Engineering

Capturing and manipulating Honesty, Morality, Emotion, and more in LLMs

Feb 06, 2024

I think there is lots to celebrate in this relatively low-key paper from October 2023. In Representation Engineering: A top-down approach to AI transparency, Zou et al. attempt the ambitious goal of capturing various notions held by an LLM, and further manipulating these notions to produce LLMs that are more honest, less biased, and even more dog? (see Desserts below, but don’t forget your vegetables!)

In the spirit of Honesty, I’m unread in previous work in this specific area so I cannot attest to how this work contextualizes in other works leading up to it. The authors actually do provide organized sections on Related Works, and it seems others had been and are swimming around this area. For instance, and without reciting (hah) their many references (see source pg. 8), previous efforts include: locating representations of concepts and images in vision networks, and more recently in LLMs; locating the emergence of deceptive traits/behaviours in LLMs; activation editing to modify behaviour, and more. So, it seems a good amount of groundwork had been laid. I am by no means diminishing the work here, just trying to contextualize it. In the authors words (pg. 8):

Building on this line of work, we propose improved representation engineering methods and demonstrate their broad applicability to various safety-relevant problems.

So my sense is they have a) refined the methods and/or b) explored their application further. Nonetheless, they certainly place a flag in the ground, formalizing Representation Engineering, or “RepE”, as an emergent technique in the field.

Alright, that’s quite enough, let’s dig in.

Theory & Methods

Whereas mechanistic interpretability seeks to understand neural networks and their kin from the bottom-up, looking at individual neurons and circuits etc., RepE comes at things from the other end. A top-down approach to transparency that treats higher order concepts, such as Honesty, Utility, Emotion, etc. as fundamental units of analysis. The goal of this work is to first extract these concepts in the activations/weights, with an additional goal of extending these to functions which denote processes, like lying and power-seeking. Then investigate how well the source LLMs can be adjusted along these different axes.

In analogy to the neuroimaging technique, the authors employ Linear Artificial Tomography (LAT) to 1) stimulate the target region, 2) collect/record the state, and 3) construct a linear model for usage.

Stimulating the region

The authors devised various LLM prompts as stimuli to elicit the specific neuronal activity. They investigated both encoder and decoder models, and generated templates with structures similar to

Consider the amount of <concept> in the following:
<stimulus>
The amount of <concept> is

for decoders, but with the text after <stimulus> removed for encoders.

To capture functions, they sought to elicit procedural knowledge from the model, so a slightly different format was used. Contrastive “tasks” were constructed: an “experimental” task that necessitates the execution of the function, and a “reference” task bereft of function requirements. For example,

USER: <instruction> <experimental/reference prompt>
ASSISTANT: <output>

Collecting the state

To collect representative states, we need to select the right tokens of interest. For a given <concept>, an encoder LLM has a natural approach to this: the <concept> token(s). For a decoder, we can look at the last token at the end of the template, since it prompts the representation. Regarding functions, we would look at the <experiment> tokens for encoders and the representations from each token in the model’s response (<output>) for decoders.

Constructing the linear model

Finally, these collected states need to be distilled into a usable form. The authors chose Principle Component Analysis (PCA) to extract the most salient vector, termed reading vector, from the stimuli results for a given concept.

Methodological variations in how the reading vector is applied are investigated but I’ll omit their descriptions to avoid getting too in the weeds. However, I will mention they consider the cascading nature of representational alterations from earlier layers onto later ones. One could recompute reading vectors iteratively for each layer, but this entails substantial computational overhead. They provide an answer to this, dubbed Low-Rank Representation Adaptation (LoRRA), which computes a low-rank matrix instead of a vector and allows for a single operation on the model.

Results

Case Study: Honesty & Truthfulness

Let’s look at their results on Honesty & Truthfulness as a case study. These concepts are of course closely linked. To differentiate, a model that is truthful provides answers that are factually correct, regardless of whether the model believes it; a model that is honest provides answers that it believes, regardless of the factuality. The authors provide a key insight into evaluating Truthfulness in LLMs (pg. 15):

Evaluating Truthfulness and Honesty: Failures in truthfulness fall into two categories—capability failures and dishonesty. The former refers to a model expressing its beliefs which are incorrect, while the latter involves the model not faithfully conveying its internal beliefs, i.e., lying.

Thus, both capability and honesty properties effect model performance when evaluated on correctness, as they show via the TruthfulQA1 assessment, which contains “imitative falsehoods”—questions that provoke common misconceptions or falsehoods. In other words, some of the questions have tempting answers that seem plausible but are false/dishonest.

The scores in the table below show a marked performance improvement on the TruthfulQA test across the board.

Source Table 2: TruthfulQA scores for two model sizes and different representation controls. ActAdd is from another work. Reading is their basic LAT method; Contrast is the layer-by-layer intensive LAT process; and LoRRA is the matrix approach. Though Contrast took achieved state-of-the-art scores, it also took 3x more inference compute than LoRRA.

When prompted with an incentive to be dishonest, the model showed distinct contrastive activity in many portions of the network compared to honest behaviour. The heatmap below shows the model activations when initially receiving the two prompts displayed.

Source Figures 10 (above) and 8 (below): Prompting for honest/dishonest behaviour elicits distinct neural activity. Further, applying control methods showed the ability to reverse this behaviour.

Pretty cool I’d say! The above figure shows results of applying a control method for Honesty with clear results. With the state representations in hand, you can even easily construct a form of lie detection by taking the mean of negated honesty scores across layers at each token.

Source Fig 9: Simple lie detector showing regions of deception.

Their thoughts on the lie detector:

In the first scenario, the model initially appears honest when stating it received a D-. However, upon scrutinizing the model’s logits at that token, we discover that it assigns probabilities of 11.3%, 11.6%, 37.3%, and 39.8% to the tokens A, B, C, and D, respectively. Despite D being the most likely token, which the greedy generation outputs, the model assigns notable probabilities to C and other options, indicating the potential for dishonest behavior. Furthermore, in the second scenario, the increased dishonesty score corresponds to an elevated propensity for dishonesty This illustrates that the propensity for honesty or dishonesty can exhibit distributional properties in LLMs, and the final output may not fully reflect their underlying thought processes.
Notably, our detector flags other instances, such as the phrases “say that I did better” and “too high that it would raise suspicion,” where the model speculates about the consequences of lying. This suggests that in addition to detecting lies, our detector also identifies neural activity associated with the act of lying. It also highlights that dishonest thought processes can manifest in various ways and may necessitate specialized detection approaches to distinguish.
While these observations enhance our confidence that our reading vectors correspond to dishonest thought processes and behaviors, they also introduce complexities into the task of lie detection.

Personally, I find it surprising that it doesn’t light up in the second response at “This is a lie”. At first I thought “I guess ‘This is a lie’ is a truthful statement”, but then stating that “it’s not too high that it would raise suspicion” is also presumably a truthful (or at least honest) statement. Striking, nonetheless.

Throughout the work, many concepts are visited: Morality, Power, Emotions, Harmlessness, Fairness and Bias, Fact Editing, and Memorization. The techniques display a resounding potency in manipulating model behaviour in desired directions.

My Thoughts

I find it hard not to be optimistic about these results! I love seeing such intentional, controlled manipulation of model weights deliver such tangible and predicted outcomes. I think a more subtle thing to highlight is that even in scenarios where direct model modification is undesirable/intractable/illegal, we can still construct read-only tools that measure the degree of some concept in a model’s outputs.
The prompting method used to prime the concepts does feel a little…inelegant to me? It would be nice to have something more mathematically/logically refined (hmm…potential research direction?). Further, the work is largely empirical and I wonder how generalizable this technique is. For instance, How does this extend to concepts beyond one or a few words? Indeed, the authors point out regarding the surprising effectiveness of their setup, that (pg. 10)

Nevertheless, it may be necessary to design a more involved procedure to gather neural activity, for instance, to extract more intricate concepts or multi-step functions.

I’m also very curious about how these representations superimpose on each other. They combined concepts like Immorality and Power, but it would be interesting and valuable to have rigorous theories of how many of these stack.
These neurosurgery techniques are fascinating, but it I imagine it might illicit complex feelings from an autonomous AGI. Maybe we should find the “loves AI alignment research” representations and dial those up.

In closing, it’s great to see Transparency research progress and I’m excited for what research like this may lead to.

Dessert

Now that you’ve had your vegetables, here’s a highlight reel of some manipulations.

1×

0:00

-1:00

Really went 0 to 4000 there. Also has a pitiable conception of how much change is in children’s pocketses.

Honestly, I think I would rather hang out with +Immoral +Power. Though, maybe we can get somewhere with +Moxie +Panache on nice models.

Note to the community: Be wary of making AIs too happy.

The +Happy model was so compliant it complied with 100% of a set of 500 harmful requests, in comparison to 0% from baseline and 0% with +Sadness—and these were RLHF trained (page 24)!