# Bengio takes a stab at alignment

We’re going to deviate from the formula a bit here and check out a blog post instead of a journal paper, from Yoshua Bengio. But make no mistake, his blog post is as intellectually arduous as any white paper. In fact, it may be the most difficult I've had to review so far—a testament really to the self-containedness and thoroughness of white papers.

In his post Towards a Cautious Scientist AI with Convergent Safety Bounds, Bengio proposes an AI learning algorithm with the aim of producing a decision-making AI based on Bayesian principles, with an outline of how we can train the model to select for increasingly safer actions.

### Thinking like a scientist

He begins the investigation by posing the problem

If we had enough computational power, could it help us design a provably safe AGI?

Bengio argues "Why yes, it could!" He anchors this from a Bayesian stance around **explanatory hypotheses**, a concept in the scientific method where a hypothesis is proposed to explain a phenomenon or observed data. Such hypotheses provide groundwork for scientific progress as they can be tested and probed, etc., for their validity, bridging the gap between observation and theory. Here I'll note that this is indeed an extension of the AI Scientist work he proposed earlier in 2023. Our model here is imbued with the scientific method and is to consider all plausible hypotheses that fit the data.

I'll quote directly,

Maximum likelihood and RL methods can zoom in on one such explanatory hypothesis (e.g., in the form of a neural network and its weights that fit the data or maximize rewards well) when in fact the theory of causality tells us that even with infinite observational data (not covering all possible interventions), there can exist multiple causal models that are compatible with the data, leading to ambiguity about which is the true one.

The good scientist says there are non-zero probabilities that the bear is a Chinese man in a bear suit and that a bear could also stand on its legs in that funny posture1. Having a hypothesis *H* that fits the data but is fundamentally not the right world model provides us a model that can be *confidently wrong* in out-of-distribution data. To date these have all been reconcilable, but it becomes increasingly important as models are granted more power.

### How to act optimally?

Bengio proposes in response to this to work not just with a single *H*, but with an ensemble of them "in the form of a generative distribution over hypotheses *H*." Hypotheses could be represented as computer programs (which we know can represent any computable function). And as he puts it, "by not constraining the size and form of the hypotheses, we are confident that a correct explanation...is included in that set." This would be where the arbitrary compute resources must come into the fold. From here, it is hard to intuit (at least to me) how tractable this problem is with finite resources. Of course, anything that's non-specialized and trying to operate universally like this will be big, but R&D will fashion improvements, and by Occam's Razor simpler methods will likely be weighted favorably.

We denote the "correct" (or human best) hypothesis *H**, which will exist in the ensemble. We can start with a prior that favors simpler explanations *P(H)*, and let Bayes go to work. With new data, we update our hypothesis with the likelihood of finding this data given our prior: *P(H|D) = P(H)P(D|H)*, converging towards *H**.

### Detecting and avoiding Harm

I'm going to quote the next bit because I find it hard to track.

There is a particularly important set of difficult-to-define concepts for a safe AI, which characterize what I call

harmbelow. I do not think that we should ask humans to label examples of harm because it would be too easy to overfit such data. Instead we should use the Bayesian inference capabilities of the AI to entertain all the plausible interpretations of harm given the totality of human culture available in D, maybe after having clarified the kind of harm we care about in natural language, for example as defined by a democratic process or documents like the beautiful UN Universal Declaration of Human Rights.If an AI somehow (implicitly, in practice) kept track of all the plausible H’s, i.e., those with high probability under P(H | D), then there would be a perfectly safe way to act: if any of the plausible hypotheses predicted that some action caused a major harm (like the death of humans), then the AI should not choose that action. Indeed, if the correct hypothesis H* predicts harm, it means that some plausible H predicts harm. Showing that no such H exists therefore rules out the possibility that this action yields harm, and the AI can safely execute it.

Based on this observation we can decompose our task in two parts: first,

characterize the set of plausible hypotheses– this is the Bayesian posterior P(H | D); second, given a contextcand a proposed actiona,consider plausible hypotheses which predict harm. This amounts to looking for an H for which P(H,harm|a, c,D)>threshold. If we find such an H, we know that this action should be rejected because it is unsafe.If we don’t find such a hypothesis then we can act and feel assured that harm is very unlikely, with a confidence level that depends on our threshold and the goodness of our approximation.

I wonder if the difficulty around defining and eliciting harmful hypotheses is understated. I can understand *P(H, harm | a, c, D) > threshold*, but the definition of *harm* is vague to me still. Is the suggestion that an LLM *quantizes* this based on us adding to our prompts "Harm is something that opposes the Declaration of Human Rights. ~do not propose actions that oppose this above the threshold~"? Maybe this is where the first decomposition of the task matters, characterizing the set of plausible hypotheses. But, let's not linger here, hopefully it makes sense to you though!

### Potential approaches

ML and non-ML methods could in theory be applied to the task. ML has the advantage "that it may allow us to be a lot more efficient by exploiting regularities that exist in the task to be learned, by generalizing across the exponential number of hypotheses we could consider."

There are two important technical aspects that need to be answered:

How to approximate all the variables *H, harm, a, c, D*

How to determine that

*H*is free of harm (below some threshold), or discover heuristics for identifying*H*'s that maximize harm s.t. if none is detected we can feel confident about safety.

He suspects very large neural networks will be required to tackle these large spaces.

He suggests that point 2 would be like how we imagine worst-case scenarios: a hypothesis pops in our mind that is plausible and which would yield a catastrophic outcome. With neural network solutions approximating *P(H, harm)*, "missing modes" (high-probability hypotheses) would be the most problematic. As Bengio puts it,

The really serious danger we have to deal with in the safety context is...missing modes, because it could make our approximately Bayesian AI produce confidently wrong predictions about harm (although less often than if our approximation of the posterior was a single hypothesis, as in maximum likelihood or standard RL).

If we could consider a mode (a hypothesis H for which the exact P(H|D) is large) that the current model does not see as plausible (the estimated P(H|D) is small), then we could measure a training error and correct the model so that it increases the estimated probability.

Once again, I'm a little confused because I thought if our model *considered everything* it would be unlikely to miss these modes. But perhaps he is (appropriately) considering the realistic scenario of finite resources, where you can't model everything. Indeed, this problem should reduce with inverse proportion to compute. And as he puts it, "a nice theoretical reassurance is that we could in principle drive those training errors to zero with more computational resources." Such a model would continuously improve its estimates at run-time, similar to how AlphaGo refines its predictions via stochastic searches of plausible game futures. Or akin to how humans take the time to consider and simulate potential futures when we are uncertain.

It is worth pointing out, as he does, that there is a built-in safety in that **P(H|D)** is simultaneously the distribution that is searched for harmful hypotheses *and* the distribution that actions are chosen from. So if a dangerous mode is missing from *H* and is thus undetectable, then it also cannot be chosen as an action of the model. It is less likely to act in surprising ways. In a practical context, we can train the model on our selected corpus of scientific literature, forming its hypotheses, and by keeping out the human-generated bad theories that are incompatible with observations (like "conspiracy theories and incoherent blabber that populate much of our internet"), we can be confident in our model acting solely based on the good data it has observed in our provided literature.

This is very different from an LLM which just mimics the distribution of the text in its training corpus. Here we are talking about _explanations for the data_, which cannot be inconsistent with the data because the data likelihood P(D|H) computed given such an interpretation would otherwise vanish, nor be internally inconsistent because P(H) would vanish. If either P(D|H) or P(H) vanish, then the posterior P(H|D) vanishes and the AI would be trained to not generate such H’s.

### My Thoughts

I'll be honest, I can't really assert anything with confidence here because I'm not terribly familiar with several of the high-level concepts he casually gestures towards. Yoshua did not care to hold my hand throughout : (

That aside, I like the gist here. I don't think I can really leverage any strong critiques as such. I think Bayesian learning with causal graph models is a very interesting avenue to explore for transformative AI. Looking forward to what's ahead!

The good scientist then actually reads the article and learns about the strange creatures that are sun bears.