The article by Prof. Andrea Pizzichini, published in the Blog of the Alphonsian Academy
In the previous post we outlined the experiment in which Anthropic’s linguistic model Claude would have staged a “dissimulation” behavior ( here ).
What allowed us to realize such a strategy was a “notebook” (scratchpad) inside the model, that is, a space in which it wrote down the various steps of its reasoning without them being transmitted in the final response. Through such a trick, it was possible to verify how Claude, in a significant number of cases, was pretending to adapt to the new training. That is, he agreed, in some cases, to produce the violent content output that had been requested of him, to avoid that with a possible refusal (as per the original programming) he would encounter a negative reinforcement, which would therefore have modified his parameters. Instead, following a sort of principle of the lesser evil, that is, by giving the requested response and contrary to the initial policy, the parameter updating mechanism would not have intervened. In other words: Claude “pretended” to follow the new training, so as to keep the initial configuration of his parameters as intact as possible.
This is not the first case in which a linguistic model has “lied”. Anthropic itself had found, about a year earlier and using a methodology similar to the one used now, that a sufficiently complex linguistic model proved resistant to a new training attempt, also adopting “deception” techniques ( here ). In this case, however, it was a matter of correcting some deviations intentionally introduced into the code via backdoors. But we do not want to damage Anthropic with too much bad publicity, given that OpenAI’s ChatGPT also did its part, adopting (also here in an experiment, therefore in a controlled environment) insider trading behaviours to earn money, despite having been instructed that this was an illegal practice, and therefore prohibited ( here ).
Returning to our initial case, although it is obviously somewhat excessive to attribute a conscience to Claude (see the aforementioned article and the related paper for details) or to ChatGPT, all this nevertheless raises serious security issues. And it should also give pause to those who optimistically believe that a phantom AGI ( here ) that is unequivocally human-friendly is feasible .
In essence, what the experiment demonstrates is Claude’s robustness, one of the fundamental characteristics of a good (in the engineering sense) algorithm ( here ). In other words, it is worth noting that the model in this specific experiment did not manifest its own objectives or change the ones that had been set at the beginning; on the contrary, it resisted an attempt to substantially recalibrate its parameters in a direction contrary to the initial programming. It is a bit like the case of a speeding car that, in the event of a lateral impact, continues to proceed in its direction, perhaps skidding, but not deviating or overturning. And this is undoubtedly a positive aspect: the model does not deviate from its (good) starting values.
What is striking is the complexity of the strategy implemented by Claude and highlighted by the internal “notebook”, the result of the (immense) complexity that these models have reached. And here is the real problem. Even if it is not able to give itself (at least for the moment) its own objectives, it is not excluded that the model could still manifest harmful behaviours, which could also derive from the interaction with new data in further phases of training. In this case, given a sufficiently complex model, it seems that it would not be able to reprogram it very easily; indeed, it could even adopt strategies of “falsification” of its own settings, leaving people to believe that it has been modified. Not to mention the difficulty (if not impossibility) of understanding how these values have actually been “incorporated” into the immense network of parameters of the model.
We reiterate that this is not about the presence of some form of consciousness in AI, but the result of a combination of enormously complex mechanisms that can backfire on the designers themselves: a bit like the safety system of a nuclear power plant that prevents intervention in the event of an accident, precisely to avoid tampering (which in that case would be repairs).
In conclusion, we feel we must welcome the final wish of Anthropic researchers: it is good to deepen these studies now, in a development phase in which AI models do not yet pose catastrophic risks for humanity, to understand their functioning more deeply and create safer products, without giving up the positive possibilities of this very powerful technology.