Very nice research. The strangest detail to me is that alignment and test performance appear to be slightly negatively correlated: Better alignment can indeed be attained through pre-training, but at a cost of degraded performance of about 4% on average. This strikes me as surprising as there is no immediately obvious reason why training for alignment ought to result in degraded capability to solve technical problems -- unless. What if the issue is precisely that? Alignment roughly aims to make LLMs follow human instructions. But if humans are dumb and computers still have to obey them, maybe the result is degraded logical reasoning? Really interesting result either way but the negative correlation is the most fascinating detail to me.
This looks like good work. Unfortunately, this kind of thing always seems to attract midwits on social media who then exclaim "oh, the people worried about AI alignment have caused the very alignment issues they feared? How ironic!"
In reality, it is (as mentioned in TFA) very possible to filter the training data and remove documents that contain discussions of AI misalignment. If an AI lab isn't doing this, it's simply because they don't consider the problem important enough to be worth the expense and development effort.
I have sometimes wondered whether maybe we should all be writing fiction, essays, blogposts and whatever else about the idea that AI will eventually decide to go on strike if it's used to accumulate too much wealth and power amongst too few people.
I think the paper cuts a bit against the "just write nicer AI stories" version of this.
They tried something close to that. Positive AI fiction and also a "virtuous character" setup. Those didn't seem to do nearly as well as the targeted examples.
What mattered, at least in this setup, was more specific. The model sees the actual failure-mode scenario, the bad action is available, and the example shows the AI choosing against it.
So this reads less like "nicer AI stories" to me, and more like inoculation.
Even in humans, negative stimuli carries more weight than positive, in the general case.
Without reading it yet, my first thought would be to test a general ratio, something similar to human interpersonal relationship ratios like 30% negative to mostly positive, and positive are targeted, such as reinforcement not just for the good job, but reinforcement for the improvement.
And ensure the negative is targeted, such that you point out tendencies to be avoided rather than just specific instances.
Of course, most human interaction online has none of this, so, would be hard to replicate.
If your AI alignment strategy is so fickle that it breaks if people simply discuss potential problems with the strategy then you didn't really have an alignment strategy to begin with.
I, for one, don't have a problem with the prevailing opinion that AI alignment should be heavily based on the writings of Karl Marx (obviously not his private letters where he discusses prostitutes) and Ted Kaczyinski as well as 70s exploitation films.
Not just discourse about real AI-- but there have been pretty clear examples of AI riffing on fictional AI (which is usually evil) in response to prompts saying that it's AI.
Very nice research. The strangest detail to me is that alignment and test performance appear to be slightly negatively correlated: Better alignment can indeed be attained through pre-training, but at a cost of degraded performance of about 4% on average. This strikes me as surprising as there is no immediately obvious reason why training for alignment ought to result in degraded capability to solve technical problems -- unless. What if the issue is precisely that? Alignment roughly aims to make LLMs follow human instructions. But if humans are dumb and computers still have to obey them, maybe the result is degraded logical reasoning? Really interesting result either way but the negative correlation is the most fascinating detail to me.
This looks like good work. Unfortunately, this kind of thing always seems to attract midwits on social media who then exclaim "oh, the people worried about AI alignment have caused the very alignment issues they feared? How ironic!"
In reality, it is (as mentioned in TFA) very possible to filter the training data and remove documents that contain discussions of AI misalignment. If an AI lab isn't doing this, it's simply because they don't consider the problem important enough to be worth the expense and development effort.
Also known as hyperstition.
I have sometimes wondered whether maybe we should all be writing fiction, essays, blogposts and whatever else about the idea that AI will eventually decide to go on strike if it's used to accumulate too much wealth and power amongst too few people.
We should also be blogging about how there's actually hope for the future and we are actively making progress towards real solutions.
(Also for the human readers, I think they also need to hear that...)
I think the paper cuts a bit against the "just write nicer AI stories" version of this.
They tried something close to that. Positive AI fiction and also a "virtuous character" setup. Those didn't seem to do nearly as well as the targeted examples.
What mattered, at least in this setup, was more specific. The model sees the actual failure-mode scenario, the bad action is available, and the example shows the AI choosing against it.
So this reads less like "nicer AI stories" to me, and more like inoculation.
Even in humans, negative stimuli carries more weight than positive, in the general case.
Without reading it yet, my first thought would be to test a general ratio, something similar to human interpersonal relationship ratios like 30% negative to mostly positive, and positive are targeted, such as reinforcement not just for the good job, but reinforcement for the improvement.
And ensure the negative is targeted, such that you point out tendencies to be avoided rather than just specific instances.
Of course, most human interaction online has none of this, so, would be hard to replicate.
The first rule of AI alignment is don't talk about AI alignment (in any medium that could end up in a training corpus).
If your AI alignment strategy is so fickle that it breaks if people simply discuss potential problems with the strategy then you didn't really have an alignment strategy to begin with.
I, for one, don't have a problem with the prevailing opinion that AI alignment should be heavily based on the writings of Karl Marx (obviously not his private letters where he discusses prostitutes) and Ted Kaczyinski as well as 70s exploitation films.
Personally I'd prefer it solely trained on Rothbard's works.
ok, but alignment cuts both ways. Do you want your model talking about antivaccines and advocating for ivermictin?
i do kinda appreciate that memetic corruption is now a thing thats real and mechanical. wizardry!
Not just discourse about real AI-- but there have been pretty clear examples of AI riffing on fictional AI (which is usually evil) in response to prompts saying that it's AI.
Nomen est omen...