Small rant : Basically, the title. Instead of answering every question, if it instead said it doesn’t know the answer, it would have been trustworthy.
Small rant : Basically, the title. Instead of answering every question, if it instead said it doesn’t know the answer, it would have been trustworthy.
Do you have a source for the “smiling when you don’t really mean it” thing? I’ve been digging around but couldn’t find that anywhere.
It’s right in the research I was mentioning:
https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html
Find the section on the model’s representation of self and then the ranked feature activations.
I misremembered the top feature slightly, which was: responding “I’m fine” or gives a positive but insincere response when asked how they are doing.