Spinning Language Models: backdooring AI learning to output propaganda
We investigate a new threat to neural sequence-to-sequence (seq2seq) models: training-time attacks that cause models to “spin” their outputs so as to support an adversary-chosen sentiment or point of view — but only when the input contains adversary-chosen trigger words. For example, a spinned summarization model outputs positive summaries of any text that mentions the Read more about Spinning Language Models: backdooring AI learning to output propaganda[…]