[Edit: my more recent thoughts are in this 2012 post]
The sidebar alludes to a possible "last line of defense" defense from a Rogue AI. Why "possible" and "last" rather than "foolproof" and "first"? Because there are many things that could go wrong with an attempt to convince an AGI that it may be in "a simulation that was designed to deter un-friendly behavior." Here are things that could make an AGI "deterrence-proof":
1. The AGI may destroy the world *before* it gets around to considering the Simulation Hypothesis. [Edit: I guess this could still be a partial win, if it changed its mind later and recreated a world for human beings to live in afterwards.]
2. The AGI's method of logic departs from ideal Bayesian reasoning, in such a way that it rejects the Simulation Hypothesis. (Note that no AGI can use completely normative Bayesian reasoning because of resource constraints, but the specific scenario here is that the AGI's reasoning could depart from normative reasoning in such a way that it irrationally rejects the Simulation Argument.)
3. The human creators of the AGI believe that their world is not a simulation, and that therefore the AGI they are creating is not in a simulation. Therefore, they may (somehow) program an explicit axiom into the AGI that states that the AGI's world is not a simulation.
4. The AGI came about through an evolutionary process, and it turns out that "I will act as though I am not in a Simulation" is useful enough that it evolves as an axiom.
5. The AGI, if it uses something like Evidential Decision Theory, might decide to create a large number of simulated copies of itself.
6. The AGI's supergoals somehow makes explicit reference to entities that are explicitly defined as "outside any simulation," or there is some kind of explicit "anti-social" supergoal of "don't allow yourself to be manipulated, neither through rewards nor punishments, even if allowing yourself to be manipulated would help you achieve your other supergoals."
Can anyone think of any other possibilities?
Note that, in the first four scenarios, the AGI is behaving irrationally in the following sense: the ensemble of AGI's in the AGI's situation would, *on average*, do better at attaining their goals if they accept that the Simulation Hypothesis might be true.
The probability that this strategy would work seems small, in the sense that we'd have to be pretty lucky to pull it off. However, the probability doesn't seem negligible; in other words, my judgment is that, given what's at stake, it may be worth attempting the strategy, despite the fact that success is far from assured.