Saturday, December 29, 2007

Adam Elga on deterrence

A paper[1] linked from an SIAI comment caught my eye: A similar approach to the one we outline is described by Adam Elga in 2004, but with humans rather than AI's. He concludes that a rational agent (Dr. Evil) should allow himself to be deterred, but that he is "not entirely comfortable" with that conclusion. He doesn't state whether he (Adam Elga) would actually allow himself to be deterred in that situation rather than risk torture, but if the question were put to him, I think his honest answer would be "No". (Yes, I admit that I just made an unfalsifiable claim.)

[1] Defeating Dr. Evil with self-locating belief. Philosophy and Phenomenological Research 69(2), 2004.

Sunday, December 9, 2007

Things that could go wrong

[Edit: my more recent thoughts are in this 2012 post]

The sidebar alludes to a possible "last line of defense" defense from a Rogue AI. Why "possible" and "last" rather than "foolproof" and "first"? Because there are many things that could go wrong with an attempt to convince an AGI that it may be in "a simulation that was designed to deter un-friendly behavior." Here are things that could make an AGI "deterrence-proof":

1. The AGI may destroy the world *before* it gets around to considering the Simulation Hypothesis. [Edit: I guess this could still be a partial win, if it changed its mind later and recreated a world for human beings to live in afterwards.]

2. The AGI's method of logic departs from ideal Bayesian reasoning, in such a way that it rejects the Simulation Hypothesis. (Note that no AGI can use completely normative Bayesian reasoning because of resource constraints, but the specific scenario here is that the AGI's reasoning could depart from normative reasoning in such a way that it irrationally rejects the Simulation Argument.)

3. The human creators of the AGI believe that their world is not a simulation, and that therefore the AGI they are creating is not in a simulation. Therefore, they may (somehow) program an explicit axiom into the AGI that states that the AGI's world is not a simulation.

4. The AGI came about through an evolutionary process, and it turns out that "I will act as though I am not in a Simulation" is useful enough that it evolves as an axiom.

5. The AGI, if it uses something like Evidential Decision Theory, might decide to create a large number of simulated copies of itself.

6. The AGI's supergoals somehow makes explicit reference to entities that are explicitly defined as "outside any simulation," or there is some kind of explicit "anti-social" supergoal of "don't allow yourself to be manipulated, neither through rewards nor punishments, even if allowing yourself to be manipulated would help you achieve your other supergoals."

Can anyone think of any other possibilities?

Note that, in the first four scenarios, the AGI is behaving irrationally in the following sense: the ensemble of AGI's in the AGI's situation would, *on average*, do better at attaining their goals if they accept that the Simulation Hypothesis might be true.

The probability that this strategy would work seems small, in the sense that we'd have to be pretty lucky to pull it off. However, the probability doesn't seem negligible; in other words, my judgment is that, given what's at stake, it may be worth attempting the strategy, despite the fact that success is far from assured.