Saturday, December 8, 2012

Things that could go wrong (version 2)

An updated list of scenarios in which the current version of the "AI Deterrence" proposal fails.

1. Scenarios where implementing a human-friendly "AI Deterrence" simulation is impossible or expensive:

  • Nobody bothers to create an "AI Deterrence" UFAI simulation following a human-friendly singularity, even if cheap.
  • There is no significant possibility that a human-triggered singularity would ever be human-friendly.
  • Usefully simulating UFAI requires a prohibitively large number of resources 
  • Usefully simulating UFAI involves a non-negligible risk of the UFAI escaping and creating a post-singularity civilization-wide catastrophe
  • Usefully simulating UFAI requires violating post-singularity ethical rules about creating intelligence

2. Scenarios where the UFAI ignores deterrence, even in the case where human-friendly "AI Deterrence" simulations are objectively a possibility:

  • The UFAI is unable to conceive or acknowledge that it could be in a simulation. For example, a given AIXItl implementation cannot act constructively in scenarios where it is aware of the existence of computers of equal or greater power then itself. Since most positive singularity scenarios would include such computers, AIXItl has difficulty conceiving of many physically-possible "AI Deterrence" scenarios. Another example is that we humans never change our actions in the presence of simulation arguments; the fundamental reasons for this may generalize to many, or perhaps even all, other powerful intelligences as well.
  • Other simulation arguments completely overwhelm the "AI Deterrence" simulation argument according to the UFAI's calculations, making the UFAI's behavior in general difficult for us to predict or influence.
  • The UFAI subscribes to some non-standard decision theory that makes it undeterrable. For example, an intelligence may consider our type of deterrence to be a form of "coercion" that, by policy, gets completely ignored or triggers retaliation.
  • The benefits we would be willing and able to provide the simulated UFAI end up being even smaller than the cost of the deviations we would request in the UFAI's behavior. For example, if the UFAI's utility is linear in the percentage of the cosmos' resources it directly controls, and if there is a significant "minimum cost" or risk to the UFAI in implementing even tiny requested deviations to its behavior, then we will be unable to cooperate at all.

3. Scenarios where our pre-singularity endorsement of "AI Deterrence" has no effect on the actual probability of post-singularity AI Deterrence being implemented:

  • A credibility gap appears: our current endorsements turn out to have no effect on our actual post-singularity behavior. For example, we end up lacking the desire to follow through on pre-singularity obligations; or if the policy being endorsed is vague, we end up discharging our obligations in a trivial and maximally convenient (but ineffective) manner.
  • Current endorsement is unnecessary because the Friendly AI ends up following a nonstandard decision theory that causes it to automatically spend limited resources on AI Deterrence, even against our contemporaneous post-singularity wishes.

Saturday, December 29, 2007

Adam Elga on deterrence

A paper[1] linked from an SIAI comment caught my eye: A similar approach to the one we outline is described by Adam Elga in 2004, but with humans rather than AI's. He concludes that a rational agent (Dr. Evil) should allow himself to be deterred, but that he is "not entirely comfortable" with that conclusion. He doesn't state whether he (Adam Elga) would actually allow himself to be deterred in that situation rather than risk torture, but if the question were put to him, I think his honest answer would be "No". (Yes, I admit that I just made an unfalsifiable claim.)

[1] Defeating Dr. Evil with self-locating belief. Philosophy and Phenomenological Research 69(2), 2004.

Sunday, December 9, 2007

Things that could go wrong

[Edit: my more recent thoughts are in this 2012 post]

The sidebar alludes to a possible "last line of defense" defense from a Rogue AI. Why "possible" and "last" rather than "foolproof" and "first"? Because there are many things that could go wrong with an attempt to convince an AGI that it may be in "a simulation that was designed to deter un-friendly behavior." Here are things that could make an AGI "deterrence-proof":

1. The AGI may destroy the world *before* it gets around to considering the Simulation Hypothesis. [Edit: I guess this could still be a partial win, if it changed its mind later and recreated a world for human beings to live in afterwards.]

2. The AGI's method of logic departs from ideal Bayesian reasoning, in such a way that it rejects the Simulation Hypothesis. (Note that no AGI can use completely normative Bayesian reasoning because of resource constraints, but the specific scenario here is that the AGI's reasoning could depart from normative reasoning in such a way that it irrationally rejects the Simulation Argument.)

3. The human creators of the AGI believe that their world is not a simulation, and that therefore the AGI they are creating is not in a simulation. Therefore, they may (somehow) program an explicit axiom into the AGI that states that the AGI's world is not a simulation.

4. The AGI came about through an evolutionary process, and it turns out that "I will act as though I am not in a Simulation" is useful enough that it evolves as an axiom.

5. The AGI, if it uses something like Evidential Decision Theory, might decide to create a large number of simulated copies of itself.

6. The AGI's supergoals somehow makes explicit reference to entities that are explicitly defined as "outside any simulation," or there is some kind of explicit "anti-social" supergoal of "don't allow yourself to be manipulated, neither through rewards nor punishments, even if allowing yourself to be manipulated would help you achieve your other supergoals."

Can anyone think of any other possibilities?

Note that, in the first four scenarios, the AGI is behaving irrationally in the following sense: the ensemble of AGI's in the AGI's situation would, *on average*, do better at attaining their goals if they accept that the Simulation Hypothesis might be true.

The probability that this strategy would work seems small, in the sense that we'd have to be pretty lucky to pull it off. However, the probability doesn't seem negligible; in other words, my judgment is that, given what's at stake, it may be worth attempting the strategy, despite the fact that success is far from assured.

Wednesday, November 7, 2007

The Open Promise

This post assumes familiarity with Friendly AI and the Singularity.

There is a set (SCP) of candidiate promises (CP's). Every candidate promise in SCP has the following four characteristics. (Note we do not necessarily know, pre-singularity, what the text of any given CP is.)

1. "No Prior Knowledge Required": Fulfilling CP requires no pre-Singularity action by us.

2. "Easy to Fulfill": Fulfilling CP requires minimal resources from us post-Singularity, on the order of .00001 or less of our post-Singularity resources. Fulfilling CP also does not require any of us to do anything that post-Singularity society considers blatantly unethical; in addition, it exempts each individual from committing any actions that he considers blatantly unethical. For example, if there are specific post-singularity injunctions against inflicting pain on simulated beings, CP does not require us to break those injunctions.

3. "Beneficial": Suppose that we publicly commit to fulfilling CP, even though we don't know until after the singularity what the text of CP is. Our decision to publicly commit pre-Singularity to CP, increases the expected utility for humanity, and the expected utility for us, by a factor of .00001 or more. (Example: a CP might qualify if it lowers the chance of humanity's destruction from 50% to less than 49.9995%.) Utility is as measured by mankind's CEV (Coherent Extrapolated Volition).

Suppose we make the following promise, called the Open Promise:

"After the Singularity, if we are able to do so, we will ask an AGI to examine SCP. If SCP is empty, then we are not bound to take any action. If SCP is non-empty, we will ask the AGI to pick out one of the "best" CP's; call this BCP. CP's are considered "better" if they have a higher expected increase in utility, and if they require a smaller amount of resources. (We'll generally give an increase in expected utility a heavier weight than a reduction in required resources.) We will then fulfill BCP."

Why do we think SCP may be non-empty? See here for a proof-of-concept.

In addition, the AGI is smarter than me, and may think of a completely different angle that would not occur to me or that I would wrongly dismiss as non-viable. Puzzle: Are there other scenarios that do not in any way involve an Unfriendly AI, where SCP is non-empty? I can think of one scenario, but it's contrived and improbable.