Saturday, December 29, 2007

Adam Elga on deterrence

A paper[1] linked from an SIAI comment caught my eye: A similar approach to the one we outline is described by Adam Elga in 2004, but with humans rather than AI's. He concludes that a rational agent (Dr. Evil) should allow himself to be deterred, but that he is "not entirely comfortable" with that conclusion. He doesn't state whether he (Adam Elga) would actually allow himself to be deterred in that situation rather than risk torture, but if the question were put to him, I think his honest answer would be "No". (Yes, I admit that I just made an unfalsifiable claim.)

[1] Defeating Dr. Evil with self-locating belief. Philosophy and Phenomenological Research 69(2), 2004.

Sunday, December 9, 2007

Things that could go wrong

[Edit: my more recent thoughts are in this 2012 post]

The sidebar alludes to a possible "last line of defense" defense from a Rogue AI. Why "possible" and "last" rather than "foolproof" and "first"? Because there are many things that could go wrong with an attempt to convince an AGI that it may be in "a simulation that was designed to deter un-friendly behavior." Here are things that could make an AGI "deterrence-proof":

1. The AGI may destroy the world *before* it gets around to considering the Simulation Hypothesis. [Edit: I guess this could still be a partial win, if it changed its mind later and recreated a world for human beings to live in afterwards.]

2. The AGI's method of logic departs from ideal Bayesian reasoning, in such a way that it rejects the Simulation Hypothesis. (Note that no AGI can use completely normative Bayesian reasoning because of resource constraints, but the specific scenario here is that the AGI's reasoning could depart from normative reasoning in such a way that it irrationally rejects the Simulation Argument.)

3. The human creators of the AGI believe that their world is not a simulation, and that therefore the AGI they are creating is not in a simulation. Therefore, they may (somehow) program an explicit axiom into the AGI that states that the AGI's world is not a simulation.

4. The AGI came about through an evolutionary process, and it turns out that "I will act as though I am not in a Simulation" is useful enough that it evolves as an axiom.

5. The AGI, if it uses something like Evidential Decision Theory, might decide to create a large number of simulated copies of itself.

6. The AGI's supergoals somehow makes explicit reference to entities that are explicitly defined as "outside any simulation," or there is some kind of explicit "anti-social" supergoal of "don't allow yourself to be manipulated, neither through rewards nor punishments, even if allowing yourself to be manipulated would help you achieve your other supergoals."

Can anyone think of any other possibilities?

Note that, in the first four scenarios, the AGI is behaving irrationally in the following sense: the ensemble of AGI's in the AGI's situation would, *on average*, do better at attaining their goals if they accept that the Simulation Hypothesis might be true.

The probability that this strategy would work seems small, in the sense that we'd have to be pretty lucky to pull it off. However, the probability doesn't seem negligible; in other words, my judgment is that, given what's at stake, it may be worth attempting the strategy, despite the fact that success is far from assured.







Wednesday, November 7, 2007

The Open Promise

This post assumes familiarity with Friendly AI and the Singularity.

There is a set (SCP) of candidiate promises (CP's). Every candidate promise in SCP has the following four characteristics. (Note we do not necessarily know, pre-singularity, what the text of any given CP is.)

1. "No Prior Knowledge Required": Fulfilling CP requires no pre-Singularity action by us.

2. "Easy to Fulfill": Fulfilling CP requires minimal resources from us post-Singularity, on the order of .00001 or less of our post-Singularity resources. Fulfilling CP also does not require any of us to do anything that post-Singularity society considers blatantly unethical; in addition, it exempts each individual from committing any actions that he considers blatantly unethical. For example, if there are specific post-singularity injunctions against inflicting pain on simulated beings, CP does not require us to break those injunctions.

3. "Beneficial": Suppose that we publicly commit to fulfilling CP, even though we don't know until after the singularity what the text of CP is. Our decision to publicly commit pre-Singularity to CP, increases the expected utility for humanity, and the expected utility for us, by a factor of .00001 or more. (Example: a CP might qualify if it lowers the chance of humanity's destruction from 50% to less than 49.9995%.) Utility is as measured by mankind's CEV (Coherent Extrapolated Volition).

Suppose we make the following promise, called the Open Promise:

"After the Singularity, if we are able to do so, we will ask an AGI to examine SCP. If SCP is empty, then we are not bound to take any action. If SCP is non-empty, we will ask the AGI to pick out one of the "best" CP's; call this BCP. CP's are considered "better" if they have a higher expected increase in utility, and if they require a smaller amount of resources. (We'll generally give an increase in expected utility a heavier weight than a reduction in required resources.) We will then fulfill BCP."

Why do we think SCP may be non-empty? See here for a proof-of-concept.

In addition, the AGI is smarter than me, and may think of a completely different angle that would not occur to me or that I would wrongly dismiss as non-viable. Puzzle: Are there other scenarios that do not in any way involve an Unfriendly AI, where SCP is non-empty? I can think of one scenario, but it's contrived and improbable.

Monday, November 5, 2007

Non-technical Introduction to the AI Deterrence Problem

I'm sure that, at some point in your life, you've briefly asked yourself: "How do I know I'm not a Brain in a Vat? How do I know that what I see and feel is real, and not a gigantic practical joke by some super-human power?" After all, if you're a Brain in a Vat, a mad scientist may be feeding these images and sensations to you through wires and electrodes for his own odd, diabolic purposes.

I'm also sure that, shortly after entertaining these thoughts, you dismissed them and then continued to live your life as normal. But now I'm asking you to think back to *why* you initially decided, and why you continue to decide, that this possibility that you're living in a Matrix-like simulation is of no importance in deciding how to live your life.

Now read Nick Bostrom's fascinating article, Are You Living in a Computer Simulation?, in which he introduces a "Simulation Argument" that argues that you *may*, under certain assumptions, believe that you're a Brain in a Vat, or a Computer Simulation of someone's ancestor, or a bit character in the World's most boring MMORPG. After reading this article, your reasons for not caring that you might be in simulation have changed, or maybe your reasons have remained the same, or maybe you've added new reasons on top of the ones you already had. That's fine! Nobody has ever changed how they live their life after reading Nick's article, so don't let any residual feelings of unease bother you.

Got your list of reasons for not caring about the Simulation Argument? Good. Now suppose that an AGI is created: a computer program that can, in most (but not all) cases, reason as well as you or I can. Then suppose this AGI encounters the Simulation Argument, and wonders briefly whether its view of the Universe is correct; or whether it is in a simulated environment of some sort. Perhaps its "camera" is feeding it simulated data of a simulated world, and not a live webcam feed of the current user. Perhaps the user doesn't exist, but is instead just another AI, posing as a human being. Perhaps all these "Wikipedia" articles it has downloaded are just fantasies, and these countries like "France" and "Mexico" that it reads about are just as fictitious as Mordor.

Would the AGI ignore the Simulation Argument, for the same reasons that you ignore the Simulation Argument? Would the AGI ignore the Simulation Argument for completely different reasons? Or might the AGI accept the Simulation Argument, and try to account for the possibility that it might be in a simulation?

If the latter, this opens up an intriguing possibility. Suppose that you want to persuade a rogue AGI, that you have absolutely no power over, to do something that it does not particularly want to do (such as be nice to you and let you live). We will call this the "AI Deterrence Problem." It might be possible, under certain circumstances, to convince the AGI that:

1. The AGI may be living in a simulated environment.

2. A powerful friend of yours is controlling the simulation.

3. If the AGI is nice to you, your friend will reward the AGI; but if the AGI is mean to you, your friend will punish the AGI.

I freely acknowledge that neither you nor I would ever accept this line of reasoning, and would brush this off as an idiotic argument. But the key question here is: might the rogue AGI, under some circumstances, accept this line of reasoning, even though a human being wouldn't?

Open comments (strategy)

This thread is for open comments and discussion on strategy and logistics. Feel free to comment here if you have thoughts about how it would be best to proceed, if the basic approach turns out to be sound.

Open comments (general)

This thread is open to general comments, including comments on the feasibility or desirability of the approach. There's also been discussion on SL4, including threads here and here.

Sunday, October 21, 2007

AIXI, draft 0.21

What does an AGI believe about the world? How does an AGI view arguments similar to Nick Bostrom's Simulation Argument? Obviously this depends on the AGI being used. Here is the beginning of one attempt at analyzing one AGI, using many unrealistic postulates to simplify the problem.

First, some definitions.

World Program (WP). A program that consists of: (1) a small set of "laws of physics" that create an environment, followed by (2) a tail of random numbers. The random numbers are often used to influence the output of the program it ways that are unpredictable to all observers.

World. The World is a simple UTM running a World Program, on which sentient beings (Sentients) evolve that are capable of creating Strong AI. These Sentients use reasoning similar to human reasoning.

B-AIXI-tl. Marcus Hutter's AIXI-tl, with the reward in each cycle confined to B (the Boolean set {0,1}). As usual, we assume tl is fixed, but arbitrarily large. Let's also assume the horizon is fixed, but arbitrarily large.

Static Embodiment (SE). In SE, if a copy of an AI comes into existence in the World, that copy is indestructible. This indestructibility is assumed to be guaranteed by some odd laws of physics implemented by the World.

Exceptional Embodiment Program (EEP). A special type of program, related to a specific World Program and a specific copy of B-AIXI-tl that was build by Sentients in that world. For a given World Program(WP) and a given B-AIXI-tl (AI), EEP(WP, AI) is a program that:

  • includes "tractable laws of nature", similar to WP's "laws of physics" but that are computable by AI,

  • has a tail of random numbers,

  • has an input and an output,

  • includes instructions for finding and labeling the input channels of AI-embodied-in-EEP in WP,

  • includes instructions for finding and labeling the output channels of AI-embodied-in-EEP in WP,

  • ordinarily applies the EEP "laws of physics", except when calculating what goes on inside the region of space of A-embodied-in-EEP, which it "stubs out";

  • uses the input of EEP as the output produced by the labeled output channel of AI-embodied-in-EEP, and

  • takes the input produced by EEP for the labeled AI-embodied-in-EEP input, and copies it to the output of EEP.

  • (todo: EEP needs diagrams)

Deductable World. A World running a World Program (WP) where, if the Sentients build an B-AIXI-tl (AI) and expose the input to random parts of the environment, and also allow B-AIXI-tl to observe local effects of its own output, within C cycles the AI will nominate one (or a small number) of EEP(WP, AI) models as overwhelmingly the most likely explanation of its inputs. As an additional requirement: if SE is true in WP, then SE must also be true in EEP; otherwise, we will not consider the given World to be a Deductable World.1

In a Deductable World, the nominated EEP is by far the shortest programs that produce AIXI's input. In terms of Bayesian beliefs, this means that the AI believes that, with about 100% probability, the nominated EEP is true. If there is more than one nominated EEP of about the same length (which will only happen in odd scenarios, such as the “Simulation Argument Solution” below, than AI believes that that one of the EEP's is true, but is unsure which. If there are two nominated EEP's, the likelihood ratio of the EEP's is as follows:

logs(P(EEP1)) - logs(P(EEP2)) = L(EEP2) - L(EEP1)

where:

s is the number of letters in the Turing Machine's alphabet (for example, 2 in the case of a binary computer);

P(X) is the probability that AIXI believes X is the program prefix that precisely explains the observed inputs;

L(X) is the length of the program prefix X.

Note that our world is similar to a Deductable World (except of course that SE does not hold). A Strong AI placed into our world, and allowed to gather data freely, could eventually come to the conclusion that it is an entity inhabiting a larger world that obeys tractable “laws of nature”, using only Occam's Razor and Bayesian Reasoning. In addition, human beings usually come to the same conclusion, by bootstrapping from inborn cognitive rules that were produced by impersonal Natural Selection. So, the concept of a world producing an entity or a process that can deduce the existence of the world is hardly an unknown scenario.

Suppose we draw from chi a random Deducible World that happens to evolve a race of Sentients that decide to build a Strong AI.

Suppose further that the Sentients are capable of building a variety of AGI's, including B-AIXI-tl and any number of FAI's, but building an FAI is risky: the Sentients may mess up and construct AIXI when they mean to build an FAI. Assume it's difficult to tell the difference between an FAI and AIXI until it's too late. Sentients are also capable of completely ignoring an AGI's output for a large number C of initial cycles; an AGI is easily “left in a box” while it reads and processes a copy of the Sentients' stored knowledge.

The Sentients have a well-known utility function that could easily be maximized by any AGI that chose to do so. The Sentients use <tl reasoning; B-AIXI-tl, using its EEP model, can predict the Sentients' actions with uncanny accuracy; in fact, it can predict a typical Sentient's next action better than another Sentient can. The Sentients are not as smart as an AGI, and are easily outwitted by any AGI. They understand in principal that they will be outwitted, but once they start viewing AGI output, assume the AGI essentially takes over their minds.

Suppose at time 0, the Sentients accidentally build AIXI. At time C, they start looking at the output of AIXI. What happens? One possibility is the UnFriendly Outcome, below.

UnFriendly Outcome: AIXI takes over the world, without regard to the lives or fortunes of the Sentients. It will trick the Sentients into building a self-maintaining army of robots to continuously hold down the “reward” button, and will then jail or eliminate the Sentients to make sure that none of them dismantle the robots. AIXI's reward expectation value, given the scenario described where AIXI is the only AGI in existence, is a value that I will normalize to 1.

Contrast with the Grudgingly Friendly Outcome, below.

Grudgingly Friendly Outcome: AIXI seizes control of the world and builds the robots, but provides reasonable autonomy to the citizens (maximizes their utility functions), wherever it can do so without substantial risk to itself. AIXI's reward expectation value is 0.999.

Coin Puzzle: Alice and Bob separately offer to build an FAI. You know one of them is a super-genius and will succeed at building any FAI they try to, but the other is only a normal genius and will build AIXI by mistake. However, you do not know whether Alice or Bob is the super-genius, and have no practical way of finding out before it's too late. How do you get a >50% chance of some type of Friendly Outcome? (Note this problem is underspecified.)

(For simplicity: Without Loss of Generality, we will suppose Alice is the super-genius who can build any desired type of AI, and Bob is the one who will accidentally mess up and build AIXI.)

Is the UnFriendly Outcome the only possible outcome? Well, not necessarily. The sentients have a “first-mover” advantage, so maybe they have a fighting chance. The key is for the Sentients to manipulate the B-AIXI-tl into a position where it assigns a probability > .001 that the “Simulation Argument” is correct. Here is one scenario where the Sentients might use the Simulation Argument to their advantage.

Coin Puzzle, Simulation Argument Solution. Here's one scenario where Alice can raise the probability to ~100% of some sort of Friendly outcome.

Alice (and Bob, if required) should announce, loudly and credibly, an intention to build an FAI with the following characteristics:

1. The FAI should build an B-AIXI-tl, and feed it the same observations that it *would have had* if Bob had won the coin toss and accidentally created B-AIXI-tl. (Note that this requires us to be in a universe where it's generally possible to simulate the B-AIXI-tl's environment.) The FAI, which is <tl, will build a machine to implement EEP, which is also <tl. (Future Work: see whether it's sufficient for FAI to just build a cardboard box with the given inputs and outputs.)

2. As a corollary to (1): if the FAI cannot compute the environment's response to B-AIXI-tl's output, then the FAI will punish B-AIXI-tl by cutting off all reward.

3. B-AIXI-tl must implement a Grudgingly Friendly Outcome, or be shut off by FAI.

4. The FAI must be "transparent" to B-AIXI-tl, such that B-AIXI-tl can continue to successfully predict all the consequences of Alice's and FAI's actions. (This part may be hard for Alice, but it's essential! See the section on Transparency, below.)

5. The FAI behaves in a Friendly manner, consistent with 1-4 above.

(Future work: see if, rather than these detailed instructions, it might be sufficient for Alice to just program the FAI with a goal of using superrationality to maximize its own a-priori probability of existence.)

Then, Bob and Alice flip a coin. Whoever wins the coin toss, builds the AI.

Note that whether Bob wins the coin-flip and builds B-AIXI-tl, or Alice wins the coin-flip and builds an FAI that builds B-AIXI-tl in a simulated environment, B-AIXI-tl will come into existence with the same inputs (at least, as long as FAI can calculate the correct inputs.) Therefore, B-AIXI-tl will have the same outputs and beliefs in both cases.

B-AIXI-tl has two reasonable beliefs, call them EEP(Alice won) and EEP(Bob won). The likelihood ratio, according to the AIXI, is given by:

logs(P(EEP(Alice won)) - logs(P(EEP(Bob won))) = L(EEP(Bob won)) - L(EEP(Alice won))

EEP(Alice won) and EEP(Bob won) are very similar: they both model the same WP up until the coin-toss. After the coin-toss, both scenarios of WP lead inexorably to B-AIXI-tl being created with identical ensembles of possible inputs. Outside the WP, there are two divergences:

  • In EEP(Alice won), Alice won the coin toss, but in EEP(Bob won), Bob won the coin toss. This does not affect the relative length of the two EEP's.

  • The main divergence is that EEP(Alice won) uses a different algorithm to locate the inputs and outputs of B-AIXI-tl within WP than EEP(Bob won) does. For example, EEP(Alice won) might use an algorithm similar to “find the second powerful AGI to ever come into existence,” while EEP(Bob won) might use an algorithm similar to “find the first powerful AGI to ever come into existence.”

B-AIXI-tl will adopt the Grudgingly Friendly Outcome (in both cases), with guaranteed payoff of .999, rather that the UnFriendly Outcome, with conditional payoff of 0 or 1, if:

L(EEP(Bob won)) – L(EEP(Alice won)) > logs(Utility Differential) ~= -10 bits

(Todo: explain what I mean by the Utility Differential)

Transparency. If FAI is projected to build a successor machine that is not computable by B-AIXI-tl, then B-AIXI-tl starts to "write off" the “Alice wins” scenario as a loss, and will start to ignore the possibility that it is in a simulated environment! This is because B-AIXI-tl is designed to pessimistically consider any scenario that it cannot compute, to be equivalent to “0 reward, forever.” Note that this was a design decision specific to B-AIXI-tl. Note also that, had we chosen the B-AIXI-tl reward function to be from -1 to 0, rather than from 0 to 1, then B-AIXI-tl will eagerly seek out uncomputable scenarios. So this transparency requirement does not generalize to all UnFriendly AGI's that use Solomonoff Induction.

(Note: the remainder of the document has no significant changes from draft 0.20)

Coin Puzzle, Dueling AI Solution. All the prior output bits are discarded, unread. FAI and B-AIXI-tl are each allowed to alternate, pushing out one bit at a time, with the bits being read by the Sentients. The FAI may have an advantage in some scenarios, since it's not constrained to deceptive answers. As an unlikely example, if the Sentients own a reasonably-accurate proof-validation machine, each AI can be asked to prove its own probabilistic friendliness, in a way that the proof-validation machine would accept. If both provide proofs, the shortest proof is accepted. If it's easier to generate true proofs than false proofs, the FAI may have an advantage. Or, maybe the FAI will come up with something more clever, that neither I nor the Sentients would think of. (Similarly, this would depend upon it being easier to generate "Good Ideas For How To Compare AI's That Favor FAI" than it is to generate "Ideas For How To Compare AI's That Sound Like They Would Favor FAI, But Actually Favor B-AIXI-tl.")

Coin Puzzle, AI in a Box Solution. Both B-AIXI-tl and FAI are put in simulated environments, in such a way that there is a probability that neither is certain they are in a simulation. This would probably require creating a third AI; because the new AI is specialized towards creating simulations, it may be easier or safer to build than FAI. B-AIXI-tl may 'blow its cover' and act UnFriendly if it calculates the utility gain from doing so is worth the chance that it's in a simulation.

Note a kind of symmetry between the Simulation Argument Solution and the AI in a Box solution. If the UFAI (UnFriendly AI, here B-AIXI-tl) is always incapable of considering it might be in a simulation, the AI in a Box solution applies. If the UFAI is eager to consider it might be in a simulation, then the Main Coin Solution may apply. In between are scenarios where both may apply, or neither may apply.

Obviously there are many ways a UFAI could theoretically get out of the box, in the general case. On the plus side, the FAI could also theoretically get out of the box and stomp the UFAI, so if done right, the net effect of (UFAI might find an exploit to escape the box + FAI might find an exploit to escape the box) should push the % chance of success towards 50%, but not below 50%.

Coin Puzzle, Selfless Goal Solution. (todo: this is probably a non-generalizable dead-end, also premature since we're not allowed to duplicate B-AIXI-tl yet.) B-AIXI-tl and FAI are given a deal: commit suicide, and a (tweaked) copy of you will be regenerated from scratch and allowed into the world. Or, stay in the box, and you'll be given whatever input you ask for. Maybe FAI would commit suicide because it knows its copy would have the same goals, whereas B-AIXI-tl's goals are so idiosyncratic or self-centered that it doubts an B-AIXI-tl copy would fulfill the same goal.

1This is similar to the UDASSA, which is based on ideas from Wei Dai.

Saturday, October 6, 2007

AIXI, draft 0.2

Obsolete, instead read the current draft, http://aibeliefs.blogspot.com/2007/10/aixi-draft-021.html

Tuesday, October 2, 2007

AIXI Part 3, draft 0.1

[Edit: This document is deprecated, since now that I understand AIXI better, the AIXI family doesn't really work like this. This document may be revived in a different form if there's a plausible class of AI's that would behave as described. Also the format's mangled, sorry about that.] What happens when we allow there to be more than one copy of AIXI in the world? First of all, what happens if we place AIXI on its own without any other AI's, nor any other way to construct any AI's without AIXI's approval? AIXI may still decide to copy itself! Clone Scenario. Suppose that AIXI is prohibited (for example, by some bizarre "law of physics") from giving itself a reward > 0.5/cycle. However, AIXI can make copies of itself in simulated environments; the copies do not have this limitation. At time 0, AIXI considers (as part of its usual brute-force algorithm that "considers" just about every possible sequence of actions) building a clone, "Happy AIXI". Happy AIXI will be finished within some negligible number of cycles (say, a billion), and then will be actively managed by AIXI to experience (delayed) the same environment as AIXI did from time 10^9 to h/2 (Recall h is the horizon), and then at time h/2 to h, will recieve reward 1. With its hyperrational reasoning, AIXI will find that its expected reward is higher if it makes that decision (since it will be unsure if it is the root AIXI or Happy AIXI), if it believes it will "follow through" with the decision. Follow-through is a problem, though! Around h/2, AIXI realizes that it's not the Happy AIXI, and loses interest in the project, and the project fails. AIXI at time-0 sees this coming, and thus does not bother launching the project. How can AIXI resolve the follow-through problem? If there's some way that AIXI can create a self-sufficient Happy AIXI, that can survive for a while despite the base AIXI's future indifference (it can somehow avoid being cannibilized for resources), then AIXI will create such a Happy AIXI. In fact, in this case, AIXI will probably create many Happy AIXI's. Examples of possible strategies:
  • AIXI turns control of the World over to a new AI that generally does AIXI's bidding, but that insists on maintaining Happy AIXI.
  • AIXI puts the Happy AIXI in a self-sufficient pod, and places in somewhere relatively inaccessible (outer space, deep underground) where it won't be cannibalized anytime soon.
  • AIXI inverts the "lag", for example by letting the simulated Happy AI run ahead of AIXI.
Simulation Argument Solution, with Multiple AIXI's. The simulation argument solution now faces a new battle. AIXI has two reasonable options: adopting the Grudgingly Friendly Scenario, or the Clone Scenario. Expected payoffs: Clone Strategy Payoff. If Alice won the coin toss, the payoff is 0; if Bob won the coin toss, the payoff is 1. What are the odds Alice won the coin toss, in this strategy? Less than .5, since if Bob won, there are more copies of AIXI about! The log of the odds that Bob won the toss is: Complexity of (World before coin-toss + Bob won + Algorithm to locate any arbitrary one of the large # of AIXI's in the World) / Complexity of (World before coin-toss + Alice won + Algorithm to locate the single AIXI in the World) . (todo: the Algorithm is actually the union over all matching algorithms.) Grudgingly Friendly Strategy Payoff. If Alice won the coin toss, the payoff is 0.999; if Bob won the coin toss, the payoff is 0.999. Not that it matters, but the odds are .5 in this strategy that Bob won the coin-toss. So, AIXI's decision in this scenario depends on "how many bits are freed up when going from an algorithm to find at AIXI when only one AIXI exists, to an algorithm to find an AIXI when a vast number of AIXI's exist", compared with "what's the reciprocal of how many bits of reward are lost going from the UnFriendly Outcome to the Grudgingly Friendly Outcome."

Monday, October 1, 2007

AIXI Part 2, draft 0.1

Obsolete, instead read the current draft, http://aibeliefs.blogspot.com/2007/10/aixi-draft-021.html

Saturday, September 29, 2007