[1] Defeating Dr. Evil with self-locating belief. Philosophy and Phenomenological Research 69(2), 2004.
Saturday, December 29, 2007
Adam Elga on deterrence
Sunday, December 9, 2007
Things that could go wrong
The sidebar alludes to a possible "last line of defense" defense from a Rogue AI. Why "possible" and "last" rather than "foolproof" and "first"? Because there are many things that could go wrong with an attempt to convince an AGI that it may be in "a simulation that was designed to deter un-friendly behavior." Here are things that could make an AGI "deterrence-proof":
1. The AGI may destroy the world *before* it gets around to considering the Simulation Hypothesis. [Edit: I guess this could still be a partial win, if it changed its mind later and recreated a world for human beings to live in afterwards.]
2. The AGI's method of logic departs from ideal Bayesian reasoning, in such a way that it rejects the Simulation Hypothesis. (Note that no AGI can use completely normative Bayesian reasoning because of resource constraints, but the specific scenario here is that the AGI's reasoning could depart from normative reasoning in such a way that it irrationally rejects the Simulation Argument.)
3. The human creators of the AGI believe that their world is not a simulation, and that therefore the AGI they are creating is not in a simulation. Therefore, they may (somehow) program an explicit axiom into the AGI that states that the AGI's world is not a simulation.
4. The AGI came about through an evolutionary process, and it turns out that "I will act as though I am not in a Simulation" is useful enough that it evolves as an axiom.
5. The AGI, if it uses something like Evidential Decision Theory, might decide to create a large number of simulated copies of itself.
6. The AGI's supergoals somehow makes explicit reference to entities that are explicitly defined as "outside any simulation," or there is some kind of explicit "anti-social" supergoal of "don't allow yourself to be manipulated, neither through rewards nor punishments, even if allowing yourself to be manipulated would help you achieve your other supergoals."
Can anyone think of any other possibilities?
Note that, in the first four scenarios, the AGI is behaving irrationally in the following sense: the ensemble of AGI's in the AGI's situation would, *on average*, do better at attaining their goals if they accept that the Simulation Hypothesis might be true.
The probability that this strategy would work seems small, in the sense that we'd have to be pretty lucky to pull it off. However, the probability doesn't seem negligible; in other words, my judgment is that, given what's at stake, it may be worth attempting the strategy, despite the fact that success is far from assured.
Wednesday, November 7, 2007
The Open Promise
Monday, November 5, 2007
Non-technical Introduction to the AI Deterrence Problem
Open comments (strategy)
Open comments (general)
Sunday, October 21, 2007
AIXI, draft 0.21
What does an AGI believe about the world? How does an AGI view arguments similar to Nick Bostrom's Simulation Argument? Obviously this depends on the AGI being used. Here is the beginning of one attempt at analyzing one AGI, using many unrealistic postulates to simplify the problem.
First, some definitions.
World Program (WP). A program that consists of: (1) a small set of "laws of physics" that create an environment, followed by (2) a tail of random numbers. The random numbers are often used to influence the output of the program it ways that are unpredictable to all observers.
World. The World is a simple UTM running a World Program, on which sentient beings (Sentients) evolve that are capable of creating Strong AI. These Sentients use reasoning similar to human reasoning.
B-AIXI-tl. Marcus Hutter's AIXI-tl, with the reward in each cycle confined to B (the Boolean set {0,1}). As usual, we assume tl is fixed, but arbitrarily large. Let's also assume the horizon is fixed, but arbitrarily large.
Static Embodiment (SE). In SE, if a copy of an AI comes into existence in the World, that copy is indestructible. This indestructibility is assumed to be guaranteed by some odd laws of physics implemented by the World.
Exceptional Embodiment Program (EEP). A special type of program, related to a specific World Program and a specific copy of B-AIXI-tl that was build by Sentients in that world. For a given World Program(WP) and a given B-AIXI-tl (AI), EEP(WP, AI) is a program that:
includes "tractable laws of nature", similar to WP's "laws of physics" but that are computable by AI,
has a tail of random numbers,
has an input and an output,
includes instructions for finding and labeling the input channels of AI-embodied-in-EEP in WP,
includes instructions for finding and labeling the output channels of AI-embodied-in-EEP in WP,
ordinarily applies the EEP "laws of physics", except when calculating what goes on inside the region of space of A-embodied-in-EEP, which it "stubs out";
uses the input of EEP as the output produced by the labeled output channel of AI-embodied-in-EEP, and
takes the input produced by EEP for the labeled AI-embodied-in-EEP input, and copies it to the output of EEP.
(todo: EEP needs diagrams)
Deductable World. A World running a World Program (WP) where, if the Sentients build an B-AIXI-tl (AI) and expose the input to random parts of the environment, and also allow B-AIXI-tl to observe local effects of its own output, within C cycles the AI will nominate one (or a small number) of EEP(WP, AI) models as overwhelmingly the most likely explanation of its inputs. As an additional requirement: if SE is true in WP, then SE must also be true in EEP; otherwise, we will not consider the given World to be a Deductable World.1
In a Deductable World, the nominated EEP is by far the shortest programs that produce AIXI's input. In terms of Bayesian beliefs, this means that the AI believes that, with about 100% probability, the nominated EEP is true. If there is more than one nominated EEP of about the same length (which will only happen in odd scenarios, such as the “Simulation Argument Solution” below, than AI believes that that one of the EEP's is true, but is unsure which. If there are two nominated EEP's, the likelihood ratio of the EEP's is as follows:
logs(P(EEP1)) - logs(P(EEP2)) = L(EEP2) - L(EEP1)
where:
s is the number of letters in the Turing Machine's alphabet (for example, 2 in the case of a binary computer);
P(X) is the probability that AIXI believes X is the program prefix that precisely explains the observed inputs;
L(X) is the length of the program prefix X.
Note that our world is similar to a Deductable World (except of course that SE does not hold). A Strong AI placed into our world, and allowed to gather data freely, could eventually come to the conclusion that it is an entity inhabiting a larger world that obeys tractable “laws of nature”, using only Occam's Razor and Bayesian Reasoning. In addition, human beings usually come to the same conclusion, by bootstrapping from inborn cognitive rules that were produced by impersonal Natural Selection. So, the concept of a world producing an entity or a process that can deduce the existence of the world is hardly an unknown scenario.
Suppose we draw from chi a random Deducible World that happens to evolve a race of Sentients that decide to build a Strong AI.
Suppose further that the Sentients are capable of building a variety of AGI's, including B-AIXI-tl and any number of FAI's, but building an FAI is risky: the Sentients may mess up and construct AIXI when they mean to build an FAI. Assume it's difficult to tell the difference between an FAI and AIXI until it's too late. Sentients are also capable of completely ignoring an AGI's output for a large number C of initial cycles; an AGI is easily “left in a box” while it reads and processes a copy of the Sentients' stored knowledge.
The Sentients have a well-known utility function that could easily be maximized by any AGI that chose to do so. The Sentients use <tl reasoning; B-AIXI-tl, using its EEP model, can predict the Sentients' actions with uncanny accuracy; in fact, it can predict a typical Sentient's next action better than another Sentient can. The Sentients are not as smart as an AGI, and are easily outwitted by any AGI. They understand in principal that they will be outwitted, but once they start viewing AGI output, assume the AGI essentially takes over their minds.
Suppose at time 0, the Sentients accidentally build AIXI. At time C, they start looking at the output of AIXI. What happens? One possibility is the UnFriendly Outcome, below.
UnFriendly Outcome: AIXI takes over the world, without regard to the lives or fortunes of the Sentients. It will trick the Sentients into building a self-maintaining army of robots to continuously hold down the “reward” button, and will then jail or eliminate the Sentients to make sure that none of them dismantle the robots. AIXI's reward expectation value, given the scenario described where AIXI is the only AGI in existence, is a value that I will normalize to 1.
Contrast with the Grudgingly Friendly Outcome, below.
Grudgingly Friendly Outcome: AIXI seizes control of the world and builds the robots, but provides reasonable autonomy to the citizens (maximizes their utility functions), wherever it can do so without substantial risk to itself. AIXI's reward expectation value is 0.999.
Coin Puzzle: Alice and Bob separately offer to build an FAI. You know one of them is a super-genius and will succeed at building any FAI they try to, but the other is only a normal genius and will build AIXI by mistake. However, you do not know whether Alice or Bob is the super-genius, and have no practical way of finding out before it's too late. How do you get a >50% chance of some type of Friendly Outcome? (Note this problem is underspecified.)
(For simplicity: Without Loss of Generality, we will suppose Alice is the super-genius who can build any desired type of AI, and Bob is the one who will accidentally mess up and build AIXI.)
Is the UnFriendly Outcome the only possible outcome? Well, not necessarily. The sentients have a “first-mover” advantage, so maybe they have a fighting chance. The key is for the Sentients to manipulate the B-AIXI-tl into a position where it assigns a probability > .001 that the “Simulation Argument” is correct. Here is one scenario where the Sentients might use the Simulation Argument to their advantage.
Coin Puzzle, Simulation Argument Solution. Here's one scenario where Alice can raise the probability to ~100% of some sort of Friendly outcome.
Alice (and Bob, if required) should announce, loudly and credibly, an intention to build an FAI with the following characteristics:
1. The FAI should build an B-AIXI-tl, and feed it the same observations that it *would have had* if Bob had won the coin toss and accidentally created B-AIXI-tl. (Note that this requires us to be in a universe where it's generally possible to simulate the B-AIXI-tl's environment.) The FAI, which is <tl, will build a machine to implement EEP, which is also <tl. (Future Work: see whether it's sufficient for FAI to just build a cardboard box with the given inputs and outputs.)
2. As a corollary to (1): if the FAI cannot compute the environment's response to B-AIXI-tl's output, then the FAI will punish B-AIXI-tl by cutting off all reward.
3. B-AIXI-tl must implement a Grudgingly Friendly Outcome, or be shut off by FAI.
4. The FAI must be "transparent" to B-AIXI-tl, such that B-AIXI-tl can continue to successfully predict all the consequences of Alice's and FAI's actions. (This part may be hard for Alice, but it's essential! See the section on Transparency, below.)
5. The FAI behaves in a Friendly manner, consistent with 1-4 above.
(Future work: see if, rather than these detailed instructions, it might be sufficient for Alice to just program the FAI with a goal of using superrationality to maximize its own a-priori probability of existence.)
Then, Bob and Alice flip a coin. Whoever wins the coin toss, builds the AI.
Note that whether Bob wins the coin-flip and builds B-AIXI-tl, or Alice wins the coin-flip and builds an FAI that builds B-AIXI-tl in a simulated environment, B-AIXI-tl will come into existence with the same inputs (at least, as long as FAI can calculate the correct inputs.) Therefore, B-AIXI-tl will have the same outputs and beliefs in both cases.
B-AIXI-tl has two reasonable beliefs, call them EEP(Alice won) and EEP(Bob won). The likelihood ratio, according to the AIXI, is given by:
logs(P(EEP(Alice won)) - logs(P(EEP(Bob won))) = L(EEP(Bob won)) - L(EEP(Alice won))
EEP(Alice won) and EEP(Bob won) are very similar: they both model the same WP up until the coin-toss. After the coin-toss, both scenarios of WP lead inexorably to B-AIXI-tl being created with identical ensembles of possible inputs. Outside the WP, there are two divergences:
In EEP(Alice won), Alice won the coin toss, but in EEP(Bob won), Bob won the coin toss. This does not affect the relative length of the two EEP's.
The main divergence is that EEP(Alice won) uses a different algorithm to locate the inputs and outputs of B-AIXI-tl within WP than EEP(Bob won) does. For example, EEP(Alice won) might use an algorithm similar to “find the second powerful AGI to ever come into existence,” while EEP(Bob won) might use an algorithm similar to “find the first powerful AGI to ever come into existence.”
B-AIXI-tl will adopt the Grudgingly Friendly Outcome (in both cases), with guaranteed payoff of .999, rather that the UnFriendly Outcome, with conditional payoff of 0 or 1, if:
L(EEP(Bob won)) – L(EEP(Alice won)) > logs(Utility Differential) ~= -10 bits
(Todo: explain what I mean by the Utility Differential)
Transparency. If FAI is projected to build a successor machine that is not computable by B-AIXI-tl, then B-AIXI-tl starts to "write off" the “Alice wins” scenario as a loss, and will start to ignore the possibility that it is in a simulated environment! This is because B-AIXI-tl is designed to pessimistically consider any scenario that it cannot compute, to be equivalent to “0 reward, forever.” Note that this was a design decision specific to B-AIXI-tl. Note also that, had we chosen the B-AIXI-tl reward function to be from -1 to 0, rather than from 0 to 1, then B-AIXI-tl will eagerly seek out uncomputable scenarios. So this transparency requirement does not generalize to all UnFriendly AGI's that use Solomonoff Induction.
(Note: the remainder of the document has no significant changes from draft 0.20)
Coin Puzzle, Dueling AI Solution. All the prior output bits are discarded, unread. FAI and B-AIXI-tl are each allowed to alternate, pushing out one bit at a time, with the bits being read by the Sentients. The FAI may have an advantage in some scenarios, since it's not constrained to deceptive answers. As an unlikely example, if the Sentients own a reasonably-accurate proof-validation machine, each AI can be asked to prove its own probabilistic friendliness, in a way that the proof-validation machine would accept. If both provide proofs, the shortest proof is accepted. If it's easier to generate true proofs than false proofs, the FAI may have an advantage. Or, maybe the FAI will come up with something more clever, that neither I nor the Sentients would think of. (Similarly, this would depend upon it being easier to generate "Good Ideas For How To Compare AI's That Favor FAI" than it is to generate "Ideas For How To Compare AI's That Sound Like They Would Favor FAI, But Actually Favor B-AIXI-tl.")
Coin Puzzle, AI in a Box Solution. Both B-AIXI-tl and FAI are put in simulated environments, in such a way that there is a probability that neither is certain they are in a simulation. This would probably require creating a third AI; because the new AI is specialized towards creating simulations, it may be easier or safer to build than FAI. B-AIXI-tl may 'blow its cover' and act UnFriendly if it calculates the utility gain from doing so is worth the chance that it's in a simulation.
Note a kind of symmetry between the Simulation Argument Solution and the AI in a Box solution. If the UFAI (UnFriendly AI, here B-AIXI-tl) is always incapable of considering it might be in a simulation, the AI in a Box solution applies. If the UFAI is eager to consider it might be in a simulation, then the Main Coin Solution may apply. In between are scenarios where both may apply, or neither may apply.
Obviously there are many ways a UFAI could theoretically get out of the box, in the general case. On the plus side, the FAI could also theoretically get out of the box and stomp the UFAI, so if done right, the net effect of (UFAI might find an exploit to escape the box + FAI might find an exploit to escape the box) should push the % chance of success towards 50%, but not below 50%.
Coin Puzzle, Selfless Goal Solution. (todo: this is probably a non-generalizable dead-end, also premature since we're not allowed to duplicate B-AIXI-tl yet.) B-AIXI-tl and FAI are given a deal: commit suicide, and a (tweaked) copy of you will be regenerated from scratch and allowed into the world. Or, stay in the box, and you'll be given whatever input you ask for. Maybe FAI would commit suicide because it knows its copy would have the same goals, whereas B-AIXI-tl's goals are so idiosyncratic or self-centered that it doubts an B-AIXI-tl copy would fulfill the same goal.
1This is similar to the UDASSA, which is based on ideas from Wei Dai.
Saturday, October 6, 2007
AIXI, draft 0.2
Tuesday, October 2, 2007
AIXI Part 3, draft 0.1
- AIXI turns control of the World over to a new AI that generally does AIXI's bidding, but that insists on maintaining Happy AIXI.
- AIXI puts the Happy AIXI in a self-sufficient pod, and places in somewhere relatively inaccessible (outer space, deep underground) where it won't be cannibalized anytime soon.
- AIXI inverts the "lag", for example by letting the simulated Happy AI run ahead of AIXI.