Concrete AI Safety Problems¶
notes in the paper “Concrete Problems in AI Safety”, by Amodei et al
- many examples of AI safety issues are about super-intelligent AI systems
that one day in the future might cause big problems
- for example, the paper-clip maximizing agent that ends up killing all humans to get their atoms to make more paper clips
- to many, these ideas seem far-fetched, more science-fiction than science
- however, there are many real safety problems that arise in actual AI systems that exist today, and that’s what we want to look at here
Paper: “Concrete Problems in AI Safety”¶
- discusses many real an practical issues in AI safety, proposing a number of possible research projects
- uses example of a “cleaning” agent throughout the paper, e.g. an agent that could do things like mop floors, move boxes, pick up trash, etc.
- the paper talks about agents using the general model of utility maximizing
agents
- it is not necessary that the internal implementation of the agent actually uses such a scoring function; it is fine if the agent is purely logic based; the idea of utility maximization is just a general way of talking about agents without having to say exactly how they’re implemented
- agents are assumed to have a numeric score function that tells them if an action/state is good/bad
- positive reward values are assigned to things we want the agent to do, and negative reward values are assigned to things we want the agent to avoid doing
- e.g. a vacuuming agent might get -100 penalty points for “seeing” dirt, and then get +120 points for cleaning up dirt
- the paper also discusses reinforcement learning, i.e. it presumes that the agent will probably be able to learn from its environment, and that it will get positive (+ utility points) and negative (- utility points) from many of its actions
Ways that AI Systems Might Fail¶
- Negative side effects, e.g. how do you make sure your vacuuming robot doesn’t knock over a vase?
- Reward hacking, e.g. how do you make sure a system doesn’t “game” it’s reward function? For example, if an agent gets -100 for “seeing” dirt on the floor, then we don’t want it to put a bucket over its head, or just turn to stare at the wall, so it no longer “sees” the dirt (thus avoiding the -100 penalty)
- Scalable oversight, e.g. how can you efficiently get an vacuuming robot to throw out things that aren’t owned by anyone (e.g. candy wrappers), but set aside (and not throw out!) things that are likely to be owned by someone (e.g. a cell phone, or an earring)? The robot could ask a person if the object should be kept, but that could become bothersome to the person if the agent is frequently asking them questions (we want the agent to be as autonomous as possible).
- Safe exploration, e.g. sometimes the agent will do want to perform exploratory actions, but we want it to do those things safely, e.g. we don’t want it to ever put a wet mop in an electrical outlet, or clean cereal bowls using chemicals poisonous to people.
- Robustness to Distributional Shift, e.g. how to you make sure that the agent performs rationally in an environment that is very different than its training environment, e.g. cleaning a factory floor is very different than cleaning the floor in a home.
Avoiding Negative Side Effects¶
Problem: imagine an agent that is supposed to pick up and move a box that is on the other side of the room, but between it and the box is a delicate base that we don’t want it to break
what stops the agent from knocking over the vase when it picks up and moves the box?
suppose the agent’s utility function gave the agent +100 for picking up the box, and +100 for moving the box to a storage room
- this utility function makes no mention of the vase
- so smashing the vase means nothing to the robot
- smashing the vase and moving the box to the storage room gives it the same score as moving the box to the storage room without smashing the vase
- in some situations, it might be quicker or more convenient to smash the vase, e.g. just go straight to the box breaking anything in the way might be much easier than walking around the vase
Impact Regularizers
e.g. assign a penalty to smashing the vase
- if we assign a penalty of, say, -10, to smashing the vase, then the robot would score 100+100+(-10)=190
the robot will get 100+100=200 points for picking up and moving the box (without smashing the vase)
- clearly, the solution of not smashing the vase gives the robot more reward, so it will try to do that
- we might assign an even higher penalty to smashing the vase, e.g. it might be expensive, or have sentimental value, or potentially cause harm due to shards of glass on the floor
- but this simple solution doesn’t scale: what if the vase is replaced
with a glass, or a cell phone, or a baby, or a sleeping vat, or a
window, or any of thousands of other things?
- we don’t want to have to assign penalties for all possible things: that would take too long, and be impossible (since we don’t know for sure all the things that could show up)
- also, some things don’t matter, e.g. a candy wrapper, or a big piece of fluff, might be ignorable
e.g. have the agent compare its planned actions to a hypothetical plan where the agent does very little, e.g. the agent just stands there and does nothing
- the idea is to program the agent to try achieve its goals by causing as few changes to the environment as possible
- it’s important that the agent have some idea about which changes are
caused by it (the agent), and which changes are a normal part of the
environment
- if the agent just tried to minimize overall change in the environment, it might be incentived to do things like restrain people who happen to be in the room (since restrained people will change the environment less than people who are free to move around)
- this might help the agent see what changes are due to it (the agent) affecting the environment, and what changes are due to the natural evolution of the environment
- when the agent is passive, the vase will not normally get smashed, and so the agent could reason that if it’s intended plan causes the vase to be smashed, then it (the agent) is the cause of that action
- in contrast, suppose there is a human in the room reading a book, and the human decides to get up to get a drink of water; that might happen if the agent was just sitting there, i.e. it’s a normal environmental change that is not causes by the agent
- the problem with this approach is that it could be very difficult to determine what an appropriate “do nothing” plan is
it has been suggest that impact regularizers could be learn and shared among similar agents; that might be more feasible than having a programmer define one ahead of time
Penalizing Influence: give the agent a penalty when it gets into situation that could cause negative side effects, e.g. never allow a cleaning agent to bring a bucket of water into a room filled with sensitive electronics
- an idea from information theory is to measure the agents empowerment, i.e. its control over the environment
- intuitively, lowering empowerment decreases the chance of a negative side effect
Multi-agent approaches, if everyone likes a side effect, then theres no need to avoid; for example, if the cleaning agent opens a window to clean it on a hot summer day, all the people in the room might like the cool breeze that happens as a side-effect (but if it was winter, they might not like it)
- generally, we want our agents to avoid harming the interests of other agents (people, or other robots)
Reward uncertainty: intentionally add some randomness to the agents reward/penalty function in the hopes that it would act more conservatively
Avoiding Reward Hacking¶
- imagine a cleaning robot that has a bug in its on-board control software
such that just by making certain unusual movements, it could increase its
rewards
- this might happen due to a programming error, e.g. a buffer overflow; all programmers know that bugs are a certainty in any sufficiently complicated program, and reward functions for a useful robot could be very complicated
- this is clearly not how we want the cleaning robot to behave, but from the agents point of view its doing exactly what we told it to do, i.e. to maximize its score
- imagine a cleaning robot that gets +100 every time it sucks up dirt
- it might then realize it could suck up dirt, immediately dump it back out, suck it up again, etc.; it gets +100 reward every time it sucks up dirt
- in these cases, the problem is that the utility function that the designer gave the robot doesn’t match the designers intent
- suppose the agents utility function gave it rewards based on the rate at
which it used up cleaning supplies, the idea being that the more cleaning
supplies it uses the more cleaning it must be doing
- but a clever agent might realize it could get rewards by doing things like dumping bleach down the drain, or throwing mops into the garbage — no need to actually do any cleaning!
- this problem is know as Goodhart’s Law: “when a metric is used as a target, it ceases to be a good metric”
- Wireheading is when an agent “modifies” the environment in a way that
increases its rewards
- a cleaning agent could self-modify its “is it clean” software in a way that gives it a very high reward
- in other words, the agent could modify its hardware or software to change its utility function!
- if there is a human in the “reward loop”, this might give the agent an incentive to harm the agent, e.g. imagine an autonomous race car agent that “wins” a race by threatening (or bribing) race officials
- some ideas for dealing with reward hacking:
- give the agent a penalty for future states, i.e. planning to overwrite its reward function could be assigned a negative value
- intentionally blind, or limit, the agent in a way that makes it difficult for the agent to understand how its rewards are calculated
- use code-checking techniques and careful testing to ensure that the reward functions has no bugs
- put a cap on the maximum possible reward
- add “trip wires” that alert us to unusual or unexpected behaviour
Scalable Oversight¶
- we’ll skip this issue here, but take a look at the original paper if you are curious
Safe Exploration¶
- autonomous learning agents must sometimes explore their environment, i.e. perform actions that might not be optimal but may later help them improve
- but some actions can lead to harm, e.g. a robot-controlled quad copter that experiments with some unusual flight movements might crash into the ground and destroy itself (or harm a person)
- one way to deal with this problem is to hard-code avoidance of dangerous actions
- but what counts as a dangerous action becomes difficult if there are a lot of actions to consider, or when the agent is working in a complex environment
- another approach is to explicitly design the agent’s utility with risk in
mind, designing it avoid things are likely to be bad (such as high-speed
dives towards the ground for a quad copter)
- this takes “worst-case” performance into account, sort of like how computer scientists usually take worst-case performance of algorithms quite seriously
- the paper calls this risk-sensitive performance criteria
- another idea: give the agent a small number of demonstrations of acceptable exploratory behaviour that it should copy
- another idea: simulated exploration, i.e. do the exploration in a simulated environment where the cost of failure is not too high
- another idea: bounded exploration, i.e. denote certain parts of the search space as safe for exploration, and other parts of the search space as unsafe; for example, a cleaning agent might be allowed to explore with different kinds of mops or soap mixtures for cleaning a floor, but not allowed to experiment with chemicals for cleaning bowls and cutlery to be used by humans
- another idea: human oversight, i.e. have a human watch the agent and press a “stop button” when it looks like something is going wrong
Robustness to Distributional Change¶
- this is the issue of what the agent should do when it finds itself in an environment very different from its training environment
- we’ll skip this issue here; check out the original paper if you are curious about this