Concrete AI Safety Problems¶

notes in the paper “Concrete Problems in AI Safety”, by Amodei et al

many examples of AI safety issues are about super-intelligent AI systems that one day in the future might cause big problems
- for example, the paper-clip maximizing agent that ends up killing all humans to get their atoms to make more paper clips
- to many, these ideas seem far-fetched, more science-fiction than science
however, there are many real safety problems that arise in actual AI systems that exist today, and that’s what we want to look at here

Paper: “Concrete Problems in AI Safety”¶

discusses many real an practical issues in AI safety, proposing a number of possible research projects
uses example of a “cleaning” agent throughout the paper, e.g. an agent that could do things like mop floors, move boxes, pick up trash, etc.
the paper talks about agents using the general model of utility maximizing agents
- it is not necessary that the internal implementation of the agent actually uses such a scoring function; it is fine if the agent is purely logic based; the idea of utility maximization is just a general way of talking about agents without having to say exactly how they’re implemented
- agents are assumed to have a numeric score function that tells them if an action/state is good/bad
- positive reward values are assigned to things we want the agent to do, and negative reward values are assigned to things we want the agent to avoid doing
- e.g. a vacuuming agent might get -100 penalty points for “seeing” dirt, and then get +120 points for cleaning up dirt
the paper also discusses reinforcement learning, i.e. it presumes that the agent will probably be able to learn from its environment, and that it will get positive (+ utility points) and negative (- utility points) from many of its actions

Ways that AI Systems Might Fail¶

Negative side effects, e.g. how do you make sure your vacuuming robot doesn’t knock over a vase?
Reward hacking, e.g. how do you make sure a system doesn’t “game” it’s reward function? For example, if an agent gets -100 for “seeing” dirt on the floor, then we don’t want it to put a bucket over its head, or just turn to stare at the wall, so it no longer “sees” the dirt (thus avoiding the -100 penalty)
Scalable oversight, e.g. how can you efficiently get an vacuuming robot to throw out things that aren’t owned by anyone (e.g. candy wrappers), but set aside (and not throw out!) things that are likely to be owned by someone (e.g. a cell phone, or an earring)? The robot could ask a person if the object should be kept, but that could become bothersome to the person if the agent is frequently asking them questions (we want the agent to be as autonomous as possible).
Safe exploration, e.g. sometimes the agent will do want to perform exploratory actions, but we want it to do those things safely, e.g. we don’t want it to ever put a wet mop in an electrical outlet, or clean cereal bowls using chemicals poisonous to people.
Robustness to Distributional Shift, e.g. how to you make sure that the agent performs rationally in an environment that is very different than its training environment, e.g. cleaning a factory floor is very different than cleaning the floor in a home.

Avoiding Negative Side Effects¶

Problem: imagine an agent that is supposed to pick up and move a box that is on the other side of the room, but between it and the box is a delicate base that we don’t want it to break

what stops the agent from knocking over the vase when it picks up and moves the box?
suppose the agent’s utility function gave the agent +100 for picking up the box, and +100 for moving the box to a storage room
- this utility function makes no mention of the vase
- so smashing the vase means nothing to the robot
- smashing the vase and moving the box to the storage room gives it the same score as moving the box to the storage room without smashing the vase
- in some situations, it might be quicker or more convenient to smash the vase, e.g. just go straight to the box breaking anything in the way might be much easier than walking around the vase
Impact Regularizers
- e.g. assign a penalty to smashing the vase
  - if we assign a penalty of, say, -10, to smashing the vase, then the robot would score 100+100+(-10)=190
  - the robot will get 100+100=200 points for picking up and moving the box (without smashing the vase)
    clearly, the solution of not smashing the vase gives the robot more reward, so it will try to do that
    
    we might assign an even higher penalty to smashing the vase, e.g. it might be expensive, or have sentimental value, or potentially cause harm due to shards of glass on the floor
    
    but this simple solution doesn’t scale: what if the vase is replaced with a glass, or a cell phone, or a baby, or a sleeping vat, or a window, or any of thousands of other things?
    
    we don’t want to have to assign penalties for all possible things: that would take too long, and be impossible (since we don’t know for sure all the things that could show up)
    
    also, some things don’t matter, e.g. a candy wrapper, or a big piece of fluff, might be ignorable
- e.g. have the agent compare its planned actions to a hypothetical plan where the agent does very little, e.g. the agent just stands there and does nothing
  - the idea is to program the agent to try achieve its goals by causing as few changes to the environment as possible
  - it’s important that the agent have some idea about which changes are caused by it (the agent), and which changes are a normal part of the environment
    - if the agent just tried to minimize overall change in the environment, it might be incentived to do things like restrain people who happen to be in the room (since restrained people will change the environment less than people who are free to move around)
  - this might help the agent see what changes are due to it (the agent) affecting the environment, and what changes are due to the natural evolution of the environment
  - when the agent is passive, the vase will not normally get smashed, and so the agent could reason that if it’s intended plan causes the vase to be smashed, then it (the agent) is the cause of that action
  - in contrast, suppose there is a human in the room reading a book, and the human decides to get up to get a drink of water; that might happen if the agent was just sitting there, i.e. it’s a normal environmental change that is not causes by the agent
  - the problem with this approach is that it could be very difficult to determine what an appropriate “do nothing” plan is
- it has been suggest that impact regularizers could be learn and shared among similar agents; that might be more feasible than having a programmer define one ahead of time
Penalizing Influence: give the agent a penalty when it gets into situation that could cause negative side effects, e.g. never allow a cleaning agent to bring a bucket of water into a room filled with sensitive electronics
- an idea from information theory is to measure the agents empowerment, i.e. its control over the environment
- intuitively, lowering empowerment decreases the chance of a negative side effect
Multi-agent approaches, if everyone likes a side effect, then theres no need to avoid; for example, if the cleaning agent opens a window to clean it on a hot summer day, all the people in the room might like the cool breeze that happens as a side-effect (but if it was winter, they might not like it)
- generally, we want our agents to avoid harming the interests of other agents (people, or other robots)
Reward uncertainty: intentionally add some randomness to the agents reward/penalty function in the hopes that it would act more conservatively

Avoiding Reward Hacking¶

imagine a cleaning robot that has a bug in its on-board control software such that just by making certain unusual movements, it could increase its rewards
- this might happen due to a programming error, e.g. a buffer overflow; all programmers know that bugs are a certainty in any sufficiently complicated program, and reward functions for a useful robot could be very complicated
- this is clearly not how we want the cleaning robot to behave, but from the agents point of view its doing exactly what we told it to do, i.e. to maximize its score
imagine a cleaning robot that gets +100 every time it sucks up dirt
- it might then realize it could suck up dirt, immediately dump it back out, suck it up again, etc.; it gets +100 reward every time it sucks up dirt
in these cases, the problem is that the utility function that the designer gave the robot doesn’t match the designers intent
suppose the agents utility function gave it rewards based on the rate at which it used up cleaning supplies, the idea being that the more cleaning supplies it uses the more cleaning it must be doing
- but a clever agent might realize it could get rewards by doing things like dumping bleach down the drain, or throwing mops into the garbage — no need to actually do any cleaning!
- this problem is know as Goodhart’s Law: “when a metric is used as a target, it ceases to be a good metric”
Wireheading is when an agent “modifies” the environment in a way that increases its rewards
- a cleaning agent could self-modify its “is it clean” software in a way that gives it a very high reward
- in other words, the agent could modify its hardware or software to change its utility function!
- if there is a human in the “reward loop”, this might give the agent an incentive to harm the agent, e.g. imagine an autonomous race car agent that “wins” a race by threatening (or bribing) race officials
some ideas for dealing with reward hacking:
- give the agent a penalty for future states, i.e. planning to overwrite its reward function could be assigned a negative value
- intentionally blind, or limit, the agent in a way that makes it difficult for the agent to understand how its rewards are calculated
- use code-checking techniques and careful testing to ensure that the reward functions has no bugs
- put a cap on the maximum possible reward
- add “trip wires” that alert us to unusual or unexpected behaviour

Scalable Oversight¶

we’ll skip this issue here, but take a look at the original paper if you are curious

Safe Exploration¶

autonomous learning agents must sometimes explore their environment, i.e. perform actions that might not be optimal but may later help them improve
but some actions can lead to harm, e.g. a robot-controlled quad copter that experiments with some unusual flight movements might crash into the ground and destroy itself (or harm a person)
one way to deal with this problem is to hard-code avoidance of dangerous actions
but what counts as a dangerous action becomes difficult if there are a lot of actions to consider, or when the agent is working in a complex environment
another approach is to explicitly design the agent’s utility with risk in mind, designing it avoid things are likely to be bad (such as high-speed dives towards the ground for a quad copter)
- this takes “worst-case” performance into account, sort of like how computer scientists usually take worst-case performance of algorithms quite seriously
- the paper calls this risk-sensitive performance criteria
another idea: give the agent a small number of demonstrations of acceptable exploratory behaviour that it should copy
another idea: simulated exploration, i.e. do the exploration in a simulated environment where the cost of failure is not too high
another idea: bounded exploration, i.e. denote certain parts of the search space as safe for exploration, and other parts of the search space as unsafe; for example, a cleaning agent might be allowed to explore with different kinds of mops or soap mixtures for cleaning a floor, but not allowed to experiment with chemicals for cleaning bowls and cutlery to be used by humans
another idea: human oversight, i.e. have a human watch the agent and press a “stop button” when it looks like something is going wrong

Robustness to Distributional Change¶

this is the issue of what the agent should do when it finds itself in an environment very different from its training environment
we’ll skip this issue here; check out the original paper if you are curious about this