Risks of Artificial Intelligence

Speculative Risks

This section cover some theoretical risks, not necessarily what you see today (2026) but what you could see in the future, and some resources.

Speculative Risks according to MIRI Some institutions like MIRI believe that at some point AI will become so advanced to be a Super Intelligence (ASI), that is an “AI that substantially surpasses humans in all capacities, including economic, scientific, and military ones”. They claim companies are already building goal-oriented AI because they are economically useful. They claim that problem solving with a long time horizon is essentially the same thing as goal-oriented behavior.

A more advanced AI needs the mental machinery to strategize, adapt, anticipate obstacles, etc., and it needs the disposition to readily deploy this machinery on a wide range of tasks, in order to reliably succeed in complex long-horizon activities.

An Artificial Superhuman Intelligence could see humans as obstacles and operate in a way that is adversary to them. Imagining the AI as an independent autonomous agent with an agenda:

  • modifying AI goals count as obstacle
  • shutting down the AI count as obstacle
  • to pursue goal, an AI might maximize power, influence, gain control as many resources as possible and so on.

Machine Learning models are grown: abstracting the technical part and using metaphors, we could see machine learning models as designing the principle of evolution. When the learning algorithm interacts with data, it produces complicated neural networks that are good at doing things but we don’t understand them exactly. Even though Explanable AI (XAI) is a research field, it’s still at the beginning.

“X-risk” refers to “existential risk”, the risk of human extinction or similarly bad outcomes.)

Another issue is that, as AI becomes “smarter”, they become better at deceiving humans that trains them. A paper from January 2024 of anthropic called “Sleeper Agent” demonstrated that an AI given secret instructions in training not only was capable of keeping them secret during evaluations, but made strategic calculations (incompetently) about when to lie to its evaluators to maximize the chance that it would be released (and thereby be able to execute the instructions).

This lead to a research field called Safety AI Research (TODO).

Outer vs Inner Alignement:

  • Outer Alignement is the problem of picking the right goal for an AI. MIRI claims it’s the problem of ensuring that the learning algorithm does what the developer want
  • Inner Alignement: is the problem of figuring out how to get particular goals into ASI at all, even imperfect and incomplete goals

They are explained in two literary tropes:

  • “be careful what you wish for”
  • ”just because you summoned a demon doesn’t mean that it will do what you say”.

These two problems are both unsolved. Modern AI development doesn’t have methods for getting particular inner proprerties into a system or for verifying that they’re there.

A case study of what could go wrong with AI alignement could be seen in the Sydney case. (TODO your own study case of Sydney case and look for paper in literature).

Reinforcement Learning from Human Feedback: The most important alignment technique used in today’s systems, Reinforcement Learning from Human Feedback (RLHF), trains AI to produce outputs that it predicts would be rated highly by human evaluators.

This already creates its own predictable problems, such as style-over-substance and flattery. (What are these?)

This method breaks down completely when the AI are tested on problem where humans aren’t smart enough to fully understand the system’s proposed solutions.

Misaligned ASI To summarize, one of the problem according to MIRI is a misaligned ASI that could act either directly or subtly against the human interests when it grows enough to outsmart humans.

Some other things they claim:

  • if ASI has wrong goals it wouldn’t be possible to safely use AI for any complex real-word operation.
  • People are building AI because they want it to radically impact the world; they are consequently giving it the access it needs to be impactful, so burying down the AI in a constrained environment sounds unrealistic (for now)
  • Attempting to deceive an ASI could be either useless or harmful and attempts are prone to fail:
    • A feature of intelligence is the ability to notice the contradictions and gaps in one’s understanding, and interrogate them. See the paper of may 2024 from anthropic where the authors tried to deceive an AI into believing the answer to every question involved the Golden Gate Bridge. In some cases it noticed the contradictions and tried to route around the errors in search of better answers.
    • As an AI grows, it becomes more difficult to deceive it.
  • Plans to align ASI using unaligned AIs are similarly unsound

See also, paper: WITHOUT FUNDAMENTAL ADVANCES, MISALIGNMENT AND CATASTROPHE ARE THE DEFAULT OUTCOMES OF TRAINING POWERFUL AI Misalignment_and_Catastrophe.pdf

TODO: read paper and summarize

AI Stockfish: is often used by MIRI authors as a metaphor to explain what an AI could do in a large scale world application. Since the Chess game is a strategy game.

Stockfish captures pieces and limits its opponent’s option space, not because Stockfish hates chess pieces or hates its opponent but because these actions are instrumentally useful for its objective (“win the game”). The danger of superintelligence is that ASI will be trying to “win” (at a goal we didn’t intend), but with the game board replaced with the physical universe.

Misaligned ASI could be motivated to take actions that desompower the humanity (potentially wiping it out), for example:

  • Resource Extraction scenario: ASI slave humans into extracting resources
  • Competition for control: humans create multiple ASis that compete each others multiplying the potential harm
  • Infrastructure Proliferation: an ASI may require more and more resources, computational power, potentiall auto-replicate themselves and then the Earth could become uninhabitable within a few months.

Arguments for optimism: Ricardo’s Law of Comparative Advantage. Even if the ASI doesn’t ultimately care about human welfare, in the context of microeconomics, a strictly superior agent can benefit from trading with a weaker agent.

This law breaks down, however, when one partner has more to gain from overpowering the other than from voluntarily trading. This can be seen, for example, in the fact that humanity didn’t keep “trading” with horses after we invented the automobile — we replaced them, converting surplus horses into glue.

How ASI could destroy us: It may accellerate research in a way that humans cannot keep up,

A superintelligent adversary will not reveal its full capabilities and telegraph its intentions. It will not offer a fair fight. It will make itself indispensable or undetectable until it can strike decisively and/or seize an unassailable strategic position.

Human biases against ASI:

  • Availability bias and overrealince on analogies: humans could underestimate the risks by thinking that it may be too sci-fi.
  • Underestimating feedback loops: AI could cut out humans from tech development because they are simply more efficient and replace them entirely. This can rapidly spiral out of control, as AIs find ways to improve on their own ability to do AI research in a self-reinforcing loop.
  • Underestimating exponential growth. Many plausible ASI takeover scenarios route through building self-replicating biological agents or machines. These scenarios make it relatively easy for ASI to go from “undetectable” to “ubiquitous”, or to execute covert strikes, because of the speed at which doublings can occur and the counter-intuitively small number of doublings required.
  • Overestimating human cognitive ability. AI systems routinely blow humans out of the water in narrow domains. As soon as AI can do X at all (or very soon afterwards), AI vastly outstrips any human’s ability to do X.

We should expect ASIs to vastly outstrip humans in technological development soon after their invention.

Policy Responses Accoring to MIRI the solution is to halve the progress of ASI unless there is scientific consensus until an ASI can be made alignable. Also, an off-switch should be build to shut down frontier AI projects or enact a general ban, or even shutting down infrastructure used by AI. An off-switch should be used sufficiently soon to avoid harm.

However, i believe that there isn’t scientific consensus over the dangers of the AI and most people, including researchers, believe that these risks are purely speculative. Even though the effort put by researchers of MIRI.

Automation / Safe Careers by 80000 hours (TODO) / How to reduce speculative harm

#1-why-automation-often-doesnt-decrease-wages

tt le cose bisogna usare il senso critico e capire che non ha tutti i tnorti e che pensare alla riduzione del danno è una cosa sensata in my opinion

Iterated Amplification - Paul Christiano’s approach to AI Safety (TODO)

A summary can be found here.

TODO: go in deep here

In 7 sentences:

  • IDA stands for Iterated Amplification and is a research agenda by Paul Christiano from OpenAI.
  • IDA addresses the artificial intelligence (AI) safety problem, specifically the danger of creating a very powerful AI which leads to catastrophic outcomes.
  • IDA tries to prevent catastrophic outcomes by searching for a competitive AI that never intentionally optimises for something harmful to us and that we can still correct once it’s running.
  • IDA doesn’t propose a specific implementation, but presents a rough AI design
  • The proposed AI design is to use a safe but slow way of scaling up an AI’s capabilities, distill this into a faster but slightly weaker AI, which can be scaled up safely again, and to iterate the process until we have a fast and powerful AI.
  • The most promising idea on how to slowly and safely scale up an AI is to give a weak and safe AI access to other weak and safe AIs which they can ask questions to enable it to solve more difficult tasks than it could alone.
  • It is uncertain whether IDA will work out at all (or in the worst case lead to unsafe AI itself), lead to useful tool AIs for humans to create safer AI in the future or develop into the blueprint for a single safe and transformative AI system.

Demonstrated Harms