ARC Evals

ARC Evals

ARC Evals is the evaluations project at the Alignment Research Center. Its work assesses whether cutting-edge AI systems could pose catastrophic risks to civilization.

What problem is ARC Evals working on?

As AI systems become more powerful, it becomes increasingly important to ensure these systems are safe and aligned with our interests. A growing number of experts are concerned that future AI systems pose an existential risk to humanity — and according to one study of machine learning researchers conducted by AI Impacts, the median respondent reported believing that there is a 5% chance of an “extremely bad outcome (e.g. human extinction)”. One way to prepare for this is to be able to evaluate current systems and receive warning signs if new risks emerge.

What does ARC Evals do?

ARC Evals is contributing to the following AI governance approach:

  1. Before a new large-scale system is released, assess whether it is capable of potentially catastrophic activities.
  2. If so, require strong guarantees that the system will not carry out such activities.

ARC Eval’s current work focuses primarily on evaluating capabilities (the first step above), in particular a capability they call autonomous replication — the ability of an AI system to survive on a cloud server, obtain money and compute resources, and use those resources to make more copies of itself.

Evals was given early access to OpenAI’s GPT-4 and Anthropic’s Claude to assess them for safety. They determined that these systems are not capable of “fairly basic steps towards autonomous replication” — but still, some of the steps they can take are already somewhat alarming. One highly publicised example from ARC Evals’s assessment was that GPT-4 successfully pretended to be a vision-impaired human to convince a TaskRabbit worker to solve a CAPTCHA code.

Suppose AI systems could autonomously replicate, what are the risks?

  • They could become extremely powerful tools for malicious actors.
  • They could replicate, accrue power and resources, and use these to further their own goals. Without guarantees these goals are aligned with our own, this could have catastrophic — potentially even existential — consequences for humanity.

Therefore, ARC Evals is also exploring developing safety standards that could ensure that even systems capable or powerful enough to be dangerous won’t be. This could include security against theft by people who would use the system for harm, monitoring so that any surprising and unintended behaviour is quickly noticed and addressed, and sufficient alignment with human interests such that the system would not choose to take catastrophic actions (for example, reliably refusing to assist users seeking to use the system for harm).

What evidence is there of ARC Eval’s effectiveness?

After investigating ARC Evals' strategy and track record, one of our trusted evaluators, Longview Philanthropy, recommended the Longtermism Fund provide a grant of $220,000. Longview shared that they thought ARC Evals had among the most promising and direct paths to impact on AI governance: “test models to see if they’re capable of doing extremely dangerous things; if they are, require strong guarantees that they won’t.”

There are a few other positive indicators of the organisation’s cost-effectiveness:

  • ARC Evals is already partnered with OpenAI (the creators of ChatGPT and GPT-4) and Anthropic (the creators of Claude). This indicates a proof of concept of their primary work evaluating capabilities of AI systems prior to their release.
  • ARC Evals has leadership with highly relevant experience and credentials:
    • The project is led by Beth Barnes, a researcher who previously worked at DeepMind and OpenAI.
    • The Alignment Research Center as a whole is led by Paul Christiano, who helped pioneer the main safety mechanism that frontier labs such as OpenAI, Anthropic, and DeepMind use to make their AI systems better aligned with human values.

As of July 2023, ARC Evals could make good use of millions of dollars in additional funding over the next 18 months.