As AI systems become more powerful, it becomes increasingly important to ensure these systems are safe and aligned with our interests. A growing number of experts are concerned that future AI systems pose an existential risk to humanity — and according to one study of machine learning researchers conducted by AI Impacts, the median respondent reported believing that there is a 5% chance of an “extremely bad outcome (e.g. human extinction)”. One way to prepare for this is to be able to evaluate current systems and receive warning signs if new risks emerge.
ARC Evals is contributing to the following AI governance approach:
ARC Eval’s current work focuses primarily on evaluating capabilities (the first step above), in particular a capability they call autonomous replication — the ability of an AI system to survive on a cloud server, obtain money and compute resources, and use those resources to make more copies of itself.
Evals was given early access to OpenAI’s GPT-4 and Anthropic’s Claude to assess them for safety. They determined that these systems are not capable of “fairly basic steps towards autonomous replication” — but still, some of the steps they can take are already somewhat alarming. One highly publicised example from ARC Evals’s assessment was that GPT-4 successfully pretended to be a vision-impaired human to convince a TaskRabbit worker to solve a CAPTCHA code.
Suppose AI systems could autonomously replicate, what are the risks?
Therefore, ARC Evals is also exploring developing safety standards that could ensure that even systems capable or powerful enough to be dangerous won’t be. This could include security against theft by people who would use the system for harm, monitoring so that any surprising and unintended behaviour is quickly noticed and addressed, and sufficient alignment with human interests such that the system would not choose to take catastrophic actions (for example, reliably refusing to assist users seeking to use the system for harm).
After investigating ARC Evals' strategy and track record, one of our trusted evaluators, Longview Philanthropy, recommended the Longtermism Fund provide a grant of $220,000. Longview shared that they thought ARC Evals had among the most promising and direct paths to impact on AI governance: “test models to see if they’re capable of doing extremely dangerous things; if they are, require strong guarantees that they won’t.”
There are a few other positive indicators of the organisation’s cost-effectiveness:
As of July 2023, ARC Evals could make good use of millions of dollars in additional funding over the next 18 months.