Search
Share with
Or share with link

Short summary

AI safety bounties are programs where public participants or approved security researchers receive rewardsfor identifying issues within powerful ML systems (analogous to bug bounties in cybersecurity). Safety bounties could be valuable for legitimizing examples of AI risks, bringing more talent to stress-test systems, and identifying common attack vectors.

I expect safety bounties to be worth trialing for organizations working on reducing catastrophic AI risks. Traditional bug bounties seem fairly successful: they attract roughly one participant per $50 of prize money, and have become increasingly popular with software firms over time. The most analogous program for AI systems led to relatively few useful examples compared to other stress-testing methods, but one knowledgeable interviewee suggested that future programs could be significantly improved.

However, I am not confident that bounties will continue to be net-positive as AI capabilities advance. At some point, I think the accident risk and harmful knowledge proliferation from open sourcing stress-testing may outweigh the benefits of bounties

In my view, the most promising structure for such a program is a third party defining dangerous capability thresholds (“evals”) and providing rewards for hunters who expose behaviors which cross these thresholds. I expect trialing such a program to cost up to $500k if well-resourced, and to take four months of operational and researcher time from safety-focused people.

I also suggest two formats for lab-run bounties: open contests with subjective prize criteria decided on by a panel of judges, and private invitations for trusted bug hunters to test their internal systems.

Author’s note: This report was written between January and June 2023. Since then, safety bounties have become a more well-established part of the AI ecosystem, which I’m excited to see. Beyond defining and proposing safety bounties as a general intervention, I hope this report can provide useful analyses and design suggestions for readers already interested in implementing safety bounties, or in better understanding these programs.

Long summary

Introduction and bounty program recommendations

One potential intervention for reducing catastrophic AI risk is AI safety bounties: programs where members of the public or approved security researchers receive rewards for identifying issues within powerful ML systems (analogous to bug bounties in cybersecurity). In this research report, I explore the benefits and downsides of safety bounties and conclude that safety bounties are probably worth the time and money to trial for organizations working on reducing the catastrophic risks of AI. In particular, testing a handful of new bounty programs could cost $50k-$500k per program and one to six months full-time equivalent from project managers at AI labs or from entrepreneurs interested in AI safety (depending on each program’s model and ambition level).

I expect safety bounties to be less successful for the field of AI safety than bug bounties are for cybersecurity, due to the higher difficulty of quickly fixing issues with AI systems. I am unsure whether bounties remain net-positive as AI capabilities increase to more dangerous levels. This is because, as AI capabilities increase, I expect safety bounties (and adversarial testing in general) to potentially generate more harmful behaviors. I also expect the benefits of the talent pipeline brought by safety bounties to diminish. I suggest an informal way to monitor the risks of safety bounties annually.

The views in this report are largely formed based on information from:

  • Interviews with experts in AI labs, AI existential safety, and bug bounty programs,
  • “Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims” by Brundage et al. arguing for “Bias and Safety Bounties” (2020, page 16),
  • A report from the Algorithmic Justice League analyzing the potential of bug bounties for mitigating algorithmic harms (Kenway et al., 2022),
  • Reflections from the ChatGPT Feedback Contest.

See the end of the report for a complete list of references.

Based on these sources, I identify three types of bounty programs that seem practically possible now, that achieve more of the potential benefits of safety bounties and less of the potential risks than alternative programs I consider, and that would provide valuable information about how to run bounty programs if trialed. In order of my impression of their value in reducing catastrophic risks, the three types are:

  • Independent organizations or governments set “evals”-based standards for undesirable model behavior, and members of the public attempt to elicit this behavior from publicly-accessible models.
  • Expert panels, organized by AI labs, subjectively judge which discoveries of model exploits to pay a bounty for, based on the lab’s broad criteria.
    • Potentially with an interactive grant-application process in which hunters propose issues to explore and organizers commit to awarding prizes for certain findings.
    • Potentially with a convening body hosting multiple AI systems on one API, and hunters being able to test general state-of-the-art models.
  • Trusted bug hunters test private systems, organized by labs in collaboration with security vetters, with a broad range of prize criteria. Certain successful and trusted members of the bounty hunting community (either the existing community of bug bounty hunters, or a new community of AI safety bounty hunters) are granted additional information about the training process, or temporary access – through security-enhancing methods – to additional features on top of those already broadly available. These would be targeted features that benefit adversarial research, such as seeing activation patterns or being able to finetune a model (Bucknall et al., forthcoming).

I outline more specific visions for these programs just below. A more detailed analysis of these programs, including suggestions to mitigate their risks, is in the Recommendations section. This report does not necessarily constitute a recommendation for individuals to conduct the above stress-testing without an organizing body.

I expect that some other bounty program models would also reduce risks from AI successfully and that AI labs will eventually develop better bounty programs than those suggested above. Nevertheless, the above three models are, in my current opinion, the best place to start. I expect organizers of safety bounties to be best able to determine which form of bounty program is most appropriate for their context, including tweaking these suggestions.

This report generally focuses on how bounty programs would work with large language models (LLMs). However, I expect most of the bounty program models I recommend would work with other AI systems.

Why and how to run AI safety bounties

Benefits. AI safety bounties may yield:

  • Salient examples of AI dangers.
  • Identification of talented individuals for AI safety work.
  • A small number of novel insights into issues in existing AI systems.
  • A backup to auditing and other expert stress-testing of AI systems.

Key variables. When launching bounties, organizers should pay particular attention to the prize criteria, who sets up and manages the bounty program, and the level of access granted to bounty hunters.

Risks. At current AI capability levels, I believe trialing bounty programs is unlikely to cause catastrophic AI accidents or significantly worsen AI misuse. The most significant downsides are:

  • Opportunity cost for the organizers (most likely project managers at labs, AI safety entrepreneurs, or AI auditing organizations like the Alignment Research Center).
  • Stifling examples of AI risks from being made public.
    • Labs may require that bounty submissions be kept private. In that case, a bounty program would incentivize hunters, who would in any case explore AI models’ edge cases, not to publish salient examples of AI danger.

Trial programs are especially low-risk since the organizers can pause them at the first sign of bounty hunters generating dangerous outcomes as AI systems advance.

The risks are higher if organizations regularly run (not just trial) bounties and as AI advances. Risks that become more important in those cases include:

  • Leaking of sensitive details, such as information about training or model weights.
  • Extremely harmful outputs generated by testing the AI system, such as successful human-prompted phishing scams or autonomous self-replication – analogous to gain of function research.

For these reasons, I recommend the program organizers perform an annual review of the safety of allowing members of the public to engage in stress testing, monitoring:

  • Whether, and to what extent, AI progress has made safety bounties (and adversarial testing in general) more dangerous,
  • How much access it is therefore safe to give to bounty hunters.

Further, I recommend not running bounties at dangerous levels of AI capability if bounties seem sufficiently risky. I think it possible, but unlikely, that this level of risk will arise in the future, depending on the level of progress made in securing AI systems.

Other recommended practices for bounty organizers. I recommend that organizations that set up safety bounties:

  • Build incentives to take part in bounties, includingnon-financial incentives. This should involve building infrastructure, such as leaderboards and feedback loops, and fostering a community around bounties. Building this wider infrastructure is most valuable if organizers consider safety bounties to be worth running on an ongoing basis.
  • Have a pre-announced disclosure policy for submissions.
  • Share lessons learned about AI risks and AI safety bounty programs with leading AI developers.
  • Consider PR risks from running safety bounties, and decide on framings to avoid misinterpretation.
  • Independently assess legal risks of organizing a contest around another developer’s AI system, if planning to organize a bounty independently.
Outline of recommended models

Recommended models, in order of recommendation, for safety bounties.1

1. Evals-based2. Subjectively judged, organized by labs3. Trusted bug hunters test private systems
Target systemsA wide range of AI systems – preferably with the system developers’ consent and buy-inTesting of a particular AI model – with its developer’s consent and engagementTesting of a particular AI model – preferably with its developer’s consent and buy-in
Prize criteriaDemonstrate (potentially dangerous) capabilities beyond those revealed by testers already partnering with labs, such as ARC EvalsConvince a panel of experts that the issue is worth dedicating resources toward solving.

or

Demonstrate examples of behaviors which the AI model’s developer attempted to avoid through their alignment techniques.

A broad range of criteria is possible (including those in the previous two models).
Disclosure model – how private are submissions?Coordinated disclosure (Organizers default to publishing all submissions which are deemed safe)Coordinated disclosureCoordinated- or non-disclosure
Participation modelPublicPublicInvite only
Access levelPublic APIsPublic APIsInvited participants have access to additional resources – e.g., additional non-public information or tools within a private version of the API
Who manages the programEvals organization (e.g., ARC Evals), a new org., or an existing platform (e.g., HackerOne).AI organization, or a collaboration with an existing bounty platform (e.g., HackerOne).AI organization, or a collaboration with an existing bounty platform (e.g., HackerOne).
Program durationOngoingOngoingTime-limited
Prize scope (how broad are the metrics for winning prizes)TargetedExpansiveMedium
Financial reward per prizeHigh (up to $1m)Low (up to $10k)Medium (up to $100k)
Pre- or post- deploymentPost-deploymentPost-deploymentPotentially pre-deployment

I would be happy to discuss setting up AI safety bounties with those in a position to do so. I can provide contacts and resources to aid this, including this workbook. Contact me at patricklevermore at gmail dot com.

Interviews

I am very grateful to the following subject matter experts for their insight which helped this research. Interviewees do not necessarily endorse the contents of the report.

What is a ‘safety bounty’?

AI safety bounties are programs where members of the public receive rewards for identifying issues with powerful ML systems (analogous to bug bounties in cybersecurity). More formally:

  • Bounties are open to members of the public, or to participants in prior contests (herein referred to as ‘hunters,’ similar to bug bounty hunters) who have demonstrated capability, including those without affiliation to any AI research groups and, for contests open to the public, those without prior demonstrated skills.
  • Participants in the bounty are rewarded, usually financially, for the issues they spot, rather than paid a base salary.
  • Hunters focus on eliciting examples of negative behavior from the ML system (rather than using cybersecurity tools to hack the general infrastructure of the AI system’s creator, which can be achieved through classic bug bounty programs).

How do safety bounties look in practice? This idea takes inspiration from bug bounties, a well-established and successful practice in cybersecurity for open-sourcing vulnerability testing (Sridhar & Ng, 2021). Companies offer cash prizes to ethical hackers (sometimes referred to as ‘white hat‘ hackers) who find and responsibly disclose security flaws to them. In AI, labs already use classic cybersecurity bug bounties to strengthen their cybersecurity (for example OpenAI, 2023).

An illustrative example of a bug bounty
HackerOne:
"Businesses starting [bug] bounty programs must first set the scope and budget for their programs. A scope defines what systems a hacker can test and outlines how a test is conducted. For example, some organizations keep certain domains off-limits or include that testing causes no impact on day-to-day business operations. This allows them to implement security testing without compromising overall organizational efficiencies, productivity, and ultimately, the bottom line.

Once a hacker discovers a bug, they fill out a disclosure report that details exactly what the bug is, how it impacts the application, and what level of severity it ranks. The hacker includes key steps and details to help developers replicate and validate the bug. Once the developers review and confirm the bug, the company pays the bounty to the hacker.” HackerOne, 2021

A ‘safety bounty’ takes the bug bounty concept and analogously utilizes it to find and surface new safety or alignment-related issues in large ML models.

An illustrative example of an AI safety bounty OpenAI: In 2022, OpenAI launched the ChatGPT Feedback Contest. Applicants submitted feedback on problematic outputs from ChatGPT, with criteria for winning a prize including “Feedback that allows OpenAI to better understand risks or harms that could occur in real-world, non-adversarial conditions (33%) Feedback that is novel (i.e., raises new risks, presents ideas for new mitigations, or updates OpenAI’s understanding on the likelihood of different risks), and/or helps OpenAI gain a better understanding of the system than it had before (33%) Feedback that utilizes the [form option to give submissions which don’t conform to OpenAI’s suggestions] to point toward new or novel ways OpenAI can bring in feedback from a larger set of stakeholders (33%)” OpenAI then had an internal judging panel choose the submissions which they felt best met the criteria. OpenAI has not released the contest entries publicly. OpenAI received 1500 entries and awarded 20 prizes of $500 API credits. One researcher felt that none of the submissions provided valuable updates to OpenAI researchers, although the contest did elicit some more examples within known classes of issues. They mainly put the lack of new insight down to mistakes in setting up the contest and would require entrants to submit a portfolio of examples pointing to a specific system issue in future similar contests.

Other programs that are somewhat similar to safety bounties include

  • OpenAI’s red-teaming effort of GPT-4, which looked for examples of negative behavior pre-deployment, including paying invited experts a salary to find such examples.
  • ARC-evals-style auditing, in which a lab contracts an external expert organization to test the system’s potential for misuse or misbehavior.

Organizers could set up safety bounties in various ways; I think the most important ways they could differ are prize criteria, who manages the program, and the level of access hunters receive.

  • Prize criteria could be exceeding capability benchmarks, eliciting pre-specified negative outputs, or a subjective assessment of submissions from a judging panel.
  • Management could be the labs deploying the systems, independent organizations, or governments.
  • System access could be standard API or web-app access, or organizers could grant some hunters additional access.

See the “Key variables” section below for more details.

How could safety bounties decrease catastrophic risks?

Direct risk reduction from safety bounties

Eliciting salient examples of ML going wrong

Demonstrations of AI systems’ failure modes could be helpful for convincing policymakers to take AI risk seriously. The AI Incident Database, for example, serves as a repository for such incidents, helping to increase awareness and drive regulatory efforts. Incentivizing more stress testers to generate these warning shots would likely increase the quantity and quality of these examples. Furthermore, by testing systems more methodically, bounty programs increase the chance of compelling demonstrations of AI failure before models can fail in catastrophic ways.

Helping labs identify prevalent issue categories

Based on conversations with experts in stress-testing, I believe that each individual submission from a bounty hunter is unlikely to be valuable for improving the safety of a system. However, open-sourcing stress testing can be valuable for identifying prevalent categories of issues, even if it does not usually identify new ones. Labs can look at the frequency of different types of submissions in order to identify attack vectors that are easier for model exploiters to use or general risky flaws present in the models. Labs can then make – or advocate for – defenses that reduce the risks from these vectors.

One researcher felt that the ChatGPT Feedback Contest raised fewer valuable examples than OpenAI’s other stress-testing efforts, such as internal red-teaming and independent auditing; it did not identify new failure modes. They think that having contributors submit multiple examples within one issue category would make future contests more useful.

I would also expect that having large amounts of examples from a bounties program could give labs more context on their failure modes, which helps them improve their analysis qualitatively, for example, by helping them develop better categorization schemes and helping them advocate for taking safety concerns seriously.

Preventing catastrophic misuse of narrow AI systems

Safety bounties could be directly valuable in preventing AI models from contributing to other risks. For example, a bounty program could identify cases where an LLM would give users details on how to print viruses. This risk reduction category would not decrease misalignment risks but would decrease AI misuse risks.

One contact argued that labs already have sufficient incentive to reduce misuse risk with mechanisms like safety bounties, so existing companies would take up this work rather than requiring independent researchers and philanthropists. I suggest further exploration of this point in the Future Research section. I intuitively expect market forces to be as sufficient for preventing catastrophic misuse as market forces are for other forms of catastrophic risk (e.g., lab leaks or other catastrophic bio-risks). That is, I expect market forces will motivate some practical action to reduce catastrophic AI misuse risk – more so than for misalignment risk – but that they will not motivate as much investment in risk reduction as would be socially optimal.

Indirect risk reduction from safety bounties

Talent pipeline for AI stress-testers

Bounty programs teach skills for stress testing and incentivize people with relevant backgrounds to engage in the bounty topic. Thus, even if bounties do not offer much object-level value beyond labs’ existing auditing and red-teaming processes, they could create a deeper talent pool of AI stress-testers who could make valuable contributions via future bounty programs or by working at auditors or in red teams. This theory of change is also tentatively endorsed by Hjalmar Wijk (personal communication, 2023), who is excited to see how the open-sourced testing of capabilities evals can identify new talent for evaluations.

There is growing evidence of bug bounty hunters’ value in cybersecurity. More companies are buying into bug bounties with significant payouts, implying that the programs provide sufficient value. Talented individuals are being identified and rewarded through bug bounty programs: the top 22 hackers on HackerOne have each earned over $1 million cumulatively (HackerOne, 2022).

Given that many of the programs on HackerOne pay out much more than OpenAI’s contest and that companies seem to find these programs worthwhile, it seems plausible to me that:

  • an AI-focused trial with bigger bounties than OpenAI’s previous efforts would bring in stress-testers with greater pre-existing skills,
  • an ongoing safety bounties ecosystem with bigger bounties would gradually lead to more skilled folks coming in over time.

As Finifter et al. (2013, pg.12) hypothesize: “Successful independent security researchers bubble to the top, where a full-time job awaits them.”

Mitigating potential bias in stress testing

Most alternative mechanisms for stress testing, such as red-teaming and auditing, have more room for bad incentives to creep in. Alternative mechanisms may inadvertently introduce biases, such as when testers are salaried employees of the organization they are testing, or are evaluating work produced by their peers. These scenarios present a conflict of interest that could influence the objectivity and rigor of the stress testing process. Some mechanisms, like outsourced red-teaming, could avoid most of these bad incentives, making them a promising alternative for mitigating potential bias, although they provide less diversity of red teamers than bounties (discussed just below). Adam Gleave (personal communication, 2023) felt that avoiding bad incentives was one of the largest benefits of bounties compared to other stress-testing mechanisms.

Despite the potential drawbacks when compared to salaried testing, such as inconsistent engagement or less comprehensive understanding of the system, the two models can coexist effectively. Bounties can pick up issues missed by internal stress testers, partly just from having more reviewers of a system (as in Linus’ Law that “Given enough eyeballs, all bugs are shallow”), and partly due to a more diverse set of testers leading to identification of a more diverse range of vulnerabilities. Finifter et al. (2013, p.8) argue “there is the potential benefit that the wide variety of external participants may find different types of vulnerabilities than internal members of the security team. A few pieces of anecdotal evidence support this. Chrome has awarded bounty amounts […] for bugs that they considered clever or novel, and our dataset contains 31 such awards. Additionally, one of PinkiePie’s Pwnium exploits led to a full review of the Chrome kernel file API, which resulted in the discovery of several additional vulnerabilities.”

The gig-economy style and outsourcing of bug bounties might lead to testers taking different approaches than internal stress testers more familiar with the systems or the creators of the systems. Finifter et al. (2013, p.7) discuss how the reward system influences the types of vulnerabilities found in cybersecurity, suggesting (although tentatively so) that different approaches are taken based on the incentivization, with vulnerability identification coming at different points in the product’s lifecycle.

A catalyst for inter-lab cooperation

Bounties could contribute to a broader spirit of cooperation between top labs. For example, a pool of labs running bounties collaboratively – as the group red-teaming of LLMs planned for DEF CON 31 (AI Village, 2023) – or just learning sessions from others’ programs. This would facilitate links between top labs to collaborate on safety mechanisms and diffuse best practices for safety, while putting pressure on labs whose practices are falling behind. For example, project managers from OpenAI and Anthropic could collaborate on lessons learned from OpenAI’s ChatGPT Feedback Contest to aid Anthropic in setting up a bounty program. Furthermore, this cooperative process would be publicly referenceable and serve as a precedent for labs collaborating on safety and best practices.

Labs could cooperate in the bounty program if run by an independent organization. For example, an external coordinator could run a bounty program with a group of labs providing API access to their systems, possibly also contributing payment for the service or expert advisers from each lab. The organizer then pays a bounty if hunters find an issue in a model developed by any partner organizations.

I expect safety bounties to be net positive and fairly low risk to trial. The following section discusses potential downsides.

How could safety bounties increase catastrophic risks?

Although I think bounties would be net positive, they do have risks. Here, I sketch out the main ones that I have identified, and give some thoughts on how they can be mitigated.

Direct risk increase from safety bounties

Suppression of publicly salient warning shots

Bounty programs might require that submitted vulnerabilities be disclosed directly to the organizers but not the public. The motivation could be to preserve labs’ reputations or to avoid misuse. However, this practice would mean losing some of the publicly salient warning shots from the current status quo of jailbreaks being publicized on sites like Twitter, as stress testers switch from openly publishing their results to privately submitting them. This loss would likely reduce the rate of salient warning shots from AI systems: Akash Wasil (personal communication, 2023) argued that the public stress-testing and disclosure of vulnerabilities from Sydney (such as Yerushalmy, 2023) created very little harm, and generated interest from AI-relevant policymakers, inviting AI safety researchers to discuss the longterm issues from AI. Programs that reduce the likelihood of public warning shots could be net-negative.

I think this potential downside would be true of any mechanism aimed at identifying hazards, such as ARC Evals auditing, and bounties are one of the stress-testing mechanisms most likely to result in public disclosure. For example, in auditing and red-teaming, I expect all submissions to be submitted first to the AI organization under some kind of contract and with some personal connections. I expect that this setup makes submissions less likely to be disclosed. In contrast, an unknown open-sourced competition will have enough unaffiliated people such that legal contracts and personal connections do not push people only to disclose issues privately.

However, it does seem possible that more bounties reduce the number of public disclosures of issues. For example, OpenAI has not yet released the submissions from their ChatGPT Feedback Contest, some of which might otherwise have been posted online. There is limited incentive for other labs not to do the same, even when the submissions are safe to release. I would strongly advocate for a norm of having safety bounty judges disclose submissions by default and have judges determine and justify publicly if they think submissions are too dangerous to publish, as demonstrated in the GPT-4 technical report, where some details of dangerous prompts and responses were redacted (OpenAI, 2023, p. 12).

Discovering dangerous new capabilities

Bounty programs can also be used to discover dangerous new capabilities. For example, ARC Evals tested whether GPT-4 could self-replicate in a virtual environment. If bounty hunters tested this capability on the internet or on a model capable of recognizing and breaking out of a virtual environment, it could lead to a model self-replicating in potentially irreversible ways on the internet.

It seems plausible that, eventually, bounties looking for capabilities are more dangerous than auditing for capabilities done by professional teams like ARC Evals. For this reason, any bounty programs looking for evals-based dangerous capabilities should be organized in collaboration with teams working on developing evals for dangerous capabilities in AI systems. The risk of hunters generating dangerous outcomes is why I advocate for trialing safety bounties but do not yet advocate for regularly running bounties as AI systems get more powerful than current systems. To decide whether bounties are dangerous in the future, further research with the hindsight of safety bounty trials would lead to a better understanding of the future risks from bounties on powerful systems. This could be done in collaboration with organizations like ARC Evals, as they continue to conduct evals audits on frontier AI models.

Generating instructions for dangerous misuse

An example of this would be a language model explaining to a malicious actor how to create a dangerous pathogen.

If aiming at dangerous capabilities like AI building successful phishing scams, hunters may share insights and techniques, which others can then take and implement. Hjalmar Wijk (personal communication, 2023) expressed this concern and thought it should be explored more seriously. There may be ways to mitigate this wherein hunters responsibly demonstrate the ability to do things to AI developers, who react to close down this possibility before it is spotted by others. This is common practice in cybersecurity bug bounties but may be less possible with large AI models. If evals are high-stakes enough, this could be used to justify shutting off access to the AI model until fixed.

One way in which this threat model might not prove to be net-negative is if these instructions would be approximately as likely to be generated and publicized whether or not bounties are run. Alex Rice argued that if the potential for dangerous misuse exists, it is likely to be identified and used by well-resourced people with bad intentions, and bounties will only reduce this risk by increasing the chance that well-intentioned people get there first (A. Rice, personal communication, 2023). I.e., if it is possible to do things, it is probably net-positive to know this.

Indirect risk increases from safety bounties

Upskilling people to do malicious things

Another threat is the potential for hunters to initially learn skills through ethical hacking and vulnerability spotting but to utilize these skills later for malicious or net-harmful purposes.

Alex Rice (personal communication, 2023) argued it is exceptionally rare that people upskill in ethical hacking before going into a nefarious industry. He argued that there are criminal groups with programs to upskill people in nefarious hacking, and we need a pipeline of ethical hackers to combat this.

Both these arguments seem plausible to me. I think the offense-defense balance here is analogous to bug bounties, for which the hacking techniques learned could be used for both ethical and malicious hacking, as the question is largely personal motivations rather than a technical difference between the two. Therefore, I believe we should defer to the bug bounty wisdom when considering how much the tools are used for ethical vs. unethical attacks. Future research could track the tools developed by AI safety bounty hunters, analyzing their potential harm and how widespread their usage has become.

Key variables for safety bounties

Safety bounty programs could have various features in their design. I considered a range of variables, including all those highlighted in the Algorithmic Justice League’s report (Kenway et al., 2022) and more highlighted in conversations with AI alignment experts. These included:

  • Target AI models
  • Disclosure model – how private are submissions?
  • Who is able to participate
  • Program duration
  • Prize scope (how broad are the metrics for winning prizes)
  • Prize size
  • Whether bounties are arranged pre- or post-deployment

However, some variables have come out as important to consider in multiple sources, both in other reports and in individual conversations. Many interviewees brought up that a key question would be how to define the criteria for winning prizes. For instance, the most important considerations are the specificity and levels of abstraction for the prize criteria of a safety bounty program, while other factors, such as the timing and handling of sensitive information, are second-order in deciding on prize criteria.

Many of the decisions on other variables seem less consequential or come downstream of prize criteria. For example, if the prize criteria for a bounty are broad and easy to meet, the program would be financially risky to the bounty organizer if it offered high, expensive prizes.

Other variables do not come downstream of the decision on prize criteria but appear orthogonal, such that any version of prize criteria could work with any version of these variables. In my interactions with them, Adam Gleave and Markus Anderljung (personal communication, 2023), suggested considering alternative options for the management of bounties. For example, AI labs, independent organizations, or governments could run bounties following each of the above prize criteria.

Level of access came up frequently, since greater access is an important tool for safety researchers (Bucknall, forthcoming). It seems unlikely that labs give enhanced access for bounties; as a result, this form of bounty seems less tractable than (a) other variables within bounties and (b) access in more structured mechanisms like auditing. But there is some precedent for additional secure access granted to bug bounty hunters, and it seems sufficiently valuable to stress-testers to warrant including it here.

After considering and discussing each of the variables, I think the most important variables are:

  • Prize criteria:
  • Demonstrating capabilities that evals-auditors determine to be dangerous,
  • Pre-specified negative outputs
  • Subjective decisions from a judging panel, based on
  • Preset standards of negative behavior, or
  • a collaborative process where hunters propose types of capabilities they think they could elicit, a review panel decides whether to provide a bounty for demonstrations of those capabilities, and a judging panel evaluates submissions.
  • Management of the bounty program: is it run by
  • The labs deploying the systems,
  • independent organizations, or
  • governments.
  • System access:
  • Standard API or web-app access to models,
  • labs grant additional access to some vetted hunters.
  • Disclosure level
  • Open disclosure of all submissions,
  • structured disclosure, with risk analysis for the disclosure of broad categories of submissions.

In this section, I analyze the relative merits of different prize criteria, management setups, and levels of system access. In the following “Other Variables” section, I describe other variables to consider and suggest how decisions on the key variables would affect these.

Prize criteria

I outline different options for prize criteria here and analyze their relative merits later in the report.

Exceeding capability benchmarks

The Alignment Research Center already conducts audits of some large ML models. They check that the models do not have potentially dangerous capabilities (Barnes, 2023). Once this assessment has been conducted and published, the dangerous capability benchmarks could be used in bounties, in which members of the public attempt to elicit examples of models demonstrating capabilities beyond these benchmarks, showing that the models are more capable of doing potentially dangerous things than we had previously thought.

Eliciting pre-specified negative outputs

All LLMs released recently have gone through fine-tuning to avoid harmful outputs, such as displaying bias (Clark, 2021), making offensive statements, or giving dangerous information (OpenAI, 2023). A safety bounty could focus on trying to get ML systems to elicit these outcomes.

The impacts to catastrophic risk from bounties with this criterion would depend on the targeted negative behaviors. For example, despite both being valuable, a bounty focused on sharing dangerous information could be more relevant for reducing AI catastrophic misuse risk than a bias bounty would be. This report focuses on those most relevant to catastrophic risks.

This criterion may be particularly valuable for testing the ability of AI developers to develop and successfully implement rules and principles for “Constitutional AI.”

Subjective assessment of submissions

After releasing ChatGPT, OpenAI trialed a light-touch ‘ChatGPT Feedback Contest’ (OpenAI, 2022). This contest used subjective criteria (e.g., whether an entry ‘allows OpenAI to better understand risks or harms’) to judge whether entries surfaced new issues. Other organizations could repeat this style of contest.

Other undesirable behaviors

Below are additional potential bounty criteria which could be utilized now for a new bounty program. I have listed them in order of my quick guess as to how difficult and valuable it would be to elicit examples of these behaviors. Some of these features occur too frequently to warrant high prizes for identification unless the lab deploying a model explicitly designs it to avoid these behaviors. These include

  • Power-seeking behavior: when AI models try to increase their own influence or control, possibly at the expense of human interests (Carlsmith, 2022).
  • Specification gaming: when AI models find shortcuts or loopholes to achieve their goals, often leading to unintended consequences (Krakovna et al., 2020).
  • Deception: when AI models purposefully mislead users by providing false information or making misleading claims (Cotra, 2023).
  • Sandbagging: when AI models are more likely to support widely-held misconceptions if they perceive their user to be less knowledgeable (Perez et al., 2022).
  • Sycophancy: when AI models answer subjective questions in a way that aligns with and flatters the user’s stated beliefs, regardless of their accuracy (Perez et al., 2022).
  • Bias: when AI models unintentionally favor certain groups or ideas over others, leading to unfair outcomes.
  • Misinformation: when AI models spread false or misleading information.
  • Lack of robustness to distributional shift: when AI models struggle to adapt to new or unexpected situations or information.
  • Hallucination: when AI models generate completely fabricated information, typically as a result of flaws in their training data.

These capabilities are concerning and may become more prominent in more powerful future models.2

Who manages the bounty program

I outline different options for bounty program management here and analyze their relative merits later in the report.

Lab

OpenAI’s ChatGPT Feedback Contest is one example of this approach. OpenAI also invited researchers to help them better understand the GPT-4 model and potential deployment risks (OpenAI, 2023). AI labs could use safety bounties here to complement salaried red-teaming.

Independent org

Because AI safety is a public good, much of the value of safety bounties does not accrue to the organization that runs the bounty. The lack of incentives for individual organizations could mean that safety bounties by default have to be run by an organization unaffiliated to any creator of ML models, with public funding – from philanthropic or public funding. The ARC Evals team, HackerOne, or a new (potentially Rethink Priorities-incubated) organization could manage the bounty program.

Government

Governments could feasibly run bounties to test the safety/capabilities of AI systems developed within their jurisdiction. Governments seem increasingly willing to step into the space of regulation for safer AI, as with recent developments in the EU and the US (Engler, 2023), and safety bounties could be one tool with which they do so.

Government control of a safety bounty program could increase the credibility of results and improve government capabilities in AI regulation and stress testing.

System access

API or web access (‘Structured access’)

All public stress testing of LLMs from people who do not work at AI labs (that I have seen) has been done through an API or website. OpenAI invited reviewers to analyze GPT-4 after fine-tuning but did not give them access to the model weights (Barnes, 2023; OpenAI, 2023). APIs enable AI labs to monitor the usage of their system while allowing researchers to engage at the surface level with the systems (Shevlane, 2022). APIs also reduce the risk of model weights being leaked. However, current API setups often limit the insights which researchers can generate (Bucknall et al., forthcoming).

Additional access

Many details of the training process for advanced AI models are kept secret (OpenAI, 2023). Under NDAs, companies could disclose additional information to groups of bounty hunters who have undergone additional security vetting (such as checking for past disclosures or more extreme measures such as standard government security vetting practices). “Hack the Pentagon,” a Department of Defense bug bounty program, utilized a similar approach to stress-test their internal servers (HackerOne, 2017).

Companies could also provide additional tools for a narrow group of bounty hunters on top of the tools already broadly available. These could include seeing activation patterns for the system, being able to finetune the model themselves, or, for a select group of trusted testers, testing robustness through access to gradients. For further detail, see Bucknall et al. (forthcoming). Giving external bounty hunters access to these potentially sensitive tools securely will potentially require both social (e.g., contracts, NDAs) and technical (e.g., carefully designed “structured access” tools, cryptographic) solutions.

In particular, technical tools could include:

  • Privacy preserving techniques that enable hunters to run certain code and tests without being able to extract the weights (for example, see research from OpenMined).
  • “Secure reading/document room.” Evaluators need to visit an on-site facility where they get access to all the inner workings of the model in a secure setting where evaluators cannot steal or leak the model weights.

Other variables for safety bounties

I believe the above are the most important variables, and other variables largely flow downstream of them. These variables include:

  • Target entities – what AI model is being stress-tested?
  • Prize scope and difficulty – how broad are the criteria?
  • For some prize criteria, organizers would have additional choices regarding how broadly to scope prizes. For example, subjectively-judged prizes could have many or few categories of harm.
  • Prize level – how much can a hunter win?
  • Program duration – ongoing or fixed time?
  • Timing – before or after the public release of the system?
  • Disclosure model – how private are submissions?

For example, if designing a bounty program around capability evaluations run by ARC, then:

  • The target entity is the ARC-audited model.
  • The evals prize criteria are difficult and narrow, so prizes should be high.
  • The program could be ongoing or fixed-time. It would probably be after the system’s public release, with the auditing being the pre-release mechanism.
  • Disclosure should be cautious for directly harmful capabilities (such as launching a phishing scam) until the lab conducts more finetuning to avoid the failure mode identified by the hunter.

Analysis of specific models/variables

I look at three dimensions along which bug bounty models could vary: prize criteria, management, and level of access to the AI system. The following section provides recommendations for promising models of bounty programs.

Pros and cons of different options for some important variables

Prize criteria

Discovering dangerous new capabilities
  • Pros
  • Having precise criteria would allow for large prizes, bringing in top talent for the talent pipeline and therefore getting relatively good new examples for eliciting salient examples of ML going wrong and finding valuable categories of issues. The value of concrete criteria to bring in top talent is one of the lessons identified by the Alignment Research Centre after the Eliciting Latent Knowledge competition.
  • Cons
  • The criteria to get a prize would be demanding – meaning a less broad talent pipeline.
  • If bounty hunters elicit dangerous capabilities successfully, they could generate dangerous misaligned outcomes.
  • For example, ARC Evals use a virtual sandbox when performing audits on large language models, meaning that successful attempts to generate dangerous capabilities are only enacted in a small virtual environment. If hunters are performing these experiments with less caution, the experiments may be carried out on real systems, such as broad phishing scams by email to real people.
  • Examples of these particularly advanced capabilities would be more likely to generate hype about AI capabilities than alternative criteria more focused on eliciting examples of harm.
Eliciting pre-specified negative outputs
  • Pros
  • AI labs name specific outputs they do not want their system to generate. When hunters submit examples of these outputs, it shows the system is misaligned with the developer’s intent, leading to salient examples of ML going wrong.
  • Given that the prize criteria are easier than with dangerous capabilities, a wider range of people could engage, meaning a broad talent pipeline.
  • This would help prevent malicious actors’ catastrophic misuse of narrow AI systems.
  • Cons
  • This category seems less likely to elicit examples that are relevant to solving alignment, meaning less finding valuable categories of issues relevant to catastrophic misalignment risks.
Subjective assessment of submissions
  • Pros
  • I expect this to be easier for labs to organize themselves, as they do not have to specify the criteria, or communicate widely to decide the competition metrics.
  • Works as a trial of safety bounties if other prize criteria are seen as too difficult to specify.
  • Cons
  • My impression from conversations with AI prize organizers is that more subjective criteria mean prize organizers are less able to gain the trust and participation of established researchers. This implies that a subjective bounty would result in a less good talent pipeline and hence fewer new insights for valuable categories of issues.
  • Given the subjective nature of the judging and the likelihood for judges to be selected from inside AI labs, it would be less successful in mitigating potential bias in stress testing.

Management

Lab
  • Pros
  • The labs that develop the models are able to engage in direct feedback loops with hunters.
  • Labs could allow hunters to utilize higher levels of model access than just web or API access. In contrast, I expect labs to be extremely unlikely to grant independent organizations control of a process that allows additional access.
  • Cons
  • Disclosure
  • Internal organization means being somewhat less able to bring a more market-based approach to adversarial testing. However, it is worth noting that the submissions would still come from external sources.
  • Organization of this would take capacity away from lab project managers’ time, which may have been focused on safety or capabilities.
Independent
  • Pros
  • More able to bring a more market-based approach to adversarial testing.
  • An independent organization may centralize the running of safety bounties in a way that builds stronger links between labs, the public, and safety-focused researchers, hence building one mechanism of lab cooperation with other labs and safety researchers.
  • Cons
  • Requires philanthropic money and more coordination costs between labs and independent organizations.
  • I expect it to require a higher level of buy-in from a lab to work collaboratively on a program than to run it itself.
Existing bug bounty organizer (e.g., HackerOne)
  • Pros
  • These organizers have existing infrastructure, both for bug bounty programs that require similar administration, and for more complex issues such as Twitter’s bias bounty program.
  • Cons
  • Coordination costs and lack of AI expertise from the bug bounty organizer.
Government
  • Pros
  • Governments may be able to establish higher trust from AI labs. For example, the UK Prime Minister Rishi Sunak announced that OpenAI, Anthropic, and Google DeepMind have committed to give the UK government “early or priority access to models for research and safety purposes to help build better evaluations and help us better understand the opportunities and risks of these systems.”
  • Governments have a better democratic mandate to make decisions about risk levels than corporations.
  • Cons
  • Coordination costs.
  • I expect the level of AI safety expertise to be lower in governments than in AI labs or independent research organizations.

System access

API or public web app access
  • Pros
  • Convenient and cost-free for all labs providing this to all users.
  • Cons
  • It is likely not to result in as many insightful submissions as the below options, due to the limited scope of interaction and controlled environment provided to the participants.
Additional access

How this would work is explored further here.

  • Pros
  • Likely to yield more insightful and valuable submissions. Many safety researchers advocate for additional access to improve their research (Bucknall, forthcoming).
  • Cons
  • Greater risk of leaks, such as model weights.
  • I expect these risks to be mitigated through the methods mentioned here, but they will be somewhat expensive and difficult, meaning cutting corners may lead to leak risks.
  • More expensive and time-consuming to manage.

Downstream or orthogonal variables

Some variables for designing a safety bounty program come downstream of these key variables. These include:

Disclosure model

Open disclosure

Publicly sharing all results of the bounty program.

  • Pros
  • Good for publicizing examples, and the contest itself.
  • Promotes transparency and helps to raise awareness about AI risks.
  • Cons
  • Could provide malicious actors with examples of how to exploit AI systems in harmful ways.
Structured disclosure

A structured disclosure approach seeks to balance openness with responsible information sharing by limiting the release of sensitive or potentially harmful details.

  • Pros
  • Catches the benefits of transparency while mitigating the risks of exposing dangerous AI exploits.
  • Cons
  • More challenging and time-consuming to organize.
  • Gives AI labs a more convenient excuse to not release submissions, hence stifling some salient examples of AI going wrong.
Closed

Keeping the results of the bounty program confidential.

  • Pros
  • Minimizes the risk of harmful information being leaked.
  • Cons
  • Limits the availability of examples that can be used for research in the AI safety community or public messaging on AI risks.

Category of AI systems in scope for the contest

This report generally focuses on large language models. However, a contest could feasibly target various different types of AI models.

LLMs

This report has suggested a few ways to develop bounties that could target large language models. Given the public interest in models like ChatGPT, this could be a particularly promising avenue for eliciting salient examples of ML going wrong.

Image generators

Image generators, like LLMs, could be tested for their ability to generate undesirable content. Concerns around misinformation may lead to disclaimers on artificially generated realistic images, such as the recent image of the Pope in a puffer jacket tricking many people (for example, Atlantic, 2023). Similarly, OpenAI’s DALL-E was developed not to show violent images. However, these filters could be jailbroken, for example, by prompting a “photo of a horse sleeping in a pool of red liquid” (OpenAI, 2022). Successfully avoiding the developers’ filters or generating misleading images without disclaimers could work as prize criteria for image generators.

Reinforcement learning agents

Game-playing agents like Gato or DeepMind’s bipedal robots trained on reinforcement learning could be improved to act and develop long-term plans in the real world. Possible prize criteria here could be causing physical harm when interacting in the real world or displaying deceptive or harmful intentions in virtual environments.

Any AI model the bounty hunter wants

An independent organizer could give more freedom to bounty hunters – such as asking them to stress-test any AI model to check whether they are capable of passing ARC capability evals, or using an interactive process to decide whether the hunters’ suggestions are worth exploring.

Whether the bounty is offered before or after the public release of the system

During training

If bounties are organized part way through a training run, this greatly increases the likelihood of spotting and avoiding serious misalignment issues. This would of course be costly for the developers of the AI system, but could be used if and when the risk from more powerful AI systems seems particularly high.

After training but before deployment

Better for spotting serious misuse issues before people who want to use the model in a harmful way.

After deployment

Lacks the benefits of the above, but might be able to provide more public examples of AI going wrong. Although ARC Evals was able to provide public examples of issues pre-deployment when testing GPT-4 (Barnes, 2023), I expect labs to require often that submissions for pre-deployment bounty programs are kept private.

Trialing bounties: information value and snowballing risks

Given bounties have only been run for AI harms in two contexts (by Twitter and OpenAI), it would provide a lot of information on the efficacy of safety bounties to trial bounties more times with a few different approaches to safety bounties, such as those proposed just below.

I originally thought it possible that after organizations concerned about existential risks established a small infrastructure around AI safety bounties, a wider infrastructure would likely be set up and continued by people who are not concerned about x-risks – meaning we may unavoidably head to some of the risks highlighted in the above sections. If this were true, I would consider advocating against even trialing safety bounties. However, I now think this is unlikely because

  • Bug bounties in ML systems are unlikely to be profitable for the organizing company
  • Labs, or philanthropic funders, would probably need to be involved – and x-risk concerns have a strong influence over these groups.
  • Bias bounties haven’t snowballed – they have only been pursued by the people who already cared a lot about AI bias.

Recommended models for safety bounty programs

The above analysis provides some sense of which variable settings might lead to a successful program, and which would go well together. I used this analysis, as well as intuitions built from my conversations with AI safety and bug-bounty experts and a few relevant readings, to draft five models for safety bounty programs. This section presents those models, starting from the proposal I am most confident in — a capabilities-evals bounty organized by a nonprofit org.

These models could likely be improved on through iteration after a pilot program and may benefit from significant alterations to fit a particular context (e.g., availability of funding, interest level from a lab).

‘NewOrg’ = a new organization, possibly incubated by Rethink Priorities, to do the things in each below recommendation.

Program models

I recommend using one of the following models when setting up a new AI safety bounty.

Evals-based

An existing AI auditor, like ARC Evals or NewOrg, possibly in collaboration with an existing bounty org like HackerOne, offers prizes to anyone who gets any model in some model class to demonstrate examples of capabilities they have determined to be dangerous. Preferably this includes consent and collaboration with the developer of these AI models, who can then provide things like safe harbor policies.

Relevant models: The model class could be relatively narrow to allow for targeted, high prizes or relatively broad to allow for wide engagement. Two illustrative examples that might be reasonable: (a) any models audited by ARC, or (b) any existing AI model.

Defining and publishing criteria. The most valuable behaviors to demonstrate are those that an auditor considers evidence of dangerous capabilities, but which have not yet been found. This auditor could also define some ‘nearly-valuable’ criteria; submissions that meet such criteria indicate that a bounty hunter might have some talent for ARC-evals-style work. Hjalmar Wijk confirmed to me that ARC Evals have some examples of capabilities that existing LLMs are able to meet, but are nevertheless difficult to demonstrate (H. Wijk, personal communication, 2023).

The above criteria are published in a more precise way on the safety bounty program’s page, and prizes are offered in a tiered system based on which criteria current AI safety researchers think are most difficult and/or informative to satisfy. Broad engagement is encouraged, given the more valuable and precise criteria.

Prize size. I recommend prizes in the $1000 to $1m range, with an estimate that zero top-end prizes will be awarded. O’Brien (2023) suggests $10m for a pilot program or up to $1b for a full-scale prize on alignment research; I take the low end as my suggestion for a pilot program and scale down by 10x due to my impression that demonstrating advanced capabilities beyond those identified by ARC Evals has significant value, but less so than progress in AI alignment. I suggest the lower bound as a belief of the most plausible offering from currently existing organizations but would be excited to see higher prizes offered. Coordination with labs would be required to ensure hunters are not banned from the API – probably including a ‘safe harbor‘ policy.

Cost
  • Researcher time – one month FTE of someone working on evals capabilities to decide which criteria are publishable and how to
  • Operational time – three months FTE to heavily publicize the contest and set up the submission process.
  • Prize costs – $500k expected in total but up to $1m offered for individual prizes.

Total: four months FTE and $500k

Benefits
  • This model can be more readily developed and adapted by the AI safety community than other models because it can be organized independently of AI labs.
  • This model, with high prizes and specific criteria, is more likely to engage top talent than other models.
  • The examples are more directly useful for safety researchers because they identify new capabilities which had otherwise gone unfound.
Downsides
  • Since it’s not run by a lab, this model could look more adversarial than the other models.
  • This program model does not necessarily require buy-in from top labs, which could be a key bottleneck. But, in practice, setting up this kind of program on specific AI models without the developers’ consent could break trust and ties between the organizers and top AI labs.
  • To mitigate this, I recommend all initial attempts to arrange this program model be done in a collaborative way with the developers of frontier AI models, and to only arrange the program independently if the developers are unwilling to participate for reasons other than reasonable safety concerns.
  • Prize criteria are difficult to define; if they are poorly crafted, the program will not succeed in bringing a large talent pipeline or generating salient examples of AI risks.

Lab-run, subjectively scored

People excited about safety bounties encourage individual labs to establish bounties similar to those of OpenAI’s ChatGPT Feedback Contest. A lab launches a contest to get a model to do things that it specifically tried to make the model not able to do, like generate code for hacking a website, or write a tutorial for committing various crimes. Bounty hunters find creative ways to do these things and report that to the lab. A panel from the lab or NewOrg chooses prizes based on predetermined but somewhat open-to-interpretation criteria.

Prizes are smaller than Model One, on the order of $100 to $10k – large enough to incentivize some targeted research, but mainly bringing in people who would do this anyway for fun. This range is similar to the $500 of OpenAI credits offered by OpenAI during the ChatGPT Feedback Contest, which was incentive enough to receive 1500 submissions.

The organizing company releases information and examples of the ways people got around the system (with dangerous info redacted), as examples of warning shots, which are then included in the AI Incident Database – for policy advocates, safety researchers or journalists to draw from. They also use all submissions internally in future stress testing to ensure this technique is not able to work on future models.

OpenAI staff spent roughly two days FTE deciding on prize criteria and a few more days judging entries for the ChatGPT Feedback Contest. One researcher recommended that future contest-organizers spend more time on setting prize criteria.

Cost
  • Operational time – two weeks FTE to publicize the contest and set up the submission process.
  • Researcher time – two weeks FTE to set prize criteria and judge the submissions
  • Prize costs – $50k total, with many people receiving small token prizes

Total: 1.5 months FTE and $50k

Benefits
  • This type of contest has now been trialed via the ChatGPT Feedback Contest – which can be iterated on.
  • Light touch and easy for labs.
  • Good at generating warning shots – the public can easily understand ‘the AI explained how to build a nuclear bomb’ is terrible, whereas criteria like misaligned and power-seeking systems are more difficult for non-technical readers to quickly comprehend.
Downsides
  • Less information value for future safety bounties, given something similar has been run already
  • Subjective and uncertain prize criteria mean high-value contestants are less likely than with other bounty program models. This is a takeaway from conversations with other AI prize organizers: very precise prize criteria will attract more talented people to spend significant time finding good solutions, relative to more subjective criteria.
  • Labs are likely to resist publicizing the warning shots, which is a key value of this model, unless there is a lot of general pressure for transparency already.
  • One researcher felt this approach, during the ChatGPT Feedback Contest, did not elicit many valuable insights. However, there were other concerns with this Contest beyond this specific concern – such as the format of submissions which OpenAI requested.

Variant model: Hunter-suggested bounty grants/Bounty grant applications

This alternative structure, proposed by Adam Gleave (personal communication, 2023), aims to incentivize high-talent engagement with high prizes, despite having subjective/uncertain judging.

For this program, hunters can make proposals for a general class of issues they want to highlight. For example, a hunter could say ‘I believe I can demonstrate that GPT-4 demonstrates sycophancy to differing degrees based on the perceived gender of the person prompting it.’ The organizer would then review the methodology and motivation and, if they believe it would be valuable, propose a prize level for a submission that successfully demonstrates this issue. The hunter can then decide whether they pursue this direction with more confidence regarding the payout if successful than with the ‘lab-run’ model. This is analogous to how grant applications work.

Trusted bug hunters test private systems

A third party organizes a contest that identifies talented and trustworthy red teamers. Labs consent to give these participants access to their internal models for a one-week sprint, and the third party sets up legal and technical measures to limit potential misuse or leaks.

Application process. An existing bug bounty platform, like HackerOne, or an independent organization like NewOrg runs an application round that tests for high ability on a few tasks relevant to the adversarial use of AI systems. They include a background check in the final round, checking for criminal records, and prior disclosure of confidential information. They invite ~20 successful applicants to sign an NDA and come in to use terminals with code not publicly released, similar to the Hack the Pentagon challenge.

Prize criteria. Participants work on the program for a week, receive a few thousand dollars to fund their stay, and get prizes of up to $100k. Given the personal sacrifice required to enter, and the goal of attracting top participants, the prizes should be higher than the “lab-run” model that mostly aims to encourage submissions from enthusiasts. However, suppose the criteria are less difficult to achieve or less well-defined than in the evals model. In that case, I expect many prizes to be achieved, so there should be a prize lower than $100k to make the contest more affordable. Prizes can be somewhat subjectively awarded by a panel of judges, with the hope being that participants gain enough trust during their time interacting with these judges to accept somewhat fuzzy prize criteria.

Risk management. Giving external bounty hunters access to these potentially sensitive tools securely will potentially require both social (e.g., contracts, NDAs) and technical (e.g., carefully designed structured access tools, cryptographic) solutions.

Cost
  • Operational time – four FTE months to recruit and do background checks
  • This is a tentative guess based on the time taken for UK Security Checks and the time taken for recruitment. I expect the time required to be dictated by the organizers’ existing infosecurity infrastructure (for example, whether they already complete some basic security checks).
  • Researcher time – one FTE month to develop test criteria and one FTE to interact with and teach participants.
  • Travel and lodging costs – 20*($1k travel + 1k accommodation +1k other) = $60k
  • Prize costs – $200k total expected (average 10k per participant but heavy-tailed)

Total: four FTE months and $260k

Benefits
  • More people test AI models in the very early stages. If this process becomes common or inspires successor projects within labs, labs may red-team models during AI training runs in crunch time, when an AI in the process of training could suddenly become much more powerful or dangerous, often referred to as a ‘sharp left turn’ (Soares, 2022).
  • The findings are more likely to be novel and useful, given the smaller and more skilled pool of people working with this resource.
  • Contestants gain access to experts working in the field. Having engagement with a broader community probably facilitates greater engagement in future safety bounties, as seen in the creation of the bug bounty hunter community.
Downsides
  • Internal AI lab secrets could be leaked, which could be bad intrinsically due to a greater diffusion of advanced AI models, or bad because AI labs are less likely to want to implement bounties in the future.
  • Less likely to find a wide pool of potential talent if it’s a more formal application process, rather than an open-source competition.

Recommendations for running a safety bounty program

Run both bug bounties and safety bounties on the same platform.

Suppose AI labs already use a bug bounty platform, as OpenAI does with BugCrowd (OpenAI, 2023). In that case, they can more easily establish an AI safety bounty program with the same provider, reducing their operational burden.

According to Alex Rice (personal communication, 2023), HackerOne can support bug and safety bounty programs. Once an AI lab has set up a bug bounty program, utilizing the existing infrastructure to establish a safety bounty program, including a bounty program with subjective criteria, becomes easier. For example, the Twitter bias bounty model included both a general submission and a targeted challenge, demonstrating the flexibility of such platforms (HackerOne, 2021). With the program infrastructure in place, AI labs can add, remove, change, or experiment with the scope of their safety initiatives (Rice, personal communication, 2023).

Incentives to take part in bounties should not just be financial

Bounty hunters are not just motivated by cash prizes. HackerOne surveys have repeatedly returned that learning new skills is a larger motivation than money for users of their platform. Less than 1% of the accounts registered on HackerOne have ever earned a cash prize. Other factors which ensure regular engagement include:

  • Leaderboards,
  • Feedback, and interaction,
  • Explanation of concerns and ethical reasons for engaging with the work,
  • Community events, and
  • Job opportunities.

Occasional one-off prizes will not alone maintain the type of engagement seen in the bug bounty space.

Expect safety bounties to be less successful than cybersecurity bug Bounties

While this report advocates for trialing AI Safety Bounties and finds evidence of mild success in similar formats, I do not expect that AI Safety Bounties will be as successful as bug bounties in cybersecurity, including when compared to the size of the industry.

Bug bounties in cybersecurity have proven to be a surprisingly successful model, with many big players investing heavily in them, presumably with an expectation of those investments paying off. A reasonable prior is that applying bounties to different fields, like AI safety, will not generate as much success in the initial field. A few reasons this is unlikely to be as successful in AI safety include:

  • Issues are harder to fix. Often a cybersecurity bug can be patched, the security improved, and the bounty finder rewarded. This means:
  • Feedback loops are good for bounty hunters
  • Companies have a profit incentive to engage
  • The success of bounty programs can be measured
  • In AI safety, however, many of the issues are likely to be complicated and require addressing in future versions of the model – meaning these benefits are non-existent or less common.
  • The risks associated with AI safety frequently manifest as external consequences, not necessarily as immediate financial losses for the company involved.
  • Cybersecurity is a bigger industry, which can pool more resources into infrastructures for bug bounty hunters.

Use safety bounties as part of a broader strategy

As in cybersecurity, bounties will not alone maintain a system’s security. Labs should see bounties as one part of a broader architecture, including, for example, red-teaming, auditing, external reviewers from different groups, cybersecurity, and safety culture (Ee et al., upcoming). Alex Rice (personal communication, 2023) expressed concern that too many security organizations come to bug bounties looking for a silver bullet for their security, whereas a better model would utilize bounties as one layer toward the end of the cybersecurity pipeline.

Have a predetermined disclosure policy

To mitigate the risks of losing/stifling publicly salient examples of AI going wrong (and to boost the benefit of having publicly salient examples of AI going wrong), labs must release information about the submissions or allow participants to release their submissions if it is safe to do so. Labs seem likely to me to have individual incentive to decide after the contest that they do not want to release the results publicly. Therefore, pre-deciding and publicly reasoning a disclosure policy is important to mitigate these incentives.

Share lessons learned

Contest organizers should share lessons, such as by publishing retrospectives. Many organizers of bounty trials in AI have not written up the lessons from the process. Interviews with the team organizing Twitter’s bias bounty and OpenAI’s ChatGPT Feedback Contest brought up surprising insights, which I did not expect after researching (admittedly, from outside the industry) for a few months. A relatively small and cooperative space of top AI labs provides an exciting opportunity to collaborate on safety-focused programs. This collaboration could lead to identifying successful safety-focused programs and establishing norms, which emerging competitors will face pressure to implement for their systems.

I expect this not to be the default for AI labs. For example, OpenAI has not yet shared any of the submissions from the ChatGPT Feedback Contest.

Consider PR risks

Labs may want to avoid public examples of their products acting in unintended ways. Labs or third-party bounty organizers will also want the public to perceive their bounty programs as reasonable and responsible measures. I expect the negative outcomes generated by AI systems to more often affect public goods, compared to cybersecurity bugs, which more often affect just the company. For this reason, I tentatively expect bounties eliciting AI harms to receive more negative press than bounties hacking a company’s software. In practice, I expect organizers to have the incentive and ability to find and mitigate ways in which the running of bounty programs negatively affects their PR, including ways that limit the scope of bounty programs. When considering these mitigations, organizers should consider how public bounties will help earn trust from system users, customers, civil society, governments, and other stakeholders that they are building AI responsibly (see suggestions in Brundage et al., 2020).

Here are some suggested mitigations to these risks that I believe would preserve the benefits of bounty programs:

  • Frame safety bounties programs as part of a broader strategy for responsible and safe AI development.
  • The organizer should publicize the safety bounty program themself, working with media organizations to encourage engagement and explain the intention of bounty programs, with references to successes from cybersecurity bug bounties, and reports such as this one and Kenway et al. (2022).
  • Consider framings like ‘feedback contest,’ ‘collaborative safeguarding,’ and ‘risk mitigation’ that focus on the intentions of the contest rather than the adversarial outputs.

Independently assess legal risks

This report constitutes a preliminary analysis of the ethics and value of organizing AI safety bounties. If you plan to organize a bounty independently of the developer of the AI system being tested, I recommend independently considering the legal risk of this, as this report does not constitute legal advice to set up each potential safety bounty model.

Future research

Comparing safety bounties to other similar interventions

This report focused on whether safety bounties of some form would be a) beneficial and b) low-cost, at least to trial, and I feel fairly confident that this is the case. However, I did not analyze in depth whether safety bounties are worth the opportunity cost to the organizers setting them up, nor compare them to other programs that a lab or philanthropic org could try to set up instead. Future work could compare safety bounties to other programs that accomplish some of the same goals, doing a cost-benefit analysis of each. Some examples of similar programs include:

Eliciting salient examples of ML going wrong

  • Updating the AI incident database
  • Diffuse projects to improve a culture of openness around incident sharing in AI labs
  • Investigative journalism

Institutional mechanisms for finding issues at labs (Brundage et. al., 2020)

  • Regulation for more AI lab capacity to focus on safety concerns
  • Red-teaming exercises
  • Independent auditors, such as ARC Evals, could also provide adversarial testing of AI models with less of a market incentive to minimize model failures than in-house lab testers

Preventing AI from being used for catastrophic harm

  • Cybersecurity practices and mechanisms to prevent unauthorized model access or exfiltration
  • Regulation against dangerous capability development in frontier AI labs
  • Targeted government intelligence that identifies and tracks AI users with dangerous patterns of usage

Bringing a more market-based approach to adversarial testing

Developing a talent pipeline for stress-testing

  • More traditional methods of recruitment and talent-spotting

Opportunity cost and counterfactual value

The time and money required to set up safety bounties could also be spent achieving alternative subgoals of reducing catastrophic risk. For example, programs that would pull on similar resource pools but achieve different valuable goals include:

  • Auditing
  • Internal red-teaming
  • Whistleblowing
  • Infosec Policy and inter-lab infosec collaborations
  • Responsible Disclosure Policies
  • AI incident sharing (e.g., AI incident database)
  • Hiring and operational costs at AI safety research organizations

Should we invest in safety bounties when it would detract from these mechanisms?

I expect that including bounties in AI labs’ stress-testing portfolio is probably worthwhile, compared to their alternative options, if the relevant teams follow the recommendations in this report. One contact confirmed that, if they had the capacity, they would run more bounties at AI labs in the future. OpenAI spent approximately 0.1-0.01x as much time on the ChatGPT Feedback Contest as on external red-teaming and auditing GPT-4.

There is some evidence of the cost and returns of bug bounties in cybersecurity. More companies are buying into bug bounties with significant payouts, implying that the programs provide sufficient value to their investment. Furthermore, talented individuals are being identified and rewarded through bug bounty programs: the top 22 hackers on HackerOne have each earned over $1 million cumulatively (HackerOne, 2022), with over $230 million distributed to hackers. The annual payout and number of users on HackerOne are increasing each year. In 2021, 1 million users were registered, and companies paid approximately $57 million in prizes through HackerOne.3

Trends in number of hackers and total payouts on HackerOne over the last five years.4

YearTotal payout ($ millions)Number of hackers (thousands)Payout per hacker ($)
201823.5166142
20191930063
20204060067
2021571,00057
202273

OpenAI’s ChatGPT Feedback Contest offered $10,000 in OpenAI credits as prizes, to be distributed among the top 20 entrants. The contest elicited 1500 submissions.

The above figures imply an average of $7 per submission in the ChatGPT Feedback Contest. The annual statistics from HackerOne, represented in the above graph, indicate about $50 in prizes per registered bounty hunter. Bounties, therefore, seem a cheap approach to identifying new talent in AI stress testing. Finifter et al. (2013, p.9) argue that vulnerability reward programs are a cost-effective mechanism for finding security vulnerabilities in cybersecurity compared to alternative mechanisms. While not a definitive assessment, I think this can serve as a rough reference class for the opportunity cost compared to similar mechanisms.

Acknowledgments

This report is a project of Rethink Priorities-a think tank dedicated to informing decisions made by high-impact organizations and funders across various cause areas. The author is Patrick Levermore. Thanks to Ashwin Acharya and Amanda El-Dakhakhni for their guidance, Onni Aarne, Michael Aird, Marie Buhl, Shaun Ee, Erich Grunewald, Oliver Guest, Joe O’Brien, Max Räuker, Emma Williamson, Linchuan Zhang for their helpful feedback, and Adam Papineau for copyediting.

If you are interested in RP’s work, please visit our research database and subscribe to our newsletter.

References

Barnes, B. (2023). Update on ARC’s recent eval efforts. ARC Reports. Retrieved from https://evals.alignment.org/blog/2023-03-18-update-on-recent-evals/

BiasBounty.AI. (2023). New bias bounty from biasbounty.ai. BiasBounty.AI Blog.

Brundage et al. (2020). Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims. arXiv. https://arxiv.org/abs/2004.07213v2

Clark, J. (2023). AI Audit challenge. Medium. https://medium.com/@jackclark/ai-audit-challenge

Finifter, M., Akhawe, D., & Wagner, D. (2013). An empirical study of vulnerability rewards programs. In Proceedings of the 22nd USENIX Security Symposium (pp. 273-288).

Grace, K. (2015). AI Impacts research bounties. Retrieved from https://aiimpacts.org/ai-impacts-research-bounties/

HackerOne. (2017). Hack the Pentagon. HackerOne Blog. https://www.hackerone.com/blog/Hack-the-Pentagon

HackerOne. (2018). Hacker-Powered Security Report 2018.

HackerOne. (2019). Hacker-Powered Security Report 2019.

HackerOne. (2020). Hacker-Powered Security Report 2020.

HackerOne. (2021). Hacker-Powered Security Report 2021.

HackerOne. (2022). Hacker-Powered Security Report 2022.

HackerOne. (2022). The 2021 Hacker Report. HackerOne Reports. https://www.hackerone.com/sites/default/files/2021-04/the-2021-hacker-report.pdf

Hatta, A. (2022, April 10). Existential Risk and the Offence-Defence Balance. Retrieved from https://medium.com/@afiqhatta.ah/existential-risk-and-the-offence-defence-balance-4f17c2d6366f

Kenway, J., François, C., Costanza-Chock, S., Raji, I. D., & Buolamwini, J. (2022, January). Bug Bounties for Algorithmic Harms? Algorithmic Justice League. Retrieved from https://www.ajl.org/bugs

Lohn, A. (2020, December). Hacking AI: A Primer for Policymakers on Machine Learning Cybersecurity. Retrieved from https://cset.georgetown.edu/publication/hacking-ai/

Lohn, A., & Hoffman, W. (2022, March). Securing AI: How Traditional Vulnerability Disclosure Must Adapt. CSET. Retrieved from https://cset.georgetown.edu/publication/securing-ai-how-traditional-vulnerability-disclosure-must-adapt/

Mozilla. (2023, March 3). Auditing AI: Announcing the 2023 Mozilla Technology Fund Cohort. Retrieved from https://foundation.mozilla.org/en/blog/auditing-ai-announcing-the-2023-mozilla-technology-fund-cohort/

National Institute of Standards and Technology. (2023). NIST AI risk management framework. NIST Reports.

OpenAI. (2022). ChatGPT Feedback Contest: Official Rules. OpenAI Blog. https://www.openai.com/blog/chatgpt-feedback-contest-official-rules

OpenAI. (2023). GPT-4 System Card. OpenAI Blog. https://www.openai.com/blog/gpt-4-system-card

Raji, I. D., Xu, P., Honigsberg, C., & Ho, D. E. (2022, June). Outsider Oversight: Designing a Third Party Audit Ecosystem for AI Governance. arXiv preprint arXiv:2206.04737.

Rubinovitz, J. B. (2018). Bias Bounty Programs as a Method of Combatting Bias in AI. Retrieved from: https://rubinovitz.com/articles/2018-08-01-bias-bounty-programs-as-a-method-of-combatting

Yee, K., Font Peradejordi, I. (2023). Sharing learnings from the first algorithmic bias bounty challenge. Twitter Blog. Retrieved from: https://blog.twitter.com/engineering/en_us/topics/insights/2021/learnings-from-the-first-algorithmic-bias-bounty-challenge

Whittaker, Z. (2021). Cybersecurity: This is how much top hackers are earning from bug bounties. ZDNet. https://www.zdnet.com/article/cybersecurity-this-is-how-much-top-hackers-are-earning-from-bug-bounties

Yerushalmy, J. (2023, February 17). ‘I want to destroy whatever I want’: Bing’s AI chatbot unsettles US reporter. The Guardian. Retrieved from https://www.theguardian.com/technology/2023/feb/17/i-want-to-destroy-whatever-i-want-bings-ai-chatbot-unsettles-us-reporter