Types of specification problems in forecasting

This research was written by Juan Gil, Visiting Fellow at Rethink Priorities, in Summer 2021 as part of RP's internship program.

In the context of forecasting, how you ask a question can be a large factor in how much information you get from the forecasts on that question. We will thus explore the problems associated with the way you operationalize a question on a forecasting platform, which we will call the question specification. (Terms also used include resolution criteria or operationalization.) Specifically:

  • What are types of specification problems that reduce or misalign optimization power going towards prediction (in different forecasting settings)?
  • What are some ways to resolve these different problems?

There are various other alignment problems related to forecasting platforms that this post will not address (see this paper for more on alignment of forecasting platforms):

  • Alignment related to scoring system, i.e. rewards that incentivize forecasters to not give their true probabilities in some cases
  • “Dead people can’t get paid”: Forecasting on existential catastrophes causing self-interested forecasters to condition on no existential catastrophe happening, since they won’t receive reward in situations where they’re dead

Summary

In this post, I’ll describe three major categories of specification problems on forecasting platforms, discuss how they’re related to one another, and outline how you might mitigate or trade off some of the costs. Note that these categories are not “crisp”; they are related to one another and a question can simultaneously have problems in different categories.

  • Ambiguous - a specification might not be tight enough for it to be clear how it should resolve, which imposes costs on forecasters and people using those forecasts in their decision-making
  • Misaligned - a specification might be misaligned with what the question-asker cares about such that they get less information from the forecasts
  • Uncompelling- a question might just not be interesting or important enough to incentivize forecasters to think about it, especially in settings like Metaculus where interest or desire to help are primary motivators (rather than e.g. financial incentives).

In particular, I’ll explore how these problems impose costs on both forecasters and question-askers. (Note: I’ll use “question-asker” to refer to any party that benefits from accurate predictions on a particular question.)

Ambiguous questions

Ambiguous questions do not have clear, well-defined specifications. It might not be clear what states of the world will cause the question to resolve one way or another, or there’s room for competing interpretations.

For an example from the EA Forum, take a look at this bet between Michael Dickens and Buck Shlegeris on the proposition: “By the end of 2021, a restaurant regularly sells an item primarily made of a cultured animal product with a menu price less than $100.”

This bet was made in 2016, but it was quite unclear how the bet should resolve near the end of 2021. What does it mean to “regularly[1] sell[2]” something? What is a “restaurant”? Ultimately, an arbiter was selected to resolve the dispute and further operationalize the question to remove ambiguity.[3]

Metaculus has a guide on how to write unambiguous and useful questions here. I find that it illustrates well the different ways that a specification can be unintentionally ambiguous. Especially check out the section “Guidelines for resolution conditions”:

  • Aim for tight resolution conditions. The resolution conditions should leave little room for discretion in deciding the resolution. As best you can, try to limit the scope for ex-post quarrels about what really happened, and who was right.
  • Define your terms. Questions of the sort “will X occur?” often hinge on how X is defined. It is therefore important to spell out your definitions with extra care. Don’t worry about being a little pedantic here!
  • Be concrete. Try to specify precisely and in detail which steps should or shouldn’t be followed when resolving the question. Examples are helpful for making these instructions concrete.
  • Use authoritative sources, when possible. Good options are numerical data regularly published by a reliable publicly available source. Note that you should be sure that the sources will be available at the time of resolution, or otherwise you might want to specify alternative sources of information.
  • Consider and account for edge-cases. Try to imagine scenarios for which the resolution conditions fail to cleanly apply, or cases that are just on the edge of counting towards resolution. If such scenarios or edge-cases are plausible, you should clarify how the question should resolve when such events rear their head.
  • Consider fall-back criteria. When you have a resolution that should be easy to check assuming all goes well, try to handle also the case where all doesn't go well. What if the data source you specified stops being published? Is there anything else odd that might happen to make the outcome unclear?
  • Try to account for unknown unknowns. Think about how the resolution criteria behave when something you don’t expect happens anyway.

There are different levels of ambiguity, and while I don’t think it’s often feasible to remove all ambiguity, reducing unintended ambiguity is typically better. In the worst cases, too much ambiguity can make the question incoherent or self-contradictory. More commonly, the question only becomes ambiguous with low probability (e.g. through edge cases or unanticipated events in the world). The lower the probability of ambiguity, the less costly it will be to forecasters.

This problem becomes harder on longer time scales since there’s more time for the world to change in a way that makes your question no longer make sense or resolve clearly one way or another. For example, if the resolution criteria is based on the reporting of some media outlet, but that media outlet ceases to exist, then the question will resolve ambiguously unless you have fall-back criteria.

Forecasting platforms primarily deal with ambiguous questions in two ways:

  • Invalidation of question → the question resolves neither positively or negatively. Depending on the platform, forecasters usually get no reward (though may get their money back if it was a market).
  • Arbiter decides → some arbiter has final say on all resolutions such that they can resolve the question in accordance with the spirit of the question

Costs of ambiguity

Costs to forecasters:

  • When an ambiguous question is invalidated:

    • This reduces the incentive to participate in the first place (since there’s a chance that the forecaster will get no reward).
    • This might actively disincentivize forecasters since it’s annoying for a question to be invalidated after you’ve invested resources in predicting.
  • When an ambiguous question is resolved by an arbiter (and the arbiter’s decision is hard to predict)[4]:

    • The forecaster would need to predict how they might decide (rather than the object-level question). This adds uncertainty that might be harder to account for, so risk-averse forecasters might choose not to participate.
    • Trying to predict what an arbiter will decide is often less interesting/important to think about, so forecasters motivated by those factors will be less interested in participating.
    • Depending on the context, forecasters might choose not to participate in a market if they fear corruption of the arbiter. Specifically, a corrupt arbiter might directly or indirectly participate in the prediction market and then resolve the question in a way that causes them to profit. More ambiguity gives the arbiter more flexibility to resolve the question in a corrupt way.

      • This doesn’t seem to be a big problem in today’s prediction markets, and I don’t think it’ll become a big problem as long as there’s competition between different platforms (which would cause forecasters to leave platforms with corrupt arbiters over time).
  • A question with lots of ambiguity can be hard to start forecasting in the first place. One benefit of concrete specifications is that they ground the question closer in reality, and when specifications are poor, forecasters might have to partly do this work themselves anyway.

Costs to question-asker:

  • If it’s less clear what forecasters are actually predicting, it’s harder to use that information effectively.
  • Costs to forecasters will result in reduced optimization power working on forecasting the question.

Reducing the costs for forecasters of a market resolving ambiguous:

  • A forecasting platform can make it less likely that the market resolves ambiguous by having a trusted arbiter that can resolve the question according to the “spirit” of the question.

    • For example, Polymarket indicates on all of their questions that, in the event of ambiguity, the question will be resolved at the discretion of their Markets Integrity Committee.
  • Forecasters could hedge against the risk of invalidation if there were rewards associated with predicting the invalidation of the question itself.

    • For example, Augur, a decentralized prediction market using Ethereum, gives the ability to buy contracts not just based on possible resolutions, but also a contract that pays out if the question is invalidated. (see here)
    • This allows forecasters to hedge their risk by buying contracts for invalidation, and a high priced invalidation contract indicates that a question might be risky to spend resources predicting (since this means that the market puts high probability on the question being invalidated)
    • This also incentivizes forecasters to determine problems with the specification that would lead to invalidation (since they could buy contracts for invalidation beforehand).
    • On a platform like Metaculus, you could have a secondary question that asks for the likelihood that the original question resolves ambiguous (or changes specification, etc.). Unfortunately, you might run into a problem where you create too many questions since the simplest version of this would just double the number of questions on the platform.
    • Additionally, many forecasters on platforms like Metaculus might be annoyed at invalidation of questions in a way unrelated to points received, since the points are only one of many incentives (others being the intellectual challenge or importance of the questions)
    • This can lead to perverse incentives for people intentionally launching invalid questions to earn rewards from that question being invalidated.[5]

Misaligned questions

A misaligned question has a specification that, while potentially unambiguous, fails to cover what the question-asker actually cares about. In other words, a misaligned question is one in which the letter of the question diverges from the spirit of the question.

There are different degrees of misalignment. Creating a question that predicts a proxy for what you care about can be fine if the proxy gives you substantial information about the thing you care about. For example, if I’m creating a market to predict the presidential election, it’s probably fine to resolve based on some combination of credible media reports since these are pretty tightly correlated to the actual election results (but these do diverge sometimes!) So, a question is only misaligned to the extent that the proxy they choose diverges from what they care about.

An example of a well-specified but misaligned resolution criteria (here): A question asked whether North Korea would have a missile launch by a certain date. While this missile launch did occur, the resolution criteria required a confirmation from the US Department of Defense, but that never happened, so it was resolved negatively.

One of the contract’s rules is that “the source used to confirm a test missile being launched and leaving North Korean airspace will be the U.S. Department of Defense.” The problem is that, according to Tradesports spokesman Matt Bonner, they made “numerous efforts to receive direct confirmation from the DoD” but were told “no statement involving the missile test and North Korean airspace would be forthcoming, as those specifics are considered a matter of national intelligence/security.” Bonner emphasized that “a confirmation source is, by definition and necessity, an integral part of the proposition on which contracts trade” – and said that traders are “obligated to be familiar with the rules of a contract before they place an order.”

I don’t think this example is particularly egregious (as in, I don’t think the question-askers were obviously wrong in their specification) since DoD reports do seem correlated with North Korea missile launches, but this happened to be a high-profile case where the spirit of the question and the letter diverged.

It can be hard to tell how tightly your proxy is tied to what you care about, especially over longer time horizons. For example, maybe I want my resolution to depend on how some reputable organization reports on it. But, what if that organization changes methodology, leadership, focus, etc. over time such that its reporting diverges from my actual question at resolution time?

Goodhart’s Law and forecasting

Another form of alignment failure can happen if the optimization power of the forecasting platform is strong enough (e.g. if there’s lots of money at stake) that forecasters try to change the world to meet some resolution criteria in a way that makes the resolution criteria less correlated with what you care about. This is an example of Goodhart’s law, often summarized as “When a measure becomes a target, it ceases to be a good measure.”

For example, maybe the number of daily visits to the EA Forum is a reasonable proxy for size of the EA community. But, if the stakes are high enough in a forecasting setting, it’s potentially easy to artificially inflate those numbers using bots.

An additional problem (besides the forecast potentially becoming less useful for the original purposes) involves how the optimizers change the world to meet their goal. They might make the world a worse place in doing so, i.e. create negative externalities.

I don’t think this is a common problem, but it becomes more likely as the stakes get higher and for questions that individuals in the market are well-positioned to influence.[6]

Costs of misalignment

In the case of misalignment, there seems to be a tradeoff between costs to the forecasters and costs to the question-asker, depending on whether the misalignment is anticipated by forecasters or not:

  • If the forecasters predict that the letter of the question will diverge from the spirit of the question, then their predictions will no longer be about the thing that the question-asker cares about. In this case, the question-asker pays a cost.
  • If the forecaster makes their predictions based on the spirit of the question, then they will incur costs (e.g. not earning points / money, getting mad) if the question resolves in a misaligned way instead. However, the question-asker gets the information they wanted from the forecasters.

    • In the above example of the North Korea missile launch, the misalignment seems like it mostly hurt forecasters rather than the question-askers. Forecasters seemed to have predicted on the question as if the DoD report would happen in the case of a missile launch, so the question-asker got useful predictions about the missile launch.

There are also the costs of the negative externalities in the cases where the optimizing force of the prediction market causes the world to change, discussed above.

Tradeoff between ambiguity and alignment

I want to briefly point out the apparent tradeoff between the ambiguity and alignment of a question, in the context where some arbiter decides how the ambiguity is resolved.

When a forecasting question is unambiguous, the arbiter has little or no wiggle room to decide how the question will resolve. In many situations, this is beneficial to forecasters and thus the question-asker.

However, in a situation where misalignment seems likely (or it seems hard to create an unambiguous specification that is aligned), ambiguity might be beneficial. In this case, a trusted arbiter could resolve the question according to the spirit of the question.

Uncompelling questions

An uncompelling question is one that, while maybe unambiguous and well-specified, is just not interesting or important enough for forecasters to care about participating.

This is specifically a problem for platforms like Metaculus. While a forecaster might be modelled theoretically as someone trying to maximize their points, in reality, points are often secondary to the question actually being interesting/important for them to think about. Relatedly, these platforms and individuals on them have more limited attention, which might be less the case in markets with larger financial incentives.

So, this problem is less about the strong optimization force of the prediction platform being misaligned, and more about the limited optimization power of the platform being reduced.

Identifying and implementing better question specifications

This section will cover two questions:

  • How can question-askers get better specifications?
  • How should platforms allow question-askers to modify questions?

How to write better specifications

  • Get more people to consider alternative specifications:

    • The platform itself might have moderators skilled at writing good specifications, either volunteers or paid by the platform, that could look at your question.

    • External to the platform, a question-asker could hire contractors specifically for this purpose.

    • Crowdsourcing improvements to specifications:

      • Offer a bounty for identifying a flaw in resolution criteria (see Polymarket, which offers bounties of hundreds of dollars).
      • In settings like Metaculus, the platform could award non-monetary prizes of some sort, like a new or existing karma system[7], for improved specifications.
    • Outsourcing the work of fixing a specification might be challenging since the question-asker needs to clearly communicate what they want from the question to the moderators/contractors/users who are improving the specification. This is similar to the problem of communicating to employees, grantees, contractors, etc. what exactly you want when you outsource or delegate work to them.

  • Solicit forecasts for a range of proxies to the question you care about (see CSET, which asked forecasters to forecast several well-specified questions on proxies to the phenomena they care about)

    • If you create several strong specifications for questions that are proxies to what you really care about, then you’re more likely to get useful information on what you care about while still having unambiguous specifications.
    • Costs to this approach:

      • You might have too many questions, which can “dilute” the optimization power to some extent as forecasters may choose to not predict on many of the questions.
      • If the questions are less obviously tied to the central question (that the question-asker cares about), then forecasters might also care less about that question.

How to change the specification

If you identify a better specification before you launch the question, then it’s simple to change it before it’s open for forecasters. However, if the problem is only identified after the launch of the question, it’ll be trickier.

Here are a few options:

  • For low stakes markets (few forecasters / not long-running / no large investments), you might just be able to change the specification in reasonable ways without resetting the question and without making anyone mad.

    • However, this might be a judgment call. There are some cases where it might appear to be low stakes, but by changing the specification, you might unfairly disadvantage forecasters who were relying on the initial specification.
  • The moderator of the platform could reserve authority to modify specification.

    • For example, FTX writes in many of their new prediction questions that they reserve the right to change the question specification at any time.
    • This could solve misspecification and ambiguity problems if the arbiter is trusted and if there’s a clear, coherent “spirit” of the question when it first launches.
  • A new question could be launched with the updated specification. In these situations, there are a few things you could do to the old question:

    • Invalidate the old question.

      • When playing for points, this could mean that nobody gets any points. In a market, this could mean participants get their money back.
      • Invalidation can be costly, and the risk of invalidation can reduce the incentives for forecasters to participate.
      • I think this option makes sense if the question is very ambiguous or incoherent.
    • Keep the old question up alongside the new question.

      • The simplest solution might just be to keep the old questions up alongside the new questions, even if they’re about approximately the same thing.
      • However, this might have some big downsides.

        • If forecaster attention is limited, then having multiple similar questions will dilute the optimization power.
        • This might also lead to confusion if there are multiple questions on approximately the same thing.
      • On the other hand, if the questions are similar enough, a forecaster would definitely not have to do double the work to forecast on both questions (and in many cases might do nearly the same amount of work).
    • Close the question early.

      • On platforms like Metaculus, the question-asker can close the question such that no new predictions come in.
      • When the (old) question is resolved, you award points following the old question’s specifications[8].
      • This allows forecasters to get the points they deserve on the old question while bringing them and new forecasters to predict on the better specification.

Credits

This research is a project of Rethink Priorities.

It was written by Juan Gil, an intern at Rethink Priorities. Thanks to alexrjl, Michael Aird, Nuño Sempere, and Linch Zhang for helpful feedback and conversations that led to this post. If you like our work, please consider subscribing to our newsletter. You can see more of our work here.

Notes


  1. One issue that came up was whether “regularly” and “by” means “the period leading up to the end of the specified time period”, or just “any period of time before the EOY 2021, inclusive” ↩︎

  2. One possible resolution source was Supermeat’s test kitchen, which offered cultured meat that’s initially free. ↩︎

  3. Even after resolution, it was still sufficiently ambiguous that the judge was not certain that the resolution was correct (private info) ↩︎

  4. Note that these costs are only significant if the arbiter’s decision is not predictable. In many cases, the spirit of the question is clear and the arbiter is trustworthy, so forecasters can just predict based on the spirit of the question. ↩︎

  5. An old version of Augur suffered from a similar problem, and scammers profited from creating invalid questions. See here. ↩︎

  6. An example in which the stakes were high and the resolution was easy to influence, from Avraham Eisenberg on Polymarket questions: “There was a market on how many times Souljaboy would tweet during a given week. The way these markets are set up, they subtract the total number of tweets on the account at the beginning and end, so deletions can remove tweets. Someone went on his twitch stream, tipped a couple hundred dollars, and said he'd tip more if Soulja would delete a bunch of tweets. Soulja went on a deleting spree and the market went crazy. Multiple people made over 10k on this market; at least one person made 30k and at least one person lost 15k.” I’m not sure what the point of this question was in the first place, so I’m not sure how costly this was to the question-asker, but if the goal was to predict how much Soulja tweets on an average week, this market certainly failed to do that. ↩︎

  7. Tachyons could be used for this purpose. ↩︎

  8. In this case, you may need to truncate the points awarded to keep the scoring rule proper, though I haven’t thought about it in depth. See here for motivation about why that might be the case. ↩︎

Juan Gil

Juan Gil was a 2021 fellow at Rethink Priorities working on longtermism research. He recently graduated from MIT with a major in Mathematics with Computer Science. Previously, he researched topics in combinatorics and machine learning. He also organized the Effective Altruism MIT student group, serving as president for two years.

Previous
Previous

An examination of Metaculus' resolved AI predictions and their implications for AI timelines

Next
Next

Intervention report: agricultural land redistribution