Search
Share with
Or share with link

Impact ranking database: A feasibility study for Charity Navigator

Editorial note

This report was commissioned by Charity Navigator and produced by Rethink Priorities over the course of ~100 hours in April and May 2024. Charity Navigator does not necessarily endorse our conclusions.

 

Our report aimed to provide Charity Navigator with insights into a wide range of potential methods that Charity Navigator could use to meet their objectives, as well as to sense check approaches already under consideration. This required a broad set of research activities, from defining impact to concretely assessing the advantages and disadvantages of methodological options. Due to the wide scope, our research into each method was not exhaustive and sometimes relied on first principles reasoning in areas outside our expertise, such as impact definitions for arts and culture.

 

We don’t intend for this report to be Rethink Priorities’ final word on impact-ranking, and we have tried to flag major sources of uncertainty in the report. We recommend that Charity Navigator and other readers verify the claims and conduct more detailed research before making any investments based on our findings.

Executive summary

Charity Navigator currently publishes data over 230,000 charities on their website and has asked Rethink Priorities to explore the feasibility of creating an impact-ranking database that aims to score these charities based on their impactfulness. The resources available to develop and maintain such a database are limited: about one full-time equivalent (FTE) for development, and considerably less than one FTE annually for maintenance.

We believe that guiding donations towards more effective charities could have a significant impact, as Charity Navigator influences many philanthropic donors. Inspiring even a small fraction of these donors could lead to substantial positive effects. For this reason, we are supportive of the idea of Charity Navigator investing in sharing information about the impactfulness of charities with its website users.

We argue that supporting donors seeking to make impactful donations necessarily requires taking cost-effectiveness into account, meaning the goal of an impact-ranking database would be to estimate how cost-effectively charities achieve a certain impact. The impact still needs to be clearly defined and operationalized, possibly per cause area since Charity Navigator has shown a preference for ranking charities within cause areas to align with donor preference. We are uncertain whether Charity Navigator could develop a clear and measurable definition of impactfulness or cost-effectiveness for all cause areas. This is because we have been unable to find a feasible definition of impact that Charity Navigator could use for the cause arts, culture, and humanities which we used as a case-study. Together with Charity Navigator, we assessed that it was not simple to find examples of charities that would score relatively low and relatively high in terms of impactfulness or cost-effectiveness, which makes us more skeptical about developing such a definition. Charity Navigator could try involving cause-specific experts to develop a feasible operationalization of impact, but we are uncertain whether this would work. If not, we consider this an important barrier for creating an impact ranking database.

We researched three methods that could theoretically be used to rank all charities on the Charity Navigator website, but ultimately advise against using any of these:

  • The first method involves developing a machine learning algorithm trained to rate a charity’s ‘impactfulness’ or cost-effectiveness based on various characteristics. We consider this approach unfeasible within Charity Navigator’s anticipated capacity, with the primary problem being that a large number of charities would need to be manually evaluated and scored for impact for the algorithm to be effective. There is also a likelihood that even a sophisticated model would perform poorly.
  • The second method we researched involves utilizing large language models (LLMs) like ChatGPT or Claude to analyze supplied information and produce ratings on the impact of charities, or along specific dimensions relevant to assessing impact. We advise against using this method, because of concerns about capacity and accuracy. We think that any available standardized form of data (such as information on tax-forms) includes too little information to accurately rank charities on cost-effectiveness. This means that information would need to be gathered from less standardized sources, such as the websites of charities. Gathering and pre-processing this information so that it can be fed into an LLM requires substantial technical resources. Moreover, creating a series of indicators with corresponding prompts and examples, and testing whether they have the intended effect is also time-consuming, and may need to be done separately for individual cause-areas. We are uncertain whether this would be feasible within the capacity Charity Navigator has available. Even if some form of it would be possible, capacity constraints mean that it would need to be an evaluation based on over-simplified indicators of cost-effectiveness, which would ultimately lead to a high risk of mis-classification in the scoring. Even with a higher capacity investment, we think it is unlikely that it is feasible to create accurate cost-effectiveness assessments with current LLMs.
  • In the third method, impact ratings could be based on a taxonomy of interventions which have been ranked on impactfulness. Charities working on impactful interventions would score higher than those that do not. While a simplified version of this method might be feasible, it would lead to very low accuracy due to several challenges. Existing taxonomies, like those from the Nonprofit Program Classification, aren’t designed to align with impact measures and often include a broad spectrum of charities with varying levels of impactfulness, making them impractical for impact ranking. It may be possible to create a new taxonomy more suited for impact ranking, but we think this still won’t lead to a precise indication of which charities have a high impact. Developing a new taxonomy would also require investing capacity, and developing a method to map charities onto such a taxonomy, which could be done either by self-reporting of charities or by using LLMs, introducing the risk of inaccuracies and capacity issues in the latter case.

Overall, we think each of these methods have higher capacity requirements to implement and maintain than Charity Navigator have available, and each of these methods has a high likelihood of low accuracy and errors. The latter presents several issues: first, errors might be publicly exposed, potentially damaging donors’ confidence in Charity Navigator. Less significant errors can still be costly if they require extensive time to verify and correct, especially if based on complaints from inaccurately rated charities. Given the large number of charities involved, even a minor error rate could strain Charity Navigator’s ability to handle these issues effectively. Furthermore, using a system prone to errors could unjustly influence donation decisions, negatively impacting charities if their ratings are inaccurately low.

We think there are good alternatives for Charity Navigator to guide donations towards impactful giving and to offer a service to impact-oriented donors. In particular, donors could be presented with a small curated list of highly impactful charities, based on a combination of existing and new prioritization research. This list could have separate sections for specific cause areas to guide donors with specific preferences. Charity Navigator has already done something similar with recommendations from Giving Green and could expand that work into other areas. Additionally, we think creating an educational module to guide users about how they could approach making the most impactful donations could be valuable.

Introduction

Charity Navigator currently publishes ratings for over 230,000 charities on their website. It has asked Rethink Priorities to do a feasibility study on creating an impact-ranking database that would ultimately aim to score all charities eligible for a rating on its website with respect to their impactfulness. The resources Charity Navigator would have available to create and maintain such a database are limited: about one FTE (possibly external) to develop such a database, and considerably less than one FTE each year to maintain the impact-ranking database.[1]

We determined through discussion with Charity Navigator that it prefers a solution that would split charities into broad cause areas, with impact scoring being applied within each area, rather than a ranking that applies across all cause areas. Therefore, for each solution we consider below, we take a ‘case study’ or ‘exemplar-based’ approach to demonstrate how the solution might be applied in two broad cause areas: Health and Arts, Culture, and Humanities. Health was selected because Charity Navigator has already done some work on ranking health interventions, and Rethink Priorities has experience in thinking about health-related interventions (in particular for global health). Arts, culture, and humanities was selected due to being very different from health, and its potential to represent a challenge in terms of defining and comparing impactfulness. However, we do also consider some potential solutions that could apply a more overarching impact assessment across charities as a whole.

Defining the scope of cause areas

Rating charities on their relative impactfulness within particular cause areas requires a clear definition of what such cause areas should be. Charity Navigator has not yet developed this. In this section, we discuss various considerations for deciding upon an appropriate scope for cause areas when comparing interventions. We refer to this as the ‘scope of comparison’. We begin with some general considerations that we think should be taken into account when defining a cause, and then apply this to the two case studies.

Considerations can be separated into pragmatic concerns, and substantive concerns. Pragmatic concerns refer to constraints resulting from the practicalities of implementing such a system and the capacity for Charity Navigator and its users to meaningfully interact with and use it. Substantive concerns revolve around ensuring that the scope of cause areas chosen allows for meaningful distinctions to be made between charities within each cause area, so that the ranking can in fact help people direct their donations to more impactful charities if they wish to do so.

Pragmatically, the scope of cause areas should be such that:

  • Causes are recognizable to the users of Charity Navigator, and align with the way users already look for causes.
  • The list of causes is short enough to be navigable for users.
  • It is viable for Charity Navigator to go through the final list of causes one by one to set up a methodology for each.
  • Where possible, the taxonomy that Charity Navigator already uses to categorize charities[2] maps easily onto the cause area definitions.

Substantively, the scope of cause areas should be such that:

  • The causes are broad enough so that we can meaningfully distinguish between low and high impact charities within each cause area.
    • The broader the category, the bigger the range of impactfulness we expect to find among charities within it, and the greater the difference Charity Navigator can make by guiding users to the most impactful giving opportunities.
  • The causes are specific enough to meaningfully compare charities and have a uniform definition of or criterion for ‘impactfulness’ within the cause area.
    • Note that according to some perspectives, all charities could in theory be made comparable by defining their impact according to a single common metric, such as improvements in wellbeing.

Given that some of these considerations may be in tension with one another, there is no single ‘best’ answer for the optimal scope of cause areas. We suggest defining the scope of cause areas as broadly as possible while still retaining a reasonable alignment with categories that users would recognise as areas in which they might be interested in donating. Charity Navigator could consider conducting user interviews to determine how well different categorizations fit with user understanding. For now, we will apply this logic and our subjective intuition to get to a best guess of what a reasonable scope of comparison could be for two case studies.

Box 1: Case studies on scope of comparison

Scope of comparison for health

For the purposes of this report, we are considering ‘health’ (of humans) as one cause area, interpreting it broadly to include mental health and physical health in all regions of the world.

 

To get a sense of what charities and interventions fall under this cause, we will use the Nonprofit Program Classification (NPC) taxonomy as a starting point, as suggested by Charity Navigator.[3] Categories and subcategories of the NPC are listed in this spreadsheet, and we also list some examples in the appendix. There are four main categories (‘hierarchies’) in the NPC taxonomy that relate directly to health: 1) Health Care, 2) Diseases, Disorders & Medical Disciplines, 3) Mental Health, Substance Abuse, and 4) Medical Research. For this case study, we will define the cause ‘health’ as all charities that fall under any of these NPC categories, aiming to keep the scope of cause areas broad for the reasons explained above.

 

Some alternative possible choices for the scope of comparison which we will not pursue for now are:

  • Combine the first three categories together, but treat medicinal research separately.
  • Consider ‘mental health’ separately, and cluster the other health interventions together.
  • Divide between ‘health focused on lower income countries’ and ‘health focused on higher income countries.
  • Divide between themes such as “health related research”, “providing health care”, “prevention”, “other support for health related problems”. This would mean investing capacity to create a new taxonomy and to map charities onto this.
  • Consider each of the four NPC categories mentioned above as an individual cause. We expect there to be a lot of overlap between the first two categories (for example, the first category includes a subcategory ‘Health Diagnostic, Intervention & Treatment Services’, which could relate to any of the diseases listed in the second category).

Examples of interventions/organizations that fall under this category

There are 99,171 charities in the Charity Navigator database that fall under the four health categories described above. For reference, some charities that fall under this cause are:[4]

  • African Child Care and Growth Foundation
  • Ara African Bone Marrow Program Inc.
  • Global Council for Education in Congenital Heart Surgery
  • Medical Aid for Africa
  • Mini Magic Therapy Horses
  • Miracle of Music
  • Ms Wheelchair Pennsylvania Organization
  • Natural Family Planning of Nebraska
  • Sustainable Global Surgery
  • The Against Malaria Foundation

Scope of comparison for arts, culture, and humanities

For the purposes of this report, we are considering all charities that fall under ‘arts, culture, and humanities’ in the NPC taxonomy as one cause, because we do not have a strong argument to deviate from the NPC taxonomy in this case.

 

Some options which we considered but will not pursue for now are:

  • Considering humanities separately from arts and culture. We did not pursue this because, all things equal, there is more scope for differential impact by guiding people to the most impactful charities among a broader as opposed to a narrower set of charities.
  • Including other NPC categories such as ‘recreation & sports’ (because of a connection to entertainment). We chose not to include them because of a subjective intuition that there are large groups of donors who are interested in these topics specifically, while not being interested in arts and culture, and vice versa.

Examples of interventions/organizations that fall under this category

There are 120,350 charities in the Charity Navigator database that fall under the NPC category ‘arts, culture, and humanities’. For reference, some charities that fall under this cause are:[5]

  • Downtown Los Angeles Art Walk
  • Modern Quilt Guild Inc.
  • Puppetry in Practice Inc.
  • Southern Arizona Alliance for Bilingual Education
  • Taubman Museum of Art

Defining and operationalizing impactful giving

Note: most of the information below is first principles reasoning, based on discussions with Charity Navigator and discussions within the team.

Charity Navigator’s mission is to “to make impactful giving easier for all” (Charity Navigator, n.d-a). In this project we are thinking about how to create a ranking that distinguishes between more and less impactful charities. In order to do so, a clear and coherent definition of impact is required, and this must also be operationalized in terms of observable features of a charity—or ‘proxies’ of impact—that can be quantified. The way in which Rethink Priorities defines impact may not match the way Charity Navigator defines impact, and there may also be mismatches between how Charity Navigator conceives of impact and what different donors/website users mean. We begin with a theoretical definition, and then discuss proxies for impact. At the end of this section we explore which proxies for impact seem most applicable to the case studies.

Theoretical definitions of impact

While certainly not exhaustive, we think three broad ways of conceiving of impact can highlight some of the major considerations for Charity Navigator to think about when choosing how to define impact for ranking charities. Rather than mutually exclusive, these ways are more like different levels of abstraction, ranging from the most all-encompassing and overarching, to the most specific. The first two levels of abstraction seem most relevant for the current project:[6]

  • An overarching or unitary scale of value. One ultimate concept you care about, such as the wellbeing of humans and nonhuman animals. Some metrics that tend towards this kind of impact definition include Disability-Adjusted Life Years (DALYs), Quality Adjusted Life Years (QALYs), Wellbeing Years (WELLBYs), or lives saved. These each aim to measure outcomes against a unified scale.
    • The advantage of this approach is that, in principle, it allows comparisons to be made across all causes. In addition, a lot of work has already been done to measure impact of interventions in terms of outcomes such as DALYs. In practice, it may be difficult or impossible to compare interventions on such an overarching definition.
  • Cause-specific scales representing different values. You could assume that different cause areas can have different end goals that are not necessarily comparable to each other or collapsible into an overarching scale. For example, “improved welfare for animals” could be treated as categorically different from “human welfare”, which again could be considered separately from “preservation of cultural heritage”.[7]

    • This approach allows you to take into account a diverse set of goals without having to translate these into a single underlying ‘currency’. It requires a definition of what the general goal(s) of each cause area should be considered to be, which is difficult in many cases.[8]

  • Charity-specific scales. Where impact is defined as achieving whatever the charity says it wants to achieve, such as the number of children from low-income families who visit a particular museum.
    • The disadvantage of this approach is that it does not allow meaningful comparisons of the impact of different charities unless they are working on precisely the same goal. The practical way to measure impact on this scale could be similar to what Charity Navigator currently does: measuring whether an organization seems generally well-functioning (such as being financially healthy). We will not consider this definition in the rest of this report.

Impact per dollar and proxies of cost-effectiveness

Supporting ‘impactful giving’ involves considering the impact per donation, or ‘impact per additional dollar given to the charity’. We think that recognizing whether charities have significantly influenced certain impact metrics is of low relevance to donors if we do not take into account how cost-effectively these charities achieve those outcomes. This is important because of finite resources that a donor can give to a charity. Therefore, the terms ‘impactful charities’ and ‘cost-effective charities’ will be used interchangeably in this report.

However, cost-effectiveness analyses for any single charity are often highly complex and time-consuming projects. In some of our case studies below, we do consider whether it might be possible to conduct cost-effectiveness analyses at scale using automated approaches, but we believe this is not possible. Consequently it is necessary for an impact-ranking database to rely upon proxies of impact/cost-effectiveness. If we consider a full-fledged cost-effectiveness analysis of each charity to be the gold standard (but unattainable) form of assessment with which we would like to be able to guide donors, a proxy would be a more easily observable indicator that we would expect to be associated with cost-effectiveness. By their nature, these proxies would be noisier than a true cost-effectiveness analysis.

We list some possible proxies here, each of which would need to be made even more concrete in order to be able to assess them at scale.

  • Charities that work with interventions that are typically considered cost effective may be more likely to be cost effective than those that do not.[9] In the absence of charity-by-charity cost-effectiveness, it may be possible to note when a charity utilizes interventions that have been assessed to be cost-effective when employed by others.

  • Charities that claim to be cost-effective and report some evidence to support that claim may be more likely to be cost-effective.
  • Charities that have good monitoring and evaluation practices may be more likely to be cost-effective than those that do not.
  • Charities that weigh evidence into their decision making may be more likely to choose more impactful interventions, and therefore be more cost-effective, than those that do not.

In practice, there are practical limitations to which proxies we could use per charity, and not all the proxies would give meaningful results for all causes. In addition, some proxies may be more or less appropriate depending on how impact itself is conceived. Below we explore some possibilities more concretely for the two case studies.

Box 2: Case studies on definition of impact

Definition and operationalization of impact for health

In order to rank health charities in terms of impact per donated dollar, we could define this in terms of costs per Disability-Adjusted Life Years (DALYs) averted. DALYs aggregate the total health burden of diseases by combining years of life lost due to premature mortality with years lived with disability, quantifying both fatal and non-fatal health outcomes in a single figure (GiveWell, 2016). DALYs are widely used, including by the World Health Organization (2019). A lot of work has gone into translating impact from different diseases and interventions into effects in terms of DALYs.[10]

 

In practice, as noted above, directly estimating cost-effectiveness to the level of precision of DALYs would be challenging for many charities, but more importantly seems infeasible for the number of different charities under consideration by Charity Navigator. Proxies for charity-level assessments of cost-effectiveness that could be considered instead include:

  • Whether the charity works on a specific intervention which we believe to be cost-effective (we could use different methods to get information on cost-effective interventions)
  • Whether the charity works on a disease for which we know there are cost-effective interventions
  • Whether the charity mentions specific terms related to cost-effectiveness or evidence-based interventions on their website

Definition and operationalization of impact for arts, culture, and humanities

For the cause arts, culture, and humanities we were not able to come up with a feasible definition of impact Charity Navigator could use. In order to calculate some proxy for ‘impact’, ‘cost-effectiveness’ or ‘strength of evidence’, we first need to determine some quantifiable or estimable goal (or set of goals) for charities within this cause area, similar to ‘averting DALYs’ for health, which we were unable to do.

 

Ambitious but unrealistic definitions of impact might relate to the impact of exposure to arts and culture on wellbeing, or the likelihood that some artistic or cultural event supported by a charity inspires a new artist to go on to contribute meaningfully to further aesthetic achievements. Exceedingly rough proxies for these might be the reported reach or number of people exposed to some cultural artifact supported by the charity. To factor in cost, one might search for information with respect to costs per exposure. However, it is clear that such a definition would apply only weakly to many charities in this space.

 

One way to search for such a definition is to consider concrete examples of charities (or hypothetical charities) that intuitively should score highly, and others that should score poorly on a metric of impact. We can then ask ourselves why one charity scores higher than the other, with the aim of finding a generalizable difference between the higher and lower-scoring charities. Such examples are not immediately obvious to us, and Charity Navigator indicated that they do not have examples either.

 

Clearly, defining impactfulness for arts, culture, and humanities is challenging. This poses a fundamental issue for any method we would try to apply: If we rank charities without a clear sense of what we are ranking them on, we cannot know whether that ranking actually adds any value, or whether a donation to a highly-ranked charity is in any sense better than a donation to a poorly-ranked charity.

 

However, Rethink Priorities has relatively little expertise in philanthropy surrounding arts, culture, and humanities, and there are likely experts who could help Charity Navigator develop a meaningful definition. This would require investing in external capacity to develop a definition of impact and might be necessary for several other causes as well.

Data sources and data processing methods for an impact ranking system

Note: all information below is first principles reasoning, based on discussions with Charity Navigator, discussions within the team, as well as our experience with impact assessments and automated methods in other projects. For this project we did not speak to experts or review literature.

There are many potential approaches that could be taken to rank charities on impact, each of which can use different sources of charity data, and different data processing techniques. In this section, we move away from more abstract considerations about defining impact, and look towards actual implementation of an impact ranking system. This includes consideration of potential sources of data, how such data could be processed to draw useful insights, and—for particular combinations of data and processing techniques—the kinds of proxies that might be used.

We outline a range of options in terms of data sources, and in terms of processing approaches, in the following sections, and summarize this in Tables 1 and 2, which also include an overview of some key advantages and disadvantages.[11]

Data sources

Table 1: Which charity data could be used?

Data SourceAdvantagesDisadvantages
Standardized public documents including NTEE classifications (e.g.,  tax forms)Standardized, readily available for all charities, contains some objective data, difficult to manipulateContains very limited information, does not capture all relevant activities of charities
Annual reports of charitiesPotentially detailed, could provide insights into operations and goalsInformation can be incomplete, biased, and not standardized, making it difficult to process, only available for a selection of charities
The charities’ websitesPotentially detailed, could provide insights into operations and goals, likely available for most charitiesInformation can be incomplete, biased, and not standardized, making it difficult to process
Self-reported data by charities (questionnaire)Can be tailored to capture specific data needed for assessment, standardizedMay be biased, potentially manipulated, only available for a selection of charities (not all charities would report data)
Peer-reviewed literature or other forms of research and analysis (e.g., non-peer reviewed research white papers)Can provide in-depth and quantifiable analyses of key metrics of impact, peer-reviewed literature has been vetted by other expertsRequires highly tailored searching for relevant literature, relevant information likely unavailable for many charities, non-standardized formats makes it difficult to process, not all peer-reviewed literature is of sound quality
Expert informationDomain experts can provide nuanced information with limited bias when including perspectives from enough experts, and they can be asked to supply information in readily usable and standardized formatsVery unlikely implementable at the level of each charity, and so necessarily has an element of noise

Standardized public documents

Charity Navigator already makes use of a range of publicly available documents that charities are required to file, such as tax forms. Given Charity Navigator’s familiarity with these sources of information, we will not discuss them in further detail in this section.

Annual reports

One additional source of information that many charities provide, but which they are not legally obligated to provide, are annual reports. Such reports would be expected to provide more in-depth and nuanced information about the activities that the charity is involved with than in tax filings, which typically only outline activities at a very simplistic level. However, annual reports also function as a form of promotional material for charities, and we would therefore expect them to be positively biased in favor of the charity and their importance or impact. Annual reports are also in a non-standardised format, which may make it difficult to extract usable information from them, and may not be available for all charities (or at least be much harder to locate for some than for others). Given the number of charities that Charity Navigator wishes to rank, even a small amount of time to locate such reports for a small percentage of charities could make acquiring this source of information exceedingly time-intensive. However, it may be possible to download and extract information from annual reports more automatically via the use of a web scraper, described in the following section.

Charity websites

Similarly to annual reports, charity websites can offer a wealth of information that can aid in rating or categorizing charities, with more nuanced (but, similar to annual reports, probably more biased) information than is available in tax filings. Some key challenges we foresee with using information from charity websites are a) getting the correct website(s) for each charity, b) extracting information in a usable form, and c) checking whether the information is still up to date.

The first of these issues may be partially avoided by focusing on the subset of charities for which Charity Navigator already has web addresses. The second challenge—extracting information in a useful format—will require technical solutions and involvement of people with expertise in developing web crawlers and scrapers.

A web crawler is a program used to collect information from web pages. At a relatively simple level, a web crawler might start by navigating to a particular web page (e.g., the home page). Once on that page, it seeks to detect additional links on the page for the same website, and collect those URL links. It will then navigate to each of those pages and extract additional URLs, and thereby iteratively cover the content of a whole website. One might simply store all of the HTML from each page, or develop a custom web scraper that targets and extracts specific elements from a web page.

While web scraping and crawling sounds simple, in practice there can be difficulties. In particular, getting the information from so many different websites, which all have different formats, and storing them in a coherent format to be used by other programs (e.g., to be fed into an LLM) would require time investment and debugging/checking before the data generated could be trusted. Depending on how much data such a process unearthed, storage solutions for large amounts of data might also be required.

We think that getting large amounts of useful information (mostly in the form of simple text) from charity websites would be feasible, but would itself represent a technical project requiring relevant expertise. We are less optimistic about cases such as extracting information about program delivery costs or other highly specific pieces of information from a host of different charity websites, because this more targeted kind of scraping usually involves very specific targeting of elements that will not be identifiable in the same way across different sites.

Self-reported data from charities

The aforementioned sources of data rely upon whatever information about charities already exists. Charity Navigator could consider soliciting from charities specific types of information in a standardized format. The key advantage of doing so would be that Charity Navigator could tailor their questions towards highly specific things that provide information with regards to a particular definition of impact, as opposed to having to adjust its definition of impact to accommodate potentially highly limited sources of information that are already available. A key drawback would be that we would not expect complete coverage of all charities, as a potentially sizable proportion might not respond to such a request. In addition, depending on what questions were asked, it might be possible for charities to provide inaccurate or insincere responses that they would expect to make them score more highly in terms of impact.

Academic research geared towards assessing the evidence for or against a host of interventions, or towards estimating their cost-effectiveness, could provide readily quantifiable and relevant indicators of impactfulness in a host of cause areas. In general, such research ought to be more reliable than unvetted claims in a charity’s annual report or webpages. However, obtaining such academic research would require the development of carefully constructed search strategies, ideally by a librarian with expertise in performing literature searches. The scope of interventions or areas that the literature would need to cover would rapidly become unmanageably vast. Extracting useful information from the resultant corpus of papers may present a further challenge, as although many papers follow a similar general outline, the key pieces of information will not be identifiable in any standardized way. At a very practical level, access to such literature may also require expensive subscriptions to a host of academic publishers. Finally, it is also not possible at such a scale to vet the quality or reliability of all the literature obtained through such a process. For these reasons, we think that reliance upon scientific or peer-reviewed literature would be appropriate only for more ‘boutique’ cost effectiveness/impact assessments of much more limited scope than the hundreds of thousands of charities that Charity Navigator hopes to rank.

Information provided by experts

For such a large number of individual charities and interventions it is clearly infeasible to have experts rate or rank all of them, but expert assessments can provide more coarse-grained pieces of information that can be applied at the level of charities, representing another source of possible information. For example, if charities could be broadly categorized in terms of the areas in which they operate—for example, in terms of the cause (e.g., disease prevention) and location (e.g., low-income countries)—then experts could provide some meaningful information to attach to these categories.

Such ratings could be in terms of such things as whether there are funding gaps for particular causes, or potential for impact at a very broad scale (e.g., experts might reasonably judge that donating money to a low-income country is likely to go further than that same amount donated to a high income country, in the same cause area). These high-level assessments could then be attached to the charity level, based upon their characteristics at this higher-level.

This approach to gathering information would require the careful development of tasks for experts to perform, which would need to be very clearly defined. This development may be time consuming, as might be the identification of relevant experts. These experts would also need to be compensated for their time. As information would not be gathered directly at the charity level, there would necessarily be an element of noise when assessments made at a higher level are linked to specific charities: charities may vary considerably in actual impact even if both operate in a very similar area.

A risk related to this method is that experts can be biased or have a view on what they consider ‘impactful’ that does not match with Charity Navigator’s intended definition. Accounting for these biases or differences in opinion requires capacity.

Data processing tools/methods

Having gathered relevant data for charities, multiple broad classes of methods for processing that data would be available. For the purposes of this report, we’ve focused on processing methods that were either raised by Charity Navigator, or that can plausibly be employed at the scale of assessment that Charity Navigator is aiming for.

We think that the different processing methods that could be applied to charity-related data, and their relative strengths and limitations, are best understood with reference to their concrete applications in example use cases, which are the focus of the next section. Hence, we only provide a summary table here so as to orient the reader regarding the kinds of methods we cover.

Table 2: How could such data be processed? These methods would be applied to each individual charity to give them a final impact rating.

Processing MethodAdvantagesDisadvantages
Expert assessmentCan provide nuanced accurate insights, quality control, not simple to manipulateManual (so likely too labor-intensive to apply to all charities), perhaps subjective
Custom machine learning model trained on charity evaluationsScalableRelatively opaque, likely inaccurate, high investment up-front (requires high-quality training data and pre-processing of input-data), possible to manipulate
Large language model (LLM) with tailored prompts (potentially fine-tuned with a training data set)ScalableRelatively opaque, likely inaccurate, high investment up-front (requires high-quality training data and pre-processing of input-data), possible to manipulate
Simple word-detection or rules-based computer program (e.g., counts of particular terms)Scalable, transparentLimited depth of analysis, inaccurate, easy to manipulate, high investment up-front (pre-processing of input-data)

Concrete approaches for implementing an impact ranking system

Having outlined potential sources of data and methods of processing that data, we can consider more concretely some possible ways that Charity Navigator might go about implementing an impact ranking system. This following section goes into more detail on how different methods might be developed, and what potential barriers they might have. We do not exhaustively consider all possible combinations of data source and method, but instead aim to expand upon those approaches that have at least surface-level plausibility, or that were raised by Charity Navigator as possibilities.

We discuss three methods in the following order:

  • Automated with machine learning
  • Automated with a large language model (LLM)
  • Ranked taxonomy of interventions

For a summary of the key characteristics for each of the methods we considered, see Table 4 in the next section.

The main text in this section provides information with respect to the general properties and considerations for each approach, whereas the boxes delve into more nuances and details of possible applications. It is not necessary for the reader to go through each box to get an impression of the different possible approaches.

Automated with machine learning

Brief description of method: This method involves developing a machine learning (ML) algorithm that is trained to make ratings of a charity in terms of ‘impactfulness’/cost effectiveness. This rating would be based on various characteristics of each charity (in machine learning terminology, ‘features’). Examples of such features might be the size of the charity or the cause area in which the charity operates.[12] Notably, the training process requires a large data set of charities, their features, and the ‘true’ outcomes (or ‘ground truth’) with respect to the impact metric of interest for each charity. This is necessary for the machine learning model to ‘learn’ how features are associated with impactfulness. Once such an algorithm is developed, it would be possible to get a predicted impact score for new charities (i.e., charities that the algorithm was not directly trained on) on the basis of a new charity’s features.

Choices to make: This approach involves myriad choices:

  • The scope of cause areas within which to develop the algorithm (e.g., one overarching model, or different models for multiple smaller cause areas).
  • What kinds of features might be relevant
  • What sources of data may be used to develop these features, and exactly how that data should be converted to specific and useful features that are fed into the model.
  • The exact type of machine learning model to be run would also be an important choice.[13]

Accuracy: We would expect that the kind of model that could be developed within the scope of this project to be low/unacceptably poor. Hypothetically a machine learning model could perform well on this task. However, this requires massive upfront investment in collecting, cleaning, and coding data, andmanually scoring many thousands of charities. This is necessary in order to train the model on what are the correct ratings for impactfulness (the adage ‘garbage in, garbage out’ applies to these kinds of models). Even with considerable investment, it is possible that there is no reliable signal to be picked up in the data. If so, then the model would perform poorly even with large amounts of data and a sophisticated modeling approach.

Upfront investment: This approach requires vast upfront investment in sources of training data. This includes not only finding and selecting sources of information and then coding these into a format that can be a ‘feature’ for a machine learning model to work with, but also requires actually evaluating thousands of charities on their impact. This is because the model learns by reducing error in predicting the outcomes in the training set. This means that the data the model is trained on also needs to include the correct answer regarding impactfulness, so that it can learn how features are associated with impact. This training set would need to be supplemented with a test set of charities that the model was not trained on, but which you have also provided features and an impact rating for. This is necessary in order to assess how well the model performs on charities it has not been trained on, which indicates how well it would perform ‘in the wild’ at rating new charities.[14]

Scalability: Once the algorithm has been developed, it can be deployed at scale to generate impact ratings for new charities. However, the algorithm requires information about the ‘features’ of each charity in order to generate a rating. Therefore, how scalable this is will also depend on the sources of data that were required and how manual or automated the conversion of data into features was. If data had to be searched out for each charity and its conversion to features was anything but extremely simple, we would assume that this would not be feasible for Charity Navigator to do.

Maintenance: Some maintenance concerns are noted in scalability above: continued involvement in the collection or provision of data for new charities could be a high maintenance cost. In addition, it is typical for ML models to be assessed over time for ‘drift’. Drift refers to changes in model accuracy over time due to changing underlying conditions. With changes in the charity ecosystem, it can be that the relationships between different features and ultimate impact change. This means there should be efforts over time to confirm that models continue to track the reality of impact. This might include efforts at certain intervals in which a large number of new charities are manually assessed with respect to their impact. These more rigorous manual ratings would then be compared with ratings given by the algorithm, to confirm that accuracy is maintained over time.

Overall feasibility: We consider this approach to be infeasible. The principal problem is that a massive number of charities would have to be manually evaluated and scored for impact in order for a machine learning algorithm to have any information to work with. We also think there is a reasonable likelihood that even a sophisticated model would perform poorly.

Automated with a large language model (LLM)

Brief description of method: This method involves developing a pipeline in which the capacities of LLMs (e.g., ChatGPT or Claude) are leveraged. The pipeline would provide the LLM with information about a charity, and the LLM would provide an assessment of the charity according to instructions as to the kinds of assessments it should be making. Hypothetically, these instructions could ask for direct assessments of impactfulness/cost effectiveness, or for assessments with respect to simpler proxies of impact. There are a host of ways that this might be approached, but in brief would look something like:

  • Providing an LLM with information about each charity
  • Describing to the LLM particular proxies to pay attention to and how it should respond depending on what it finds
  • The LLM returning an assessment of whether a particular proxy is present or not.[15]

Various different proxies might be evaluated and aggregated into a composite score that is intended to reflect the likely impactfulness of the charity.

Choices to make: As with the automated machine learning approach above, this approach involves many choices, including what data sources to use, and at what level the LLM should be aiming to assess impact (e.g., asking it to simply make a judgment of how cost-effective a charity is, or instead asking about a host of simpler proxies). We expect that direct judgments of cost-effectiveness would be well beyond the capacities of current LLMs. Instead, it would probably be better to pick a range of simpler proxies for impact that would be easier for an LLM to detect and report on. Having decided upon such proxies, a key question would be where to find information about each charity (or, approaching this the other way around, given readily available information, what proxies could be picked that you can expect to find indications of in the data). The exact type of LLM to use would also be a choice to make, although it would be possible to leverage multiple LLMs in the hope of achieving greater accuracy by aggregating their judgments.

Accuracy: If accuracy is defined simply in terms of an LLM accurately detecting certain simple proxies for impact, then the accuracy could be high. The question is then the extent to which scores based upon such proxies are an accurate reflection of real impact. One might consider selecting a range of charities from different areas that are expected to vary in terms of their ‘true’ impact, and see how they score according to the LLM. This would require manually assessing this selection of charities in terms of impact in order to have a ground truth against which to assess the LLM-based scores.

Upfront investment: Before interacting with LLMs, this approach would require careful consideration of the proxies for impact that charities would be rated upon, and the sources of data to be fed into the LLM. This may be more principled/top down, starting from intended proxies and then finding a way of scraping/collating data sources that might be relevant for their assessment, or more bottom up—starting with what data is known to be readily available and assessing what proxies might be assessable in that data. Charity Navigator would then need to develop very precise prompts, including a host of examples for the LLM to take into account to understand exactly how each proxy is defined and how they should be assessed. This would likely be an iterative process to find prompts that produce the greatest accuracy. Hence, it would be necessary to test the prompting approach against a manually assessed sample—probably at least a few hundred examples for which someone has manually generated assessments of the different proxies of interest, against which the LLM assessments based upon the same data could be compared.[16]

Scalability: Once a reliable prompting approach has been developed, it could be deployed at scale to rate charities on proxies for impact. As with ML models, however, how scalable this is will depend greatly on the sources of data that are required. If data (e.g., annual reports, websites) had to be searched out for each charity or was generally hard to come by, then this would likely require more time than Charity Navigator has available. You would also want to randomly select samples of charities that were evaluated when the approach is deployed at scale to further assess how well the LLM ratings of whatever proxies were chosen are tracking the ‘correct’ (i.e., human) ratings of these proxies.

Maintenance: Besides possible maintenance in terms of ensuring sources of data for new charities are always available, LLMs may have other maintenance requirements. In particular, companies often change the underpinnings of their LLMs. If using the off-the-shelf versions of models, then a change in how the LLM works could suddenly change how accurate it is, or present an unexpected block (e.g., due to overly strict content rules the model may suddenly refuse to evaluate certain types of charities). Possibly, this could be prevented by relying on some stable version of the LLM that would be stored in a stable form. Depending on what sources of information were used as inputs to the LLM, they might also change over time—e.g., websites are updated and may suddenly score differently on the proxies of interest, or a new set of yearly funding documents may be released. This would require maintenance over time to keep ratings up to date.

Overall feasibility: This approach seems more feasible in principle than a machine learning approach. In practice, however, we think that developing such an approach would be a very substantial undertaking, which would not be feasible in the capacity Charity Navigator has available. It would require expertise and time consuming development at all points in the process, from defining proxies, acquiring data, developing and testing prompts, and assessing accuracy and ensuring performance is maintained over time. After going through such a process, it could turn out that the rating system is simply inaccurate. This possibility may not be apparent until quite far through the development process. We also believe the kinds of judgments that could be made on the basis of LLM assessments might be quite basic and may not accurately track actual impact/cost-effectiveness.

Box 3: Case studies on using an LLM to create an impact ranking, and practicalities of using an LLM for automated assessments

With the advent of readily available and impressive LLMs such as ChatGPT and Claude, there is considerable excitement about their possible use cases. Although LLMs can be tremendously powerful, it should be stressed that they are not a panacea, and have many limitations. In particular, although LLMs respond to readily interpretable text inputs and produce human-like outputs, the user should not expect that the LLM will approach the task it is set in the same fashion as a human might.

 

For example, if we simply give an LLM a charity and a link to the charity’s website, and ask the LLM to assess the cost effectiveness of that charity, or evidence for the interventions it employs, it is unlikely that the will actually ‘perform a cost effectiveness analysis’ (e.g., carefully break down the task, exhaustively search for direct evidence or the most relevant indirect evidence, carefully interpret that scientific literature with respect to possible biases and inadequacies of the research designs, then peruse a host of other sources to estimate monetary costs, and so on). Rather, the user will typically have to specify precisely the kinds of things it would like the LLM to do, provide key sources of information and a framework for how they would like these interpreted, and may also rely on developing pipelines that chain various LLM responses together, with careful assessment of performance at each step in the chain.

 

In our estimation, developing such a complex LLM pipeline to directly assess cost effectiveness or ‘impact’ more generally of many thousands charities, across a host of different domains, would be infeasible with the time and budgetary constraints Charity Navigator is considering, and may be infeasible in principle for current LLMs even with greater investment. Consider, for example, that in Rethink Priorities’ experience, evaluations of charities and their interventions often involve solicitation of information directly from experts and other stakeholders involved. Such information may be essential for making a judgment but would not be available to any pre-trained LLMs. The breadth and complexity of the information that might need to be fed to such an LLM would also present a challenge. For example, one might consider ‘fine tuning’[17] an LLM by providing it with additional training on all the most relevant literature for assessing evidence of various interventions. However, even narrowing down the scope of what literature ought to be included in such a set of resources would be a large-scale project in itself.

 

Instead of using LLMs to literally conduct assessments of cost-effectiveness or to directly try to assess the impact of various charities, it may be more realistic to have LLMs aid in tagging, categorizing, or otherwise labeling charities based upon relatively simpler or more easily identifiable proxies that have been pre-specified by Charity Navigator as potentially indicative of cost-effectiveness, impactfulness, or evidence-based practices. Exactly how this would look would depend on factors such as what definition of impact was chosen, what proxies seem relevant to that definition of impact, and at what scope of cause areas the LLM would be making judgments.

Using an LLM to rank all charities according to a universal definition of impact

Imagine that Charity Navigator chose to rank all charities in terms of how cost-effectively they were achieving improvements in human wellbeing. Charity Navigator could consider what proxies for this might be. One such proxy could be that the charity uses evidence-based practices and conducts assessments of their own programs’ efficaciousness.[18] If a focus on evidence-based practices and evaluation was a core value of a charity, then one would expect them to promote this on their website. Rather than tasking an LLM with assessing the evidence base behind a specific charity, one might instead try to develop a yet simpler proxy, such as claims or mentions of using evidence-based approaches on the charity’s website.

 

Accordingly, the LLM would be prompted with content from a charity’s website and instructed to pay attention to particular terms or phrases that indicate a concern with or reliance upon evidence-based practices in deploying interventions. The prompt might also include a range of examples of the kinds of things it should be looking out for, and the corresponding response it should provide. It would also be possible to request that the LLM extracts the particular parts of text that it ‘thinks’ are indicative of the proxy chosen and returns this information as part of its response. This could then be checked to confirm the LLM has a correct grasp of what it should be doing, and also to confirm that the extracted text is in fact present in the web content provided to the LLM, so as to detect possible hallucinations.[19]

 

This proxy could be one indicator among multiple that LLMs assess the charities upon. For example, another prompt could be developed to focus on references to ‘cost effectiveness’, or other terms defined with reference to causes or interventions that Charity Navigator judges to be more severe or urgent. Charities prioritizing more urgent issues could receive higher scores. A range of such scores could form a composite score, with various items intended to address the inadequacies of each individual assessment, ultimately aiming to indicate the extent to which a charity effectively promotes human wellbeing.

 

Developing such a collection of indicators, honing the most effective prompts, and assessing the validity of LLM judgments would be more feasible than literally expecting an LLM to conduct cost-effectiveness assessments. Still, we expect it would prove challenging (see our summary of Overall Feasibility) and likely require significant investment.

Using an LLM to rank health charities

Focusing more specifically upon health-related charities, it might be possible to develop a finer-grained set of proxies that charities might be rated upon. For example, it might be that charities geared towards fighting diseases or illnesses in low-income countries are typically more cost-effective than those that focus on such disease prevention in the United States or other high-income countries. The area of operations for a charity could be something that an LLM is tasked with detecting, based upon information from the charity’s website or other documentation that Charity Navigator has available.

 

Through consultation with experts, Charity Navigator might arrive at additional proxies. Experts might suggest that certain diseases with very high global burden, or some combination of high burden and high neglect in terms of funding, are likely more impactful to focus upon than others. It may be possible to set up a pipeline in which an LLM detects which diseases a charity focuses upon, and charities that focus upon particular high burden or neglected diseases could be given a higher score. Again, this could be one proxy among multiple.

 

As with the previous LLM example of more generic indicators, such development would involve considerable set up and checking, and would also be subject to limitations. Such limitations include cases in which cost-effective charities operating in high-income countries might be overlooked, or ineffective charities in potentially impactful disease areas might be unfairly boosted.

Using an LLM to rank arts, culture, and humanities charities

Applying an LLM-based approach specifically within arts, culture, and humanities charities would involve similar processes to the application of LLMs in the other domains noted above. As in the cases above, it should be stressed that a clear specification of exactly what the LLM should be aiming to detect or determine would need to be provided, which would require a clear definition of what differential impact within this category of charities means.

Error detection and evaluation

When such an automated LLM approach is deployed over many thousands of charities, one could not simply trust the results. Beyond checks such as we suggested could be included directly in the LLM pipeline (such as confirming that a phrase that was detected is present in the materials with which the LLM was provided), a considerable amount of manual coding and checking would be strongly advised so as to determine the accuracy of the LLM. This would involve humans being given the same materials (or smaller subsets of the materials) as the LLM, and being tasked with performing the same evaluations as the LLM. For a reliable assessment of accuracy, this might require rating several hundred charities. Note, however, that since we suggest giving an LLM simple detection or categorization tasks, it might be possible to outsource manual assessments to non-experts. Such measures could help determine the accuracy of the LLM in performing the specific task, but would not confirm whether or not whatever scoring system was used reflects actual cost-effectiveness or impact. Knowing how scoring is associated with cost-effectiveness would require fuller, manual assessments of cost-effectiveness for a host of the charities scored, and then assessing how well the scoring predicts those outcomes; these activities would likely need to be done by experts.

Ranked taxonomy of interventions

Brief description of method: In this method, impact ratings are based on a taxonomy of interventions which have been ranked on impactfulness. Charities that work on impactful interventions score higher than charities that don’t.

Choices to make: This method requires an answer to the following questions:

  • Granularity: how broadly do you define an intervention? If interventions are too broad we cannot meaningfully research their impact, but if interventions are too specific the list of interventions becomes very long making the methodology difficult to develop.
  • Taxonomy: once you decide on a granularity, how do you create a taxonomy corresponding to that granularity that contains all the interventions in a cause area?
  • How do you map charities onto the taxonomy, i.e. how do you determine which intervention a charity works on? You need to select the data sources and processing methods described above.
  • How do you rate a charity that works on multiple interventions?

Accuracy: The accuracy of this method depends on the choices above—in theory it could be very accurate if there is a fine granularity in the taxonomy of interventions, the impactfulness of each intervention is researched in detail, and the charities are mapped onto interventions accurately. However, doing so would require lots of capacity—much more capacity than Charity Navigator has budgeted for this exercise. In practice, either the granularity would have to be very coarse, or the method of assessing impact per intervention and the interventions per charity would be done automatically (likely by an LLM), both leading to a low accuracy overall.

Upfront investment: This method requires up front investments to create a taxonomy, to create a ranking of that taxonomy and to map charities onto that taxonomy of interventions. The finer the granularity of the taxonomy, the more investment this method requires up-front.

Scalability: Once the methodology is developed, this method is scalable if it is possible to automatically map charities onto interventions using an LLM. We think that this could be possible automatically, but only with low accuracy/certainty. In addition, developing the tools to do this would itself require additional investment, and would effectively have to follow the same or many of the same procedures for generally developing an ‘automated’ approach. Alternatively, charities could be asked to self-report the interventions they work on (from the interventions in the taxonomy), or they could be given the option to adjust the automated mapping for their own charity. However, offering these options for user input increases the chances of bias, and manipulation of the impact-ranking system.

Maintenance: Any automated elements in this method would require maintenance, as described above. Also, the taxonomy and impact ranking of interventions require maintenance, as new interventions and new evidence become available. For example, it may be necessary for Charity Navigator to intermittently assess whether their originally chosen taxonomies continue to meaningfully segregate charities in terms of effectiveness in light of new research; it can often happen that with further research, particular classes or subclasses of intervention emerge as more or less impactful. In addition, charities themselves may add to or remove interventions they work with, and so charity-related data would also need to be kept up to date.

Overall feasibility: We think a simple version of this method might be feasible, but would lead to a very low accuracy due to several challenges, which we describe more fully in our case studies below.[20] Existing taxonomies like the NPC aren’t designed to align with impact measures, and often include a wide spectrum of charities with varying levels of impactfulness. This makes it impractical to use them for impact ranking. It may be possible to create a new taxonomy which is more suited for impact ranking, but we think this often still won’t lead to a precise indication of which charities have high impact, as illustrated in the case study below. Developing a new taxonomy also requires investing capacity and setting a clear definition of impact for each cause. Finally you would need to develop a method to map charities onto such a taxonomy which can be done either by self-reporting of charities or by using LLMs. Both methods introduce the risk of inaccuracies.

Box 4: Case studies on using a ranked taxonomy of interventions to create an impact ranking

Using a ranked taxonomy of interventions to rank health charities

For this method, an important choice is which taxonomy of interventions to use. Ideally, we would use an existing taxonomy, in particular one that Charity Navigator already works with. Creating a logical and universally applicable taxonomy requires a large time-investment, and for some taxonomies there is an existing mapping from charities onto these taxonomies. We will investigate this option first below, and then we will investigate the possibility of creating a new taxonomy for the purpose of ranking charities.

Using an existing taxonomy: In practice we do not think it is feasible to create a ranking within any of the taxonomies that Charity Navigator has provided us with, nor with other existing taxonomies. We will make this concrete with an example, using NPC categories and subcategories. In order to use the NPC taxonomy to rank charities, we would need to give each subcategory in the taxonomy a score that indicates how likely it is that charities in that subcategory are cost-effective. As we are considering health charities, we assume that to be cost-effective means to avert DALYs at a low cost, as described in the definition of impact for this cause above. In practice, the variety within many NPC subcategories is too large to meaningfully do this, meaning that subcategories will contain a mix of charities with high- and low cost-effectiveness. This is an issue even when only considering a very coarse ranking with two possible scores (i.e., above or below a chosen threshold of cost-effectiveness).

 

Some examples of NPC subcategories which we do not think could meaningfully be rated as cost-effective or not, versus some threshold, are:[21]

  • Health Care Issues
  • Health Care Reform
  • Quality of Health Care
  • Health Diagnostic, Intervention & Treatment Services
  • Mental Health Treatment
  • Fund Raising and/or Fund Distribution
  • Birth Defects, Genetic Disorders & Developmental Disorders Research
  • International Public Health/International Health
  • Preventive Health

 

For most interventions in these categories the cost-effectiveness depends on specific information about the intervention which is not included in the description of the subcategory, such as the disease, the treatment and the population of the treatment.

 

Theoretically you could score a subcategory by estimating whether more than half the charities in that subcategory are cost-effective (giving the subcategory a ‘cost-effective’ ranking in that case), but that would implicitly mean that up to half the charities get a score that does not reflect their cost-effectiveness.[22]

 

Even if it would be possible to give all subcategories a reasonable score,[23] there will still be examples of very effective charities in categories that are ranked as non-effective, or vice versa. This could lead to charities for which there is compelling evidence that they are cost-effective (such as the Against Malaria Foundation) to get a negative score.[24] See this earlier section for the implications of a low accuracy and charities getting an incorrect score.

 

The NTEE taxonomy has similar issues, and we think this is the case for other existing taxonomies as well unless there is a taxonomy that was developed explicitly to distinguish between charities with high- and low cost-effectiveness. This seems unlikely to us, but we did not explicitly search for one.

 

Creating a new taxonomy for the purpose of impact-ranking: If we cannot use existing taxonomies, Charity Navigator would have to develop a new taxonomy to apply the ranked-taxonomy method to. Mapping charities onto a new taxonomy may present challenges, as we will explain below. For now, we will focus only on creating a taxonomy that represents differences in cost-effectiveness.

 

One way to do this would be to categorize the charities using simple questions that may relate to cost-effectiveness. For example, a very coarse taxonomy would be to put all charities that work mainly in low- and middle-income countries (LMICs) in one group, and charities that work in high-income countries (HICs) in another group. The LMICs subcategory would then get a better cost-effectiveness score than the HICs one. This is a very imprecise way to measure cost-effectiveness, and it will be directionally incorrect in many cases- there are undoubtedly examples of charities that work in HICs that are more cost-effective than some specific charities that work in LMICs.[25]

 

Some alternative questions that may correlate somewhat with high cost-effectiveness are:

  • Whether the charity works mainly on diseases with a high burden, i.e., diseases causing a lot of suffering for a large number of people globally,
  • Whether the charity focuses on diseases that are prominent and have a high burden of disease in LMICs (such as malaria),
  • Whether the charity works on diseases which are neglected, meaning the global spending on reducing its burden is low relative to the global health burden it causes,
  • Whether the charity works on diseases for which there are cost-effective cures, preventative measures, or other cost-effective interventions,
  • Whether the charity works on a particular cost-effective intervention.

 

For the first four, you could theoretically create a list of all diseases, and score them on the questions. For the first two, the Global Burden of Disease (GBD) study would be a promising starting point (see for example the GBD data tool). We are unsure to what extent it lists all possible diseases, but that does not need to be an issue- you could for example make a subcategory with ‘the 100 diseases mentioned by GBD as diseases with the largest health-burden’, and a second subcategory with all diseases that are not on this list. This would require some research into what a reasonable cut-off point is, and what to count as separate diseases (for example whether you would count cancer as one disease or each type of cancer as one disease would have an impact on this ranking), but seems doable within a reasonable time-frame.

 

The last three questions likely require a larger time-investment, and we do not know of any one source that you could use to answer them. It would likely require analyzing several sources that investigate neglectedness and cost-effectiveness of interventions. From these you could create a list with a relatively small number of diseases or interventions that you would score as cost-effective. The more accurate you want this ranked taxonomy to be, the more research time you would need to invest.

 

You could use a combination of several questions listed above to create a taxonomy, for example you could make ‘buckets’ of diseases for each combination of burden in LMICs, neglectedness, and availability of cost-effective cures.

 

The main issue with these taxonomies is that they will not give a precise indication of whether a charity has a positive impact. They do not measure whether the charity actually functions well, and the more feasible of the options (focusing on the diseases rather than the interventions) don’t even indicate whether the charity works on a cost-effective intervention. For example, a hypothetical charity that aims to implement a low-cost intervention that has benefits but also causes serious harm could score positively on each of these aspects.[26]

 

Mapping charities on to the taxonomy: Once you have developed a ranked taxonomy, charities need to be mapped onto the chosen taxonomy. There are some fundamental difficulties with this mapping:

  • Perhaps not all charities can be categorized into this taxonomy, because they do not work on a specific disease or intervention listed in the taxonomy. For example, a charity that works on health systems strengthening may be difficult to categorize.
  • Charities may work on several things. You would need to define some cut-off for when to say that a charity does enough work related to a particular disease or intervention to give it the score of that subcategory (e.g., they need to spend at least 20% of their budget on it). However, this makes it more difficult to measure, and there will always be border-cases.

 

Since the mapping needs to be done at scale, we think the only options are to use a survey in which charities choose their own category (or possibly multiple), or to use an LLM that analyzes the charity’s websites, their annual reports, or another form of self-reported data. The pros and cons of these methods are described above.

 

If using an LLM there may be errors in the mapping of charities, and accuracy will likely be an issue. The clearer and simpler the distinction between categories, the bigger the likelihood that it is feasible to categorize the charities using an LLM. However, as described above, this requires a large investment, and will always be prone to some error. Our tentative conclusion is that taxonomies based on diseases may offer clearer distinctions between categories compared to those based on interventions, as interventions often encompass a broad spectrum of activities that can be subject to varying interpretations.

 

All in all, we think that asking charities to self-report one or multiple categories seems like the most viable option, though we recognize that this reduces the objectiveness of the process.

 

Rating a charity that works on multiple interventions: We did not spend much time on developing methods of rating charities that work on multiple interventions, because we think there are some solutions once you have solved the issues mentioned above. Four ways to deal with this question are:

  • You could present the scores for different interventions of each charity separately (ranking at the ‘program’ level, rather than at the ‘charity’ level).
  • A second approach could involve calculating a weighted average of the effectiveness scores for each intervention, where the weights reflect the proportion of resources allocated to each intervention. This method does require self-reported estimates of relative spending on different interventions.
  • More simply, you could use the ‘maximum of all scores’ for each charity. This approach emphasizes the charity’s potential for high impact, even if not all activities meet the same standard.
  • Finally, you could give the charity the score of whichever intervention or disease it spends most of its resources on. This would likely require self-reporting (because it will not always be clear from the charity’s website or annual report).

Using a ranked taxonomy of interventions to rank arts, culture, and humanities charities

The approach for applying this method to the cause area arts, culture, and humanities is similar to the approach described above for health. As for health, we do not think Charity Navigator could rank the subcategories in the existing NPC taxonomy on impactfulness in a meaningful way.

 

However, different from health, we are uncertain whether you could make any taxonomy that you could rank on impactfulness. This is related to the fact that we could not come up with a clear definition of impact for this cause, as described here. Perhaps you could work with experts to develop such a definition and a taxonomy that enables you to distinguish between charities with high and low impactfulness according to that definition.

Alternative approaches to achieving Charity Navigator’s impact goals

Besides trying to assess or rank all the charities in terms of their impact, we believe there may be alternative ways for Charity Navigator to achieve the goals of increasing the impact of the donations that it facilitates, and of helping donors to find more impactful charities. We present these ideas as largely independent options, but some combination of them is also possible and may have synergistic benefits. For example, combining ‘impact highlights’ for specific charities with informative material about how to give effectively could be particularly beneficial. A highlighted charity may not be appealing to a user because it is in a cause area that they aren’t excited by, but it may inspire them to read more about effective giving. They could then use that information to apply some effective giving principles to pick a more effective charity in a domain they are more excited about.

Expert analysis for a subset of charities

This method was somewhat out of scope, since Charity Navigator asked us to research methods for rating charities at scale. However, we have included it because we think it is a feasible alternative which could help Charity Navigator achieve the goals of supporting users to give to more impactful charities.

Brief description of method: Ranking all charities on the Charity Navigator website for impactfulness will require a high degree of automation, because of the large numbers involved. As an alternative, Charity Navigator could use expert capacity to rank only a small selection of charities. The goal would be to present donors with a small, curated list of highly impactful charities. A practical approach would be a funnel strategy that uses multiple time-boxed stages of research to first develop a longlist of potentially impactful charities or interventions. This list would be progressively refined through further research to produce a final list of top charities.

Choices to make: The first choice to make is whether you want to offer a selection of impactful charities for each cause separately, in which case you would apply the method above several times, or to apply the method once for all causes combined.[27] Other choices relate to making the selection method outlined above more concrete, in particular how much time to spend on each iteration of refining the list of interventions. More fundamentally, this is related to finding a balance between doing a broad search to include all potentially impactful challenges, and doing in-depth research for the final candidates to be more certain that they are truly impactful.

Accuracy: This method has the potential to be quite accurate, in the sense that Charity Navigator could potentially score a small number of charities with somewhat high confidence. However, there are two limitations. The first is that you are unlikely to identify all highly impactful charities. Applying this method in a limited timeframe requires quite ruthless prioritization, so that there will be charities that do not make it to the final list despite being highly impactful. The second is that evaluating just one charity in detail to get to a high confidence judgment of its impactfulness requires a lot of capacity and often access to data that is not public. For some charities and some cause areas, there is existing research of this type that Charity Navigator could (and in our opinion should) build upon. However, relying on other sources does limit the number and scope of charities that Charity Navigator could consider. In particular, there may not be existing research for all cause areas.

Implementation: This method could be implemented by experts with experience in impact assessments, and would likely also require more specific expertise for specific causes.

Scalability: This method cannot be applied to all charities on the Charity Navigator website within the dedicated capacity. It may be plausible to build up sets of willing volunteer experts for cause area panels—motivated by the impact or by career capital. This type of system would require preparing extensive guidance materials for the volunteers, and creating some form of quality assurance.

Maintenance: This maintenance for this method consists of periodically assessing whether it seems reasonable to assume that the highly rated charities are still highly impactful, which can likely be done with a limited time-investment. Additionally, it is likely that in the future new highly impactful opportunities arise for charities to work on, and it would be advisable to consider researching these—although this is not essential for the initial implementation of the method.

Overall feasibility: Evaluating a small group of charities within the available time frame is feasible. However, extending this method across all areas would require focused effort, strict prioritization, and, where possible, integration of findings from other evaluations.

Impact highlights/individual promotion of impactful charities

There are also less time-intensive versions of the ‘Expert analysis for a subset of charities’ method, which rely more heavily on readily available resources. Charity Navigator could highlight a small collection of charities that are already considered to be highly effective—as you have done in collaboration with Giving Green in the past. For example, it could aggregate across cost effectiveness assessments/chosen charities from other evaluators that have focused on a more specific subset of charities that stood a plausible chance of being highly effective. Alternatively or in addition, Charity Navigator could invite a roundtable/conduct a Delphi study (see Iqbal & Pipon-Young, 2009) of a selection of experts in cost effectiveness and charity evaluation to highlight a small selection of charities that are highly cost-effective, perhaps only in a selection of core areas (e.g., animal welfare, global poverty).

Charity Navigator could then have a specific section on the website for charities specifically ‘vetted’ or highlighted for their potential for impact, or could intermittently promote these specific charities in specific sections of the websites or as part of ‘impact awareness’ campaigns. This could include some informational materials as described above, as well as explanations of in what ways these particular charities were seen to meet a particular bar for impact.

Some positives of this approach would be:

  • It would likely align donations with legitimately impactful charities
  • It would not leave impact-focused donors with a ‘paradox of choice’ among the multitude of rated charities
  • It could be explained in such a way that it is not claiming those charities that are not included are definitely not impactful, and therefore avoids concerns over unfairly misrating charities
  • The reasoning and overall process could be made highly transparent
  • Its implementation could be spread out over any desired period of time by dividing the work up by cause area.

Providing information and educational material on impactful giving to guide donor choices

Rather than scoring every charity on impactfulness, Charity Navigator could support donors in making more impactful choices by informing them about how they might approach making the most impactful donations. This would involve explaining to donors in engaging and clear terms about how experts in charity impact evaluation have conceived of impact (e.g., helping the most people/animals in the most meaningful ways, as cost effectively as possible), and conveying the possible magnitude of difference in effects of donating to the topmost effective charities as opposed to ‘average’ or ‘typical’ charities.

Difficulties in evaluating the exact impact of different causes could be conveyed, such that users are not misled into believing this is an exact science, but they could be informed about several key insights that have been gained. For example: health improvement or poverty reduction interventions in low-income countries tend to be much more cost-effective than in high-income countries/the USA. Animal-related interventions that focus on helping very large numbers of animals (e.g., policy improvements for farmed animals) tend to be much more cost effective in alleviating animal suffering than more niche or specific interventions. The materials may also highlight particular cause areas that are often considered especially impactful in terms of improving wellbeing or alleviating suffering.

Numerous such insights could be used to help donors make more informed judgments, while still giving them agency to incorporate their own preferences. It might be the case that nudging people to consider what they want to achieve with their donations, and reflecting on the scope of what they might achieve could be highly impactful.

The materials used to communicate these insights should be tailored to align with the preferences and characteristics of the website’s users, ensuring that the content is not only relevant and engaging but also presented at appropriate times and places on the website. This approach should aim to avoid alienation that might occur from a mismatch between the content and the personal interests or values of the users.

Some positives in favor of this approach would be:

  • The impression it gives of the relative impact of different charities/charitable causes may be just as accurate as an imperfect algorithm, while Charity Navigator would be insulated against key failure modes such as providing specific rankings or ratings that are incorrect.
  • Donors retain a high degree of agency.
  • The reasoning and process is highly transparent.
  • Even if donors/other charities disagree with the definition of impact and its implications, it cannot be claimed that Charity Navigator has made an objective mistake: you have merely explicitly conveyed a defensible approach that donors can choose to accept or not.

Note that some automated elements might enhance donor agency/impact. For example, if certain key dimensions are highlighted as being of importance, then it may be useful to try to categorize charities along such dimensions (similar to the method described above). This would enable donors to apply filters for them if they find the reasoning behind this dimension of impact to be compelling. However, this would not strictly be necessary for such an approach.

Interactive donor preferences

As opposed to either directly rating charities in terms of their impact or informing donors as to a possible conception of impact, it may be possible to match donors and charities on the basis of alignment between how they conceive of impact. For example, donors might be given a brief questionnaire that solicits their preferences and attitudes in a range of impact-relevant areas.

Charity Navigator could ask charities to provide answers to the same questions as the respondents’ were answering. This data could then be used by a matching algorithm to connect donors with charities that share similar views on impact, enhancing the potential for meaningful contributions. In order to ensure that charities provide realistic and balanced responses, rather than only portraying their activities in an overly positive light, you could use a points distribution method in the questionnaire. This approach requires charities to make thoughtful decisions about their priorities by allocating a finite number of points, or in some cases, specifying how they would distribute a hypothetical $100, between different strategic choices. Below is a table that illustrates some examples of the choices you could ask charities and donors to prioritize between. These examples are merely illustrative, and could be improved with more capacity.

Table 3: Examples choices you could ask donors and charities to prioritize between

Choice 1Choice 2Explanation
Direct intervention spendingMonitoring and EvaluationAllocates resources between directly implementing programs and assessing the effectiveness of those programs.
Using evidence-based interventionsInnovating new approachesBalances adherence to interventions with proven success against trying innovative, potentially high-impact solutions.
Animal welfareHuman welfareBalances efforts between improving conditions for animals versus addressing human needs and welfare issues.
Helping people in the USHelping people in other countriesChooses between focusing resources on domestic issues versus international challenges, reflecting geographic priorities.
Helping many peopleHelping fewer people with a specific conditionDecides whether to address the needs of a larger general population or to focus on a smaller group with specific, potentially severe challenges.

Alternatively, the matching algorithm could be based on the development of an automated system that categorizes charities according to the different dimensions of impact that the donor questions sought to differentiate between. If automated categorization were used, then there would be similar difficulties to tackle in terms of the accuracy of such categorizations as with other automated approaches discussed above. Having charities provide their own values/perspectives would reduce this risk, although could involve considerable overhead to organize, and it is possible that charities are not forthcoming with providing such information.

This approach might also fail to align donations with impact in terms of most effectively helping the most people to the greatest amount possible, if this is not the idea of impact that donors value. Still, it could be a valid way of satisfying donor preferences for having impact according to their own definition of the term. Hence, the value of this approach depends on whether Charity Navigator principally aims to increase its impact according to a particular definition of impact that Charity Navigator endorses, or rather to satisfy a donor preference for a (potentially very different) idea of impact. Our understanding is that while Charity Navigator is concerned with satisfying such donor preferences generally, it is especially interested in aligning donations with a well-reasoned definition of impact. We believe our other suggested alternatives achieve this better, while also respecting donor autonomy.

Summary of key characteristics of approaches

The table below presents a concise summary of the key characteristics for each of the methods we considered. Some nuances are inevitably lost in such a summary. The table should not be read in isolation from the rest of the document nor should it be used to make definitive decisions based solely on the color coding. For more detailed information, please refer to the sections above.

Table 4: Summary of the key characteristics for each charity ranking method discussed

Note. Accuracy refers to whether the ranking method will guide users to impactful charities. Scalability refers to the number of charities that can be fully covered. Transparency also refers to interpretability.

Considerations when scoring charities on impact

Consequences of errors in the impact ranking

Any system that involves rating or categorizing charities according to some standard involves the risk of incorrect assessments. Whether it is wise to pursue a rating approach may hinge on a) the likelihood of such mistakes, b) the consequences of such mistakes, and c) Charity Navigator’s tolerance of these risks and possible costs. While we cannot speak to Charity Navigator’s tolerance, in the following sections we aim to give some perspective on the likelihood of errors, and consider what some consequences might be.

Likelihood of error

If using an automated system, what kind of accuracy can Charity Navigator expect? There is no single answer to this question, as accuracy would depend on multiple factors, including the quality and informativeness of the data that the rating is based on, the complexity of the categorization/rating task, and how fine-grained the rating system is. However, it is certain that accuracy for such a task will not be nearly so high as the very high rates required in just pulling objective pieces of information for sources of data.

It is also worth highlighting two key types or errors that might arise: Firstly, if some proxy for impact or categorization is performed automatically, then the automatic procedure may misdetect evidence for that proxy or miscategorize the target charity (e.g., an LLM miscategorizes an animal shelter as being involved with farm-animal welfare). Secondly, chosen proxies may fail to be indicative of true impact (e.g., an LLM may perfectly categorize all the charities, but these categorizations may not correlate strongly with cost-effectiveness upon deeper analysis).

If we aim to assess the actual cost-effectiveness of a charity on a metric such as quality-adjusted life years (QALYs), or even a fairly coarse ranking on this dimension, using automated methods, then this amounts to an exceedingly complex task. There is limited time available to gather and optimally format data that may pertain to the question for each charity that should be rated. In addition, it is questionable if data that would be accessible will meaningfully inform an estimate of cost effectiveness for a vast number of charities under consideration. As a result, we would expect an automated method to produce a large number of mistakes in such a task.[28]

The complexity of the task could be greatly reduced by focusing not on some rating of impact or determination of cost effectiveness per se, but rather upon detection of things that Charity Navigator deem to be indicative of or aligned with how it conceives of impact. For example, rather than asking the question of if there is compelling evidence for an intervention, this assessment would be flipped to simply the detection of statements (e.g., on the charity’s website) that indicate use or consideration of particular evidence-based practices. This simplification of the task that an automated method is expected to perform would be expected to reduce (though not eliminate[29]) classification errors. However, there remains the question of whether these secondary indicators of impact themselves accurately reflect the impact of charities. Hence, it may be considered misleading if the terminology used for such a score implied it were an actual assessment of impact, as opposed to indicators of impact.

Charity Navigator may want to be transparent about what kinds of things are included in the scoring system, so that both donors and charities can understand how they are being rated. For many reasons, this transparency would be desirable, but it may also introduce another source of errors: manipulation or score hacking. It would be easy for charities to recognise terms or ideas that are considered indicators of impact and just incorporate them into text on their websites or other materials. It would even be possible to insert reams of such text invisibly on a site, which would be picked up by a web scraper but not seen by a regular user of the site.

Consequences of errors

Consequences of any errors will depend on the number of errors, whether they are caught and by whom, the type or magnitude of these errors, and Charity Navigator’s capacity to deal with them once revealed. In a worst case scenario, a rating approach may have many errors, and this is highlighted publicly by an influential actor in the charity space. The nature of the mistakes may be large in magnitude or be the result of something particularly embarrassing, such as a ‘hallucinating’ LLM inventing that it has detected certain terms in a range of websites and categorized the respective charities accordingly. Alternatively, an automated scoring system might be implemented correctly without technical mistakes, but be clearly mistaken at the object level—for example, rating a charity that is by all accounts deemed to be highly cost-effective as low impact. Such cases could result in reputational damage to Charity Navigator.

Less consequential errors, in terms of public reputation, could also have high costs if they would necessitate considerable time investment for Charity Navigator to check up on and update/modify ratings on the basis of claims from charities that they have been misrated. Given the number of charities to be rated, it is possible that even a relatively small percentage error rate could result in a number of such claims that it would be difficult for Charity Navigator to have the capacity to deal with.

Finally, even if errors are not caught, Charity Navigator should consider whether deploying a potentially error-prone system at scale—which may affect people’s donation decisions—is acceptable in terms of potential unfair costs to the charities that are ranked. Even if errors are not realized, misrated charities may receive more or less donations than they deserve.

Transparency about the meaning of an impact score

When discussing the cost-effectiveness of charities, it is important to be transparent about the methodology used to derive such assessments, what these indicators measure, and the limitations of these methods. Using precise terminology helps prevent misinterpretation of the data. For instance, phrases like “impact ranking” might imply a more exact measure of a charity’s effectiveness than is accurate. A fitting term depends on the method chosen, but could be “likelihood of alignment with cost-effective methods,” or an acronym related to a similar description.

Ranking charities that work on controversial topics

Ranking charities that deal with sensitive and controversial topics such as LGBT rights, abortion, and end-of-life care presents distinct challenges, which we cannot expect an automated model to solve. These issues are deeply influenced by personal values and cultural norms, which can significantly affect how different stakeholders perceive the impact of related charities. In order to create an impact ranking that is applicable to all charities, Charity Navigator will need to take a stance on the value of work in these areas.[30] More generally, any system that ranks charities runs the risk of creating discrepancies with donors’ perceptions and preferences.

The choice to rank separately within causes has disadvantages

Choosing to rank charities only within specific causes to align with Charity Navigator donor preferences presents certain limitations. This approach narrows the scope compared to a cause-neutral strategy, which allows for greater diversity in charity selection and could lead donors to more cost-effective charities overall. Additionally, scoring charities by cause might result in misleading rankings; for instance, it could be misinterpreted that a moderately effective intervention in one cause area is less cost-effective than a highly effective program in another cause area, despite the impact ranking values assigned being incomparable across cause areas.

Risk aversion

Another aspect to consider when giving advice on impactful donations is risk aversion. How certain do donors want to be that their donation has an impact, and would they trade off that certainty for potentially improving your impact? Some charities may have a high impact in expectation, but a low chance of success.[31] In theory, we could introduce a threshold that aligns with the preference of donors, for example that charities should have at least a 10% chance of success to be able to get a high score for impactfulness. In practice we don’t think we will be able to assess the chance of success at the charity level, so for the rest of this report we will not take risk aversion into account.

Our advice: should Charity Navigator move forward with the development of an impact ranking database?

We think it could be very impactful for Charity Navigator to guide donations towards more effective charities. Since Charity Navigator influences many philanthropic donors, influencing even a small fraction of those could already have a large positive impact. For that reason, we are very supportive of the idea of Charity Navigator investing capacity in sharing some information on impactfulness of charities with the users of the website.

In concordance with Charity Navigator, we investigated the possibility of giving all charities on its website a ranking related to impact. Ultimately, we advise against creating such an exhaustive system, for several reasons.

One fundamental barrier is that we are uncertain whether Charity Navigator could develop a clear and measurable definition of impactfulness or cost-effectiveness for all cause areas. Perhaps they could develop one by involving cause-specific experts, but we are uncertain whether this would be the case, and we consider this an essential step in creating an impact ranking database.

Secondly, we think that all exhaustive methods we researched would have a low accuracy, in the sense that for many charities the impact score would not give a good indication of their true cost-effectiveness. Impactfulness or cost-effectiveness is a difficult concept to measure. We think that every exhaustive method would either measure an oversimplified proxy for impact, or would mis-classify many charities when trying to evaluate a more realistic proxy for impact. Relatedly, it is very time consuming to check how accurate the ratings of any automated method are. This means that Charity Navigator and its users would not be well-informed about how much value to place on the scores given by the system, and could not judge whether the system is leading to more effective donations.

We also think each of the exhaustive methods cost a lot of capacity to implement and maintain. Of the methods we researched, we think (with some uncertainty) that only a simplified version of the ‘ranked interventions’ method could be implemented within the capacity Charity Navigator is able to invest (about one FTE).[32]

Finally, we think there are good alternatives for Charity Navigator to guide donations towards impactful giving and to offer a service to impact-oriented donors. Charity Navigator could present donors with a small curated list of highly impactful charities, based on a combination of existing and new prioritization research. This list could have separate sections for specific cause areas to guide donors with specific preferences. Charity Navigator has already done something similar with recommendations from Giving Green, and could expand that work into other areas. Transparency is important for this approach, so that users understand that you did not do exhaustive research and that charities that are not on this list can also be impactful. We also think an educational module to guide users about how they could approach making the most impactful donations could be valuable.

In summary, rather than attempting to create a comprehensive impact ranking system—which presents considerable challenges and may not accurately guide donors—we believe it would be more effective for Charity Navigator to focus on recommending a select group of high-impact charities and supporting donors with educational resources related to impact.

Contributions and acknowledgments

Carmen van Schoubroeck and Jamie Elsey jointly researched and wrote this report. Carmen van Schoubroeck also served as the project lead. Tom Hird supervised and reviewed the report. Special thanks to Lea Prince, Natalie Volin and Marcus A. Davis for helpful comments on drafts. Thanks also to Shaan Shaikh, Thais Jacomassi, James Hu, and Rachel Norman for assistance with editing and publishing the report online.

Appendix: Taxonomy

Charity Navigator has asked us to think of ways to compare impactfulness of charities within causes, but the term ‘cause’ is not yet well-defined within Charity Navigator. As described earlier, we think existing taxonomies of charities could be a good starting point to define the scope of cause areas. We focus on two taxonomies for which Charity Navigator currently has data:

The National Taxonomy of Exempt Entities (NTEE) is a classification system used by the Internal Revenue Service (IRS) and the National Center for Charitable Statistics (NCCS), among other entities, to categorize nonprofit organizations based on their purposes and activities (Jones, 2019). Charities select their NTEE code when filling out their tax forms, and Charity Navigator has access to these codes for all charities in their database.

The Nonprofit Program Classification (NPC) System, developed by the National Center for Charitable Statistics at the Urban Institute, categorizes the programs, services, and activities of public charities. This system complements the National Taxonomy of Exempt Entities (NTEE) by focusing on what organizations do, rather than just their organizational type (Lampkin et al., 2001).

From our analysis, we understand that each of the main categories of the NPC system (“hierarchies”) corresponds to exactly one of the main categories in the NTEE system (“Types”), though their names are slightly different. The subcategories are different.

In the report we will use and refer to the NPC classification system in several sections, because Charity Navigator suggested this was the most logical option. We do not think our findings would change significantly if we would use a different existing taxonomy. In some cases, we will use data of main categories labeled in the NTEE classification, since this was the data we had available—but as explained above this maps uniquely onto the NPC categories.


  1. ^

     As indicated by Charity Navigator in conversations about this project

  2. ^

     Charity Navigator currently does not have a single taxonomy it uses, but likely candidates are the National Taxonomy of Exempt Entities (NTEE) or the Nonprofit Program Classification (NPC) System, which are quite similar; see the appendix for more details.

  3. ^

     See the appendix for more details on the NPC and other taxonomies.

  4. ^

     These were selected by scrolling through the list, as well as searching for ‘Africa’, ‘heart surgery’, and ‘malaria’ to represent some of the diversity.

  5. ^

     These were selected by scrolling through the list and pulling out some examples from which the name of the charity gives a sense of its purpose.

  6. ^

     We reiterate that this treatment is far from exhaustive, and a complete definition of impact could be a project in itself.

  7. ^

     Charity Navigator (n.d.-b) defines impact as “the change in mission-driven outcomes net of what would have happened in the absence of the program (the ‘counterfactual’), relative to the cost to achieve that change”, which we think aligns with the cause-specific or charity-specific scales of impact.

  8. ^

     If one cause has multiple goals or multiple ways to measure a goal, we also need ways to translate those goals into a uniform scale in order to compare charities within that cause.

  9. ^

     Cost-effectiveness is not trivial to define, because it requires a clear definition of impact that is measurable and quantifiable, in order to express it in impact per dollar and compare it to the impact per dollar of other interventions.

  10. ^

     GiveWell (n.d.): “The Disease Control Priorities report frequently estimates the cost-effectiveness of different programs in terms of cost per DALY averted”. GiveWell themselves try to translate DALYs into “numbers that have clearer and more intuitive meanings”.

  11. ^

     Note that this categorization is imperfect and likely incomplete, and not all combinations of data and processing techniques will result in a sensible approach. For example, reliance upon expert judgment based upon their existing knowledge being applied to various specific interventions could be considered both a source of data and a means of processing data.

  12. ^

     It is also possible to develop more complex or informative features (a process known as ‘feature engineering’) by combining, summarizing, or transforming other pieces of information that can be linked to the charity.

  13. ^

     In practice though, this is often determined just by assessing the performance of multiple different models, rather than an in principle preference for a particular type of model in general.

  14. ^

     As an example, consider machine learning projects that are crowdsourced on the machine learning competition site Kaggle. To get a sense of data scale, an effort to develop an algorithm to score student essays on a 1-6 scale (which seems a much simpler task than assessing the impactfulness of multiple very different charities) involved a training set of 26,000 examples and a test set of 8,000 (see here).

     

  15. ^

     For example, one might ask the model to assess whether or not text from the charities website mentions working in a cause area deemed to be associated with high impact by Charity Navigator, or the extent to which they reference using evidence or conducting evaluations of their activities.

  16. ^

     For accuracy/confidence, the more examples, the better, but we would expect much fewer manually assessed examples to be necessary than would be the case for training an ML algorithm, and the assessments to test accuracy at this level would be more simplistic assessments of proxies, as opposed to actual assessments of impact.

  17. ^

     ‘Fine tuning’ involves taking a pre-trained LLM and training it with further information or instructions, usually with the intent of making it perform better in a specific domain of interest.

  18. ^

     This is a simple example and subject to several limitations—including that not all charities using evidence-based practices will be doing so cost-effectively, or for purposes that improve human wellbeing, but the point is to work through an example.

  19. ^

     Hallucinations is a term used in reference to LLMs when they essentially invent or make up things that are not true. In this case, it might be the claim that a particular phrase was present in the data that the LLM was fed, when in fact no such phrase was present.

  20. ^

     We did not spend time on mapping out the capacity necessary for each method, but our intuition is that a) defining the cause areas, b) choosing a definition of impact per cause, c) creating a simple ranked taxonomy for each cause, and d) creating a simple questionnaire for charities to self-identify which categories they fall into, would require about a year (1 FTE). The capacity estimate could be informed by a trial cause area.

  21. ^

     See this spreadsheet for all the NPC subcategories in the four categories related to health.

  22. ^

     Imagine a subcategory in which 51% of charities are cost-effective, then the category is labeled as ‘cost-effective’. Then based on that taxonomy, the 49% non-effective charities would also be labeled as cost-effective.

  23. ^

     One way to achieve this might be to rely on input from several experts. There are in total 681 subcategories (across all NPC categories), so if an expert would spend eight hours a day, five days a week for 52 weeks on rating these, they could spend on average 3.05 hours per subcategory (=52*8*5/681). Realistically, no single expert could assess all NPC subcategories, so there would be significant coordination required to identify, contact, and engage a large number of experts for this effort. We are also unsure that three hours is sufficient to achieve a “reasonable” score, however this is defined.

  24. ^

     The Against Malaria Foundation could in theory be categorized under several subcategories such as Infectious Diseases, International Public Health/International Health, or Preventive Health. It is not directly obvious to us that any of these categories should receive a score of ‘high cost-effectiveness, and each of these subcategories could include charities that are not cost-effective.

  25. ^

     This taxonomy could also mis-classify charities that focus on alleviating health burden in LMICs by doing research in HICs.

  26. ^

     We do not intend to suggest that any existing charity is currently doing this.

  27. ^

     You could choose to apply this method to only a subset of causes such as those for which you have a clear definition of impact or those for which you think impact-oriented donors are most interested.

  28. ^

     Consider a much, much simpler task—simply coding headlines that pertain to economic events as indicating positive or negative sentiment towards the economy, where the ‘ground truth’ is an aggregate of ratings provided by experts. Academic researchers approaching this task (Van Atteveldt et al., 2021) using a range of custom-built machine learning tools achieved an accuracy of around 63%. In my (Jamie’s) work to assess how well an LLM could perform this task, accuracy of around 80% was achieved (Elsey, 2023), and this would typically be considered a success for such a classifier.

  29. ^

     Even in very simple tasks or in response to simple questions, LLMs are known to ‘hallucinate’—that is, they invent and respond to completely made up information, which they appear to have just confabulated in response to a user prompt. Given low capacity for maintaining and evaluating performance, it would be very difficult to determine in such a large number of rankings how much information going into the LLM’s final categorizations might have been hallucinated.

  30. ^

     Or at least verify that the value that an automated method has assigned to these topics align with Charity Navigator’s own values, which is also a form of taking a stance on them.

  31. ^

     For example lobbying for a policy that would improve many people’s lives. This could have an enormous impact if it succeeds, but may have a small chance to succeed.

  32. ^

     The method we are referring to is dividing up cause areas into a taxonomy that is aimed at roughly identifying charities that are likely high-impact by simple classification questions. Charities would self-select in which category of the taxonomy they best fit.