Important evidence and knowledge comes in many shapes and forms: it includes on-the-ground practical knowledge of practitioners, information about the location and terrain, views of communities intended to benefit, information about the quality and consistency of implementation, and academic / objective evaluations of effectiveness.
Our EGM and Guidebook in child protection show the results of a search for a specific type of evidence: rigorous studies of ‘what works’ about the effects of interventions, i.e., studies that have assessed the link between a certain intervention (or programme) and an outcome (or results). Some studies of other types are therefore not included in the EGM even though they may be important, e.g., qualitative studies (e.g., of attitudes), studies of prevalence of the problem, studies of activity (e.g., funder activity, nor how many organisations have safeguarding policies), studies of public policy. This is not because those are not important, but because this particular map is only about causal evidence between an intervention and outcome(s).
Each primary study* on the EGM looks at the effect of an intervention on an outcome. For example, the effect of a programme in-school to teach children about good touches vs. bad touches, on children’s knowledge of abuse. In that example, the intervention is aiming at prevention, so the study goes in the top row, and the target outcome is ‘children’s knowledge’, so the study goes in the relevant cell.
Where a study examines the effect of an intervention on several outcomes, we place the study in each relevant cell. For instance, the in-school programmes mentioned above are measured on their effects on children’s knowledge, children’s mental health, and sometimes parent knowledge. So the same study may appear several times on the EGM.
If a cell is empty, that means that we did not find any studies of the effect of that type of intervention on that type of outcome.
* The EGM also includes systematic reviews. These are a type of study-of-studies: they look for collate and synthesise evidence, such as ‘what works’ studies. If a cell is empty apart from systematic review(s), that means that those systematic reviews did not find any primary studies of that type of intervention on that type of outcome – and neither did we. So, for our purposes, we can consider those cells to be empty.
The EGM shows where there is evidence of “what works” studies, and where there is not (yet). The Guidebook shows what the evidence says. Each cell in the EGM shows the studies of the effect of one intervention on one outcome. Across the EGM there is great diversity to the number of studies per cell:
- There is no evidence, the cell is empty – constitutes 25 out of 72 cells
- There are 1 or 2 primary studies in the cell – constitutes 11 out of 72 cells
- There are 3 or more studies in the cell – constitutes 25 out of 72 cells, where the no. studies per cell ranges from 3 to 60+ studies
This diversity implies that a different approach is needed between the cells. A cell with 1 or 2 studies cannot be summarised in the same way as one with more than 10 studies. The guidebook therefore consists of 4 different approaches to summarise what the evidence says:
- Cells which are empty: clearly we cannot say anything about ‘what works’ there because it is not known.
- Cells with only 1 or 2 primary studies: The Guidebook has a plain language summary of each study. (Note that if a cell has only one and it is a systematic review, we have not written a summary of it. That is because that systematic review found no primary studies in that cell. If there are two studies, one of which is a systematic review, then the other one is a primary study so we wrote a summary of that but not of the systematic review.)
- Cells with 3 or more studies: The Guidebook has a synthesis of all the studies in that cell.
Because we are interested in “what works” studies, we included only studies with a valid counterfactual, i.e, which show what would have happened without the programme (see Box 1). Such studies are randomized controlled trials and quasi-experimental designs. We did not include pre-post studies (i.e., studies which only look at the level of something (such as children’s knowledge) before a programme vs. after it, with no comparison group) because these studies lack a valid counterfactual.
We also included systematic reviews of studies. These are a type of study-of-studies: they look for collate and synthesise evidence, such as ‘what works’ studies. If a cell is empty apart from systematic review(s), that means that those systematic reviews did not find any primary studies of that type of intervention on that type of outcome – and neither did we. So, for our purposes, we can consider those cells to be empty.
Box 1: The importance of a valid counterfactual[1]: fair tests
The best way to understand whether or not an intervention (e.g. training children in good touch/bad touch) has had an effect on a certain population (e.g. raises children’s awareness) is by knowing what would have happened if we had NOT done the intervention. This latter is referred to as the counterfactual. The difference between the outcome (here: children’s awareness) with the programme vs. without the programme is the programme’s effect. Without a comparator, we simply know whether something rose but have no idea why: perhaps it rose for everybody and had nothing to do with the programme. (See diagram below.) ![]() Some studies look simply at the level of some measurable thing (e.g., children’s knowledge) before the programme vs. after it: and, if that level rises, they assume that that was due to the programme. This is not a good assumption – i.e., it is not a fair test of the intervention: there are many thing going on in the world apart from this programme, and perhaps some change affected everybody, irrespective of the programme. These studies are called “pre-post studies” (because they simply compare the level before the programme with that after it). We do not include them on the EGM because they are so unreliable. Sometimes one even finds instances where some desirable outcome changed more for the group which did not get the programme! – i.e., the programme was a hindrance. A valid counterfactual (a fair test) is often hard to find or create. One reason is that the people who got the programme need to be similar to the people who didn’t (often referred to as the control group). One way to do this is to find people who are eligible for the programme (e.g., the right age, in the right location etc.) and choose randomly which ones get the programme and those who don’t. This is a randomised controlled trial. That method is used, for example, in testing medicines and vaccines because it is a fair test. In this EGM, the primary studies all have a defensible counterfactual: either a randomised control group, or a similar comparison group. |
[1] For a detailed explanation see: Gugerty and Karlan (2018)
If you study the same thing more than once, you are quite likely to get different answers each time – even if the study was well-designed and implemented. This can be because:
- the sample size differed between the studies: smaller studies are more likely to get inaccurate answers than are large ones. Or
- the people studied differed between the studies: a programme may have quite a different effect on happy, stable children than it does on children who have already had some traumatic experience of abuse. It may get different results in a high-income country than in a lower-income country. Or
- just random chance.
In short, the results of a study can depend on the context and conditions where the programme ran.
Just because one study found that some intervention in some country and some particular time produced a particular effect, that does not mean that that intervention will produce that same effect in a different place and/or at a different time.[1] This matters because many cells in the EGM contain only one or two studies. We should be cautious in interpreting studies in these cells.
When a cell has multiple studies – i.e., the effect of some programme on some outcome has been studied multiple times – we can be more confident about the effect by analysing them together. We do this in the syntheses in the Guidebook: for cells with three or more primary studies, the Guidebook has a synthesis showing the findings. For example, the programmes which have been most studied are programmes run in schools to prevent sexual abuse, by teaching children about good touches vs. bad touches: those many studies all find pretty similar (positive) effects, so we can be pretty confident that they work. However, none of the studies is from Asia so we cannot be confident about the effects of those programmes there.
[1] See a detailed account of this is Cartwright and Hardie (2012)
How single studies can cost lives; and combining studies can save lives:
Excerpt from Bad Science, by Dr Ben Goldacre {ref} The below section refers to a meta-analysis, which is a way of synthesising the results of multiple academic studies to see if any patterns arise. “A meta-analysis is very simple in some respects: you just collect all the results from all the trials on a given subject, bung them into one big spreadsheet, and do the maths on that, instead of relying on your own gestalt intuition about all the results from each of your little trials. It’s particularly useful when there have been lots of trials, each too small to give a conclusive answer, but all looking at the same topic. So if there are, say, ten randomised, placebo-controlled trials looking at whether asthma symptoms get better with homeopathy, each of which has a paltry forty patients, you could put them all into one meta-analysis and effectively (in some respects) have a four-hundred-person trial to work with. In some famous cases, meta-analyses have shown that a treatment previously believed to be ineffective is in fact rather good, but because the trials that had been done were each too small, individually, to detect the real benefit, nobody had been able to spot it. [This diagram, which has become the logo of The Cochrane Collaboration is] a graph of the results from a landmark meta-analysis which looked at an intervention given to pregnant mothers. When people give birth prematurely, as you might expect, the babies are more likely to suffer and die. Some doctors in New Zealand had the idea that giving a short, cheap course of a steroid might help improve outcomes, and seven trials testing this idea were done between 1972 and 1981. Two of them showed some benefit from the steroids, but the remaining five failed to detect any benefit, and because of this, the idea didn’t catch on. Eight years later, in 1989, a meta-analysis was done by pooling all this trial data. If you look at the blobbogram [diagram above], you see what happened. Each horizontal line represents a single study: if the line is over to the left, it means the steroids were better than placebo, and if it is over to the right, it means the steroids were worse. If the horizontal line for a trial touches the big vertical ‘nil effect’ line going down the middle, then the trial showed no clear difference either way. One last thing: the longer a horizontal line is, the less certain the outcome of the study was. Looking at the blobbogram, we can see that there are lots of not-very-certain studies, long horizontal lines, mostly touching the central vertical line of ‘no effect’; but they’re all a bit over to the left, so they all seem to suggest that steroids might be beneficial, even if each study itself is not statistically significant. The diamond at the bottom shows the pooled answer: that there is, in fact, very strong evidence indeed for steroids reducing the risk — by 30 to 50 per cent — of babies dying from the complications of immaturity. We should always remember the human cost of these abstract numbers: babies died unnecessarily because they were deprived of this life-saving treatment for a decade. They died, even when there was enough information available to know what would save them, because that information had not been synthesised together, and analysed systematically, in a meta-analysis.” |
A study, or a group of studies, can find one of four things. Understanding the direction of the results is important and hence highlighted in this Guidebook:
- Positive effect: the intervention seems to improve the outcome.
- No effect: the intervention did not change the outcome(s) at all. There are generally two reasons why a study may not find any result[1]:
- Implementation failure: the intervention was not implemented well, or not implemented at all. Clearly people do not benefit from an intervention which they did not receive. (This is what happened in the example in Indian schools mentioned here: the intervention simply never got delivered.)
- Theory failure: the intervention does not affect the outcome. Theory of Change is simply not true: the mechanism/s it relies on do not affect the outcome/s in the predicted way/s.
- Negative effect: the intervention has a negative effect on the exact thing it was intended to improve, i.e., is counterproductive. There are examples in crime, where interventions increase the rate of crime(!) Happily, we did not find any clear examples of this in child protection, so there are no examples on the EGM. (We say ‘clear’ examples because there are some unclear examples. E.g., in teaching young people about dating violence, some young people realise that they have been having sex under-age, which is illegal. Therefore the amount of crime which they realise that they have been involved in increases, even though the actual of crime itself has not changed.)
- Mixed or unclear effects. This might be that the intervention improves some outcome but has no effect on other outcome/s. There are examples of this on the EGM. Or the intervention might improve one outcome but worsen other outcome/s. For instance, perhaps the children gain knowledge, but also increase anxiety. Or it could be that the intervention has different effects on different groups, e.g., it might work for girls but not for boys. Mixed effects are common in groups of studies, but also quite common within a single study, given that most studies measure more than one outcome.
[1] For more detail, see: Astbury, B., & Leeuw, F. L. (2010). Unpacking black boxes: mechanisms and theory building in evaluation. American journal of evaluation, 31(3), 363-381.
One of the most important ways to use the EGM and Guidebook, but also one of the most difficult, is assessing whether the results will apply in your context. In other words, if you run the intervention in a different place or different time, will you get the same results as in the study? There is no standardised or straightforward way to make this assessment, and many factors come into play. In addition, a study may not say much about the extent to which the results can be generalised (see Box X below). The following examples illustrate the issues.
Box 2: Using evidence from one place in another place
Example 1: HIV The following example is from Mary Ann Bates & Rachel Glennerster (then) of J-PAL: Against HIV, they had run a “Sugar Daddies Risk Awareness” programme in Kenya, to reduce sexual relationships between teenage girls and older men. It taught girls about how many older men have HIV-AIDS, and thus educated them about the risks of such relationships. It was “remarkably effective”. This was in part because the Kenyan “girls did not realize that HIV risk rose with age.” Those girls under-estimated the risks, so the education programme discouraged that behaviour. JPAL considered running this programme in Rwanda. But here (i)HIV rates of older men are vastly lower than in Kenya (“1.7 percent compared with 28 percent in the district in Kenya where the original evaluation was carried out”), and (ii) girls in Rwanda “massively overestimated the percentage of both younger and older men who have HIV”, so educating girls about the actual prevalence might “lead teenage girls to increase the amount of unprotected sex they have with both younger and older men.” They say: “Note that the data that ultimately helped to diagnose whether the treatment might be effective in Rwanda did not come from an impact evaluation or an RCT. They were simple descriptive or observational data that were collected quickly (over two weeks) to assess whether the conditions were right for a program to be effective.” Example 2: Nutrition In a parenting program in India focused on improving child growth, the Theory of Change relied on the transfer of knowledge to mothers to improve nutritional practices at home. After successful implementation in Tamil Nadu, the program failed to achieve success in children in Bangladesh. An exploration of this result looked beyond the mothers and found that it was family members not present at the sessions – for example, the mothers-in-law – who were important decision makers in the home. As such, it turned out that including mothers-in-law in the parenting sessions for example could be a more effective strategy than increasing the number of sessions.[1] [1] Example described in Evidence-Based Policy, Hardie and Cartwright, OUP, 2012 |
Both of these examples show how the mechanism in an intervention does not translate to a different place. In the HIV example, it worked in Kenya because it made girls realise that the risks were higher than they had thought: but in Rwanda, the risks were not higher than girls thought, so it would not work. That difference between girls’ perceptions and reality, was critical to the programme’s success, and that difference was not the same in the two places. Similarly, in the nutrition example, a key part of the mechanism was that the training involved the key decision-maker: that was assumed to be the mother, which was true in one place but not the other. So the mechanism does not apply in both places.
This shows why it is essential to understand the mechanism and establish whether it applies in a new place. Notice that in the HIV example, that did not require complicated data-gathering or analysis: it simply took a fortnight. (In this third example, it was also simple to establish that the mechanism which worked in Bangladesh would not work in Malawi or Zambia.)
A few considerations can help you identify the mechanism of an intervention, and hence assess whether it is likely to work in your context:
Theory of Change and mechanisms: Theory of Change and “mechanisms” are essential to assessing whether a programme will work in a different place or time. The Theory of Change spells out how an intervention is supposed to affect some outcomes, the reasons why, and the circumstances under which it should work. Importantly, it is possible for interventions from very different contexts and locations to share Theories of Change: that can allow the study in one place to inform a very different context. Mechanisms[1] are usually referred to as a piece of theory, or hypothesis, that explains a causal link, i.e., that explains why one activity would lead to the desired result. Famous examples are incentives and motivation stimulated by (financial) rewards, or conformity following social norms. Important for both Theories of Change and their mechanisms is that we understand the circumstances under which they may, or may not, be activated. Where the ToC is given in a study on the EGM, we explain it in the Guidebook.
- Location and setting: Studies conducted in a location or setting similar to yours will likely translate. But the key thing is to check that the mechanism applies, as discussed above. This does not mean that evidence from different locations or settings is by nature irrelevant. In fact, evidence from very different contexts may be very informative, but may require more scrutiny to translate (as the example in Box 2 show).
- Target group: This means the people who get the intervention, which may be different from end-beneficiaries. For example, the end-beneficiaries may be children, but intervention may be delivered to teachers, social workers, school heads, etc. While lessons may be learned from interventions delivered to target groups other than from the ones you are serving, the evidence is a stronger predictor if the target group is the same.
- Implementing organization: the organization implementing affects whether the results will be the same in a different place or time. For example, interventions run by well-established government agencies may have different dynamics from new, local NGOs. Sometimes the latter may be less able to implement robustly; but on the other hand, sometimes local NGOs are more trusted than government so may get better engagement. So understand the implementing organisations. Where the studies report on the implementing organization this is included in the Guidebook, though sadly many studies do not report on this. Therefore there are many calls globally for better implementation information and data.[2]
Where possible and available, the Guidebook provides information on these considerations.
An important point. Whether an intervention gets the same results in one place as it does in another need not have anything to do with the quality of a study. Rather, it could be a feature of the world, that things and people differ between places (and times). For example, the Good Schools Toolkit in Uganda which is on the EGM succeeded in reducing violence in schools: but it may not have that effect in, say, Scotland, where violence in schools is much lower than in Uganda. Social scientists distinguish between:
- A study’s internal This is whether the study’s answer is reliable, i.e., whether the study was done well and measured accurately the thing it set out to measure in that context. And
- A study’s external This is whether the study’s answer will be the same in different places. This is about whether the measured phenomenon is the same in different places: it has nothing to do with the study itself. It’s entirely to do with the characteristics of the thing being studied: in our case, the people in the study.
This bears repeating because there is often confusion here. The Good Schools Toolkit may have been studied well, but it just happens that Scotland is very different to Uganda. (This happens also in physics: a good study of the strength of the Earth’s gravitational field will give different answers in Sweden than in Singapore. That is not because anything is wrong with those studies: rather it is a physical reality because the Earth is not spherical.)
[1] To learn more about mechanisms and evaluation, see: Astbury, B., & Leeuw, F. L. (2010). Unpacking black boxes: mechanisms and theory building in evaluation. American journal of evaluation, 31(3), 363-381.
[2] NYAS series ECD; Measurement for Change
We are often interested in studies because of the effect that an intervention has on a certain outcome (be it positive, neutral or negative). However, a study can be valuable for many other reasons. Three other things can be valuable to learn about within a study and apply in our work beyond the result.
- The Theories of Change on which interventions are based can be very informative as they explain why and how an intervention is hypothesised to affect an outcome, and under which circumstances. Some recent literature suggests that a shortcoming of many interventions is a lack of theory, quite possibly leading to poorly designed interventions.[1] Learning from existing Theories of Change to improve future interventions is therefore crucial.
- The way in which an intervention was implemented, the challenges and opportunities encountered, is another area from which many insights can be derived. Especially when an intervention has been very successful or very unsuccessful, understanding how it was implemented can be important to explain the result and replicate that success elsewhere.
- Outcome measurement. Measuring outcomes is often not straightforward and finding or developing the right measure for the outcome requires time and effort. Existing studies in an outcome area can act as a starting point to find useful, relevant and reliable measures to use. If possible, if you are running an intervention in a new place, it is best to use the same measures as have been used to study it before because that allows you (and others) to compare the results directly.
[1] Deaton, A., & Cartwright, N. (2018). Understanding and misunderstanding randomized controlled trials. Social Science & Medicine, 210, 2-21; Lortie-Forgues, H., & Inglis, M. (in press). Rigorous Large-Scale Educational RCTs are
Often Uninformative: Should We Be Concerned? Educational Researcher.
Though the primary studies on the EGM have a valid counterfactual, they nonetheless may be biased. This makes their findings less reliable. We assessed the reliability of each study on the EGM using an established “risk of bias” tool, using the information in the study report: the ‘risk of bias’ for each study is shown on the EGM (the colour of its circle) and in the Guidebook.
There are various ways in which a study can be biased. For example, the method used to randomize people is important (see glossary), and if the report doesn’t say which method was used, then we cannot be confident that randomization was done well (i.e., was unbiased), so there is a risk that it was biased. (To be clear, this is about confidence and risk: the study may have been conducted brilliantly, but we do not know that.)
The result of the various risk of bias assessments shows that we can have only limited / low confidence in most studies on the EGM, because almost all the studies on the map have considerable risk of bias. The results of this assessment are reported in the EGM reports.
One implication of this is the importance of monitoring, to ensure that you are achieving the kind of outcomes that you expect: i.e., not relying on an evaluation from a different place to predict precisely the outcomes that you will achieve.
A theory of change for a programme can be strengthened by looking at additional information – whether or not there are existing studies on the EGM about the effect of that intervention on that outcome.[1]
1. First, revisit and strengthen its Theory of Change. Critically thinking through why and how an intervention is supposed to generate an outcome, and under which circumstances, can help to stress test an intervention. In the examples in Box 2, perhaps the theory assumes that the mothers being trained are the decision-makers, and we can simply test whether that is true. Identifying the assumed causal links like this can expose assumptions which are not true, and we can amend the programme design accordingly. Most ToCs draw on existing knowledge, so there is a lot of ground to stand on (see some ideas in the box below).
Box N: Strengthening a Theory of Change
Even a new Theory of Change (ToC) is almost certainly not without evidence: most ToCs rely on influencing human behaviour, and through the existing literature quite a lot about is known about that. Social science has identified and described patterns that are helpful, such as the prediction that, in general, demand falls when prices rise. We also know that people respond to incentives. And, for example, people forget things (so ToCs which rely on people remembering loads of things, or remembering things many times, often fail. This can be a problem for medical regimes, for example, which require people to take tablets many times). We know from behavioural science that people are more likely to do something if it is EAST: easy, attractive, social, timely. Those principles informed the design of chlorine dispensers in Kenya, which are easy to use and remember because they are put at the water well so that people see them when collecting water at the well: the intervention is timely (they’re thinking about water anyway: it’s not a separate thing to remember) and social (they see other people adding chlorine to their water, so get reminded). The more that a programme accords with these known patterns of behaviour, the more likely it is to work. |
2. Second, the organisation running the intervention can use many additional sources of data. These can include evidence in “neighbouring” cells on the EGM; or data about a context; knowledge and experience of practitioners and the community. For example, if an intervention relies on beneficiaries traveling to some place but transport there is limited, it is unlikely to succeed. As our colleague Professor Howard White says: “you don’t need an RCT to tell you the effect of opening the clinic another day a week if the boat only goes to the clinic once a week”. A home-visiting intervention may work much better in such circumstances. Furthermore, it is important to understand whether the key stakeholders want the programme, and trust it. This assessment relies on data specific to the context.
[1] See also Gugerty and Karlan for a detailed outline of step like these
Even if a rigorous evaluation with a counterfactual is not possible, an organisation should do proper monitoring. For example, when Covid vaccines were being rolled out, obviously they had properly evaluated first, but the implementation was monitored very carefully to ensure, among other things, that everybody got the correct dose, that the vaccines were stored all the correct temperatures, and to monitor each recipient for adverse reactions.
Good monitoring will not indicate the effect of the programme. But it will indicate if it is being implemented at all, implemented as planned, whether everybody is receiving the whole intervention, and whether there are problematic outcomes which should be investigated.
What are primary research and systematic reviews?
Primary research is a study of people. It can involve questionnaires, surveys or interviews, or other measurements about people such as their income, height, or scores in tests.
A systematic review is a study of studies. It is a structured investigation to find, critically appraise and synthesise all the relevant primary research on a specific topic. Systematic reviews are stronger than non-systematic ‘literature reviews’ in that they: (i) can reconcile differences in the conclusions of different studies by looking across a larger set of participants, (ii) identify gaps to inform further research, (iii) are more transparent and hence can be reproduced by other researchers in future and (iv) are less prone to bias, as science writer, doctor and Oxford academic Ben Goldacre explains:
“Instead of just mooching through the research literature consciously or unconsciously picking out papers that support [our] pre-existing beliefs, [we] take a scientific, systematic approach to the very process of looking for evidence, ensuring that [our] evidence is as complete and representative as possible of all the research that has ever been done.”[i]
Thus a systematic review is more likely to be accurate and hence useful to practitioners for informing research and programme design than non-systematic literature. It is also more credible and hence useful in terms of convincing funders and policy-makers.
Each systematic review defines a scope (the topics, geography and timescale of interest) and the way that it will search for studies with that remit (the ‘search strategy’). Most set some threshold for the quality of the primary studies they include in their analysis (the importance of quality of primary studies is discussed in Box 2). This is significant because the systematic review process is not magic: if the primary studies on which a systematic review is based are unreliable, the review’s results will be unreliable. As a Yale cardiologist wrote on Twitter (Krumholz 2015): ‘You can’t just combine weak evidence and pretend that when mushed together it is strong. [Rather] it is meta-mush.’
[i] Bad Science, Ben Goldacre, Harper Collins, 2009