12.07.2015 Views

Challenges in impact evaluation of development interventions ...

Challenges in impact evaluation of development interventions ...

Challenges in impact evaluation of development interventions ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

DISCUSSION PAPER / 2010.01<strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong><strong>of</strong> <strong>development</strong> <strong>in</strong>terventions:opportunities and limitationsfor randomized experimentsJos Vaessen


Comments on this Discussion Paper are <strong>in</strong>vited.Please contact the author at: jos.vaessen@ua.ac.beInstituut voor Ontwikkel<strong>in</strong>gsbeleid en -BeheerInstitute <strong>of</strong> Development Policy and ManagementInstitut de Politique et de Gestion du DéveloppementInstituto de Política y Gestión del DesarrolloPostal address:Visit<strong>in</strong>g address:Pr<strong>in</strong>sstraat 13 Lange S<strong>in</strong>t-Annastraat 7B-2000 Antwerpen B-2000 AntwerpenBelgiumBelgiumTel: +32 (0)3 275 57 70Fax: +32 (0)3 275 57 71e-mail: dev@ua.ac.behttp://www.ua.ac.be/dev


Discussion Paper / 2010.01<strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong><strong>of</strong> <strong>development</strong> <strong>in</strong>terventions:opportunities and limitationsfor randomized experimentsJos Vaessen*August 2010* Jos Vaessen is lecturer at Maastricht University and researcher at the Institute <strong>of</strong> Development Policy andManagement (IOB), University <strong>of</strong> Antwerp.


Table <strong>of</strong> ContentsAbstract 6Résumé 71. The ris<strong>in</strong>g star <strong>of</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>in</strong> <strong>development</strong> 82. Scope 113. Challenge 1: delimitation 133.1. The importance <strong>of</strong> stakeholder values 133.2. The <strong>impact</strong> <strong>of</strong> what? 143.3. The <strong>impact</strong> on what? 163.3.1. Institutional versus beneficiary level effects 163.3.2. Intended versus un<strong>in</strong>tended effects 173.3.3. Short-term versus long-term effects 184. Challenge 2: Attribution ‘versus’ explanation 194.1. Address<strong>in</strong>g the attribution challenge with randomized experiments 194.2. Internal versus external validity 204.3. Interventions as theories 224.4. The law <strong>of</strong> comparative advantages: theory-based and multi-method<strong>evaluation</strong> 235. Challenge 3: Impact <strong>evaluation</strong> <strong>in</strong> practice 255.1. Threats to attribution analysis <strong>in</strong> experimental sett<strong>in</strong>gs 255.2. Other design and implementation challenges 266. Some lessons for randomized experiments and <strong>impact</strong> <strong>evaluation</strong><strong>in</strong> general from the perspective <strong>of</strong> a ‘non-randomista’ 29References 34<strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventionsIOB Discussion Paper 2010-01 • A


AbstractIn recent years debates on as well as fund<strong>in</strong>g <strong>of</strong> <strong>impact</strong> <strong>evaluation</strong>s <strong>of</strong> <strong>development</strong><strong>in</strong>terventions have flourished. Unfortunately, controversy regard<strong>in</strong>g the promotion and application<strong>of</strong> randomized experiments (RE) has led to a sense <strong>of</strong> polarization <strong>in</strong> the <strong>development</strong> policyand <strong>evaluation</strong> community. As some proponents claim epistemological supremacy <strong>of</strong> REs (withrespect to attribution) the counter reaction among others has been rejection. Needless to say,such extreme positions are counterproductive to reach<strong>in</strong>g a goal that is commonly endorsed:to learn more about what works and why <strong>in</strong> <strong>development</strong>. This paper discusses the prospectsand limitations <strong>of</strong> REs from the perspective <strong>of</strong> three categories <strong>of</strong> challenges <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong>:delimitation and scope, attribution versus explanation, and implementation challenges.The implicit lesson is tw<strong>of</strong>old. First <strong>of</strong> all, the question ‘to randomize or not to randomize’ isoverrated <strong>in</strong> the current debate. Limitations <strong>in</strong> scope, applicability as well as implementationwill necessarily restrict the use <strong>of</strong> REs <strong>in</strong> <strong>development</strong> <strong>impact</strong> <strong>evaluation</strong>. There is a risk that thecurrent popularity <strong>of</strong> REs <strong>in</strong> certa<strong>in</strong> research and policy circles might lead to a backlash as toohigh expectations <strong>of</strong> REs may quicken its demise. More importantly, given the nature and scope<strong>of</strong> the challenges discussed <strong>in</strong> the paper, more energy should be devoted to develop<strong>in</strong>g and test<strong>in</strong>g‘rigorous’ mixed method approaches with<strong>in</strong> a framework <strong>of</strong> theory-driven <strong>evaluation</strong>.AcknowledgmentsI would like to thank Frans Leeuw and Robrecht Renard for their comments on thispaper. Any rema<strong>in</strong><strong>in</strong>g errors are my own. • IOB Discussion Paper 2010-01<strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventions


RésuméCes dernières années, les évaluations d’<strong>impact</strong> des <strong>in</strong>terventions en matière de développementont fait l’objet de nombreux débats et ont été largement f<strong>in</strong>ancées. Malheureusement,la controverse au sujet de la promotion et de l’application des expériences randomisées(ER) a suscité un sentiment de polarisation parmi ceux qui déf<strong>in</strong>issent et évaluent les politiquesde développement. Comme certa<strong>in</strong>s partisans revendiquent la suprématie épistémologique desER (pour ce qui est de l’attribution), la réaction de certa<strong>in</strong>s autres a pris la forme d’un rejet. Ilva sans dire que ces positions extrêmes sont contreproductives dans la poursuite d’un objectifpartagé par tous : en savoir plus long sur ce qui fonctionne en matière de développement, etpourquoi. Ce document commente les perspectives et les limitations des ER du po<strong>in</strong>t de vue detrois catégories de défis dans l’évaluation d’<strong>impact</strong> : délimitation et étendue, attribution versusexplication, et les défis liés à la mise en œuvre. La leçon implicite est double. Premièrement,la question ‘randomiser ou non’ prend trop d’importance dans le débat actuel. Des limitationsd’étendue, d’applicabilité a<strong>in</strong>si que de mise en œuvre restre<strong>in</strong>dront fatalement l’utilisation desER dans l’évaluation de l’<strong>impact</strong> du développement. Il se pourrait que la popularité actuelledes ER dans certa<strong>in</strong>s cercles de chercheurs et de décideurs entraîne un retour de bâton, car desattentes trop élevées par rapport aux RE pourraient accélérer leur abandon. Enf<strong>in</strong>, et surtout,étant donné la nature et l’étendue des défis commentés dans ce document, il est nécessaire deconsacrer plus d’énergie à l’élaboration et à la mise à l’épreuve d’approches mixtes ‘rigoureuses’dans un cadre d’évaluation basé sur la théorie.<strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventionsIOB Discussion Paper 2010-01 • A


1. The ris<strong>in</strong>g star <strong>of</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>in</strong> <strong>development</strong>The question <strong>of</strong> ‘what works and why’ <strong>in</strong> <strong>development</strong> assistance has received considerableattention over the past few years. The major reason is that many outside <strong>of</strong> <strong>development</strong>agencies believe that achievement <strong>of</strong> results has been poor, or at best not conv<strong>in</strong>c<strong>in</strong>glyestablished. In the last decades <strong>of</strong> the previous century part <strong>of</strong> the <strong>development</strong> assistance paradigmwas about ‘th<strong>in</strong>k<strong>in</strong>g big’, e.g. structural adjustment policies as key factors <strong>in</strong> generat<strong>in</strong>gstability and growth <strong>in</strong> develop<strong>in</strong>g countries. Correspond<strong>in</strong>gly, a lot <strong>of</strong> <strong>in</strong>tellectual effort went<strong>in</strong>to analyses <strong>of</strong> <strong>development</strong> at the macro level, with scores <strong>of</strong> economists work<strong>in</strong>g on growthregressions, try<strong>in</strong>g to identify the key factors that were determ<strong>in</strong><strong>in</strong>g country growth <strong>of</strong> GDP andthe role <strong>of</strong> <strong>development</strong> assistance there<strong>in</strong>. Grow<strong>in</strong>g pessimism with<strong>in</strong> the <strong>in</strong>ternational communityabout the effectiveness <strong>of</strong> macro <strong>in</strong>terventions, <strong>in</strong> part fuelled by the lack <strong>of</strong> decisive evidencefrom the academic community, gradually led to a shift <strong>in</strong> <strong>development</strong> paradigm towardsmore ‘th<strong>in</strong>k<strong>in</strong>g small’ (Easterly, 2001; Cohen and Easterly, 2009). Yet, the state <strong>of</strong> evidence onthe effectiveness <strong>of</strong> concrete policy <strong>in</strong>terventions (programs, policies) was far from promis<strong>in</strong>geither. Towards the end <strong>of</strong> the previous century the realization grew that, given the evidencebase, it was hard to determ<strong>in</strong>e the extent to which <strong>in</strong>terventions were mak<strong>in</strong>g a difference(Baker, 2000). In 2006, an <strong>in</strong>fluential paper published by the Center for Global Development–“When will we ever learn?” (CGD, 2006)- po<strong>in</strong>ted at an <strong>evaluation</strong> gap <strong>in</strong> <strong>development</strong>; despiteenormous <strong>in</strong>vestments <strong>in</strong> <strong>development</strong> policy, the evidence base on what works was diagnosedas weak. Accord<strong>in</strong>g to the paper, too much <strong>of</strong> the evaluative work <strong>in</strong> <strong>development</strong> focused onprocess <strong>in</strong>stead <strong>of</strong> results and credible <strong>evaluation</strong>s <strong>of</strong> results were scarce. Fortunately, at thetime <strong>of</strong> publication <strong>of</strong> this paper, the tide had already been gradually turn<strong>in</strong>g. A number <strong>of</strong> keyevents such as the endorsement <strong>of</strong> the Millennium Development Goals by the global community<strong>in</strong> 2001, the 2002 Monterrey Conference on F<strong>in</strong>anc<strong>in</strong>g for Development, the 2005 Paris Declarationon Development Effectiveness and the 2008 High-Level Forum on Aid Effectiveness <strong>in</strong>Accra were signs <strong>of</strong> a grow<strong>in</strong>g results-focus <strong>in</strong> the <strong>development</strong> community, gradually pav<strong>in</strong>g theroad for more attention to the assessment <strong>of</strong> effects <strong>of</strong> <strong>development</strong> <strong>in</strong>terventions.As a result <strong>of</strong> this evolution, <strong>in</strong> recent years debates on as well as fund<strong>in</strong>g <strong>of</strong> <strong>impact</strong><strong>evaluation</strong> have flourished. Impact <strong>evaluation</strong> can be roughly def<strong>in</strong>ed as the (grow<strong>in</strong>g)field <strong>of</strong> evaluative practices aimed at assess<strong>in</strong>g the <strong>in</strong>tended and un<strong>in</strong>tended effects <strong>of</strong> policy<strong>in</strong>terventions. One <strong>of</strong> the particularly productive areas <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> is (quasi-)experimental<strong>impact</strong> <strong>evaluation</strong>, <strong>in</strong> particular randomized experiments. A number <strong>of</strong> <strong>in</strong>itiatives, mostnotably the Poverty Action Lab (J-PAL), Innovations for Poverty Action and the World Bank’sSpanish Impact Evaluation Fund, are creat<strong>in</strong>g a grow<strong>in</strong>g body <strong>of</strong> evaluative evidence based onrandomized experiments.[1] The comparative advantage <strong>of</strong> the latter methodology,[2] as it hasbeen argued widely, is its <strong>in</strong>herent strength to address the attribution problem <strong>in</strong> <strong>evaluation</strong>through counterfactual analysis. The basic idea <strong>of</strong> a randomized experiment (RE)[3] is that thesituation <strong>of</strong> a participant group (receiv<strong>in</strong>g benefits from/affected by an <strong>in</strong>tervention) is com-[1] Another fairly recent <strong>in</strong>itiative, the International Initiative for Impact Evaluation, is fund<strong>in</strong>g proposals for researchbased on a broader palette <strong>of</strong> methodological designs, usually a comb<strong>in</strong>ation <strong>of</strong> quantitative methods embedded<strong>in</strong> a theory-based approach.[2] To keep th<strong>in</strong>gs simple, I dist<strong>in</strong>guish between methodologies and methodological designs on the one hand andmethods on the other. The former can comprise multiple methods, the latter referr<strong>in</strong>g to specific tools <strong>of</strong> data collectionand analysis.[3] Given the scope <strong>of</strong> the <strong>in</strong>terventions that are currently assessed with this methodology, I prefer the term randomizedexperiments (RE) <strong>in</strong>stead <strong>of</strong> the “treatment-oriented” term <strong>of</strong> randomized controlled trials. • IOB Discussion Paper 2010-01<strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventions


the shortcom<strong>in</strong>gs <strong>in</strong> econometric methods for expla<strong>in</strong><strong>in</strong>g effectiveness, ma<strong>in</strong>ly due to the identificationproblem. [1] As REs have been enthusiastically taken up by <strong>development</strong> economists,sharp critique has come from the same discipl<strong>in</strong>e, most notably from Rodrik (2008), Ravallion(2008, 2009a) and Deaton (2009). [2] Banerjee and Duflo (2008), two prom<strong>in</strong>ent ‘randomistas’,acknowledge much <strong>of</strong> the critique on REs. However, at the same time they counter the critiqueby argu<strong>in</strong>g that most <strong>of</strong> it is not specific to randomized experiments but also applicable to nonexperimentalobservational studies. However, this is exactly a po<strong>in</strong>t that critics refer to. The factthat randomized experiments can be justifiably criticized on a number <strong>of</strong> key issues (see below)underm<strong>in</strong>es the alleged claim to epistemological supremacy, a claim that many ‘randomistas’explicitly or implicitly endorse.[1] Generally, this refers to the problem that multiple values <strong>of</strong> parameters or multiple models might expla<strong>in</strong> a certa<strong>in</strong>pattern <strong>in</strong> the data.[2] See also Cohen and Easterly (2009).10 • IOB Discussion Paper 2010-01 <strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventions


Challenge 3: Impact <strong>evaluation</strong> <strong>in</strong> practice. Good <strong>impact</strong> <strong>evaluation</strong> is good research.Whereas many <strong>of</strong> the current debates on <strong>impact</strong> <strong>evaluation</strong> tend to center on argumentspro and aga<strong>in</strong>st REs, <strong>in</strong> practice the validity <strong>of</strong> f<strong>in</strong>d<strong>in</strong>gs <strong>of</strong> any type <strong>of</strong> <strong>impact</strong> <strong>evaluation</strong>, <strong>in</strong>clud<strong>in</strong>gREs, heavily depends on the extent to which a number <strong>of</strong> key design and implementationchallenges have been appropriately addressed.In this paper the focus will be on methodological design and implementation aspects<strong>of</strong> <strong>impact</strong> <strong>evaluation</strong>. Less attention will be paid to the properties <strong>of</strong> statistical analysis with<strong>in</strong>the context <strong>of</strong> REs or other methodological designs (see for example Heckman, 1992; Shadish etal. 2002; Cook, 2006; for <strong>development</strong> <strong>in</strong>terventions see for example Deaton, 2009).12 • IOB Discussion Paper 2010-01 <strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventions


3. Challenge 1: delimitation3.1. The importance <strong>of</strong> stakeholder valuesA first important criterion for delimitation <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> concerns the question<strong>of</strong> ‘<strong>impact</strong> accord<strong>in</strong>g to whom’. There is a strong movement <strong>in</strong> <strong>development</strong> research andpractice which endorses the idea that <strong>impact</strong> <strong>evaluation</strong> is not only about assess<strong>in</strong>g the effects<strong>of</strong> an <strong>in</strong>tervention but also about underly<strong>in</strong>g questions <strong>of</strong> what types <strong>of</strong> processes <strong>of</strong> change andeffects are valued as important (either positive or negative) and by whom? [1]This l<strong>in</strong>e <strong>of</strong> thought is most manifest <strong>in</strong> the <strong>evaluation</strong> tradition <strong>of</strong> participatory <strong>impact</strong><strong>evaluation</strong>. Nowadays, participatory methods have become ‘ma<strong>in</strong>stream’ tools <strong>in</strong> <strong>development</strong><strong>in</strong> almost every area <strong>of</strong> policy <strong>in</strong>tervention. The roots <strong>of</strong> participation <strong>in</strong> <strong>development</strong> lie <strong>in</strong>the rural sector, where Chambers (1995) and others developed the now widely used pr<strong>in</strong>ciples <strong>of</strong>Participatory Rural Appraisal. [2] Participatory <strong>evaluation</strong> approaches (see for example, Cous<strong>in</strong>sand Whitmore, 1998) are built on the pr<strong>in</strong>ciple that stakeholders should be <strong>in</strong>volved <strong>in</strong> some orall stages <strong>of</strong> the <strong>evaluation</strong>. In the case <strong>of</strong> <strong>impact</strong> <strong>evaluation</strong> participation <strong>in</strong>cludes aspects suchas the determ<strong>in</strong>ation <strong>of</strong> objectives, <strong>in</strong>dicators to be taken <strong>in</strong>to account, as well as stakeholderparticipation <strong>in</strong> data collection and analysis. In practice it can be useful to differentiate betweenstakeholder participation as a process and stakeholder perceptions and views as sources <strong>of</strong> evidence(Cous<strong>in</strong>s and Whitmore, 1998).Randomized designs are not about stakeholder participation or elicitation <strong>of</strong> stakeholdervalues. However, there is no reason to assume that REs may not be comb<strong>in</strong>ed with participatoryprocesses and methods <strong>of</strong> data collection (Karlan, 2009). In practice a wide variety<strong>of</strong> methods are available (see for example IFAD, 2002; Mikkelsen, 2005; Pretty et al., 1995;Salmen and Kane, 2006). Stakeholder participation <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> can be beneficial <strong>in</strong>many ways, i.e. by enhanc<strong>in</strong>g the ownership and (possibly) utilization <strong>of</strong> an <strong>evaluation</strong>, improv<strong>in</strong>gthe quality <strong>of</strong> the data collected from target populations, or strengthen<strong>in</strong>g local processes<strong>of</strong> governance. At the same time, participatory methods have been criticized on many grounds.Often mentioned critical aspects concern the limited applicability <strong>of</strong> <strong>impact</strong> <strong>evaluation</strong>s with ahigh degree <strong>of</strong> participation especially <strong>in</strong> large-scale, comprehensive, multi-site <strong>in</strong>terventions.In such contexts, organiz<strong>in</strong>g processes <strong>of</strong> stakeholder participation may not be feasible due tohigh costs and logistical barriers. In addition, there is some doubt about the reliability <strong>of</strong> <strong>in</strong>formationbased on stakeholder perceptions (e.g. due to risks <strong>of</strong> strategic responses, manipulation<strong>of</strong> <strong>in</strong>formation or advocacy by stakeholders). [3][1] The importance <strong>of</strong> stakeholder values may differ accord<strong>in</strong>g to the type <strong>of</strong> <strong>in</strong>tervention. For example, whereasstakeholder views on what is important <strong>in</strong> the <strong>evaluation</strong> <strong>of</strong> the effects <strong>of</strong> reforestation programs may widely differ,this may be less the case for health <strong>in</strong>dicators <strong>in</strong> the case <strong>of</strong> nutrition programs.[2] Participatory Impact Assessment is an extension <strong>of</strong> Participatory Rural Appraisal and <strong>in</strong>volves the adaptation<strong>of</strong> participatory tools comb<strong>in</strong>ed with more conventional statistical approaches specifically to measure the <strong>impact</strong> <strong>of</strong>humanitarian assistance and <strong>development</strong> projects on people’s lives (Catley et al., 2008).[3] In turn, proponents <strong>of</strong> participatory <strong>evaluation</strong> approaches are <strong>of</strong>ten very critical <strong>of</strong> the reliability <strong>of</strong> surveybasedresearch and correspond<strong>in</strong>g data analyses with<strong>in</strong> the framework <strong>of</strong> experimental or non-experimental methodologicaldesigns. Especially well-known is Robert Chambers’ (1983) discussion <strong>of</strong> what he labeled as ‘survey slavery’,criticiz<strong>in</strong>g among other th<strong>in</strong>gs the costs (also for respondents), <strong>in</strong>efficiencies, rigidities and data problems <strong>of</strong> surveyresearch <strong>in</strong> rural <strong>development</strong> contexts.<strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventionsIOB Discussion Paper 2010-01 • A


An alternative approach for elicit<strong>in</strong>g stakeholder values which does not rely onstakeholder participation is values <strong>in</strong>quiry and “refers to a variety <strong>of</strong> methods that can be appliedto the systematic assessment <strong>of</strong> the value positions surround<strong>in</strong>g the existence, activities,and outcomes <strong>of</strong> a social policy and program” (Mark et al., 1999: 183). Values <strong>in</strong>quiry exercisesmay be more useful than participatory <strong>evaluation</strong> approaches <strong>in</strong> situations where policy makersare <strong>in</strong>terested <strong>in</strong> a representative picture <strong>of</strong> the value positions <strong>of</strong> large groups <strong>of</strong> beneficiariesdispersed over large territories.3.2. The <strong>impact</strong> <strong>of</strong> what?When talk<strong>in</strong>g about the scope and delimitation <strong>of</strong> <strong>impact</strong> <strong>evaluation</strong> it is useful toaddress the follow<strong>in</strong>g two questions: the <strong>impact</strong> <strong>of</strong> what and the <strong>impact</strong> on what (see also Vaessenand Todd, 2008)? Regard<strong>in</strong>g the <strong>impact</strong> <strong>of</strong> what, today more than ever one can speak <strong>of</strong> a‘cont<strong>in</strong>uum’ <strong>of</strong> <strong>in</strong>terventions. At one end <strong>of</strong> the cont<strong>in</strong>uum are relatively simple projects characterizedby ‘s<strong>in</strong>gle strand’ <strong>in</strong>itiatives with explicit objectives, carried out with<strong>in</strong> a relatively shorttimeframe, where <strong>in</strong>terventions can be isolated, manipulated and measured. An <strong>impact</strong> <strong>evaluation</strong><strong>in</strong> the agricultural sector for example, will seek to attribute changes <strong>in</strong> crop yield to an <strong>in</strong>terventionsuch as a new technology or agricultural practice. In a similar guise, <strong>in</strong> the health sector,a reduction <strong>in</strong> malaria will be analyzed <strong>in</strong> relation to the <strong>in</strong>troduction <strong>of</strong> bed nets. For these types<strong>of</strong> <strong>in</strong>terventions, experimental and quasi-experimental designs may be appropriate for assess<strong>in</strong>gcausal relationships. At the other end <strong>of</strong> the cont<strong>in</strong>uum are comprehensive programs withan extensive range and scope (<strong>in</strong>creas<strong>in</strong>gly at country, regional or global level), with a variety <strong>of</strong>activities that cut across sectors, themes and geographic areas, and emergent specific activities.Many <strong>of</strong> these <strong>in</strong>terventions address aspects that are assumed to be critical for effective <strong>development</strong>yet difficult to def<strong>in</strong>e and measure, such as human security, good governance, politicalwill and capacity, susta<strong>in</strong>ability, and effective <strong>in</strong>stitutional systems.One <strong>of</strong> the trends <strong>in</strong> <strong>development</strong> is that donors are mov<strong>in</strong>g up the ‘aid cha<strong>in</strong>’.Whereas <strong>in</strong> the past donors were very much <strong>in</strong>volved <strong>in</strong> ‘micro-manag<strong>in</strong>g’ their own projectsand (sometimes) bypass<strong>in</strong>g government systems, nowadays a sizeable chunk <strong>of</strong> aid is allocatedto national support for recipient governments. Attention to some extent has shifted from micro-earmark<strong>in</strong>g(e.g. donor money dest<strong>in</strong>ed for an irrigation project <strong>in</strong> district x) to meso-earmark<strong>in</strong>g(e.g. support for the agricultural sector) or macro-earmark<strong>in</strong>g (e.g. support for the governmentbudget to be allocated accord<strong>in</strong>g to country priorities). Besides a cont<strong>in</strong>ued <strong>in</strong>terest <strong>in</strong>the <strong>impact</strong> <strong>of</strong> <strong>in</strong>dividual projects, donors, governments and nongovernmental <strong>in</strong>stitutions are<strong>in</strong>creas<strong>in</strong>gly <strong>in</strong>terested <strong>in</strong> the <strong>impact</strong> <strong>of</strong> comprehensive programs, sector strategies or countrystrategies, <strong>of</strong>ten compris<strong>in</strong>g multiple <strong>in</strong>struments, stakeholders, sites <strong>of</strong> <strong>in</strong>tervention and targetgroups (see Jones et al. (2008) for a recent <strong>in</strong>ventory <strong>of</strong> <strong>impact</strong> <strong>evaluation</strong>s <strong>in</strong> different sectors<strong>of</strong> <strong>development</strong> <strong>in</strong>tervention).In most countries donor organizations are (still) the ma<strong>in</strong> promoters <strong>of</strong> <strong>impact</strong><strong>evaluation</strong>. The partial shift <strong>of</strong> the unit <strong>of</strong> analysis to the macro and (government) <strong>in</strong>stitutionallevel requires <strong>impact</strong> evaluators to pay more attention to complicated and more complex <strong>in</strong>terventionsat national, sector or program level. Multi-site, multi-governance and multiple (simultaneous)causal strands are important elements <strong>of</strong> this (see Rogers, 2008). At the same time,the need for more rigorous <strong>impact</strong> <strong>evaluation</strong> at the level <strong>of</strong> ‘s<strong>in</strong>gle strand’ projects or activitiesrema<strong>in</strong>s as important as ever s<strong>in</strong>ce they are the build<strong>in</strong>g blocks <strong>of</strong> higher-level programs and14 • IOB Discussion Paper 2010-01 <strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventions


policies. Furthermore, the ongo<strong>in</strong>g efforts <strong>in</strong> capacity-build<strong>in</strong>g on national M&E systems (seeKusek and Rist, 2004; Morra and Rist, 2009) and the promotion <strong>of</strong> country-led <strong>evaluation</strong> effortsstress the need for further guidance on <strong>impact</strong> <strong>evaluation</strong> at ‘s<strong>in</strong>gle <strong>in</strong>tervention’ level.With<strong>in</strong> the light <strong>of</strong> the heterogeneous landscape <strong>of</strong> <strong>in</strong>terventions, critics and (most)proponents <strong>of</strong> REs alike acknowledge the limitations <strong>in</strong> applicability <strong>of</strong> REs. What is problematicis that the special status attributed to REs by some researchers and policymakers is likelyto generate a bias <strong>in</strong> terms <strong>of</strong> too much evaluative focus on <strong>in</strong>terventions that are amenableto this approach. Already <strong>development</strong> <strong>evaluation</strong> is biased <strong>in</strong> terms <strong>of</strong> what Ravallion calls a “‘myopia bias’ <strong>in</strong> our knowledge, [with <strong>evaluation</strong>] favor<strong>in</strong>g <strong>development</strong> projects that yield quickresults” (Ravallion, 2008: 6). Similarly, Blattman (2008) refers to the ‘over<strong>evaluation</strong>’ <strong>of</strong> certa<strong>in</strong>economic, educational and health <strong>in</strong>terventions and the ‘under<strong>evaluation</strong>’ <strong>of</strong> <strong>in</strong>terventionson peace-build<strong>in</strong>g, crime reduction, and governance issues (e.g. public management, decentralization;see also Jones et al., 2008). REs are most readily applicable <strong>in</strong> case <strong>of</strong> discrete, homogenous<strong>in</strong>terventions with clearly del<strong>in</strong>eated target groups [1] rather than more complicated<strong>in</strong>terventions, <strong>in</strong>terventions that evolve dur<strong>in</strong>g implementation or full-coverage <strong>in</strong>terventionssuch as laws or macroeconomic policies (Bamberger and White, 2007; Rossi et al. 2004). Even<strong>in</strong> the case <strong>of</strong> rather simple <strong>in</strong>terventions, quantitative researchers such as those handl<strong>in</strong>g REsf<strong>in</strong>d themselves opposed by anthropologists and sociologists who criticize the rather simplisticview <strong>of</strong> deconstruct<strong>in</strong>g <strong>in</strong>terventions <strong>in</strong>to neatly del<strong>in</strong>eated packages <strong>of</strong> activities, benefits ortreatments (see for example Hulme (2000) for a discussion). In contrast, they emphasize thecomplexity <strong>in</strong> social <strong>development</strong> <strong>in</strong>terventions. Staff <strong>in</strong> such <strong>in</strong>terventions “are not functionariesdutifully provid<strong>in</strong>g a standardised service, such as immunis<strong>in</strong>g babies or distribut<strong>in</strong>g foodrations; they are <strong>in</strong>stead engaged <strong>in</strong> extensive face-to-face <strong>in</strong>teraction with villagers over manymonths, mak<strong>in</strong>g <strong>in</strong>numerable discretionary decisions. In many respects ‘the project’ is itself adynamic decision-mak<strong>in</strong>g process rather than a static ‘product’, and as such attempts to makecausal claims regard<strong>in</strong>g overall <strong>impact</strong> must address endemic unobserved heterogeneity bias.In short, on both the ‘demand side’ (local context) and the ‘supply side’ (front-l<strong>in</strong>e project staff)there is, by design, enormous variation” (Woolcock, 2009: 5). Indeed, identify<strong>in</strong>g what the <strong>in</strong>terventionexactly is, where it beg<strong>in</strong>s and where it ends can be rather challeng<strong>in</strong>g (Pawson andTilley, 1997).In case <strong>of</strong> complicated <strong>in</strong>terventions (compris<strong>in</strong>g multiple (<strong>in</strong>teract<strong>in</strong>g) <strong>in</strong>terventioncomponents) at best sometimes only some <strong>of</strong> the components may be amenable to a RE. Incase <strong>of</strong> <strong>in</strong>terventions that focus on the <strong>in</strong>stitutional level (e.g. capacity-build<strong>in</strong>g, technical assistance,adm<strong>in</strong>istrative reform), correspond<strong>in</strong>g <strong>evaluation</strong>s look at one or a few units <strong>of</strong> analysis(i.e. the <strong>in</strong>stitution) -Bamberger and White (2007) call this the small n problem- a situation thatis not amenable to a RE. Homogeneity <strong>of</strong> the <strong>in</strong>tervention is another frequently referred to conditionfor REs. A proper RE requires a high degree <strong>of</strong> homogeneity <strong>in</strong> <strong>in</strong>tervention, target groupsand context over time, conditions which are unlikely to hold <strong>in</strong> many cases. Often the nature <strong>of</strong>the <strong>in</strong>tervention changes over time through adaptive learn<strong>in</strong>g, political pressures or to obta<strong>in</strong>more fund<strong>in</strong>g (e.g. add<strong>in</strong>g tra<strong>in</strong><strong>in</strong>g to subsidy programs). In addition, there might be changes <strong>in</strong>contextual factors that affect participants differently than control group members. For example,ris<strong>in</strong>g output prices might speed up technology adoption processes among participants <strong>of</strong>a tra<strong>in</strong><strong>in</strong>g program more than <strong>in</strong> the case <strong>of</strong> control group members who are fac<strong>in</strong>g a knowledge[1] This <strong>in</strong> l<strong>in</strong>e with what Ravallion (2009b) calls assigned programs.<strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventionsIOB Discussion Paper 2010-01 • A


constra<strong>in</strong>t. Given the above, if anyth<strong>in</strong>g, a narrow focus on REs would re<strong>in</strong>force an already exist<strong>in</strong>g<strong>evaluation</strong> bias towards particular types <strong>of</strong> <strong>in</strong>terventions.3.3. The <strong>impact</strong> on what?3.3.1. Institutional versus beneficiary level effectsThe second issue, the <strong>impact</strong> on what, concerns the type <strong>of</strong> effects that we are <strong>in</strong>terested<strong>in</strong>. The causality cha<strong>in</strong> l<strong>in</strong>k<strong>in</strong>g policy <strong>in</strong>terventions to ultimate policy goals (e.g. povertyalleviation) can be relatively direct and straightforward (e.g. the <strong>impact</strong> <strong>of</strong> vacc<strong>in</strong>ation programson mortality levels) but also complex and diffuse. Impact <strong>evaluation</strong>s <strong>of</strong> for example sectorstrategies or general budget support potentially encompass multiple causal pathways result<strong>in</strong>g<strong>in</strong> long-term direct and <strong>in</strong>direct <strong>impact</strong>s. Some <strong>of</strong> the causal pathways l<strong>in</strong>k<strong>in</strong>g <strong>in</strong>terventions to<strong>impact</strong>s may be fairly straightforward (e.g. from tra<strong>in</strong><strong>in</strong>g programs <strong>in</strong> alternative <strong>in</strong>come generat<strong>in</strong>gactivities to employment and to <strong>in</strong>come levels), whereas other pathways are more complexand diffuse <strong>in</strong> terms <strong>of</strong> go<strong>in</strong>g through more <strong>in</strong>termediate changes, and be<strong>in</strong>g cont<strong>in</strong>gentupon more external variables (e.g. from stakeholder dialogue to changes <strong>in</strong> policy priorities tochanges <strong>in</strong> policy implementation to changes <strong>in</strong> human welfare).Figure 1.Levels <strong>of</strong> <strong>in</strong>tervention, programs and policies and types <strong>of</strong> <strong>impact</strong>International conferences, treaties,declarations, protocols, policynetworksINSTITUTIONAL LEVEL IMPACTDonorcapacities /policiesmacro-earmark<strong>in</strong>g(e.g. debt relief, GBS)Governmentcapacities /policiesOther actors (INGOs,NGOs, banks,cooperatives, etc.)micro-earmark<strong>in</strong>g,meso-earmark<strong>in</strong>g (e.g. SBS)Programs(e.g. healthreform)may constitute multipleProjects (e.g.agriculturalextension)Prolicy measures(e.g. tax <strong>in</strong>creases)BENEFICIARY LEVEL IMPACTcommunitieshouseholds<strong>in</strong>dividuals (tax payers, voters, citizens, etc.)replication and scal<strong>in</strong>g-upwider systemic effects (environment, markets, etc.)Source: Leeuw and Vaessen (2009)16 • IOB Discussion Paper 2010-01 <strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventions


Given this diversity it is useful for purposes <strong>of</strong> ´scop<strong>in</strong>g´ to dist<strong>in</strong>guish betweentwo pr<strong>in</strong>cipal levels <strong>of</strong> <strong>impact</strong>: <strong>impact</strong> at the <strong>in</strong>stitutional level and <strong>impact</strong> at the beneficiarylevel (see Figure 1). [1] It broadens <strong>impact</strong> <strong>evaluation</strong> beyond either measur<strong>in</strong>g whether objectiveshave been achieved or assess<strong>in</strong>g direct effects on <strong>in</strong>tended beneficiaries. It <strong>in</strong>cludes thefull range <strong>of</strong> <strong>impact</strong>s at all levels <strong>of</strong> the results cha<strong>in</strong>, <strong>in</strong>clud<strong>in</strong>g ripple effects on families, householdsand communities, on <strong>in</strong>stitutional, technical or social systems, and on the environment.Interventions that can be labeled as <strong>in</strong>stitutional primarily aim at chang<strong>in</strong>g second-order conditions(i.e. the capacities, will<strong>in</strong>gness, and organizational structures enabl<strong>in</strong>g <strong>in</strong>stitutions todesign, manage and implement better policies for communities, households and <strong>in</strong>dividuals).Examples are policy dialogues, policy networks, tra<strong>in</strong><strong>in</strong>g programs, <strong>in</strong>stitutional reforms, andstrategic support to <strong>in</strong>stitutional actors (i.e. governmental, civil society <strong>in</strong>stitutions, private corporations,hybrids) and public private partnerships. Other types <strong>of</strong> <strong>in</strong>terventions directly aimat/affect communities, households, <strong>in</strong>dividuals, <strong>in</strong>clud<strong>in</strong>g voters and taxpayers. Examples arefiscal reforms, trade liberalization measures, technical assistance programs, cash transfer programs,construction <strong>of</strong> schools, etc.3.3.2. Intended versus un<strong>in</strong>tended effectsA widely endorsed reference to <strong>impact</strong> <strong>evaluation</strong> concerns the OECD-DAC def<strong>in</strong>ition,which def<strong>in</strong>es <strong>impact</strong>s as (OECD-DAC, 2002: 24) “[p]ositive and negative, primary andsecondary long-term effects produced by a <strong>development</strong> <strong>in</strong>tervention, directly or <strong>in</strong>directly, <strong>in</strong>tendedor un<strong>in</strong>tended”. However, when we look at the body <strong>of</strong> research under the banner <strong>of</strong><strong>impact</strong> <strong>evaluation</strong>, a large part <strong>of</strong> it is not on long-term results, nor on <strong>in</strong>direct and un<strong>in</strong>tendedresults. In fact, the body <strong>of</strong> <strong>evaluation</strong> research based on REs and quasi-experiments is usuallyabout analyz<strong>in</strong>g the attribution <strong>of</strong> short-term outcomes to a particular <strong>in</strong>tervention (White,2009). Moreover, <strong>in</strong> order to permit a difference <strong>in</strong> difference estimation <strong>of</strong> the net effects <strong>of</strong> an<strong>in</strong>tervention, only <strong>in</strong>dicators <strong>of</strong> effects that are expected to be important are taken <strong>in</strong>to account.As a result <strong>of</strong> the rather limited and rigid set <strong>of</strong> <strong>in</strong>dicators employed with<strong>in</strong> the framework <strong>of</strong> REsand quasi-experiments they are not very useful for identify<strong>in</strong>g and analyz<strong>in</strong>g unanticipated effects(see for example Davidson, 2006; Bamberger and White, 2007).Policymakers are very much <strong>in</strong>terested <strong>in</strong> the <strong>in</strong>direct diffusion, replication or scal<strong>in</strong>g-upeffects <strong>of</strong> <strong>in</strong>terventions. Whether <strong>in</strong>tended or un<strong>in</strong>tended they usually are ‘under’ theradar <strong>of</strong> REs as the analysis <strong>of</strong> these effects requires broaden<strong>in</strong>g the scope <strong>of</strong> <strong>in</strong>dicators as wellas the sampl<strong>in</strong>g framework <strong>of</strong> an <strong>impact</strong> <strong>evaluation</strong>. Replicatory effects <strong>in</strong> terms <strong>of</strong> behavioralchanges <strong>in</strong> actors beyond the orig<strong>in</strong>al target group can stem from market responses (giventhat participants and non-participants trade <strong>in</strong> the same markets), the (non-market) behavior <strong>of</strong>participants/non-participants or the behavior <strong>of</strong> <strong>in</strong>terven<strong>in</strong>g agents (governmental/NGO). Forexample, “aid projects <strong>of</strong>ten target local areas, assum<strong>in</strong>g that the local government will not respond;yet if one village gets the project, the local government may well cut its spend<strong>in</strong>g on thatvillage, and move to the control village” (Ravallion, 2008: 10). Another example concerns displacementeffects <strong>of</strong> environmentally damag<strong>in</strong>g land use towards other areas beyond the grasp<strong>of</strong> an <strong>in</strong>tervention; deforestation may <strong>in</strong>crease elsewhere as land use becomes more restricted<strong>in</strong> certa<strong>in</strong> areas.[1] In addition, one can discern other “levels” such as replicatory effects and systemic effects.<strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventionsIOB Discussion Paper 2010-01 • A


3.3.3. Short-term versus long-term effectsIn some types <strong>of</strong> <strong>in</strong>terventions, effects emerge quickly. In others effects may takemuch longer to become manifest, and change over time. The tim<strong>in</strong>g <strong>of</strong> the <strong>evaluation</strong> is thereforeimportant. Development <strong>in</strong>terventions are usually assumed to contribute to long-term <strong>development</strong>(with the exception <strong>of</strong> humanitarian disaster and emergency situations). However, focus<strong>in</strong>gon short-term or <strong>in</strong>termediate outcomes <strong>of</strong>ten provides for more useful and immediate <strong>in</strong>formationfor policy- and decision-mak<strong>in</strong>g. However, <strong>in</strong>termediate outcomes may be mislead<strong>in</strong>g,<strong>of</strong>ten differ<strong>in</strong>g markedly from those achieved <strong>in</strong> the longer term. Many <strong>of</strong> the <strong>impact</strong>s <strong>of</strong> <strong>in</strong>terestfrom <strong>development</strong> <strong>in</strong>terventions will only be evident <strong>in</strong> the longer-term, such as environmentalchanges, or effects on subsequent generations. Search<strong>in</strong>g for evidence <strong>of</strong> such effects too earlymight mistakenly conclude that <strong>in</strong>terventions have failed.In this context, the exposure time <strong>of</strong> an <strong>in</strong>tervention to be able to make an <strong>impact</strong>is an important po<strong>in</strong>t. A typical agricultural <strong>in</strong>novation project that tries to change farmers’ behaviorwith <strong>in</strong>centives (tra<strong>in</strong><strong>in</strong>g, technical assistance, credit) is faced with time lags <strong>in</strong> both theadoption effect (farmers typically are risk averse and face resource constra<strong>in</strong>ts and start adopt<strong>in</strong>g<strong>in</strong>novations on an experimental scale) as well as the diffusion effect (other farmers want tosee evidence <strong>of</strong> results before they copy). In such gradual non-l<strong>in</strong>ear processes <strong>of</strong> change withcascad<strong>in</strong>g effects, the tim<strong>in</strong>g <strong>of</strong> the ex post measurement (<strong>of</strong> land use) is crucial. Ex post measurementsjust after project closure could either underestimate (full adoption/diffusion <strong>of</strong> <strong>in</strong>terest<strong>in</strong>gpractices has not taken place yet) or overestimate <strong>impact</strong> (as farmers will stop <strong>in</strong>vest<strong>in</strong>g<strong>in</strong> those land use practices that are not attractive enough to be ma<strong>in</strong>ta<strong>in</strong>ed without project <strong>in</strong>centives).Woolcock (2009) has recently highlighted a related problem. He argues that REs,and especially those that are based on a limited number <strong>of</strong> data po<strong>in</strong>ts <strong>in</strong> time (e.g. before andafter only), do not take <strong>in</strong>to account the nature and dynamics <strong>of</strong> processes <strong>of</strong> change <strong>in</strong>duced byan <strong>in</strong>tervention. Processes <strong>of</strong> change are <strong>of</strong>ten not l<strong>in</strong>ear. In practice, processes <strong>of</strong> change <strong>of</strong>tenresemble j-curves or step functions. Examples <strong>of</strong> such processes are the effects <strong>of</strong> microcredit onempowerment (e.g. <strong>in</strong>itial resistance by men until persistent and collective pressures lead to ashift <strong>in</strong> norms) or the adoption <strong>of</strong> new agricultural technologies (e.g. Rogers, 2003). The implicationfor REs is that for example if ex post measurement happens to take place when the changecurve has hit a (temporal) low, then estimates <strong>of</strong> net effects will be entirely unrealistic. REs thatare not supported by theory or data from additional methods <strong>of</strong> <strong>in</strong>quiry are not equipped to addressthe abovementioned issues. [1][1] It is important to mention that REs us<strong>in</strong>g a simple difference <strong>in</strong> difference estimate <strong>of</strong> net <strong>impact</strong> (ex ante versusex post) are more prone to error than designs that rely on multiple waves <strong>of</strong> measurement with<strong>in</strong> participant and controlgroups.18 • IOB Discussion Paper 2010-01 <strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventions


4. Challenge 2: Attribution ‘versus’ explanation4.1. Address<strong>in</strong>g the attribution challenge with randomized experimentsThe OECD-DAC def<strong>in</strong>ition <strong>of</strong> <strong>impact</strong>s mentioned above refers to the ‘effects producedby’ an <strong>in</strong>tervention, stress<strong>in</strong>g the attribution aspect. This implies an approach to <strong>impact</strong><strong>evaluation</strong> which is about attribut<strong>in</strong>g <strong>impact</strong>s rather than ‘merely’ assess<strong>in</strong>g what happened. [1]Multiple factors can affect the livelihoods <strong>of</strong> <strong>in</strong>dividuals or the capacities <strong>of</strong> <strong>in</strong>stitutions. Forpolicy makers as well as for stakeholders it is important to know what the added value is <strong>of</strong> thepolicy <strong>in</strong>tervention apart from these other factors. The attribution problem is <strong>of</strong>ten referred toas the central problem <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong>. The central question is to what extent can changes<strong>in</strong> outcomes <strong>of</strong> <strong>in</strong>terest be attributed to a particular <strong>in</strong>tervention? [2] Attribution refers both toisolat<strong>in</strong>g and estimat<strong>in</strong>g accurately the particular contribution <strong>of</strong> an <strong>in</strong>tervention and ensur<strong>in</strong>gthat causality runs from the <strong>in</strong>tervention to the outcome. In most contexts, adequate empiricalknowledge about the effects produced by an <strong>in</strong>tervention requires at least an accurate estimate<strong>of</strong> what would have occurred <strong>in</strong> the absence <strong>of</strong> the <strong>in</strong>tervention (the counterfactual) and a comparisonwith what has occurred with the <strong>in</strong>tervention implemented.Box 1 – The attribution problemAnalyz<strong>in</strong>g attribution requires compar<strong>in</strong>g the situation ‘with’ an <strong>in</strong>tervention to what would have happened<strong>in</strong> the absence <strong>of</strong> an <strong>in</strong>tervention, the ‘without’ situation (the counterfactual). Such comparison <strong>of</strong> the situation‘with and without’ the <strong>in</strong>tervention is challeng<strong>in</strong>g s<strong>in</strong>ce it is not possible to observe how the situationwould have been without the <strong>in</strong>tervention, and has to be constructed by the evaluator. The counterfactualis illustrated <strong>in</strong> the Figure below. The value <strong>of</strong> a target variable (po<strong>in</strong>t a) after an <strong>in</strong>tervention should not beregarded as the <strong>in</strong>tervention’s <strong>impact</strong>, nor is it simply the difference between the before and after situation(a-b, measured on the vertical axis; the dotted arrow). The net <strong>impact</strong> (at a given po<strong>in</strong>t <strong>in</strong> time) is the differencebetween the target variable’s value after the <strong>in</strong>tervention (a) and the value the variable would have had<strong>in</strong> case the <strong>in</strong>tervention would not have taken place (c).value target variableabc‘before’Figure 2 - Graphical display <strong>of</strong> the net <strong>impact</strong> <strong>of</strong> an <strong>in</strong>tervention[1] Goal achievement can be assessed without the need for attribution analysis as no differentiation is made betweenwhether changes are to due to the <strong>in</strong>tervention or other factors.[2] Attribution is about causal relationships between <strong>in</strong>tervention and effects. How and to what extent the attributionchallenge has been addressed <strong>in</strong> an <strong>impact</strong> <strong>evaluation</strong> determ<strong>in</strong>es the <strong>in</strong>ternal validity <strong>of</strong> f<strong>in</strong>d<strong>in</strong>gs.<strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventionsIOB Discussion Paper 2010-01 • A


Usually <strong>in</strong>terventions target particular groups, and <strong>in</strong> addition self selection effectsmay occur as more motivated or socially well-positioned <strong>in</strong>dividuals or groups ga<strong>in</strong> access to aparticular program. Consequently, one cannot simply compare the situation <strong>of</strong> participants overtime with the population at large. Estimates would be distorted due to this selection bias problem.The safest way to avoid selection effects is a randomized selection <strong>of</strong> <strong>in</strong>tervention group andcontrol before the experiment starts. When the experimental group and the control group areselected randomly from the same eligible population, <strong>in</strong> sufficiently large samples both groupswill have similar average characteristics (except that one group has been subjected to the <strong>in</strong>terventionand the other not). Consequently, <strong>in</strong> a well-designed and correctly implemented RE asimple comparison <strong>of</strong> average outcomes <strong>in</strong> the two groups can adequately resolve the attributionproblem and yield accurate estimates <strong>of</strong> the net effect <strong>of</strong> the <strong>in</strong>tervention on a variable <strong>of</strong><strong>in</strong>terest: by design, the only difference between the two groups was the <strong>in</strong>tervention.This powerful feature <strong>of</strong> REs expla<strong>in</strong>s the <strong>in</strong>creas<strong>in</strong>g popularity <strong>of</strong> this methodology.With a long tradition <strong>in</strong> medic<strong>in</strong>e and public health and a much younger tradition <strong>in</strong> policy fieldssuch as education and crime and justice (Leeuw, 2009), REs are now also <strong>in</strong>creas<strong>in</strong>gly applied<strong>in</strong> the context <strong>of</strong> (social) <strong>development</strong> <strong>in</strong>terventions. Randomization is not always feasible, buta wide variety <strong>of</strong> quasi-experimental designs are available to ensure a high <strong>in</strong>ternal validity <strong>of</strong>f<strong>in</strong>d<strong>in</strong>gs. Basically, designs differ <strong>in</strong> terms <strong>of</strong> the technique used for creat<strong>in</strong>g comparable groups(e.g. regression discont<strong>in</strong>uity, propensity score match<strong>in</strong>g, pipel<strong>in</strong>e approaches) as well as <strong>in</strong>terms <strong>of</strong> the structure <strong>of</strong> periodic measurement with<strong>in</strong> participant and control groups (e.g. simpleex ante – ex post participant group design, <strong>in</strong>terrupted time series design; see for exampleCampbell, 1969; Cook and Campbell, 1979; Shadish et al., 2002, Morgan and W<strong>in</strong>ship, 2007; for<strong>development</strong> <strong>in</strong>terventions see for example Bamberger et al., 2006). [1]4.2. Internal versus external validityREs are typically equipped to address the question <strong>of</strong> what works with<strong>in</strong> the particularconf<strong>in</strong>es <strong>of</strong> the experiment; they are strong on <strong>in</strong>ternal validity. However, policymakers are<strong>of</strong>ten <strong>in</strong>terested <strong>in</strong> other questions (Heckman, 1992; Heckman et al., 1997; Ravallion, 2009b).Does this <strong>in</strong>tervention also work <strong>in</strong> other regions or contexts? What happens when we go toscale with a particular <strong>in</strong>tervention? What are the determ<strong>in</strong>ants <strong>of</strong> effectiveness? Another typicalquestion that might not be easily answered with REs or quasi-experiments, is whether andhow people are differently affected by an <strong>in</strong>tervention. [2] This question can be answered withadditional quantitative data analysis, if (large) data sets are available which allow for extensivemodel<strong>in</strong>g <strong>of</strong> confounders and <strong>in</strong>teraction effects. Alternatively (or <strong>in</strong> addition), many qualitativemethods such as case study methods can help evaluators to study <strong>in</strong> detail how <strong>in</strong>terventionswork differently <strong>in</strong> different situations.[1] Most quasi-experimental techniques are useful when selection characteristics are known and can be measured.Even <strong>in</strong> the case <strong>of</strong> unobservable characteristics which might differ between groups and affect <strong>in</strong>tervention effects,this may not be a problem. If these characteristics are time <strong>in</strong>variant, <strong>in</strong> pr<strong>in</strong>ciple they can be controlled for by doubledifferenc<strong>in</strong>g or multiple data po<strong>in</strong>ts <strong>in</strong> time.[2] This is an issue that is closely related to the idea <strong>of</strong> external validity. If one knows how an <strong>in</strong>tervention affectsgroups <strong>of</strong> people <strong>in</strong> different ways, then one can more easily generalize f<strong>in</strong>d<strong>in</strong>gs to other similar sett<strong>in</strong>gs.20 • IOB Discussion Paper 2010-01 <strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventions


Without further <strong>in</strong>formation results <strong>of</strong> a RE cannot be generalized beyond the experimentalsett<strong>in</strong>g, as important confounders that are controlled for <strong>in</strong> the experiment are notrevealed by the experiment itself. Moreover, “[t]he people who are normally attracted to a program,tak<strong>in</strong>g account <strong>of</strong> the expected benefits and costs to them personally, may differ systematicallyfrom the random sample <strong>of</strong> people who were <strong>in</strong>cluded <strong>in</strong> the trial” (Ravallion, 2008: 17).Critics <strong>of</strong> the ‘alleged superiority’ <strong>of</strong> REs argue that <strong>in</strong>ternal validity is not typically the questionthat policy makers are <strong>in</strong>terested <strong>in</strong> and argue for more attention to external validity <strong>of</strong> f<strong>in</strong>d<strong>in</strong>gs,an aspect on which REs enjoy no comparative advantage (e.g. Rodrik, 2008; see Shadish et al.(2002) for a discussion on experiments and external validity). They argue that <strong>in</strong> order to be ableto generalize f<strong>in</strong>d<strong>in</strong>gs from a RE to other sett<strong>in</strong>gs, one needs to know how an <strong>in</strong>tervention works,what are the determ<strong>in</strong>ants <strong>of</strong> the processes <strong>of</strong> change (possibly <strong>in</strong>duced by an <strong>in</strong>tervention), howan <strong>in</strong>tervention might affect people <strong>in</strong> particular circumstances <strong>in</strong> different ways and what thetime path and nature <strong>of</strong> the changes might be. In order to answer these types <strong>of</strong> questions (andothers, see Ravallion, 2009b) one needs an <strong>in</strong>formative explanatory theory (e.g. based on researchwith<strong>in</strong> the social sciences) and other (additional) methods <strong>of</strong> data collection and analysis(e.g. Van der Knaap et al., 2008; for the case <strong>of</strong> <strong>development</strong> <strong>in</strong>terventions see Deaton, 2009).‘Randomistas’ have asserted that external validity <strong>of</strong> f<strong>in</strong>d<strong>in</strong>gs can be enhanced bydo<strong>in</strong>g a series <strong>of</strong> experiments <strong>in</strong> different contextual sett<strong>in</strong>gs (Banerjee and Duflo, 2008). Yet asRavallion argues, “the feasibility <strong>of</strong> do<strong>in</strong>g a sufficient number <strong>of</strong> trials – sufficient to span therelevant doma<strong>in</strong> <strong>of</strong> variation found <strong>in</strong> reality for a given program, as well as across the range <strong>of</strong>policy options – is far from clear. The scale <strong>of</strong> the randomized trials needed to test even one largenational program could well be prohibitive” (Ravallion 2008: 19). Moreover, as Rodrik argues,“[f]ew randomized <strong>evaluation</strong>s – if any – <strong>of</strong>fer a structural model that describes how the proposedpolicy will work, if it does, and under what circumstances it will not, if it doesn’t. Absent afull theory that is be<strong>in</strong>g put to a test, it is somewhat arbitrary to determ<strong>in</strong>e under what differentconditions the experiment ought to be repeated.” (Rodrik, 2008: 21-22; italics added). A similarpo<strong>in</strong>t is made by Deaton who asserts that “repeated successful replications <strong>of</strong> a ‘what works’experiment is both unlikely and unlikely to be persuasive. Learn<strong>in</strong>g about theory, or mechanisms,requires that the <strong>in</strong>vestigation be targeted towards that theory, towards why someth<strong>in</strong>g works,not whether it works” (Deaton, 2009: 31).A special type <strong>of</strong> generalizability concerns the <strong>in</strong>ference from small-scale pilots toup-scaled programs with broad coverage. In <strong>development</strong> contexts REs are <strong>of</strong>ten implementedon small scale pilots. While the outcomes <strong>of</strong> REs are considered as useful <strong>in</strong>puts <strong>in</strong> the decisionto go to scale, REs alone are <strong>in</strong>sufficient to support such a decision. [1] In case <strong>of</strong> scal<strong>in</strong>g-up, conditionsat both the <strong>in</strong>stitutional level (<strong>of</strong> implement<strong>in</strong>g agencies) as well as the beneficiary level<strong>in</strong>variably change, mak<strong>in</strong>g the new <strong>in</strong>tervention altogether different from the small-scale pilot itwas derived from (Deaton, 2009). At the <strong>in</strong>stitutional level for example, the organizational setupmight change completely, with new people with divergent capacities and <strong>in</strong>terests manag<strong>in</strong>gand implement<strong>in</strong>g the new <strong>in</strong>tervention, or corrupt public <strong>of</strong>ficials suddenly be<strong>in</strong>g drawn to ascaled-up <strong>in</strong>tervention which has appeared on their radar. In addition, new contextual conditionsand characteristics <strong>of</strong> target groups may confound <strong>in</strong>tervention effects, this new heterogeneitynot be<strong>in</strong>g covered by the orig<strong>in</strong>al small-scale experiment (Ravallion, 2009a). On the other[1] In fact, there are good examples <strong>of</strong> REs conv<strong>in</strong>c<strong>in</strong>g policymakers to scale up and replicate <strong>in</strong>terventions. For example,the spread <strong>of</strong> conditional cash transfer programs over Lat<strong>in</strong> America is <strong>in</strong> part fuelled by the evidence producedby REs.<strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventionsIOB Discussion Paper 2010-01 • A


hand, it can be argued that a RE which has covered a sizeable sample can be quite <strong>in</strong>formative onthe likelihood that the same policy <strong>in</strong>strument will produce similar results when scaled up with<strong>in</strong>the same population the RE sample was taken from. [1]4.3. Interventions as theoriesMost critics <strong>of</strong> REs subscribe to the po<strong>in</strong>t <strong>of</strong> view that there is noth<strong>in</strong>g that precludesREs from be<strong>in</strong>g more theory-based or -driven. Of course, <strong>in</strong> such cases it would be thecomb<strong>in</strong>ation <strong>of</strong> theory (or theories) and RE that would <strong>in</strong>crease the external validity (and also<strong>in</strong>ternal validity) potential <strong>of</strong> REs rather than the RE format alone. Recently, more examples <strong>of</strong>theory-based REs can be found <strong>in</strong> the literature (see for example Banerjee (2005) or Banerjeeand Duflo (2008) for illustrations). [2] Moreover, as argued by Cook (2006) REs are never completelytheory-empty. [3] Among other th<strong>in</strong>gs, they require substantive theory as guidance forselect<strong>in</strong>g or construct<strong>in</strong>g suitable measures <strong>of</strong> effects.While REs <strong>in</strong> pr<strong>in</strong>ciple are not or need not be theory-empty, the abovementionedcritique <strong>of</strong> Deaton and others on REs rema<strong>in</strong>s valid. REs are geared towards the question <strong>of</strong>whether an <strong>in</strong>tervention works and do not shed much light on how and why <strong>in</strong>terventions work.In order to look at the latter question, <strong>evaluation</strong> researchers need to look <strong>in</strong>to the black box betweenoutput, outcome and <strong>impact</strong> <strong>in</strong>dicators and reconstruct the underly<strong>in</strong>g causality (Pawsonand Tilley, 1997). Two types <strong>of</strong> theory are <strong>of</strong> importance here. A reconstructed <strong>in</strong>tervention theoryshould adequately represent the ma<strong>in</strong> assumptions <strong>of</strong> decision makers, target groups andother stakeholders about causal pathways runn<strong>in</strong>g from <strong>in</strong>tervention to effects (see for example,Chen, 1990, Rogers et al., 2000; Leeuw, 2003). The <strong>in</strong>tervention theory can be further enrichedor tested by tak<strong>in</strong>g <strong>in</strong>to account exist<strong>in</strong>g explanatory theories (from the social and behavioralsciences) on <strong>in</strong>tervention contexts, processes <strong>of</strong> change and potential effects.The idea that exist<strong>in</strong>g explanatory theories can improve the quality <strong>of</strong> <strong>evaluation</strong>sis not a new one and has been discussed quite extensively <strong>in</strong> the literature (e.g. Rigg<strong>in</strong>, 1990;Lipsey, 1993; Donaldson and Lipsey, 2006). For example, Rigg<strong>in</strong> (1990) discusses how Etzioni’stheory <strong>of</strong> compliance can improve the quality <strong>of</strong> an <strong>evaluation</strong> <strong>of</strong> an employment assistance programboth at the stage <strong>of</strong> conceptualization as well as at the <strong>in</strong>terpretation stage <strong>of</strong> an <strong>evaluation</strong>.A first important role for ‘theory’ lies <strong>in</strong> the conceptualization <strong>of</strong> an <strong>evaluation</strong> question.Substantive theories help to po<strong>in</strong>t the evaluator toward the relevant constructs and relationshipsbetween these constructs <strong>in</strong> order to make useful abstractions <strong>of</strong> the reality <strong>of</strong> a policy<strong>in</strong>tervention, its <strong>in</strong>tended effects and the wider context <strong>in</strong> which it is embedded, aspects whichsubsequently can be tested through empirical research. Substantive theory from past (<strong>evaluation</strong>)research to some extent can help to anticipate these effects, which subsequently can betaken <strong>in</strong>to account by means <strong>of</strong> additional data collection. A second important role for theory[1] Thereby assum<strong>in</strong>g that first a random sample was taken from the population, with subsequent randomized allocation<strong>of</strong> the <strong>in</strong>tervention to treatment and control groups with<strong>in</strong> the sample.[2] One example is Duflo and Hanna (2008) on us<strong>in</strong>g an experiment to develop a model <strong>of</strong> teacher behavior.[3] Theory-empty refers here to a situation <strong>in</strong> which the causal relationships between <strong>in</strong>tervention outputs, outcomesand <strong>impact</strong> are not made explicit <strong>in</strong> a theoretical framework. Such a framework may be based on reconstructions<strong>of</strong> stakeholders’ assumptions and/or exist<strong>in</strong>g research relevant to the causal pathways <strong>in</strong> question. For example,an analysis <strong>in</strong> which a claim to a relationship or even a causal relationship between two variables is based on statisticalassociation only (whether <strong>in</strong> an experimental sett<strong>in</strong>g or not), can be called theory-empty (for a discussion see forexample Coleman, 1986).22 • IOB Discussion Paper 2010-01 <strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventions


lies <strong>in</strong> re<strong>in</strong>forc<strong>in</strong>g the causal analysis, the analysis <strong>of</strong> how and to what extent changes <strong>in</strong> targetvariables can be attributed to a policy <strong>in</strong>tervention. Relevant substantive theories can shed lighton the nature <strong>of</strong> the causal relationships between <strong>in</strong>terventions and (un)<strong>in</strong>tended processes <strong>of</strong>change and help to rule out rival explanations for changes <strong>in</strong> target variables. [1] F<strong>in</strong>ally, there isan important role for theory <strong>in</strong> the <strong>in</strong>terpretation <strong>of</strong> <strong>evaluation</strong> f<strong>in</strong>d<strong>in</strong>gs. Theory can provide auseful framework for help<strong>in</strong>g us to understand why certa<strong>in</strong> changes have comeabout or provide <strong>in</strong>sights <strong>in</strong>to the relevant (contextual) variables which are likely to <strong>in</strong>fluencepatterns <strong>of</strong> results across sett<strong>in</strong>gs.4.4. The law <strong>of</strong> comparative advantages: theory-based and multi-method<strong>evaluation</strong>My account so far has gradually led us the realization that REs are potentiallystrong on <strong>in</strong>ternal validity yet miss out on provid<strong>in</strong>g (strong) evidence on other aspects such asfor example unanticipated and long-term effects and the external and construct validity <strong>of</strong> f<strong>in</strong>d<strong>in</strong>gs.This br<strong>in</strong>gs us to the important realization that <strong>impact</strong> <strong>evaluation</strong>s by default cannot andshould not exclusively rely on one methodological design only. [2] The rema<strong>in</strong>der <strong>of</strong> this sectionwill add some more illustrative power to this argument.As discussed elsewhere (Leeuw and Vaessen, 2009), the <strong>in</strong>tervention theory constitutesthe guid<strong>in</strong>g framework <strong>of</strong> <strong>impact</strong> <strong>evaluation</strong>s. Typically, multiple causal assumptionsemerge that require further test<strong>in</strong>g <strong>in</strong> order to be able to claim whether the <strong>in</strong>tervention has<strong>in</strong>duced certa<strong>in</strong> changes, and <strong>in</strong> what circumstances. However, not all causal assumptions canbe tested with the same methodological design or specific method. For example, consider theeffects <strong>of</strong> subsidies for susta<strong>in</strong>able land use on biodiversity. One can envisage a useful RE whichtests the effects <strong>of</strong> subsidies on the land use behavior <strong>of</strong> farmers. The subsequent causal stepfrom land use behavior to biodiversity cannot be tested so easily by means <strong>of</strong> a RE. One <strong>of</strong> thema<strong>in</strong> reasons is that biodiversity depends on plot-specific variables (e.g. the type <strong>of</strong> vegetationon a certa<strong>in</strong> plot <strong>of</strong> land) as well as on determ<strong>in</strong>ants at higher levels such as the connectivity betweensystems <strong>of</strong> land use, the proximity <strong>of</strong> certa<strong>in</strong> biospheres, the geographical location withrespect to migration routes <strong>of</strong> birds and other species, and so on. Moreover, as argued above,there are other challenges such as the non-l<strong>in</strong>earity, uncerta<strong>in</strong>ty and susta<strong>in</strong>ability <strong>of</strong> changes.Ideally, the <strong>in</strong>tervention theory should provide guidance on the choice <strong>of</strong> causalassumptions to be analyzed (see Weiss (2000) for a discussion on this) and correspond<strong>in</strong>gly, differentdesigns and methods can be used to assess specific assumptions. In addition, other complementaryperspectives on the use <strong>of</strong> multiple methods can be discerned <strong>in</strong> the literature (for ageneral discussion see for example Tashakkori and Teddlie, 2003).[1] An example <strong>of</strong> such an approach is Scriven’s (2008) General Elim<strong>in</strong>ation Methodology. See also Pawson’s (2006)discussion on adjudication between rival theories. In an <strong>evaluation</strong> <strong>of</strong> the <strong>impact</strong> <strong>of</strong> social funds, Carvalho and White(2004) reconstruct a theory and an ‘anti-theory’, which are both put to the test empirically, <strong>in</strong> order to arrive at a betterunderstand<strong>in</strong>g <strong>of</strong> the nature <strong>of</strong> change processes.[2] “[I]t is trivial to argue about whether evidence-based research should be multi-method or not. Even causal researchis deepened by learn<strong>in</strong>g about non-causal issues, such as what the substantive theory beh<strong>in</strong>d a program designis, who gets to participate <strong>in</strong> it or not, how well the program is implemented, who gets greater exposure, and whatthe program costs. So nearly all causal studies require multiple methods that complement each other. Multi-method,complementary research is desirable even when a causal claim is centrally at issue” (Cook, 2006: 1).<strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventionsIOB Discussion Paper 2010-01 • A


Let us briefly illustrate three ‘logics’ for multi-method approaches <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong>.The first starts out from Campbell’s framework <strong>of</strong> different types <strong>of</strong> validity. As suggestedearlier <strong>in</strong> this paper, specific methods have comparative advantages <strong>in</strong> ensur<strong>in</strong>g a high degree<strong>of</strong> <strong>in</strong>ternal/external/construct validity. Consider a similar example as <strong>in</strong>troduced above, i.e. an<strong>in</strong>tervention that provides monetary <strong>in</strong>centives and tra<strong>in</strong><strong>in</strong>g to farmers <strong>in</strong> order to promote landuse changes lead<strong>in</strong>g to improved livelihoods conditions as well as other effects. Simplified, acomprehensive methodological design could be the follow<strong>in</strong>g:E.g. a randomized experiment with the use <strong>of</strong> survey data on participant and controlgroups could be used to assess the effectiveness <strong>of</strong> different <strong>in</strong>centives on land use changeand/or socio-economic effects <strong>of</strong> these changes (potentially strengthens <strong>in</strong>ternal validity <strong>of</strong>f<strong>in</strong>d<strong>in</strong>gs);E.g. further survey data (multivariate) analysis and case studies could tell us how<strong>in</strong>centives have different effects on particular types <strong>of</strong> farm households (potentially strengthens<strong>in</strong>ternal validity and <strong>in</strong>creases external validity <strong>of</strong> f<strong>in</strong>d<strong>in</strong>gs);E.g. targeted syntheses <strong>of</strong> exist<strong>in</strong>g research, semi-structured <strong>in</strong>terviews and focusgroup conversations could tell us more about the nature <strong>of</strong> effects <strong>in</strong> terms <strong>of</strong> production,consumption, poverty, environment etc. (potentially enhances construct and <strong>in</strong>ternal validity <strong>of</strong>f<strong>in</strong>d<strong>in</strong>gs).A second framework <strong>of</strong> multi-method <strong>evaluation</strong> has been labeled the ‘shoestr<strong>in</strong>g’approach (see Bamberger et al., 2004). It refers to a number <strong>of</strong> scenarios <strong>of</strong> multi-method approacheswhich are adapted to particular conditions <strong>of</strong> budget, time and data constra<strong>in</strong>ts. Thesescenarios all rely on an <strong>in</strong>tervention theory model as a basis for different methodological strategiesto simplify <strong>evaluation</strong> design (e.g. <strong>in</strong> relation to text book REs), reduce costs <strong>of</strong> data collectionand analysis, and <strong>in</strong>tegrate qualitative and quantitative methods. [1]A third illustration <strong>of</strong> a multi-method perspective is Woolcock’s account <strong>of</strong> differentoptions to ga<strong>in</strong> a better understand<strong>in</strong>g <strong>of</strong> the nature <strong>of</strong> causal pathways. “There are at leastthree entry po<strong>in</strong>ts, each <strong>of</strong> <strong>in</strong>creas<strong>in</strong>g degrees <strong>of</strong> sophistication. The first is simply raw experience:seasoned project managers should have a good sense <strong>of</strong> how long and <strong>in</strong> what form the<strong>impact</strong>s associated with a particular project <strong>in</strong> a particular context should take to materialise.[…] Astute <strong>in</strong>tuition and seasoned field experience comb<strong>in</strong>ed with solid theory should providea second avenue: the very essence <strong>of</strong> a good theory should be that it provides a sense and ajustification <strong>of</strong> the conditions under which, and mechanisms by which, certa<strong>in</strong> project outcomesshould be expected. […] Thirdly, the regular collection <strong>of</strong> empirical evidence can itself be a basisfor determ<strong>in</strong><strong>in</strong>g the shape <strong>of</strong> the project’s <strong>impact</strong> trajectory, and is ultimately (for researchers atleast) the most defensible basis on which to do so” (Woolcock, 2009: 7-8).[1] See also Bamberger et al. (2009) for further discussion on mixed methods <strong>in</strong> the context <strong>of</strong> <strong>impact</strong> <strong>evaluation</strong>.24 • IOB Discussion Paper 2010-01 <strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventions


5. Challenge 3: Impact <strong>evaluation</strong> <strong>in</strong> practice5.1. Threats to attribution analysis <strong>in</strong> experimental sett<strong>in</strong>gsThe ma<strong>in</strong> threats to the <strong>in</strong>ternal validity <strong>of</strong> f<strong>in</strong>d<strong>in</strong>gs from REs (and certa<strong>in</strong> quasi-experiments)are well-known and widely discussed (Campbell and Stanley, 1963; Campbell, 1969;Cook and Campbell, 1979; Shadish et al., 2002):Selection bias: refers to the problem <strong>of</strong> under- or overestimat<strong>in</strong>g <strong>in</strong>tervention effectsdue to uncontrolled differences between different (treatment) groups that would lead todifferences <strong>in</strong> effect variables if none <strong>of</strong> the groups would have received benefits.Contagion (or treatment diffusion): refers to the problem <strong>of</strong> groups that are notsupposed to be exposed to (or receiv<strong>in</strong>g) certa<strong>in</strong> benefits are <strong>in</strong> fact benefit<strong>in</strong>g from an <strong>in</strong>tervention<strong>in</strong> one or more ways: by directly receiv<strong>in</strong>g the benefits from the <strong>in</strong>tervention, by <strong>in</strong>directlyreceiv<strong>in</strong>g benefits through other participants, or by receiv<strong>in</strong>g similar benefits from other organizations.Ag<strong>in</strong>g/Maturation: An effect that arises when participants grow older, wiser, moretired, more self-confident due to factors other than the <strong>in</strong>tervention.History: The effect is caused by some event other than the <strong>in</strong>tervention occurr<strong>in</strong>gdur<strong>in</strong>g the same time period as the <strong>in</strong>tervention.Test<strong>in</strong>g: The pre-test measurement causes a change <strong>in</strong> the post-test measure.Instrumentation: The effect is caused by a change <strong>in</strong> the method <strong>of</strong> measur<strong>in</strong>g theoutcome.Regression to the Mean: Where an <strong>in</strong>tervention is implemented on units with unusuallyhigh scores (e.g. unusually high student performance scores), natural fluctuation willcause a decrease <strong>in</strong> these scores on the post-test which may be mistakenly <strong>in</strong>terpreted as aneffect <strong>of</strong> the <strong>in</strong>tervention.Attrition: Changes <strong>in</strong> effect measurements due to drop-outs.Causal Order: It is unclear whether the <strong>in</strong>tervention preceded the outcome.Behavioral responses: Several un<strong>in</strong>tended behavioral responses not caused by the<strong>in</strong>tervention or ‘normal’ conditions might <strong>in</strong>hibit the reliability <strong>of</strong> comparisons between groupsand hence the ability to attribute changes to the <strong>in</strong>tervention. An example is expected behavioror compliance behavior: beneficiaries’ behavior is not caused by <strong>in</strong>tervention outputs (e.g. a subsidy)but due to reasons <strong>of</strong> compliance with a (formal/<strong>in</strong>formal) contract between beneficiaryand implement<strong>in</strong>g organization, due to the (longstand<strong>in</strong>g) relationship with a particular organization(deliver<strong>in</strong>g <strong>in</strong>tervention outputs), or due to certa<strong>in</strong> expectations about future benefitsfrom the organization (not necessarily the <strong>in</strong>tervention).While each <strong>of</strong> these issues might be potential problems that render claims on attributionless valid, the fact that they have been systematically identified and addressed <strong>in</strong> theliterature (see for example Cook and Campbell, 1979) enhances the scientific rigor <strong>of</strong> experimentaland quasi-experimental designs.A key underly<strong>in</strong>g determ<strong>in</strong>ant <strong>of</strong> the <strong>in</strong>ternal validity <strong>of</strong> f<strong>in</strong>d<strong>in</strong>gs from REs is the extentto which those manag<strong>in</strong>g the experiment are capable and will<strong>in</strong>g to safeguard the designfrom the threats to validity described above. This is certa<strong>in</strong>ly the case for selection bias andtreatment diffusion issues; well-designed REs may be safeguarded from these problems. Morecomplicated is the potential threat from un<strong>in</strong>tended behavioral responses among participantor control groups. As argued by several authors (e.g. Deaton, 2009; Cohen and Easterly, 2009)<strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventionsIOB Discussion Paper 2010-01 • A


andomization can lead to hard feel<strong>in</strong>gs between the different groups <strong>in</strong>volved <strong>in</strong> the experiment.“[S]ubjects may fail to accept assignment, so that people who are assigned to the experimentalgroup may refuse, and controls may f<strong>in</strong>d a way <strong>of</strong> gett<strong>in</strong>g the treatment, and either maydrop out <strong>of</strong> the experiment altogether. The classical remedy <strong>of</strong> double bl<strong>in</strong>d<strong>in</strong>g, so that neitherthe subject nor the experimenter know which subject is <strong>in</strong> which group, is rarely feasible <strong>in</strong> socialexperiments” (Deaton, 2009: 36). In some cases researchers cannot (and <strong>in</strong>deed should not)withhold the <strong>in</strong>formation from stakeholders that they are part <strong>of</strong> an experiment. Banerjee andDuflo (2008) provide several examples <strong>of</strong> mechanisms (e.g. lotteries) which facilitate the collaboration<strong>of</strong> local populations <strong>in</strong> an experiment. Yet, even <strong>in</strong> situations where <strong>in</strong>stitutions andtarget groups agree to randomized allocation <strong>of</strong> benefits, it is well-known <strong>in</strong> the literature (e.g.Campbell, 1969; Shadish et al., 2002) that experimentation may lead to a range <strong>of</strong> un<strong>in</strong>tendedbehavioral responses with<strong>in</strong> the participant and treatment groups which can affect the validity<strong>of</strong> f<strong>in</strong>d<strong>in</strong>gs result<strong>in</strong>g from an experiment. An example <strong>of</strong> un<strong>in</strong>tended behavioral responses comesfrom the famous PROGRESA study (see for example Skoufias and McClafferty, 2001). “One issuewith the explicit acknowledgement <strong>of</strong> randomization as a fair way to allocate the program isthat implementers will f<strong>in</strong>d that the easiest way to present it to the community is to say that anexpansion <strong>of</strong> the program is planned for the control areas <strong>in</strong> the future (especially when it is <strong>in</strong>deedthe case, as <strong>in</strong> phased-<strong>in</strong> designs). This may cause problems if the anticipation <strong>of</strong> treatmentleads <strong>in</strong>dividuals to change their behavior. This criticism was made <strong>in</strong> the case <strong>of</strong> the PROGRESAprograms, where control villages knew that they were go<strong>in</strong>g to eventually be covered by theprogram” (Banerjee and Duflo, 2008: 22).5.2. Other design and implementation challengesA key issue regard<strong>in</strong>g the success <strong>of</strong> REs <strong>in</strong> practice revolves around the ethics andfeasibility <strong>of</strong> randomization <strong>in</strong> practice and the correspond<strong>in</strong>g reactions by stakeholders. Experimentscan cause resentment as people do not understand or support the differences <strong>in</strong> benefitsallocated to different groups. Random allocation <strong>in</strong> many cases is also unacceptable to policymakers (e.g. Bamberger and White, 2007; Ravallion, 2008). Interventions are <strong>of</strong>ten <strong>in</strong>tended tobe targeted to specific groups and outreach is a direct concern to policy makers. Randomizationwould limit outreach and at the same time is <strong>of</strong>ten considered as unethical <strong>in</strong> view <strong>of</strong> the press<strong>in</strong>gneeds <strong>of</strong> target populations. [1]Banerjee and Duflo (2008: 22) argue that collaboration with <strong>in</strong>stitutions <strong>in</strong> develop<strong>in</strong>gcountries is becom<strong>in</strong>g less <strong>of</strong> an issue <strong>in</strong> develop<strong>in</strong>g countries. “Randomization that takesplace at the level <strong>of</strong> location can piggy-back on the expansion <strong>of</strong> the organization’s <strong>in</strong>volvement<strong>in</strong> these areas limited by budget and adm<strong>in</strong>istrative capacity, which is precisely why they agreeto randomize. Limited government budgets and diverse actions by many small NGOs mean thatvillages or schools <strong>in</strong> most develop<strong>in</strong>g countries are used to the fact that some areas get someprograms and others do not and when a NGO only serve some villages, they see it as a part <strong>of</strong>the organization’s overall strategy. When the control areas [are] given the explanation that theprogram had only enough budget for a certa<strong>in</strong> number <strong>of</strong> schools, they typically agree that a lotterywas a fair way to allocate it---they are <strong>of</strong>ten used to such arbitrar<strong>in</strong>ess and so randomizationappears transparent and legitimate”.Another way to enhance the goodwill among implement<strong>in</strong>g agencies is to forge a[1] This <strong>of</strong>ten provides a compell<strong>in</strong>g argument for us<strong>in</strong>g quasi-experimental designs. For example, regression discont<strong>in</strong>uityis very useful when target<strong>in</strong>g is based on clear and easy to measure criteria <strong>of</strong> selection.26 • IOB Discussion Paper 2010-01 <strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventions


longstand<strong>in</strong>g relationship between the latter and RE researchers <strong>in</strong> which a series <strong>of</strong> experimentswill constitute the basis <strong>of</strong> a cumulative process <strong>of</strong> learn<strong>in</strong>g. It still rema<strong>in</strong>s to be seenhowever what the costs and benefits <strong>of</strong> such a relationship would be <strong>in</strong> divergent <strong>in</strong>stitutionalcontexts, especially <strong>in</strong> view <strong>of</strong> the previously discussed issues. Institutional will<strong>in</strong>gness to undertakea RE is not a black and white issue, as <strong>in</strong>terventions typically comprise multiple <strong>in</strong>stitutionalpartners and multiple layers <strong>of</strong> management, from headquarters to field level. The idea <strong>of</strong><strong>in</strong>stitutional will<strong>in</strong>gness is also closely related to <strong>in</strong>stitutional capacities and <strong>in</strong>centives. REs are<strong>in</strong>tr<strong>in</strong>sically l<strong>in</strong>ked to <strong>in</strong>tervention implementation processes and the question to what extentwell-tra<strong>in</strong>ed researchers are de facto present and able to ensure the quality control <strong>of</strong> experimentalconditions, is an empirical one that merits further <strong>in</strong>quiry. There is a marked differencebetween an experienced research team undertak<strong>in</strong>g REs (as is the case for most <strong>of</strong> the publishedwork on REs <strong>in</strong> <strong>development</strong>) and the idea <strong>of</strong> ma<strong>in</strong>stream<strong>in</strong>g REs <strong>in</strong> the design <strong>of</strong> selectedprojects <strong>of</strong> donor and develop<strong>in</strong>g country agencies’ portfolios. In the latter case <strong>in</strong>variably notall the tasks perta<strong>in</strong><strong>in</strong>g to the design and implementation <strong>of</strong> a RE are managed by experiencedand motivated researchers.Apart from <strong>in</strong>stitutional capacities, will<strong>in</strong>gness to collaborate and ethics, otherchallenges rema<strong>in</strong>. A first obvious condition for REs is the active <strong>in</strong>volvement <strong>of</strong> researchers orevaluators <strong>in</strong> the <strong>in</strong>tervention design and implementation phase. This <strong>in</strong>volvement is essentialfor basel<strong>in</strong>e data collection as well as quality control <strong>of</strong> randomization. In practice however,many <strong>impact</strong> <strong>evaluation</strong>s are commissioned after an <strong>in</strong>tervention has been implemented andbasel<strong>in</strong>e data cont<strong>in</strong>ue to be a problem (Bamberger and White, 2007). Although preferably doubledifference (participant-control group comparisons over time) designs should be used, it ismore usual that <strong>impact</strong> assessments are based on less rigorous – and reliable – designs, wherebasel<strong>in</strong>e data are reconstructed or collected later dur<strong>in</strong>g the implementation phase, or basel<strong>in</strong>edata are collected only for the treatment group, or there are no basel<strong>in</strong>e data for neither treatmentnor control group (for options on how to address these constra<strong>in</strong>ts see Bamberger et al.,2004; Bamberger et al. 2006; or White and Bamberger, 2007).A second aspect is the costs <strong>of</strong> do<strong>in</strong>g REs. Early experiences <strong>of</strong> REs (and quasi-experimentalstudies) <strong>in</strong> <strong>development</strong> by the World Bank turned out to be rather costly (OED, 2005),but these studies were <strong>of</strong>ten quite ambitious <strong>in</strong> scope. More recent experiences seem to suggestthat REs do not necessarily have to be more expensive than other non-experimental observationalstudies that are based on orig<strong>in</strong>al fieldwork with multiple data po<strong>in</strong>ts <strong>in</strong> time. However,given the narrow focus <strong>of</strong> REs, when large programs with multiple <strong>in</strong>tervention activities need tobe evaluated, REs need to compete with less expensive non-experimental methodologies whichcan cover a broader scope <strong>of</strong> activities with less budget. Proponents <strong>of</strong> REs will need to justifyif the potential ga<strong>in</strong>s <strong>in</strong> terms <strong>of</strong> the high <strong>in</strong>ternal validity <strong>of</strong> f<strong>in</strong>d<strong>in</strong>gs delivered by an RE will beworthwhile the <strong>in</strong>vestment given the loss <strong>in</strong> scope. Alternatively, adversaries <strong>of</strong> experimentationor others that choose scope over depth will have to justify that alternative methodologicaldesigns adequately address crucial issues such as attribution.A third issue concerns the quality <strong>of</strong> the data. Bamberger and White (2007) arguethat problems <strong>of</strong> data quality, although relevant for any type <strong>of</strong> methodological design, mightbe particularly problematic <strong>in</strong> case <strong>of</strong> REs as they rely one a limited number <strong>of</strong> <strong>in</strong>dicators. Banerjeeand Duflo (2008) contest this idea. In their view, if data is be<strong>in</strong>g collected especially for<strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventionsIOB Discussion Paper 2010-01 • A


the purposes <strong>of</strong> a RE, then given the limited number <strong>of</strong> observations that are usually necessaryfor reliable estimates, researchers are able to dedicate special attention to data collection andmeasurement issues. Ensur<strong>in</strong>g high data quality is particularly challeng<strong>in</strong>g <strong>in</strong> rural contexts <strong>in</strong>develop<strong>in</strong>g countries. [1] Surveys cont<strong>in</strong>ue to be the ma<strong>in</strong> <strong>in</strong>struments generat<strong>in</strong>g <strong>impact</strong> <strong>evaluation</strong>data. Potential measurement errors due to problems <strong>of</strong> recall, sensitivity <strong>of</strong> certa<strong>in</strong> topics,<strong>in</strong>tercultural communication, translation errors are just a few <strong>of</strong> the problems that affect dataquality (see for example De Leeuw et al., 2008; Mikkelsen, 2005; Bamberger et al., 2004; Bamberger,2000). Moreover, surveys may not be appropriate for address<strong>in</strong>g sensitive topics (seefor example Bamberger et al., 2009) such as for domestic violence, household <strong>in</strong>come or localnorms and <strong>in</strong> these cases are more likely to generate unreliable data. Unfortunately, data qualityis not as sexy a topic as methodological design, especially if the latter is the acclaimed basis fordeliver<strong>in</strong>g ‘rigorous scientific evidence’.[1] For a rather critical perspective on data quality <strong>in</strong> (rural) develop<strong>in</strong>g country contexts, see Gill (1993).28 • IOB Discussion Paper 2010-01 <strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventions


6. Some lessons for randomized experiments and <strong>impact</strong> <strong>evaluation</strong> <strong>in</strong>general from the perspective <strong>of</strong> a ‘non-randomista’The controversy surround<strong>in</strong>g the promotion and application <strong>of</strong> REs <strong>in</strong> <strong>development</strong>has led to a sense <strong>of</strong> polarization <strong>in</strong> the <strong>development</strong> policy and <strong>evaluation</strong> community. As someproponents claim epistemological supremacy <strong>of</strong> REs (with respect to attribution) the counterreaction among others has been rejection. Needless to say, such extreme positions are counterproductiveto reach<strong>in</strong>g a goal that is commonly endorsed: to learn more about what works andwhy <strong>in</strong> <strong>development</strong>. Polarization leads to ‘argument m<strong>in</strong><strong>in</strong>g’, with proponents br<strong>in</strong>g<strong>in</strong>g up (arguablyvalid) arguments <strong>in</strong> defense <strong>of</strong> REs while adversaries pick their favorite (and arguably valid)arguments aga<strong>in</strong>st REs. Clearly, this is not the way forward on the path to knowledge growth.If one explores the grow<strong>in</strong>g body <strong>of</strong> evidence <strong>in</strong> <strong>development</strong> generated through REs, one cannotdeny the positive (though divergent) direct and <strong>in</strong>direct benefits to knowledge generationabout what works. At the same time, as acknowledged by most scholars <strong>in</strong> the literature, theapplicability <strong>of</strong> REs is limited to certa<strong>in</strong> contexts. Moreover, as illustrated previously, the range<strong>of</strong> potential challenges (<strong>in</strong>clud<strong>in</strong>g different questions) to be addressed <strong>in</strong> an empirical exercise<strong>of</strong> assess<strong>in</strong>g what works and why <strong>in</strong> <strong>development</strong> <strong>in</strong>tervention is too broad to be adequatelyaddressed by REs only. By present<strong>in</strong>g a diverse array <strong>of</strong> methodological and conceptual perspectiveson <strong>impact</strong> <strong>evaluation</strong> this paper has illustrated potential benefits but also limitations aswell as complementary perspectives to REs. The implicit lesson is tw<strong>of</strong>old. First <strong>of</strong> all, the question‘to randomize or not to randomize’ is overrated <strong>in</strong> the current debate. Other challenges demandthe attention <strong>of</strong> policymakers and <strong>evaluation</strong> researchers alike. Second, ‘do not throw outthe baby with the bath water’. There is a risk that the current popularity <strong>of</strong> REs <strong>in</strong> certa<strong>in</strong> policycircles might lead to a backlash. Too high expectations <strong>of</strong> REs may quicken its demise.An important barrier to RE application is that policy-driven <strong>impact</strong> <strong>evaluation</strong>s, i.e.<strong>impact</strong> <strong>evaluation</strong>s commissioned by fund<strong>in</strong>g or implement<strong>in</strong>g agencies, <strong>of</strong>ten favor scope overdepth. With a limited budget, evaluators are forced to develop plausible statements on <strong>impact</strong>over a range <strong>of</strong> <strong>in</strong>tervention activities <strong>in</strong> divergent contexts. In such cases, a tension betweenaccountability (imply<strong>in</strong>g a coverage <strong>of</strong> most or all <strong>of</strong> the funded activities) and learn<strong>in</strong>g (giv<strong>in</strong>gmore attention to one particular <strong>in</strong>tervention or type <strong>of</strong> <strong>in</strong>tervention) may exist. [1] Even if theapplicability range <strong>of</strong> REs broadens <strong>in</strong> the future due to new experiences (e.g. experiments at the<strong>in</strong>stitutional level), aspects such as implementation challenges (e.g. the ethics, logistics <strong>of</strong> do<strong>in</strong>gREs but also the will<strong>in</strong>gness among policy makers, implement<strong>in</strong>g agencies and other stakeholdersto transform <strong>in</strong>tervention formats <strong>in</strong>to randomized experiments), budget allocation priorities(e.g. scope versus depth) and other limitations will <strong>in</strong>evitably restrict RE-type <strong>evaluation</strong>s tobe<strong>in</strong>g a m<strong>in</strong>ority practice <strong>in</strong> the <strong>impact</strong> <strong>evaluation</strong> bus<strong>in</strong>ess (see below).The realization that there are <strong>in</strong>herent limitations <strong>in</strong> applicability <strong>of</strong> REs should notbe an argument aga<strong>in</strong>st the further promotion <strong>of</strong> REs. Nor should the range <strong>of</strong> threats to validitythat can affect the analytical strength <strong>of</strong> an RE constitute an argument aga<strong>in</strong>st REs. [2] I[1] See CGD (2006) for a broader discussion on the <strong>in</strong>centive problems that expla<strong>in</strong> the limited <strong>in</strong>vestments <strong>of</strong> donors<strong>in</strong> REs.[2] In fact, the systematic identification and discussion <strong>in</strong> the literature <strong>of</strong> threats to the (<strong>in</strong>ternal) validity <strong>of</strong> REsstrengthens the scientific basis <strong>of</strong> the methodology. As such, it should not be considered as an argument aga<strong>in</strong>st REs.However, caution is <strong>in</strong> order when talk<strong>in</strong>g about the possibility <strong>of</strong> ma<strong>in</strong>stream<strong>in</strong>g REs <strong>in</strong> the <strong>in</strong>tervention cycle <strong>of</strong> anagency. I remember a recent debate <strong>in</strong> a multilateral donor organization where there was talk <strong>of</strong> ma<strong>in</strong>stream<strong>in</strong>g REs<strong>in</strong> the design <strong>of</strong> projects. It is very likely that such experiments will not benefit from the same level <strong>of</strong> quality control<strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventionsIOB Discussion Paper 2010-01 • A


iefly discuss a few po<strong>in</strong>ts <strong>of</strong> <strong>in</strong>terest here. First, experimentation should not be perceived asan anomaly <strong>in</strong> <strong>development</strong> <strong>in</strong>tervention. The policy field <strong>of</strong> education <strong>in</strong> the Netherlands is <strong>in</strong>this aspect not much different from the policy field <strong>of</strong> agricultural technologies <strong>in</strong> Lat<strong>in</strong> America.From the former perspective, Borghans (2009) argues that teachers cont<strong>in</strong>uously experimentwith teach<strong>in</strong>g methods, whereas the target group, students, are cont<strong>in</strong>uously receiv<strong>in</strong>g different‘treatments’, for example simply by go<strong>in</strong>g to different schools. Similarly, <strong>in</strong> Lat<strong>in</strong> America (orelsewhere), farmers and <strong>development</strong> agencies alike cont<strong>in</strong>uously experiment with new techniquesand new <strong>in</strong>tervention activities to achieve particular desired objectives. [1] If one considersa specific sample <strong>of</strong> farmers or agencies, at any given po<strong>in</strong>t <strong>in</strong> time, one can identify a range <strong>of</strong>similar yet different practices aimed at boost<strong>in</strong>g productivity per unit <strong>of</strong> land given particular resourceconstra<strong>in</strong>ts. Observation over time <strong>in</strong> comb<strong>in</strong>ation with a systematic variation <strong>in</strong> respectivelythe application or promotion <strong>of</strong> certa<strong>in</strong> techniques, are common behavior among farmers,cooperatives, NGOs or state agencies. A RE is a more systematic approach to experimentationthan what most stakeholders are used to. As argued by Borghans (2009) from the perspective <strong>of</strong>education <strong>in</strong> the Netherlands, if we take this experiment<strong>in</strong>g attitude as a given, then one wouldexpect it to become more acceptable and desirable for decision makers and target groups toorganize and participate <strong>in</strong> more systematic experiments, such as REs. In the context <strong>of</strong> <strong>development</strong><strong>in</strong>terventions Banerjee and Duflo (2008) talk about develop<strong>in</strong>g long-term work<strong>in</strong>grelationships between researchers and <strong>development</strong> agencies. In such conditions it is possibleto show that a RE is much more efficient <strong>in</strong> prov<strong>in</strong>g whether someth<strong>in</strong>g works than much <strong>of</strong> thehaphazard experimentation that goes on <strong>in</strong> the daily practice <strong>of</strong> target groups and implement<strong>in</strong>gagencies. And it is <strong>in</strong> such cases that the comparative advantage <strong>of</strong> a RE can truly blossom: itmagnifies heterogeneity <strong>in</strong> ‘treatment’ by <strong>in</strong>troduc<strong>in</strong>g a clear comparative perspective betweenthose with and without an <strong>in</strong>tervention and it reduces bias <strong>in</strong> estimation <strong>of</strong> effects through thepr<strong>in</strong>ciple <strong>of</strong> randomization. These are two powerful features that help us to identify efficientlywhat works for a particular group <strong>in</strong> a particular context.Now, <strong>in</strong> order to go from a specific sett<strong>in</strong>g to a more generalizable conclusion abouteffectiveness we need theory. To illustrate this, consider the adoption <strong>of</strong> certa<strong>in</strong> organic farm<strong>in</strong>gpractices. We know from research that knowledge and labor substitute for capital. In otherwords, less <strong>in</strong>puts are bought on the market <strong>in</strong> exchange for an <strong>in</strong>creased <strong>in</strong>put <strong>of</strong> labor andknowledge. Peasant economics (e.g. Ellis, 1988) and the diffusion <strong>of</strong> <strong>in</strong>novation literature (e.g.Rogers, 2003) teach us that the opportunity costs <strong>of</strong> labor (next to a range <strong>of</strong> other variables,and cont<strong>in</strong>gent upon among other th<strong>in</strong>gs the type <strong>of</strong> crop) is an important explanatory variable<strong>of</strong> smallholder adoption behavior. With this <strong>in</strong>formation <strong>in</strong> m<strong>in</strong>d we can test whether this theoryholds for a specific farmer population or region. For example, we may set up a clustered RE withdifferent samples <strong>in</strong> several regions which ma<strong>in</strong>ly differ <strong>in</strong> the opportunity costs <strong>of</strong> labor yetare similar <strong>in</strong> other potential explanatory variables (as derived from theory). While this examplemay not be water tight, it can prove to be very <strong>in</strong>formative on the effectiveness <strong>of</strong> certa<strong>in</strong> policy<strong>in</strong>struments (e.g. subsidies, tra<strong>in</strong><strong>in</strong>g) <strong>in</strong> promot<strong>in</strong>g the adoption <strong>of</strong> organic practices and at thesame time test whether the opportunity costs <strong>of</strong> labor is a decisive factor <strong>in</strong> adoption behavior.Of course, the stratification may be a little bit more complex as more explanatory variables(derived from theory) may determ<strong>in</strong>e the setup <strong>of</strong> the series <strong>of</strong> REs. The core idea is that <strong>in</strong> thisway REs may more effectively contribute to exist<strong>in</strong>g theories on the determ<strong>in</strong>ants <strong>of</strong> adoptionpresent <strong>in</strong> most (research-driven) RE studies which feature so prom<strong>in</strong>ently <strong>in</strong> the literature.[1] However, smallholder farmers are <strong>of</strong>ten risk averse and tend to experiment on a small scale. Once a particularnew practice has demonstrated its pay-<strong>of</strong>f, experimentation may lead to broader adoption.30 • IOB Discussion Paper 2010-01 <strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventions


processes. In general, it shows that theory may be rigorously tested by means <strong>of</strong> REs and thattheory itself can be a guidel<strong>in</strong>e for determ<strong>in</strong><strong>in</strong>g how to set up a series <strong>of</strong> REs (see also Cohen andEasterly, 2009, for a discussion on the theory-test<strong>in</strong>g potential <strong>of</strong> REs). Limitations to externalvalidity rema<strong>in</strong>, especially with respect to scal<strong>in</strong>g-up effects. Nevertheless, a theory-driven REis potentially stronger on external validity than a ‘theory-empty’ RE. As commented by Deaton(2009) and also Banerjee and Duflo (2008) the number <strong>of</strong> theory-<strong>in</strong>formed and theory-test<strong>in</strong>gREs is on the <strong>in</strong>crease. This is important as REs without a basis <strong>in</strong> theory are prone to be weakernot only <strong>in</strong> terms <strong>of</strong> external validity, but also the <strong>in</strong>ternal validity and construct validity <strong>of</strong> f<strong>in</strong>d<strong>in</strong>gs.The biggest ga<strong>in</strong>s from this grow<strong>in</strong>g attention for rigorous <strong>impact</strong> <strong>evaluation</strong>, this‘new push for objective evidence’ as one may call it, are not to be found <strong>in</strong> the grow<strong>in</strong>g body <strong>of</strong>REs but rather <strong>in</strong> its sp<strong>in</strong>-<strong>of</strong>fs. Whereas <strong>development</strong> <strong>evaluation</strong>s used to be largely on processissues or output delivery, the paradigm shift <strong>in</strong> <strong>development</strong> policy towards more attention forresults has strengthened the belief across the develop<strong>in</strong>g world that <strong>in</strong>terventions should not bebased on hunches, <strong>in</strong>tuitions or ideologies but on evidence on whether an <strong>in</strong>tervention is likelyto make a difference <strong>in</strong> terms <strong>of</strong> the desired objectives. A randomized experiment <strong>in</strong> a way is one<strong>of</strong> the elegant flagships <strong>of</strong> this new evolution. As a methodological design it has drawn a lot <strong>of</strong>attention yet it is unlikely to w<strong>in</strong> the war on its own. Several other trends and opportunities canbe noted which <strong>in</strong> part have been strengthened by the ‘randomista’ movement. These (partial)‘sp<strong>in</strong>-<strong>of</strong>fs’ [1] po<strong>in</strong>t at other challenges <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong>. First <strong>of</strong> all, a RE is just one <strong>of</strong> thedesigns based on explicit counterfactual analysis. The number <strong>of</strong> quasi-experimental studies <strong>in</strong><strong>development</strong> has <strong>in</strong>creased sharply over the last ten years or so. For example, where randomizationis not possible or appropriate, regression discont<strong>in</strong>uity analysis may be used <strong>in</strong>stead, as itrelies on a different pr<strong>in</strong>ciple for the def<strong>in</strong>ition <strong>of</strong> groups. [2] In addition, as data gather<strong>in</strong>g effortshave <strong>in</strong>creased under the <strong>in</strong>fluence <strong>of</strong> results-based th<strong>in</strong>k<strong>in</strong>g and new M&E systems, opportunitiesfor ex post statistical match<strong>in</strong>g or multivariate analyses with statistical controls have alsomarkedly <strong>in</strong>creased.This paper has focused on design and implementation issues <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong>,scarcely touch<strong>in</strong>g upon the large and expand<strong>in</strong>g body <strong>of</strong> statistical <strong>impact</strong> <strong>evaluation</strong>s. Statisticaltechniques are <strong>of</strong>ten used with<strong>in</strong> the framework <strong>of</strong> (quasi-)experimental designs but more <strong>of</strong>tenthan not <strong>in</strong> non-experimental sett<strong>in</strong>gs. Comparative advantages <strong>of</strong> (non-experimental) statistical<strong>impact</strong> <strong>evaluation</strong>s are among other th<strong>in</strong>gs their broad coverage <strong>in</strong> terms <strong>of</strong> the number<strong>of</strong> <strong>in</strong>dividuals, households, districts, and regions encompassed by the data sets. The <strong>in</strong>creasedavailability <strong>of</strong> data and the grow<strong>in</strong>g capacities (<strong>in</strong> terms <strong>of</strong> expertise and technological support)to process and analyze these data have enhanced the opportunities for <strong>impact</strong> <strong>evaluation</strong> ataggregate (e.g. regional) levels. The issue <strong>of</strong> the comparative advantage <strong>of</strong> REs vis-à-vis non-experimentalstatistical analyses has been briefly raised <strong>in</strong> this paper. One particular argument alltoo <strong>of</strong>ten is ignored. As a methodological design requir<strong>in</strong>g active manipulation <strong>of</strong> an <strong>in</strong>tervention,an RE relies on orig<strong>in</strong>al data, collected specifically for the purposes <strong>of</strong> the study. In mostcases, the <strong>evaluation</strong> researchers analyz<strong>in</strong>g the data are also <strong>in</strong>volved <strong>in</strong> the data collection. Inqualitative research one also commonly f<strong>in</strong>ds a strong l<strong>in</strong>k between data collection and analy-[1] In fact, it is better to th<strong>in</strong>k <strong>in</strong> terms <strong>of</strong> an association between growth <strong>in</strong> REs and other quantitative methodologicaldesigns.[2] Compar<strong>in</strong>g over time a participant group beneath a certa<strong>in</strong> threshold value <strong>of</strong> a particular target<strong>in</strong>g variable (e.g.distance to road) with a control group just above the threshold value.<strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventionsIOB Discussion Paper 2010-01 • A


sis. Not necessarily so <strong>in</strong> statistical <strong>impact</strong> <strong>evaluation</strong> exercises. All too <strong>of</strong>ten data analyses arebased on data sets constructed by others for other purposes. Econometricians and economists<strong>in</strong>volved <strong>in</strong> statistical <strong>impact</strong> <strong>evaluation</strong>s us<strong>in</strong>g (mostly) non-experimental data tend to overemphasize‘threats to validity’ <strong>in</strong> the data analysis phase (e.g. specify<strong>in</strong>g the right selection model)and <strong>of</strong>ten blatantly ignore (or are ignorant <strong>of</strong>) any problems or biases that may have arisen <strong>in</strong>the data collection phase. The severed l<strong>in</strong>k between data collection and analysis <strong>in</strong> many <strong>impact</strong><strong>evaluation</strong> exercises is distress<strong>in</strong>g and merits a higher pr<strong>of</strong>ile <strong>in</strong> methodological debates.Data quality is also a concern <strong>in</strong> the literature on mixed methods <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong>.[1] Previously, I already underl<strong>in</strong>ed the importance <strong>of</strong> mixed methods from the perspective<strong>of</strong> comparative advantages <strong>of</strong> methods. Two other reasons make this area <strong>of</strong> research particularlyrelevant to policymakers and <strong>evaluation</strong> researchers. First <strong>of</strong> all, practically all <strong>evaluation</strong>work is multi-method <strong>in</strong> nature. This is most clearly visible <strong>in</strong> large program or portfolio <strong>evaluation</strong>swhere approach papers and <strong>evaluation</strong> matrices specify the range <strong>of</strong> data collection andanalysis tools used to analyze specific questions. A second reason is that the majority <strong>of</strong> <strong>impact</strong><strong>evaluation</strong>s take place under less than ideal circumstances (see Bamberger et al., 2009). Often<strong>evaluation</strong> researchers are not present <strong>in</strong> the design phase <strong>of</strong> <strong>in</strong>terventions; they are called <strong>in</strong>when an <strong>in</strong>tervention is already <strong>in</strong> the implementation phase or after completion. Consequently,evaluators <strong>of</strong>ten have to resort to basel<strong>in</strong>e reconstruction, secondary data or ex post data only.In addition, budgets <strong>of</strong>ten do not permit large sample sizes or elaborate designs with multiplegroup comparisons. Pressures on scope further limit the chances <strong>of</strong> evaluators to set up for examplea solid quasi-experiment. Time pressures may force evaluators to do the work <strong>in</strong> ‘quickerand dirtier’ ways than what is needed to appropriately address the attribution issue. Methodologicaloptions for mixed method <strong>evaluation</strong>s under less than ideal circumstances are existentyet, as argued by one <strong>of</strong> the pr<strong>in</strong>cipal authors on this subject, Michael Bamberger, much rema<strong>in</strong>sto be done <strong>in</strong> terms <strong>of</strong> develop<strong>in</strong>g new methodologies and standards for mixed method <strong>evaluation</strong>s(see for example Bamberger et al., 2004, 2009).If decision makers want answers to their questions on what works across <strong>in</strong>terventionsand across contexts, they may need to change part <strong>of</strong> their focus. Instead <strong>of</strong> see<strong>in</strong>gadm<strong>in</strong>istrative categories such as projects or country portfolios as the only pr<strong>in</strong>cipal units <strong>of</strong>analysis (<strong>of</strong>ten ma<strong>in</strong>ly for accountability purposes) they need to look more at particular policy<strong>in</strong>struments or <strong>in</strong>tervention types which recur throughout projects.[2] Consequently, decisionmakers may learn to appreciate and <strong>in</strong>vest more <strong>in</strong> rigorous <strong>evaluation</strong>s <strong>of</strong> particular <strong>in</strong>terventiontypes (e.g. us<strong>in</strong>g REs) and see how these pieces <strong>of</strong> evidence connect to the macro picture, i.e.look<strong>in</strong>g at the importance <strong>of</strong> a particular <strong>in</strong>tervention type and the divergent contexts <strong>in</strong> whichit is implemented. Such a focus on particular <strong>in</strong>tervention types and policy <strong>in</strong>struments can perfectlycoexist with more comprehensive approaches to <strong>impact</strong> <strong>evaluation</strong> which start out fromprograms and portfolios, <strong>in</strong>clud<strong>in</strong>g a variety <strong>of</strong> strategies and <strong>in</strong>terventions.[1] For examples see Leeuw and Vaessen (2009) or Bamberger et al. (2009). Karlan (2009) argues the case for moremixed method research <strong>in</strong> an RE context.[2] For example, even though microcredit may only be a small component <strong>of</strong> a particular project or country portfolio<strong>of</strong> a donor agency, throughout its entire portfolio a sizeable portion <strong>of</strong> the budget may be allocated to support<strong>in</strong>g microcreditactivities (at different levels).32 • IOB Discussion Paper 2010-01 <strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventions


We know there are no laws <strong>in</strong> social sciences (Elster, 2007), yet we also know therepatterns <strong>of</strong> regularity, or demi-regularities, <strong>in</strong> <strong>in</strong>dividual, social and <strong>in</strong>stitutional behavior (Pawson,2009). Identify<strong>in</strong>g and ref<strong>in</strong><strong>in</strong>g such patterns <strong>of</strong> regularity require a particular way <strong>of</strong> theoriz<strong>in</strong>gabout <strong>in</strong>terventions, deconstruct<strong>in</strong>g or unpack<strong>in</strong>g <strong>in</strong>terventions <strong>in</strong>to their active <strong>in</strong>gredients:policy <strong>in</strong>struments l<strong>in</strong>ked to contexts l<strong>in</strong>ked to behavioral mechanisms (see Pawson, 2006,2009). It is our mission as <strong>development</strong> researchers and evaluators to uncover these patterns<strong>of</strong> (demi-)regularities as they are the build<strong>in</strong>g stones <strong>of</strong> knowledge about what works and whyacross <strong>in</strong>terventions. Intervention theories with a certa<strong>in</strong> degree <strong>of</strong> external validity are becom<strong>in</strong>gmore and more important <strong>in</strong> the context <strong>of</strong> review and synthesis work as well. 3IE is one <strong>of</strong>the organizations which is currently <strong>in</strong>vest<strong>in</strong>g <strong>in</strong> this type <strong>of</strong> work (for an example on microcreditsee Vaessen et al., 2009). [1] We still have a long way to go. One th<strong>in</strong>g is for sure, we need goodempirical <strong>impact</strong> <strong>evaluation</strong>s – ;which rely on <strong>in</strong>tervention theory, explanatory theory and multiplemethods tailored to a specific context – <strong>in</strong> order to succeed, eventually…[1] Comparable to the work by <strong>in</strong>stitutions such as the Campbell and the Cochrane Collaboration on health, education,crime and justice and social work, based on empirical work from (mostly) OECD countries.<strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventionsIOB Discussion Paper 2010-01 • A


ReferencesBaker, J.L. (2000) Evaluat<strong>in</strong>g the <strong>impact</strong> <strong>of</strong><strong>development</strong> projects on poverty, World Bank,Wash<strong>in</strong>gton D.C.Bamberger, M. (2000) “Opportunities andchallenges for <strong>in</strong>tegrat<strong>in</strong>g quantitative andqualitative research”, <strong>in</strong>: M. Bamberger(ed.) Integrat<strong>in</strong>g quantitative and qualitativeresearch <strong>in</strong> <strong>development</strong> projects, World Bank,Wash<strong>in</strong>gton D.C.Bamberger, M. and H. White (2007) “Us<strong>in</strong>gstrong <strong>evaluation</strong> designs <strong>in</strong> develop<strong>in</strong>g countries:Experience and challenges”, Journal <strong>of</strong>Multidiscipl<strong>in</strong>ary Evaluation 4(8), 58-73.Bamberger, M., J. Rugh, M. Church and L. Fort(2004) “Shoestr<strong>in</strong>g Evaluation: Design<strong>in</strong>g <strong>impact</strong><strong>evaluation</strong>s under budget, time and dataconstra<strong>in</strong>ts”, American Journal <strong>of</strong> Evaluation25(1), 5-37.Bamberger, M. J. Rugh and L. Mabry (2006)Realworld <strong>evaluation</strong>: Work<strong>in</strong>g under budget,time, data, and political constra<strong>in</strong>ts, SagePublications, Thousand Oaks.Bamberger, M., V. Rao and M. Woolcock(2009) “Us<strong>in</strong>g mixed methods <strong>in</strong> monitor<strong>in</strong>gand <strong>evaluation</strong>: Experiences from <strong>in</strong>ternational<strong>development</strong>”, MIMEO, World Bank,Wash<strong>in</strong>gton D.C.Banerjee, A. (2005) “‘New <strong>development</strong>economics and the challenge to theory”,Economic and Political Weekly 40, October 1-7,4340-4344.Banerjee, A.V. and E. Duflo (2008) “Theexperimental approach to <strong>development</strong> economics”,NBER Work<strong>in</strong>g Paper 14467, Cambridge.Banerjee, A., S. Cole, E. Duflo and L. L<strong>in</strong>den(2007) “Remedy<strong>in</strong>g education:Evidence from two randomized experiments<strong>in</strong> India”, Quarterly Journal <strong>of</strong> Economics122(3), 1235-1264.Blattman, C. (2008) “Impact Evaluation 2.0”,Presentation to the Department <strong>of</strong> InternationalDevelopment, February 14, 2008,London.Borghans, L. (2009) “Leren over leren”, <strong>in</strong>:R. Rouw, D. Satijn and T. Schokker (eds.)Bewezen beleid <strong>in</strong> het onderwijs, M<strong>in</strong>isterievan Onderwijs, Cultuur en Wetenschap, TheNetherlands.Campbell, D.T. (1969) “Reforms as experiments”,American Psychologist 24, 409-429.Campbell, D.T. and J.C. Stanley (1963) “Experimentaland quasi-experimental designsfor research on teach<strong>in</strong>g”, <strong>in</strong>: N. L. Gage (ed.)Handbook <strong>of</strong> research on teach<strong>in</strong>g, Rand Mc-Nally, Chicago.Carvalho, S., and H. White (2004) “Theorybased<strong>evaluation</strong>: The case <strong>of</strong> social funds”,American Journal <strong>of</strong> Evaluation 25(2), 141–160.Catley, A., J. Burns, D. Abebe and O. Suji(2008) Participatory Impact Assessment: aguide for practitioners, The Fe<strong>in</strong>ste<strong>in</strong> InternationalCenter, Tufts University, Medford.CGD (2006) When will we ever learn? Improv<strong>in</strong>glives through <strong>impact</strong> <strong>evaluation</strong>, Report <strong>of</strong>the Evaluation Gap Work<strong>in</strong>g Group, Center forGlobal Development, Wash<strong>in</strong>gton, DC.Chambers, R. (1983) Rural <strong>development</strong>:putt<strong>in</strong>g the last first, Wiley, New York.34 • IOB Discussion Paper 2010-01 <strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventions


Chambers, R., (1995) “Paradigm shifts andthe practice <strong>of</strong> participatory research and<strong>development</strong>”, <strong>in</strong>: S. Wright and N. Nelson(eds.) Power and participatory <strong>development</strong>:Theory and practice, Intermediate TechnologyPublications, London.Chen, H.T. (1990) Theory-driven <strong>evaluation</strong>,Beverly Hills, Sage Publications.Cohen, J. and W. Easterly (eds.) (2009) Whatworks <strong>in</strong> <strong>development</strong>? Th<strong>in</strong>k<strong>in</strong>g big and th<strong>in</strong>k<strong>in</strong>gsmall, Brook<strong>in</strong>gs Institution press, Wash<strong>in</strong>gtonD.C.Coleman, J.S. (1986) “Theory, social researchand a theory <strong>of</strong> action”, AmericanJournal <strong>of</strong> Sociology 91(6), 1309-1335.Coleman, J.S. (1990) Foundations <strong>of</strong> socialtheory, Belknap Press, Cambridge.Cook, T.D. (2006) “Describ<strong>in</strong>g what is specialabout the role <strong>of</strong> experiments <strong>in</strong> contemporaryeducational research: Putt<strong>in</strong>g the ‘GoldStandard’ rhetoric <strong>in</strong>to perspective”, Journal<strong>of</strong> Multidiscipl<strong>in</strong>ary Evaluation 3(6), 1-7.Cook, T.D. and D.T. Campbell (1979) Quasiexperimentation:Design and analysis for fieldsett<strong>in</strong>gs, Rand McNally, Chicago.Cous<strong>in</strong>s, J.B. and E. Whitmore (1998) “Fram<strong>in</strong>gparticipatory <strong>evaluation</strong>”, <strong>in</strong>: E. Whitmore(ed.) Understand<strong>in</strong>g and practic<strong>in</strong>g participatory<strong>evaluation</strong>, New Directions for Evaluation80, Jossey-Bass, San Francisco.Davidson, E.J. (2006) “The RCTs-only doctr<strong>in</strong>e:Brakes on the acquisition <strong>of</strong> knowledge?”Journal <strong>of</strong> Multidiscipl<strong>in</strong>ary Evaluation3(6), ii-v.Deaton, A. (2009) “Instruments <strong>of</strong> <strong>development</strong>:Randomization <strong>in</strong> the tropics, andthe search for the elusive keys to economic<strong>development</strong>”, NBER Work<strong>in</strong>g Paper 14690,Cambridge.De Leeuw, E.D., J.J. Hox and D.A. Dillman(eds.) (2008) International handbook <strong>of</strong> surveymethodology, Lawrence Erlbaum Associates,London.Donaldson, S.I. and Lipsey, M.W. (2006)“Roles for theory <strong>in</strong> contemporary <strong>evaluation</strong>practice”, <strong>in</strong>: I. Shaw, J.C. Greene and M.M.Mark (eds.) The SAGE handbook <strong>of</strong> <strong>evaluation</strong>,Sage Publications, Thousand Oaks.Duflo, E., and R. Hanna (2008) “Monitor<strong>in</strong>gworks: Gett<strong>in</strong>g teachers to come to school”,NBER Work<strong>in</strong>g Paper 11880, Cambridge.Easterly, W. (2001) The elusive quest forgrowth: Economists’ adventures and misadventures<strong>in</strong> the tropics, MIT Press, Cambridge.Ellis, F. (1988) Peasant economics: Farmhouseholds and agrarian <strong>development</strong>, CambridgeUniversity Press, Cambridge.Elster, J. (2007) Expla<strong>in</strong><strong>in</strong>g social behavior- More nuts and bolts for the social sciences,Cambridge University Press, Cambridge.Gill, G. (1993) Ok the data is lousy, but it’s allwe’ve got (Be<strong>in</strong>g a critique <strong>of</strong> conventionalmethods), Gatekeeper Series 38, Susta<strong>in</strong>ableAgriculture Program, International Institutefor Environment and Development, London.Heckman, J. (1992) “Randomization andsocial program <strong>evaluation</strong>,” NBER TechnicalWork<strong>in</strong>g Paper 107, Cambridge.<strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventionsIOB Discussion Paper 2010-01 • A


Heckman, J., J. Smith and N. Clements (1997)“Mak<strong>in</strong>g the most out <strong>of</strong> programme <strong>evaluation</strong>sand social experiments: Account<strong>in</strong>gfor heterogeneity <strong>in</strong> programme <strong>impact</strong>s”,Review <strong>of</strong> Economic Studies 64, 487-535.Hulme, D. (2000) “Impact assessment methodologiesfor micr<strong>of</strong><strong>in</strong>ance: Theory, experienceand better practice”, World Development28(1), 79-98.IFAD (2002) Manag<strong>in</strong>g for <strong>impact</strong> <strong>in</strong> rural<strong>development</strong>: A practical guide for M&E, IFAD,Rome.Jones, N., C. Walsh, H. Jones and C. T<strong>in</strong>cati(2008) Improv<strong>in</strong>g <strong>impact</strong> <strong>evaluation</strong> coord<strong>in</strong>ationand uptake - A scop<strong>in</strong>g study commissionedby the DFID Evaluation Departmenton behalf <strong>of</strong> NONIE, Overseas DevelopmentInstitute, London.Karlan, D. (2009) “Thoughts on randomisedtrials for <strong>evaluation</strong> <strong>of</strong> <strong>development</strong>: presentationto the Cairo <strong>evaluation</strong> cl<strong>in</strong>ic, Journal <strong>of</strong>Development Effectiveness 1(3), 237-242.Karlan, D. and J. Z<strong>in</strong>man (2009) “Expand<strong>in</strong>gmicroenterprise credit access: Us<strong>in</strong>g randomizedsupply decisions to estimate <strong>impact</strong>s<strong>in</strong> Manila”, CEPR paper 7396, London.Kusek, J and R.C. Rist (2004) Ten steps to aresults-based monitor<strong>in</strong>g and <strong>evaluation</strong> system:A handbook for <strong>development</strong> practitioners,World Bank, Wash<strong>in</strong>gton D.C.Leeuw, F.L. (2003) “Reconstruct<strong>in</strong>g programtheories: Methods available and problems tobe solved”, American Journal <strong>of</strong> Evaluation24(1), 5-20.Leeuw, F.L. (2009) “On the contemporaryhistory <strong>of</strong> experimental <strong>evaluation</strong>s and itsrelevance for policy mak<strong>in</strong>g”, <strong>in</strong>: O. Rieper,F.L. Leeuw and T. L<strong>in</strong>g (eds.) The evidencebook: concepts, generation, and use <strong>of</strong> evidence,Transaction Publishers, New Brunswick.Leeuw, F.L. and J. Vaessen (2009) Impact<strong>evaluation</strong>s and <strong>development</strong> – NONIE guidanceon <strong>impact</strong> <strong>evaluation</strong>, Network <strong>of</strong> Networkson Impact Evaluation, Wash<strong>in</strong>gton D.C.Lipsey, M.W. (1993) “Theory as method:Small theories <strong>of</strong> treatments,” <strong>in</strong>: L.B.Sechrest and A.G. Scott (eds.), Understand<strong>in</strong>gcauses and generaliz<strong>in</strong>g about them, NewDirections for Program Evaluation 57, Jossey-Bass, San Franscisco.Mark, M.M., G.T. Henry and G. Julnes (1999)“Toward an <strong>in</strong>tegrative framework for <strong>evaluation</strong>practice”, American Journal <strong>of</strong> Evaluation20(2), 177-198.McKenzie, D. and C. Woodruff (2008) “Experimentalevidence on returns to capitaland access to f<strong>in</strong>ance <strong>in</strong> Mexico”, World BankEconomic Review 22(3), 457-482.Miguel, E. and M. Kremer (2004) “Worms:Identify<strong>in</strong>g <strong>impact</strong>s on education and health<strong>in</strong> the presence <strong>of</strong> treatment externalities,”Econometrica 72(1), 159-217.Mikkelsen, B. (2005) Methods for <strong>development</strong>work and research, Sage Publications,Thousand Oaks.Morgan, S.L. and C. W<strong>in</strong>ship (2007) Counterfactualsand causal <strong>in</strong>ference – methodsand pr<strong>in</strong>ciples for social research, CambridgeUniversity Press, Cambridge.Morra, L.G. and R.C. Rist (2009) The roadto results: design<strong>in</strong>g and conduct<strong>in</strong>g effective<strong>development</strong> <strong>evaluation</strong>s, World Bank, Wash<strong>in</strong>gtonD.C.36 • IOB Discussion Paper 2010-01 <strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventions


Oakley, A. (2000) Experiments <strong>in</strong> know<strong>in</strong>g:Gender and method <strong>in</strong> the social sciences, PolityPress, Cambridge.OECD-DAC (2002) Glossary <strong>of</strong> key terms <strong>in</strong><strong>evaluation</strong> and results based management,OECD-DAC, Paris.OED (2005) OED and <strong>impact</strong> <strong>evaluation</strong>: A discussionnote, Operations Evaluation Department,World Bank, Wash<strong>in</strong>gton D.C.Olken, B. (2007) “Monitor<strong>in</strong>g corruption:Evidence from a field experiment <strong>in</strong>Indonesia”, Journal <strong>of</strong> Political Economy115(2), 200-249.Pawson, R. (2006) Evidence-based policy: Arealist perspective, Sage Publications, London.Pawson, R. (2009) “Middle range theoryand program theory: from practice to provenance”,<strong>in</strong> : J. Vaessen and F.L. Leeuw (eds.)M<strong>in</strong>d the gap: perspectives on policy <strong>evaluation</strong>and the social sciences, Transaction Publishers,New Brunswick.Pawson, R. and N. Tilley (1997) Realistic Evaluation,Sage Publications, Thousand Oaks.Pretty, J.N., I. Guijt, J. Thompson and I.Scoones (1995) A tra<strong>in</strong>ers’ guide to participatorylearn<strong>in</strong>g and action, IIED ParticipatoryMethodology Series, IIED, London.Ravallion, M. (2008) “Evaluation <strong>in</strong> the practice<strong>of</strong> <strong>development</strong>”, Policy Research Work<strong>in</strong>gPaper 4547, World Bank, Wash<strong>in</strong>gton D.C.Ravallion (2009a) “Should the Randomistasrule?” Economists’ Voice, February 2009.Ravallion, M. (2009b) “Evaluat<strong>in</strong>g threesylised <strong>in</strong>terventions”, Journal <strong>of</strong> DevelopmentEffectiveness 1(3), 227-236.Rigg<strong>in</strong>, L.J.C. (1990) “L<strong>in</strong>k<strong>in</strong>g program theoryand social science theory”, <strong>in</strong>: L. Bickman(ed.) Us<strong>in</strong>g program theory <strong>in</strong> <strong>evaluation</strong>, NewDirections for Program Evaluation 33, Jossey-Bass, San Francisco.Rodrik, D. (2008) “The new <strong>development</strong>economics: We shall experiment, but howshall we learn?” MIMEO, Harvard University,Cambridge.Rogers, E.M. (2003) Diffusion <strong>of</strong> <strong>in</strong>novations,New York, Free Press.Rogers, P. J. (2008) “Us<strong>in</strong>g programme theoryfor complex and complicated programs”,Evaluation 14(1), 29-48.Rogers, P.J., Hacsi, T.A., Petros<strong>in</strong>o, A., andHuebner, T.A., (eds.) (2000) Program theory<strong>in</strong> <strong>evaluation</strong>: <strong>Challenges</strong> and opportunities,New Directions for Evaluation 87, Jossey-Bass,San Francisco.Rossi, P.H., Lipsey, M.W., and Freeman, H.E. (2004) Evaluation: A systematic approach,Sage Publications, Thousand Oaks.Salmen, L. and E. Kane (2006) Bridg<strong>in</strong>gdiversity: Participatory learn<strong>in</strong>g for responsive<strong>development</strong>, World Bank, Wash<strong>in</strong>gton D.C.Scriven, M. (2008) “Summative <strong>evaluation</strong> <strong>of</strong>RCT methodology: & an alternative approachto causal research”, Journal <strong>of</strong> Multidiscipl<strong>in</strong>aryEvaluation 5(9), 11-24.Shadish, W. R., T.D. Cook and D.T. Campbell(2002) Experimental and quasiexperimentaldesigns for generalized causal <strong>in</strong>ference,Houghton Miffl<strong>in</strong> Company, Boston.Sherman, L.W., D.C. Gottfredson, D.L. Mac-Kenzie, J. Eck, P. Reuter and S.D. Bushway(1997) Prevent<strong>in</strong>g crime: what works, whatdoesn’t, what’s promis<strong>in</strong>g, US Office <strong>of</strong> JusticePrograms, Wash<strong>in</strong>gton D.C.<strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventionsIOB Discussion Paper 2010-01 • A


Skoufias, E. and B. McClafferty (2001) “IsPROGRESA work<strong>in</strong>g? Summary <strong>of</strong> the results<strong>of</strong> an <strong>evaluation</strong> by IFPRI”, FCND DiscussionPaper 118, IFPRI, Wash<strong>in</strong>gton D.C.Tashakkori, A., and C. Teddlie (eds.) (2003)Handbook <strong>of</strong> mixed methods <strong>in</strong> social andbehavioral research, Sage Publications, ThousandOaks.The Lancet Editorial (2004) “The World Bankis f<strong>in</strong>ally embrac<strong>in</strong>g science”, The Lancet364(9436), 731-732.Vaessen, J. and D. Todd (2008) “Methodologicalchallenges <strong>of</strong> evaluat<strong>in</strong>g the <strong>impact</strong> <strong>of</strong>the Global Environment Facility’s biodiversityprogram”, Evaluation and program plann<strong>in</strong>g,31(3), 231-240.Vaessen, J., F. Leeuw, S. Bonilla, R. Lukachand J. Bastiaensen (2009) “Protocol forsynthetic review <strong>of</strong> the <strong>impact</strong> <strong>of</strong> microcredit”,Journal <strong>of</strong> <strong>development</strong> effectiveness, 1(3), 285-294.Weiss, C.H. (2000) “Which l<strong>in</strong>ks <strong>in</strong> which theoriesshall we evaluate?” <strong>in</strong>: P.J. Rogers, T.A.Hacsi, A. Petros<strong>in</strong>o and T.A. Huebner (eds.)Program theory <strong>in</strong> <strong>evaluation</strong>: <strong>Challenges</strong> andopportunities, New Directions for Evaluation87, Jossey-Bass, San Francisco.White, H. (2009) “Some reflection on currentdebates <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong>”, Work<strong>in</strong>g Paper1, International Initiative for Impact Evaluation,New Delhi.Woolcock, M. (2009) “Toward a plurality<strong>of</strong> methods <strong>in</strong> project <strong>evaluation</strong>: a contextualizedapproach to understand<strong>in</strong>g <strong>impact</strong>trajectories and efficacy”, Journal <strong>of</strong> DevelopmentEffectiveness 1(1), 1-14.Worrall, J. (2007) “Why there’s no cause torandomize”, The British Journal for the Philosophy<strong>of</strong> Science 58(3), 451-488.Van der Knaap, L.M., F.L. Leeuw, S. Bogaertsand L.T.J. Nijssen (2008) “Comb<strong>in</strong><strong>in</strong>g Campbellstandards and the realist <strong>evaluation</strong> approach– the best <strong>of</strong> two worlds?”, AmericanJournal <strong>of</strong> Evaluation 29(1), 48-57.38 • IOB Discussion Paper 2010-01 <strong>Challenges</strong> <strong>in</strong> <strong>impact</strong> <strong>evaluation</strong> <strong>of</strong> <strong>development</strong> <strong>in</strong>terventions

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!