Draft · pending Janet's review

The doing gap in educational evaluation

April 19, 2026 · 8 min read EvaluationField observations

Three decades of hiring and working alongside education PhDs have taught me something uncomfortable about my own field. Many highly-credentialed researchers cannot actually operate the instruments of their own research. They can frame a study but not build the survey. They can specify an analysis but not clean the data. They can design an evaluation but not write the final report without a ghostwriter. This isn't a personal failing — it is what their training taught them to value — and it is finally starting to cost the field enough that the pattern is breaking.

1. What the gap looks like in practice

Over thirty-two years I have hired a fair number of education PhDs and brought more in as subcontractors. A consistent pattern: many can talk fluently about research design but cannot build a valid survey without help, cannot perform a basic data-cleaning pass, cannot write a funder-ready report from a raw dataset, and cannot produce a working dashboard. Often they cannot produce a serviceable Excel summary. When I offer to teach them, the reaction is almost never "yes, show me." It is a mix of discomfort and distance — as if the request itself were a category error.

The evaluators we most often replace in the middle of a project are the ones who were hired on credentials and on how they talked about evaluation, and who then could not operationalize the plan they had written.

The gap shows up especially sharply in statistical work and survey validation. Most education PhDs were trained to read statistical results, not to produce them. They can describe what a t-test does but cannot carry it out themselves on a real dataset in whatever software a project happens to use. They can cite Cronbach's alpha in a literature review but cannot compute it for a survey they fielded last week. They learned psychometric theory as vocabulary but have never run a Rasch analysis on their own instrument to see how the high, middle, and low scorers are actually separated by its items — which questions discriminate well, which don't, and which sit at the wrong difficulty level for the population they're measuring. The validation questions that matter most to a program — "is this instrument measuring what we think it's measuring?", "where are items failing to discriminate?", "which subgroup patterns are real and which are artifacts of item behavior?" — simply don't get asked, because the person nominally responsible for answering them cannot operate the tools that answer.

2. This is a status hierarchy, not a skill gap

Most of these researchers are not incapable. They have absorbed a clear status message from their own training: thinking is higher status than doing. Professors model it. Peer-reviewed journals reward it. Grant proposals foreground it. Departmental prestige flows to whoever is writing the theory section, not whoever is building the measurement instrument.

The consequence is that education PhDs learn early to delegate the doing — to graduate students, then to research assistants, later to contractors. The identity they are trained into is the person who frames the question. Building the instrument is someone else's job, and ideally an invisible one.

This pattern is not unique to education, but education has it worse than most applied fields, because educational research is usually done about people doing the work of teaching — and that work is itself low-status in the academy. The researcher is already one rung above the classroom; they are not eager to step back down to the mechanics of a Qualtrics survey or a pivot table.

3. Why the pattern is breaking now

Two forces are converging. First, employers outside the academy are telling colleges out loud that their graduates need to be able to actually produce work, not just describe it. This has been true for a long time; the willingness to say it publicly is newer. Industry is asking community colleges and universities to treat skill-building with the same seriousness they treat theory-building — and it is backing that up with hiring decisions.

Second, program funders increasingly want evaluation work that is executed, not just designed. Funders want the dashboard, not the description of what a dashboard could look like. They want a cleaned and documented dataset, not a methodology note about what the dataset will eventually contain. They want the evaluation report written by someone who understands the data, not by a project manager translating between a specialist and the funder. Evaluation firms either adapt or watch the work migrate to firms who can build things.

4. What thinking-and-doing looks like in the same person

The alternative to the status hierarchy is not anti-intellectualism. It is a practitioner model — a researcher who designs the survey and also fields it, validates the instrument and also interprets what the validation reveals, cleans the data and also analyzes it, frames the findings and also writes the report. Each of those pairs strengthens the other. Fielding a survey teaches you what the question actually measures in the mouths of real respondents. Running a Rasch analysis on the returned data teaches you which items are actually discriminating between high, middle, and low scorers — and which items were decoration that looked reasonable on paper. Cleaning a dataset teaches you where your design assumptions were wrong. Writing the report teaches you which findings are load-bearing and which are ornament.

In practice, the people who actually produce change inside educational programs operate this way. They make curriculum decisions and they know how the attendance codes are entered. They run the program and they keep the roster. They make the theoretical argument and they can show you the spreadsheet. When you evaluate programs like these, you need evaluators who can match that mode — who can talk theory of change with the program director in the morning and fix a broken SQL query for the data manager in the afternoon.

5. What this means for hiring an evaluator

Ask your candidate not just what they would evaluate, but what they would build. Ask to see a dashboard they designed and can walk you through. Ask how they would clean a messy dataset — not in principle, in practice. Ask them to show you a survey they fielded and the raw exported data that came back from it. If they can only show you methodology documents and finished reports, that is a flag. If their CV includes five white papers and no artifacts, that is also a flag.

The best practical test is small and concrete: describe a modest data task and ask the candidate to outline, step by step, how they would accomplish it using a specific tool. The answer should sound operational, not theoretical. If they begin with "I would engage a data analyst to..." — you have your answer.

6. What this means for training an evaluator

Resist the default that the dissertation is the apex of the training. Insist on hours operating instruments — writing SQL, cleaning CSV files, building surveys in Qualtrics or Jotform, running real statistical and psychometric analyses on the data those surveys produce, producing dashboards in whatever tool the client actually uses. The graduates who thrive in applied settings are the ones who did this early, even when their departments implicitly signaled it was beneath them. There is no shortcut that skips the hours on the keyboard.

And do not outsource the writing. Evaluators who cannot write a report without a ghostwriter will be limited for the whole of their career, because writing the report is how you discover what you actually found.

7. Why this has been Edstar's shape from the start

We built Edstar on the combined practitioner model from day one — partly because the co-founders were K-12 and college educators before they were researchers, and partly because the work we were hired to do in the early 1990s demanded it. School systems were not paying for methodological elegance. They were paying for something that worked on a Monday morning. Every hiring and consulting decision since has passed the same bar: can this person both frame the problem and build the solution?

The tools elsewhere on this site — the enrollment system, the STEM observation instrument, the dashboards, the grant-writing workflows — are what that standard produces over time. Each started as a research question and ended as a working piece of software that someone uses on a Monday. None of them could have been built by someone who only wanted to think about them.

Closing

The status hierarchy between thinking and doing is finally cracking in education, and not a moment too soon. Programs need evaluators who can operate instruments. Graduates need training that treats skill-building as a serious part of the work. Universities need to acknowledge that their least-prestigious faculty — the ones teaching Excel, SQL, database design, survey construction — are producing the most employable graduates. And evaluation firms need to stop outsourcing the doing, because the clients who hire them are about to stop paying for work that comes back in pieces.