i'm posting this as a question because in conversation over the past week, many people have already contributed many perspectives on this topic, ranging from the scientific validity side, to the education and onboarding experience, the design of the web interface, and more. Please add your perspective as an "answer"! Thank you.
Also, please let me know if you'd like to book club Chapter Two "The Idea of Replication" in Harry Collins' book Changing Order: Replication and Induction in Scientific Practice.
I have a few other ideas for how replication could work:
@mathew suggested: ways to encourage people to post "partway done" work, to ask for help and post their work so far
how (in terms of interface) to mark something with success/failure: a prompt that creates a replication:success vs. replication:unsuccessful tag?
echoing Gretchen, above -- who gets to mark it as such (author or replicator)
how success condition is described/tested/debated -- and how the activity can be refined to have a narrower and more clearly articulated "success condition"
perhaps a "mini-quiz" which replicators fill out, which helps establish and (through posting links to raw data) support a claimed "success"
also via @gretchengehrke -- a positive, discursive forum will be important to reach consensus on "success"
And one thought I had -- are people submitting replications incentivized to say that they've succeeded, or activity authors to agree? Or are activity authors incentivized to say that replications have failed? It probably depends a lot on the situation, but I could imagine these playing out this way...
Gretchen has a good point that the term replication has very different meanings when applied to builds, modifications, field tests, observations, experiments, or monitoring deployments. The meaning of replication that is generally associated with scientific research is not the meaning that will typically apply when someone tries to repeat a build, modification, field test, or observation that someone has described in a research note. To avoid confusion and pretension, it might be better to use a phrase like “I tried this” which can be applied to any category of activity.
Jeff’s idea of presenting a quiz to the poster can collect crucial information about the nature of the attempt to repeat someone else’s activity.
Hi. You have started to post your “I tried this” results. The activity you tried is [name of note].
Did you try to repeat all of the steps in the activity?
What numbered steps in the activity did you try? (describe them if there are no numbers)
Are you also reporting on additional steps that were not in the original activity?
Was your result different from the one described in the original post?
If your result was different, was it because: A….B….C….D…Other.
Do you think your attempt to repeat this activity was successful?
Thanks. Please describe your experience below. …
I am a little worried that this approach is creating a bit of a monster. My guess is that many hundreds of people try things they learn about in research notes, but very few of them post research notes about their experience. Posting useful research notes is not for everyone, and can be hard work. It can also be incredibly intimidating unless you are familiar with Public Lab and aware that we are happy almost every time anybody contributes anything in a research note. To most people it’s just too hard to know if you’re doing it right when you post, or whether the community will value your post.
The proposed changes associated with the activity/repetition types of posts have the potential to scare away even more potential posters. There will suddenly be much more to learn about how to do it right, and many more ways to do it wrong.
For example, the new question interface is often used by people who have tried an activity (e.g., using the Infragram or spectrometer) and have a question about the process or result. Should those people instead post an “I tried this” note? How do people decide which to do?
If you tried only one of the four steps in an activity, is it okay to post an "I tried this?" If you followed steps 1-3 and then added your own steps 4-5, is it okay to post an "I tried this?" If you build a new spectrometer that has no components in common with existing builds (e.g., Ebert when it was posted), is this a new activity or an "I tried this" with (lots of) modifications? How do you write the rules to guide people through this?
Posting an original activity requires that people determine which of the categories (build, observation, experiment, etc) it fits into. That’s hard, although repeating an activity (“I tried this”) does not require making this decision. However, reporting intelligently on your experience repeating an activity will require understanding the difference between builds and modifications, and between observations and experiments. This really raises the bar for all posters. Raising the bar is a good thing, unless few people can get over it.
As Gretchen describes, determining whether repeating a previous activity was successful or not will almost never be straightforward. Every category of activity needs to be evaluated in a different way. Training people to figure this out will be onerous and asking people to do it could drive them away.
It might be best to let the poster who tried to repeat an activity report whether they thought it was successful or not. If they are convinced it was successful, then it was. Trying to pin down whether a repetition of an activity validates the activity or provides “replication” in some scientific way might be too much to ask.
It's possible that this entire new structure could be much more relaxed and still provide the collaborative, community building benefits which seem to be the main goal of the project.
Some stray thoughts copied and pasted from an email exchange with Liz:
Theorizing replication is a cottage industry within STS. Collins deals with it in terms of reconstructing the TEA Laser in Changing Order. I think that's the most relevant case for Public Lab, because it deals with replicating the assembly of a piece of technical lab equipment (rather than replicating research results). He also deals with replication as a sort of red herring in his introductory chapters. Everyone talks about it, no one ever actually does it for most studies. And what does it mean to replicate? You can never perfectly replicate a study. The beakers were a little different. The lab was the wrong temperature. The moon was in the seventh house instead of the sixth and Jupiter had yet to align with Mars. And so the results are off. But are they off because the original results were invalid or because the replicator failed to sufficiently create 1:1 conditions for the second test? Who can say? Not the experiment itself. And so we get what Harry Collins calls "the experimenter's regress." When there's controversy over findings in this way, at some point there are social dynamics that take over from experimental protocols to end the infinite regress.
So, not only is replication rarely ever actually used as a litmus test for good science (why twould I spend all that time and money trying to perform the same operations you're describing in your journal article? What glory is in it for me? And who's going to pay for it!?), but when replication results in controversy, the regress sets in.
Then there's a whole history of the construction of replication as a standard in science, which Steven Shapin focuses on in much of his work (beginning with The Leviathan and the Air Pump).
Once I read Collins' treatment of replication I kind of stopped paying attention to the topic. Sort of an intellectual dead end if you buy into the concept of the experimenter's regress. Worth noting all the psychological studies that have recently been "debunked" after failures to replicate the same findings. Like... duh, right? If you can use minor variations in highly controllable experiments in the biophysical sciences to ignite the experimenter's regress, then the near infinity of uncontrollable variables in human behavior make it really easy to claim that later researchers just failed to adequately duplicate the conditions of the first study. As if we could clone and copy the psyches of the people in those first studies to use them again for replication.
Or, yeah, the studies were just bad science to begin with. But that's a different story.
Thanks, Phil - I'm curious if Collins' analysis is widely recognized among scientists -- i.e. is this simply common knowledge, and replication is not widely considered to be an important process as science is practiced today, or is Collins' analysis that science practitioners today hold a flawed understanding of their systems of producing knowledge, and that there is a separate mechanism for the wide recognition and adoption of new ideas (perhaps, publication, instead of replication)?
Whether or not it is broadly recognized, I'm interested in it as a mechanism of evaluation for people sharing ideas on this site, because it gives authors a motivation to provide clear, thorough, and easy-to-follow steps, so that they get replications. This may sound obvious but I think many folks end up just writing about what they did, not setting out to guide others through the process. And by celebrating "replications" or whatever we call them, we make a "thing" out of doing something twice, or three times, which chips away at the cult of originality we see in some creative spaces -- the rush to "be the one to do something" and the lack of glamor in being second.
Replication (reproducibility) in scientific studies has become a topic of conversation because:
Recent investigations suggest that few researchers ever try to replicate studies (reproduce the results) and when they do they often fail to get the same result. Most of the publicized examples of this type of failure are in the pharmaceutical and medical fields (where this outcome should concern us).
Sociologists noticed that published papers sometimes fail to include enough information to allow others to replicate a study (reproduce the study’s results). In one famous case, new researchers had to work closely with the original team to get the same result. For some reason, sociologists thought this was important.
The type of studies referred to above generally involve experiments. Precise measurements were made in carefully manipulated environments where many variables were controlled so that they could not confuse the outcome. It should be possible to replicate these studies (reproduce the results) – otherwise the results of the original experiment must be questioned. This is a key component of scientific research.
If you are not doing an experiment, this type of replication (reproducibility) might not be an important part of the process. If the other categories of Public Lab activities are being done (Build, Verify, Observations, Test tool limits, Field test, Monitor your environment) it will be good to see that someone else can do something similar, but the concept of reproducibility might not be applicable. There generally will not be any singular result to reproduce.
There will be exceptions when the original activity specifies a controlled environment, a carefully thought out procedure, and multiple trials including controls, in which case that activity was indeed an experiment. Attempts could be made to reproduce the results of that type of activity, but this might be a very rare situation in Public Lab activities.
In most Public Lab activities, the goal is far less specific: Does the kite fly well? Does your spectrum look like mine? Do the colors in my NDVI image look meaningful? Did the device log data every five minutes? Repeating this type of activity is a repetition, not a replication, and the sociological conversation about reproducibility does not really apply.
It will be good to have a record of multiple people building a certain spectrometer, or doing a careful job calibrating a spectrometer, or getting meaningful NDVI images from an infrared camera, or making a circuit respond to external stimuli. Calling these activities replications might confuse people about why reproducibility is critical in scientific research. Grasping the conceptual importance of reproducibility will come in handy when you finally get down to doing an experiment to see if your technique can identify environmental contamination. If nobody else can reproduce your results, your technique will not be useful.
When the scope of activities includes things that are experiments and also things that are not experiments, reserve the word experiment for the experiments. When the scope of activities includes reproducible experiments but also other things that are just attempts to repeat a procedure, don’t call the repeats replications. This is not just semantics; it is essential if your goal is to help people understand concepts they might not be familiar with.
The conceptual differences among replication, reproducibility, and repeating a procedure are not trivial. It does not help that the term replication is used two ways – it is a common term for the use of multiple samples or trials in an experiment, and is also used to refer to the reproducibility of an experiment’s results. These three concepts are distinct, and all are important concepts to grasp as you design, implement, and present scientific research. Those familiar with research might be less likely to take your results seriously if these concepts are confused.
It might be good to agree on definitions of some basic terms. Here is one suggestion.
1. Reproducibility, Reproducing the results of an experiment:
Repeating an experiment and getting a result that leads to the same conclusion. More technically, using the same experimental procedure to test the same hypothesis and coming to the same conclusion about the hypothesis (confirming or rejecting it). Also performing a different, related experiment that produces a result consistent with the conclusion of the original experiment.
This can also be applied to doing something that does not appear to be an experiment. For example, if someone uses a mercury thermometer to measure the temperature of water in an ice bath and gets a result of 32.1 ± 0.8°F (n= 10 measurements), this result could be reproduced using a Riffle and DHT sensor. If the Riffle results are 32.4 ± 1.1°F (n= 10 measurements), then the result has been reproduced (i.e., there is no statistical difference between those two results). Although this appears to be just two measurements and not an experiment, it could be done so that all of the requirements of an experiment are fulfilled:
A stated hypothesis (e.g., the measurement of ice water temperature is not different from 32°F)
A procedure appropriate for the system (e.g., lots of ice and water)
A number of replicates (multiple measurements) which is appropriate to describe the variability of the device (the thermometer) and the parameter (the water temperature).
An appropriate statistical test.
This would be a very simple experiment, but it is nonetheless an experiment. Therefore, its result should be reproducible. In this sense, the results of simple observations or measurements can be reproduced as long as the series of observations or measurements meet the above requirements and are therefore bonifide experiments.
2. Replication, Replicates:
Multiple units of study (samples, trials, measurements, study plots, days, populations, etc.) which are required to account for the different types of variability in the subject of study and in the method of study.
These replicates (or replicate samples, replicate measurements, etc.) must be collected under the same conditions and in the same way. This type of replication is the basis for all statistical analysis because multiple data points allow the variability in some parameter to be quantified.
Replication must be done at multiple levels depending on the question being asked (i.e., on the hypothesis being tested). For example, if asking the question Do these two air samples differ in the amount of suspended silica? then a lab procedure could be done on five replicate subsamples from each air sample. However, if the question is Do silica mines pollute the air? then the experiment might require collecting 10 replicate air samples at each of 10 replicate sites near each of 10 replicate mines and also 10 replicate control locations on 10 different replicate days, and then running five replicate lab analyses on each sample. The number of replicates required at each level depends on how much the measured parameter varies at that level and is often not known until the samples are measured.
3. Repetition, Repeating a procedure:
Doing something multiple times.
When there is no way to test whether the outcome of repeating a procedure is the same every time it is done, then the concept of reproducibility does not apply.
Building and modifying devices
Following someone else’s instructions to build or modify a device is not reproducing a result unless a test can be done to determine if some predetermined specifications have been met. If such a test is available, then building or modifying a device can become part of an experiment. If the above requirements of an experiment are met, then the test can be used to determine if the build or modification has reproduced the results of the original. Multiple builds might be required because each would be a replicate in the experiment.
In most cases, the requirements of an experiment are not met when people follow instructions to build something, often because there are too many variables to control. However, anything can be part of an experiment if careful planning is done and replication is sufficient to account for the inherent variability.
If five people each use their Riffle to monitor the water temperature in a stream near their house, this is repetition. It might be difficult to argue that these activities are part of a single experiment or are replicates or are reproducing a result. It’s just five people more or less repeating the same activity. Additional measurements, restrictions, replications, or controls could allow this type of activity to be part of an experiment, but monitoring by itself is not an experiment and often the results cannot be reproduced (environmental variation can make this difficult).
Is replication the same as reproducing the result of an experiment?
The answer to this question is messy because the term replication is used to refer both to experimental replicates and to replicate experiments. An experimental replicate is a unit of study (e.g., one of five random study plots) and a replicate experiment is when an entire study is repeated to see if the same result is achieved.
But there are two different types of “replicate” experiments. The obvious one is when every detail of the original experiment is repeated. For several very good reasons, this almost never happens. The only thing you can learn from this is whether the original experiment might have been flawed in some way. When research has been peer reviewed, we assume it is not very flawed, so repeating the exact same experiment rarely tells us anything worthwhile. (The assumption that peer reviewed research is robust apparently does not apply to some fields of research.)
The more common response to earlier research is to do a similar study designed to apply a different test of the greater question at hand. So a new study can be done with some important variables changed. It can be done with a different organism, or at a different place, or at a different time of year. A different technique can be used to measure the important thing in the study, or a different but related thing can be measured. The goal is not to search for flaws, but to see if the results of the new study are consistent with the results of the original study.
For example, consider a study that tested the hypothesis that precipitation events carry nutrients from fertilized fields down the watershed toward streams. Three streams downhill from fertilized fields were sampled every two hours for nitrogen in the water, and precipitation events were recorded. The result was that nitrogen in the water increased significantly two hours after big rain events compared to two hours before rain events. This is consistent with the stated hypothesis that precipitation caused the movement of nutrients from fields toward streams.
Replicating this study just requires copying everything that was done the first time. But reproducing the original result (confirming the hypothesis) can be done with many different studies. For example, phosphorus could be measured instead of nitrogen. Water could be sampled while it is running across the ground instead of after it enters the stream. Isotopically labeled nitrogen can be added to the fields to see if those very molecules can be detected in the streams. Shallow ground water can be extracted from soil in transects from the fields down to the streams and analyzed for nutrients. The study can be repeated in spring instead of fall, or after bigger rain events or after snow melt events. The nutrient content of streamside plants could be measured near streams downhill from fertilized fields and also near streams far from fields.
Unlike the simple approach of repeating the original study, these studies can add important new information while also reproducing the results (confirming the hypothesis) of the original study. These new studies can strengthen our confidence that nutrients (do or don't) move from field to stream, and do so far more effectively than just repeating what the first researcher did. This is what usually happens in science.
This is why the idea of “replicating” a study needs clarification. The more important goal is to “reproduce the result” of a study, not just repeat the study. The term “reproducibility” can refer to this idea of confirming the big hypothesis with a strategically designed new study. That is the primary way that science advances, and the reason that in this context the term “replication” is misleading and should probably be avoided.
Replicating an experiment: Repeating an experiment to see if an earlier implementation of it was flawed.
Reproducing the results of an experiment: Performing a similar or related experiment to test the hypothesis confirmed by an earlier experiment and increase our understanding of the study system.