Critical Appraisal is the process of carefully and systematically assessing the outcome of scientific research (evidence) to judge its trustworthiness, value, and relevance in a particular clinical context. When critically appraising a research study, you want to think about and comment on:
Having a systematic approach to critically appraising a piece of research can be helpful. Below is a guide.
Random error is an error in a study that occurs due to pure random chance. For example, if the inherent prevalence of depression is 4%, and we do a study with sample size of 100 examining the prevalence of depression, the study could by pure random chance have 0 people with people depression even though the prevalence is 4%. Increasing the sample size decreases the likelihood of random errors occuring, because increases your power to detect findings.
Systematic error is also known as bias. This is an error in the design, conduct, or analysis of a study that results in a mistaken estimate of a treatment's or exposure's effect on the risk or outcome of a disease. These errors can distort the results of a study in a particular direction (e.g. - favouring a medication treatment, or not favouring a treatment).
Systematic errors (i.e. - biases) can threaten the validity of a study. There are two types of validity:
There are many types of biases, and some of them are listed out in the table below. Researchers also use more comprehensive tools to measure and assess for bias. The most commonly used tool is the Cochrane Risk of Bias Tool. The components of the Cochrane Risk of Bias Tool is not described here.
Type of Bias | Definition | Example | How To Reduce This Bias |
---|---|---|---|
Sampling Bias | When participants selected for a study are systematically different from those the results are generalized to (i.e. - the patient in front of you). | A survey of high school students to measure teenage use of illegal drugs does not include high school dropouts. | Avoid convenience sampling, make sure that the target population is properly defined, and that the study sample matches the target population as much as possible. |
Selection Bias | When there are systematic differences between baseline characteristics of the groups that are compared. | A study looking at a healthy eating diet and health outcomes. The individuals who volunteer for the study might already be health-conscious, or come from a high socioeconomic background. | Randomization, and/or ensure the choice of the right comparison group. |
Measurement Bias | The methods of measurements are not the same between groups of patients. This is umbrella term that includes information bias, recall bias and lack of blinding. | Using a faulty automatic blood pressure cuff to measure BP. See also: Hawthorne Effect | Use standardized, objective and previously validated methods of data collection. Use placebo or control group. |
Information Bias | Information obtained about subjects is inadequate resulting in incorrect data. | In a study looking at oral contraceptive use (OCP) and risk of deep vein thrombosis, one MD fails to do a comprehensive history and forgets to ask about OCP use, while another MD does a very detailed history and asks about it. | Choose an appropriate study design, create a well-designed protocol for data collection, train researchers to properly implement the protocol and handling, and properly measure all exposures and outcomes. |
Recall Bias | Recall of information about exposure to something differs between study groups | In a study looking at chemical exposures and risk of eczema in children, one anxious parent might recall all of the exposures their child has, while another parent does not recall the exposures in as much detail. | Could use records kept from before the outcome occurred, and in some cases, the keep exact hypothesis concealed from the case (i.e. - person) being studied. |
Lack of blinding | If the researcher or the participant is not blind to the treatment condition, the assessment of outcome might be biased. | A psychiatrist tasked assessing whether a patient's depression has improved using a depression rating scale, but he knows the patient is on an antidepressant. He may be unconsciously biased to rate the patient as having improved. | Blind the participant/researcher. |
Confounding | When two factors are associated with each other and the effect of one is confused with or distorted by the other. These biases can result in both Type I and Type II errors. | A research study finds that caffeine use causes lung cancer, when really it is that smokers drink a lot of coffee, and it has nothing to do with coffee. | Repeated studies; do crossover studies (subjects act as their own controls); match each subject with a control with similar characteristics. |
Lead-time bias | Early detection with an intervention is confused with thinking that the intervention leads to better survival | A cancer screening campaign makes it seem like survival has increased, but the disease’s natural history has not changed. The cancers are picked up earlier by screening; but even early identification (with or without early treatment) does not actually change the trajectory of the illness. | Measure “back-end” survival (i.e. - adjust survival according to the severity of disease at the time of diagnosis). Have longer study enrollment periods and follow up on patient outcomes for longer. |
More Reading:
A research question that leads to a research study is often quite general, but the answer that we get from a published research study is actually extremely specific. As an example, as a researcher, I might want to know if a new SSRI is more effective than older SSRIs. This seems like an easy question to answer, but the reality is much more complicated. When looking at the results of a study, you want to think how specific or generalizable are the findings?
The question I (the researcher) wish I could answer* | The question I'm actually answering I do the study | |
---|---|---|
Scope | General | Specific |
Question | Is this newer antidepressant more effective the older antidepressant for people with depression? I want to know if it is a simple yes or no! | What proportion of depressed patients aged 18 to 64 with no comorbidities and no suicidal ideation and who have not been treated with an antidepressant experience a greater than 50% reduction in the Hamilton Rating Scale after 6 weeks of treatment with the new antidepressant compared to the old antidepressant? |
Ask yourself, what is the target population in the research question? Then ask yourself, does the study sample itself actually represent this target population? Sometimes, the study sample will be systematically different from the target population and this can reduce the external validity/generalizability of the findings (also known as sampling bias). After this, you should look at the inclusion and exclusion criteria for the participants:
Different study designs have different limitations. Having an understanding of each type of common study design is an important part of critically appraising a study. The most common study designs in psychiatric research are experimental designs, such as randomized controlled trials, and observational studies, including: cross-sectional, cohort, and case-control.
Description | Measures you can get | Sample study statement | |
---|---|---|---|
Randomized Control Trial (RCT) | A true experiment that tests an experimental hypothesis. Neither the patient nor the doctor knows whether the patient is in a treatment or control (placebo) group. | • Odds ratio (OR) • Relative risk (RR) • Specific patient outcomes | “Our study shows Drug X works to treat Y condition.” |
Cross-Sectional | • Assesses the frequency of disease at that specific time of the study | • Disease prevalence | “Our study shows that X risk factor is associated with disease Y, but we cannot determine causality.” |
Case-Control | • Compares a group of individuals with disease to a group without disease. • Looks to see if odds a previous exposure or a risk factor will influence whether a disease or event happens | • Odds ratio (OR) | “Our study shows patients with lung cancer had higher odds of a smoking history than those without lung cancer.” |
Cohort | • Cohort studies be prospective (i.e. - follows people during the study) or retrospective (i.e. - the data is already collected, and now you're looking back). • Compares a group with an exposure or risk factor to a group without the exposure. • Looks to see if an exposure or a risk factor is associated with development of a disease (e.g. - stroke) or an event (e.g. - death). | • Relative risk (RR) | “Our study shows that patients with ADHD had a higher risk of sustaining traumatic brain injuries that non-ADHD patients.” |
Randomized Control Trials (RCTs) allows us to do a true experiment to test a hypothesis. Randomization with sufficient sample sizes, that both measurable and unmeasurable variables are even distributed across the treatment and non-treatment groups. It also ensures that reasons for assignment to treatment or no treatment are not biased (avoids selection bias).
Blinding is of course not always possible (e.g. - an active medication may have a very obvious side effect that a placebo doesn't have, making it very obvious to the participant if they are on a placebo or not). One needs to understand potential impact of the breaking of the blind in these studies. Some studies may therefore implement active placebos rather than inert placebos, to counter this potential bias.[1]
The presence of a placebo or control group can adequately account for these confounding factors.
Why are clinical studies so obsessed with randomization and randomized control trials (RCTs)? Randomization allows us to balance out not just both known biases and factors, but more importantly, unknown biases and risk factors. Randomization saves us from the arduous process of needing to account for every single possible bias or factor in a study. For example, in a study looking at the effectiveness of an antidepressant versus a placebo, a possible bias that might affect the result could be might be gender or other medication use (something you can measure). Of course, you try to skip randomization, and try to divide the groups equally amongst gender and medication use by yourself. However, unmeasurable factors (like family support, resiliency, genetics) might also affect a participant's response to medications.
The beauty of a properly done randomization is you can eliminate (or come close to eliminating) all the unknown factors. This way, you can be sure that the outcomes of your study are not affected by these factors since randomization should equally distribute all known and unknown factors amongst all treatment groups. Randomization is most effective when sample sizes are large. When studies have small sample sizes, they are called underpowered studies, which makes it hard to ensure the sample has been adequately randomized.
Observational Studies are studies where researchers observe the effect of a risk factor, medical test, treatment or other intervention without trying to change who is or is not exposed to it. Thus these studies do not have randomization and do not have control groups. Observational studies come in several flavours, including cross-sectional, cohort, and case-control studies.
Case-control studies (also known as retrospective studies) are studies where there are “cases” that already have the outcome data (e.g. - completed suicide, diagnosis of depression, diagnosis of dementia), and “controls” that do not have the outcome (hence the name case-control). Rather than watch for disease or outcome development, case-control studies look for and compare the prevalence of risk factors that lead to the outcome happening. For example, a research might look at antidepressant use, and whether that affects the outcome they already know (e.g. - cases of completed suicides)
A cohort study (also known as a prospective study) follows a group of subjects over time.
Systematic Reviews (SR) and Meta-Analyses (MA) synthesize all the available evidence in the scientific literature to answer a specific research question. Systematic reviews describes outcomes of each study individually (always look for a forest plot in the paper). The meta-analysis is an extension of a systematic review that uses complex statistics to combine outcomes (if outcomes of different studies are similar enough). In most meta-analyses, there are usually strict criteria for study inclusion that usually weeds out flawed research study. However, this is not always the case! Poor study selection can often result in flawed meta-analysis findings (garbage in = garbage out)! It is also important to watch for publication bias (i.e. - certain articles might be favoured over others).
Depending on the design of the study, the results section will look very different. However, all studies should:
Results may first compare the mean (average) between a treatment or non-treatment group:
Results can be expressed in a number of ways, including:
</WRAP>
Measure | Calculation* | Example |
---|---|---|
Control Event Rate (CER) | (development of condition among controls) ÷ (all controls) | 4% as per above example |
Experimental Event Rate (EER) | (development of condition among treated) ÷ (all treated) | 3% as per above example |
Absolute Risk Reduction (ARR) | CER (risk in control group) - EER (risk in treatment group) | 4%-3% = 1% absolute reduction in death |
Relative Risk (RR), aka Risk Ratio (RR) | EER (risk in treatment group) ÷ CER (risk in control group) | 3% ÷ 4% = 0.75 (the outcome is 0.75 times as likely to occur in the treatment group compared to the control group). The RR is always expressed as a ratio. |
Relative Risk Reduction (RRR) | ARR ÷ CER (risk in control group) | 10% ÷ 40% = 25% (the treatment reduced risk of death by 25% relative to the control group) |
Number Needed to Treat (NNT) | 1 ÷ (ARR) | 1 ÷ 10% (i.e. 1 ÷ 0.1) = 10 (we need to treat 10 people for 1 years to prevent 1 death) |
Relative Risk (RR) is a ratio of the probability of an outcome in an exposed group compared to the probability of an outcome in an unexposed group
RR = 1 | No association between exposure and disease |
---|---|
RR > 1 | Exposure associated with ↑ disease occurrence |
RR < 1 | Exposure associated with ↓ disease occurrence |
Relative risk reduction (RRR) measures how much the risk is reduced in the experimental group compared to a control group. For example, if 50% of the control group died and 25% of the treated group died, the treatment would have a relative risk reduction of 0.5 or 50% (the rate of death in the treated group is half of that in the control group).
Absolute Risk Reduction (ARR) is the absolute difference in outcome rates between the control and treatment groups (i.e. - CER - EER). Since ARR does not involve an explicit comparison to the control group like RRR, it does not confound the effect size of the treatment/intervention with the baseline risk.
Number Needed to Treat (NNT) is another way to express the absolute risk reduction (ARR). NNT answers the question “How many people do you need to treat to get one person to remission, or to prevent one bad outcome?” To compare, a RR and RRR value might appear impressive, but it does not tell you how many patients would you actually need to treat before seeing a benefit. The NNT is one of the most intuitive statistics to help answer this question. In general a NNT between 2-4 means there is an excellent benefit (e.g. - antibiotics for infection), NNT between 5-7 is associated with a meaningful health benefit (e.g. - antidepressants), while NNT >10 is at most associated with a small net health benefit (e.g. - using statins to prevent heart attacks).[4]
OR > 1 | Means there is greater odds of association with the exposure outcome |
---|---|
OR = 1 | Means there is no association between the exposure and outcome |
OR < 1 | Means there is lower odds of association with the exposure outcome |
High Pesticides | Low Pesticides | |
---|---|---|
Lukemia (Yes) | 2 (a) | 3 (b) |
Lukemia (No) | 248 (c) | 747 (d) |
Total | 300 (a+c) | 750 (b+d) |
Note that even though RR and OR both = 2, they do not mean the same thing!
In psychiatry, studies may typically look at the time to relapse for an event (e.g. - time until a depressive episode relapse). There are several “buzz words” that may be used:
Linear regression generates a beta, which is the slope of a line
If the 95% CI for a mean difference between 2 variables includes 0… | There is no significant difference and H0 is not rejected |
---|---|
If the 95% CI for odds ratio (OR) or relative risk (RR) includes 1… | There is no significant difference and H0 is not rejected |
If the CIs between 2 groups do not overlap… | A statistically significant difference exists |
If the CIs between 2 groups overlap… | Usually no significant difference exists |
Most studies only demonstrate an association (e.g. - antidepressant use in pregnancy associated with an increased rate of preterm birth). How can we decide whether association is, in fact, causation? (e.g. -Does anti-depressant use in pregnancy actually cause preterm birth?). The Bradford Hill criteria, otherwise known as Hill's criteria for causation, are a group of nine principles that can be useful in establishing epidemiologic evidence of a causal relationship between a presumed cause and an observed effect
Evidence-Based Medicine (EBM) typically will depend on the use of statistical and critical appraisal approaches. However, there are also inherent limitations and potential issues with the way EBM is applied in current medical research.