The GRADE Approach, an Introductory Workshop on Making Recommendations, Part 2

By Adem Lewis / in , , , , , , , , , /

So, the next hour or so — I will have to reduce it
to an hour — we’ll be focusing on assessing
the quality. And I will walk you through
these individual criteria that I mentioned earlier. Just remember our general
evidence hierarchy. And this is to remind you
that this evidence assessment and moving from evidence
to a recommendation is really a gray box when it comes
to the evidence assessment. I mentioned these examples
from the various guidelines about atrial fibrillation. So, what this is about is making this process
a bit more transparent and removing this gray box
and provide information there. One of the criticisms that we frequently hear
from various groups is that their field is special and, because their field
is special, they can’t apply something
that is a generic framework. Now, my answer to that is that
every single field is special, and surgical interventions have
a limited body of evidence, the public-health
and interventions have this evidence, and so on. Because you’re all different,
you all are the same. And that is probably true
because it is about a framework that can be applied across
different interventions and we believe that it can
actually be applied and I will tell you how
it can be done, hopefully. So, talked about
the determinants of quality. Let me start
with the first criterion, which is design and execution,
or risk of bias. So, there are general principles
for observational studies and for randomized trials. If you look
at observational studies, the four key principles really
are failure to develop and apply appropriate
eligibility criteria, which can lead to under-
or over-matching in case-control studies
or inappropriate selection of exposed and unexposed groups
in cohort studies, flawed measurement of exposure
and outcome — once again, this leads to a difference in measurement
of exposure, and therefore
it can lead to bias — failure to adequately control
for confounding, and incomplete or inadequately
short follow-up — are four key criteria. And it becomes pretty clear
that the assessment tools that we have
for observational studies — At the present time, there is the Newcastle-Ottawa
scale, for instance. They are not —
They won’t fit every purpose, so that your body of evidence — It is possible that,
for your observational studies, you need to look
at specific criteria that probably fall
into these broad issues here. But sometimes you need
to develop the data-abstraction tables
specifically for your question. In randomized controlled trials, we have issues
such as lack of concealment, intention to treat principle
violated when it shouldn’t be, inadequate blinding
of the different groups that could be blinded — outcome
assessors, care providers, statisticians and so on,
and patients, obviously — loss to follow-up,
early stopping for benefit — although that may possibly
fit better into an overall assessment of — well, anyway, so, early stopping
for benefit of trials likely overestimates results — and then selective outcome
reporting. Now,
these are general principles, in terms of design
and execution of trials. I will give you an example now,
not from the vaccine field. I don’t want you to get involved
with the topic. I want you to get involved
with the methods, okay? So, this is an example looking
at serious adverse events from a medication given
for chronic asthma patients. And what the systematic review
authors did is they identified approximately 30 randomized
controlled trials that fulfilled
their inclusion criteria, and they specifically looked at the reporting
of severe adverse events. They found 30 trials
that are depicted here. I will make this larger. They used this —
the Cochrane “Risk of bias” tool and looked at three key criteria
for assessing bias — there’s the first criterion, limitations in design
and execution — looked at bias in these trials. And a green dot basically means that that was appropriately
performed. It’s a judgment by the authors
of the systematic review. In terms of what is reported, a question mark means
that it was hard to tell, it was not reported, and even contacting the authors
didn’t provide the information. And red means
that it was not done. So, in other words, for this
particular first example here, this study was a study
that should have reported on serious adverse events,
but it did not. So, in other words, there’s a risk of selective
outcome reporting. It’s something that
was discussed widely the past couple years. Some nice papers also there
in “JAMA,” in terms of selective
outcome reporting. Anyway, over the body
of evidence of 30 studies, 30 randomized trials, this is
how it approximately looks. So, another way to depict this is that approximately 50%
of the studies possibly are at high risk
of bias, in terms of the overall
estimate of effect for serious adverse events. So, let’s say
that the meta-analysis showed a relative risk of 2
for serious adverse events. You don’t really know
how large it is because 50% of the studies
didn’t even tell you about them when they should have had
the data. Okay, this is an imperfect
representation because it doesn’t weigh
the studies, but it gives you a number count of the studies
that didn’t show that. Okay. There are two things
that become clear. One is that it’s pretty
complicated, right, to make these judgments. But it would be more complicated for a general practitioner
to make this judgment than it is for somebody
who’s trained in the methods and who has spent time
reviewing the literature. The other thing
that becomes clear is that you do need to make
an overall judgment because all of these studies fulfilled your inclusion
and exclusion criteria. Now, you could go out and, say,
do a sensitivity analysis and look at whether
there is a difference in the estimates of effect. But under those circumstances, you don’t have the estimate
of effect. All that you know is
there is another 15 studies — we don’t know whether the risk
is even larger or smaller. You need to make
an overall judgment. And the judgment would be one — whether you think that
there is a risk of bias that would influence
your confidence in the estimate of effect. And I’ve provided you with this
information — about 30 studies, about 15 didn’t report
on the serious adverse effects. And the question now is, do you
believe that this risk of bias, of selective outcome reporting, would be a concern in expressing
your confidence in any estimate of effect
that you have? And you all have to answer. You can ask
for more information. Do you need more information? No? Okay,
then I’m going to ask you. So, who believes that,
in this example, simple example, the risk of bias, in terms of
expressing any risk estimate for serious adverse events,
would be a problem? You have to all make a judgment. And I will ask you at the end why you didn’t make
a judgment or — So, raise your hand if you think
it’s a problem. Okay.
Who thinks it’s not a problem? And who is unde–
You think it’s not a problem? Okay. Why it’s not a problem?
Why is it not a problem? [ Woman speaking indistinctly ] DR. SCHUNEMANN: Okay, then let
me go to “Don’t know/undecided.” So, who is — who thinks —
who’s undecided? So, tell — Okay. As far as I can tell, there was nobody who said
there was not a problem, or was there anybody who said
there was not a problem? Tell me why you’re undecided. [ Woman speaking indistinctly ] DR. SCHUNEMANN: Okay. Okay. What is the information
that you would need? [ Woman speaking indistinctly ] DR. SCHUNEMANN: Okay. Okay,
got you, got you, got you. Anybody else who was undecided?
Why were you undecided? Was it mostly because of
information that was missing? Or somebody who would say, you
know, “It’s not a problem”? Okay — “more information” —
I buy this argument. It’s a quick exercise. I can’t provide you with much more information
in the time frame, but the point is — So,
as I remembered it correctly, there was nobody who said,
“This is not a concern.” Actually, a majority of you —
about 80% — probably raised your arm and
said, “It’s a problem,” right? “There is a risk of bias —
possibly a risk of bias.” And in the real world, one or two reviewers
would make this judgment and would possibly downgrade
the quality of evidence because this is really
a problem, right? It is really a problem because any estimate of effect
would be totally uncertain. Even if you get
an increased risk of 2, it may be actually 5
if everybody had reported it. Okay. All right. I’ll give you
another quick example. So, you may have a table that —
Is this — Excuse me. So, you may have a table
that looks like this — smaller body of evidence. This is a question that deals
with giving anticoagulation to patients with cancer, a
systematic review that we did, one of a series
of systematic reviews. And the table may look like
this, where now you say, “We can prevent mortality
by giving patients with cancer “anticoagulation because
they are at a high risk of having
a thromboembolic event.” I get five studies. And when we actually
do the assessment on the five studies —
they were all blinded — there was really concern, major
concern, only in about one study where there was some loss
to follow-up that was greater
than we hoped it to be. So, it’s a judgment at the end
that you need to make. This is what you’ve got. You assess the key criteria — in this case,
for randomized trials. And you need to say, you know,
“Do we downgrade? Do we not downgrade?” Obviously, this would come — As you said,
you need more information, and you would go through
a detailed assessment. But if I were to present
at table like this — Is there anybody who would
want more information? If I were to ask you, “Do you think risk of bias is
a problem or not?” — Yes? [ Woman speaking indistinctly ] DR. SCHUNEMANN: Okay, so,
the duration of the study. Let’s say that they all had
adequate follow-up — so, they were all
under two years. Yes? [ Woman speaking indistinctly ] DR. SCHUNEMANN: Okay,
so, what were the issues with adequate sequence
generation? They were all described as
randomized controlled trials, but we couldn’t find
in the paper exactly the method
how they did it. And that was labeled
as “unclear.” So, we were very severe
in our ratings. At the same time — so it’s good
that you raised this issue — when people describe
allocation concealment, it would be very unusual,
very unusual that they wouldn’t
adequately randomize because it’s very hard
to conceal randomization without proper randomization. It’s almost impossible, right? So, we were very severe
in our rating. We said it was not
clearly described. But in reality, you could use
this collateral information to say,
“It’s probably not a problem,” in two of these studies. Yes? [ Woman speaking indistinctly ] DR. SCHUNEMANN:
Okay. What do you want to know? [ Woman speaking indistinctly ] DR. SCHUNEMANN: Okay,
the relative importance — Another way to look at this
is to say, “Did this study influence the estimate
of effect?” Right? “Was this study in any way “showing larger effects
or smaller effects, “based on the fact that they have incomplete
outcome reporting?” And I can tell you that
that particular study — so, if you go back,
this was the study by Clerk. This study actually was very much within the overall
estimate of effect. So, it was not either showing
a larger effect or a smaller effect. So, that question
was raised earlier. This assessment frequently
requires you looking at a forest plot, right? This is a forest plot showing the individual
estimates of effect with confidence interval. We did a subgroup analysis here, but the red dots
represent the five studies. So, you need to look, right? And now you know
that this particular study didn’t influence the overall
estimate of effect, but it fulfilled your
inclusion/exclusion criteria. So, going back to this table,
I will now ask you, based on these five randomized
controlled trials — This is all that you have. We can’t this time
ask for more information. You need to make a judgment. In the ideal world, you would have more information,
obviously. I would ask you,
“So, who thinks, under those circumstances, the
risk of bias is a big concern?” Who thinks it’s not a concern? Okay. And there’s still some
people perhaps undecided. A couple people undecided
at the most. Everybody thought
it was not a concern. These were two relatively
clear-cut examples. There will be lots of judgment
in between, a lot of judgment in between. The key is to describe it
transparently, what actually happens. So, in the second example,
we would say, “Despite the fact
that one of the five trials “had possibly
incomplete follow-up, “we didn’t downgrade
the quality of evidence “because it didn’t influence
the overall estimate of effect “or because all of these other
quality criteria are correct. So, overall,
we would not downgrade.” So, it’s about transparency
so that next time, when you update your review
or your guideline and you have another
four studies, you actually remember why you made a certain
quality assessment or because somebody asked you why you made certain
quality assessment, you just are transparent. Okay, I will take you through
this individual criteria. Now, the second criterion
is inconsistency of results. And what we are looking for
is the — relates to inconsistency
between different study results. In other words, we are always looking
at the totality of evidence. Are the studies showing
different effects? And if they do,
we look for an explanation. And we usually use
the PICO format. That’s why the PICO format
is extremely helpful. It will come back with
indirectness, as well. We look at whether there were
differences in the population, the intervention,
the comparator, or the way that the outcome
was assessed, despite it maybe
being certain — being myocardial infarction
or allergic reactions, were the different ways
in how this was measured. Does that explain
the inconsistency? If it doesn’t, then we would
usually downgrade if we can’t explain
the inconsistency. So we lower the quality. This is an example,
and it makes two points. One is, once again, we look
at the totality of the evidence. This is five studies that actually looked at
using reminders, as opposed to controls, for —
and then measuring the — So, they used letter reminders
in patient groups and then measured the outcome of whether recipients of
the reminders were immunized. And on the five studies,
across the five studies — one, two, three, four, five —
the overall estimate of effect was an increase of about 60% in
the uptake of the immunization, with fairly tight confidence
intervals here, relatively tight
confidence intervals, and these five results. So, I’m gonna ask you — are there concerns
about inconsistency in the results
between these studies? Do you think that these studies
all pretty much show the same? Most of you are nodding. Okay. It’s pretty clear-cut. So, what are your criteria
to look at this? There is no one criterion, and there is no “yes” or “no”
answer, necessarily. But there are indicators
that make you confident. Perhaps — You were nodding. Perhaps you can — Why do you
think that they’re consistent? Anybody? Anybody? [ Woman speaking indistinctly ] If you use the microphone. Yeah. WOMAN: The estimate of effect is
all in the same direction, so — DR. SCHUNEMANN: Okay, so they’re
all in the same direction. Okay. All showing benefit. WOMAN: There are — Yeah, I
mean, the overall estimate is — you know, has confidence
intervals that don’t overlap, one — DR. SCHUNEMANN: Okay, so they’re significant,
statistically significant. Yeah. What else?
Anybody else? Okay. I can tell you
how I would have looked at this. I would have said —
Okay, the estimates of effect are all still
relatively similar. It always depends on the scale, but they all
are relatively similar. I would have possibly not looked
at whether they all show benefit because the inconsistency
can still be of concern if you’re looking for
a certain effect, and I’ll get back to that. But I would have possibly said
they’re all relatively similar. The confidence intervals
are all overlapping and — the confidence intervals
of the individual studies. And then we have two statistical
measures that tell us about it, tell us about inconsistency. They’re also not perfect. One is obviously the P-value,
testing for heterogeneity. The P-value here
was relatively high, basically refusing
the hypothesis that there is heterogeneity
or confirming — not rejecting the hypothesis
of homogeneity. And the I-squared value here — I-squared is a measure
of inconsistency, and it expresses the proportion
of inconsistency that is due to variation
between study results, as opposed to random variation
within the study. So, that —
A value of 100% indicates that there’s lots of difference
between the studies, between the estimates of effect
in the studies. 0% indicates
there is no concern. There are some
interpretation guides. But anyway, so,
there are four factors that we typically look at. And under those circumstances,
most of you — nobody had concern about this
being fairly consistent results. What about this? Similar type of intervention,
similar outcome. Are they homogeneous,
or is there inconsistency? Okay. Why are they inconsistent? They are on one side
of the point estimate. Why are you concerned
about the inconsistency? [ Woman speaking indistinctly ] DR. SCHUNEMANN:
Yeah. Pretty far apart. Okay. Only two of them. They’re far apart, right? And the confidence intervals
are not overlapping, as we just talked about. We talked about I-squareds —
97%. We talked about the P-value —
highly significant. So, we have these four
indicators, right? I showed you
two extreme examples. There will be lots of examples
in between, okay, where you need to provide
a much better description. But the point here is — it’s
an important point, actually — that they are probably
heterogeneous, despite the fact
that they’re on one side, because, as you mentioned,
the effect is quite different. So, if you were looking because the intervention comes
with certain costs — under those circumstances,
perhaps no harm, but cost. If you were looking for an effect of
at least a 3.5 increase, you would be uncertain whether it really is
a 3.5 increase because one study showed a relative estimate
of effect of 1.92 and the other one of 6.77. So, the heterogeneity,
in terms of whether you could apply these
results in a recommendation without lowering
your confidence, would probably influence it. Okay. The heterogeneity
would influence it. So, we look at these four issues
here that I just described — I-squared, P-value, overlap
in confidence intervals, difference in point estimates. The third criterion is
directness of the evidence. I’ll move through this
relatively quickly, although it’s a very important
issue for you. We once again look at PICO. Is the evidence
that we actually find direct enough to answer
the question? Directness covers the issue
of generalizability, transferability, applicability. We strongly believe
that each of these concepts doesn’t really cover everything and that it is much better
to think of it in the PICO framework —
that is, once again, are there differences
between the evidence and how you would like
to see it applied? And so we look at —
Is the population the same? Or is it similar enough that
you don’t lower your confidence that the estimate
actually applies? So, when you transfer results
from adults to children, can we be confident that the results apply
to both population groups? Interventions — you are trying
to make a recommendation for a class of vaccines, and all of the evidence is
coming from the older vaccines, and there are a couple
of newer ones in the group. You need to make a judgment. Ideally, you have evidence
to support it, but sometimes you don’t, and you need to, once again,
generalize it or apply it to a slightly
different question. The comparator —
is your comparator appropriate? So, you may be interested
in making a comparison against the old vaccines, but all you have is information about no vaccine
being administered or placebos being administered, which would lower
your confidence. And then finally, outcomes — is it a patient-important
outcome or a surrogate? How confident can you be
that the surrogate outcome really is related
to the important outcome? So, hepatitis “B” infection
and liver cancer may be one example where
you need to make this judgment. But just by inducing
an immune response, you may not prevent mortality, and you would lower
your confidence, possibly. So, these judgments
are critical. And it comes back
to the question formulation, the issue
that you mentioned earlier. Sometimes we start with
the patient-important outcomes. We would like to see information
on liver cancer. This is what we are looking for, but we don’t find
this information, so what we represent is hepatitis “B” infection,
perhaps. And then you need to make
a judgment of whether you can apply
the relative estimates of effect to your recommendation. Can you take these relative
estimates of effect? Or is your confidence in these
relative estimates of effect — because it may be just driven by the baseline risk
for mortality — are you confident that they
actually really apply narrowly? Can you take them
and need to make a judgment that they’re indirect or direct? Okay? So, you use
the indirect outcome. You use the relative effect
measure, possibly, from hepatitis “B”
to make inferences about what happens
to liver cancer, in terms of different
baseline risks. When you do that, you usually —
or in many instances, you would lower your confidence
in estimates of effect. Okay, and then — That’s why it
is not necessarily covered well by the term “generalizability,
transferability.” Indirect comparisons
are critical here in making judgments
about directness — that is, if you’re interested in intervention “A”
versus intervention “B” and you only have information from intervention “A”
versus “C” and “B” versus “C,” then you may downgrade
the quality of the evidence or the confidence
in the estimates of effect because of this indirect
comparison. And that is frequently the case
for new interventions, obviously,
because usually the industry is not interested in
making head-to-head comparisons. So, for instance, you may have
one rotavirus vaccine being compared
to no intervention and another one being compared
to no intervention. But what you would want to know
is how well they compare against each other, just
hypothetically, as an example. So, under those circumstances, if you don’t have the direct
information, the direct studies, you need to make a judgment, how confident you can be
in the indirect comparisons — that is, the estimate of effect
that you get here versus the estimate of effect
that you would be getting there? Okay. Publication bias. So, who feels he or she
would hear a little bit more about publication bias? Okay, there are several people
in the room, so let’s do that. So, publication bias,
in our view, should always be suspected,
right? Unfortunately,
that’s probably reality, that a lot of the studies — in particular,
observational studies — remain unpublished. Investigators
investigate a hypothesis. It doesn’t confirm
the hypothesis. It ends in the file drawer. So that’s why it’s also called
the “file drawer problem.” You don’t write it up
because you have a better study that you would like to publish, and then you never get back
to what you’ve done previously. So, observational studies
are particularly prone to publication bias. You should suspect it when you have only a few small
positive studies, when there’s lots of for-profit
interest — that’s what the empiric
evidence shows. There are lots of methods
to evaluate it. None of them is perfect,
but it remains a problem. So, one of the ways to
conceptualize publication bias, for those of you who wanted
to hear a little bit more, is the following. So, in the 1980s, there was
a group of investigators that questioned whether
intravenous magnesium reduces the risk of dying in patients who had an acute
myocardial infarction. And they looked at
all of the studies that were actually reported, all of the randomized
controlled trials under those circumstances
that were reported. And they found a relatively
large evidence base. They found studies that,
in part, included
several thousand patients. Here you see the number of
patients on a logarithmic scale, and here you see the odds ratio
for the risk — or the odds of mortality if you
have received the intervention. And this is what
the investigators found. They then went to calculate
an overall estimate of effect, the meta-analysis
that Salim Yusuf did, and they found that there was an approximate
20% odds reduction in the risk of dying
from intravenous magnesium. A group of investigators —
in part, the same investigators who were involved in the
meta-analysis — then went on to do
a mega-trial — 50,000 patients. And that’s what
the mega-trial showed. The mega-trial showed that the
odds ratio was 1.0, all right? So, a very large trial that didn’t confirm
this meta-analysis. So, why was that the case,
all right? And the answer is
in the picture, is in this particular picture. As you can — If you look
at this carefully, any overall estimate of effect
should be influenced by random estimates of effect
of the various studies, right? And each of these dots
represents an estimate of effect from a single study by its size. So, just because of biased
statistics and randomness, you should see,
in most circumstances, a random distribution
of small studies around the point estimate
of effect on both sides of this particular overall
estimate of effect, just because of randomness. The smaller the study, the more
likely they are to deviate, and you would have a random
distribution of small studies that show much larger
or much smaller effects. in this case, it’s not the case. All of the studies are positive
except for one, all right? And even the small studies
are only on one side. So, it’s an example
of publication bias, and it explains,
under those circumstances, why the ISIS-4 study
actually didn’t show it — because there are probably 10,
15 studies that were performed that showed actually
an increase in the risk, but were not published because it did not confirm
the hypothesis. Okay, so, funnel plots are a good way to conceptualize
publication bias. They can be used. The guidance says
that you should have at least a certain number
of studies, between 5 and 10, in order to use this, but sometimes it gives you
information even with fewer studies. A symmetrical funnel plot, which basically plots the size
of the study, here expresses a standard error,
which is a better way of measuring it
against the estimate of effect. So, ideally, you would have
a symmetric distribution around this. Under certain circumstances,
you don’t have this. And then you should be
particularly more concerned about publication bias. So, there are no clear-cut
methods for doing this, but there are indicators, such as using plots to assess
publication bias — because if you don’t do this, estimates of effect
that would be, perhaps, 0.6, truly based on publication bias
may be much larger, or you may see an effect
when there was no effect. Okay? All right. Imprecision
is the fifth criteria. Imprecision has to do
with small sample sizes, as well as small numbers
of events. And this both then leads to possibly wide
confidence intervals, and it will cause uncertainty about the magnitude
of a true effect. I’ll give you
a couple of examples. When we do this, we look
at the totality of evidence. This is a similar type of
example that I showed you before from the immunization
literature. Under those circumstances,
the investigators try to address whether inactivated vaccines
are related to otitis media — vaccines to prevent influenza. And the outcome was
otitis media, as I just said. And they found one
randomized controlled trial that addressed
that particular question and that particular patient
group, and this is the result. So, about 120 people involved,
24 events. This is the estimate
of effect — 0.48. Confidence intervals like that. So, who would think there
is a problem with imprecision? Yeah. Why is there a problem
with imprecision? Wide confidence intervals,
right? So, it’s wide confidence
intervals. We’re not necessarily worried
about a single study because a single study
can be very large, very tight confidence intervals. But that’s basically the point. So, you wouldn’t know whether
it’s really associated with it. As I said, only 24 events — small number of patients,
overall. Alternative example — many of
you have seen these examples, but I think they basically
make the point. Once again,
similar type of question. The question is not
that relevant here, but under those circumstances, there are now 5 studies,
about 1,700 patients enrolled in these studies. This is the estimate of effect. Who would be concerned
about imprecision? Nobody? Okay, so, point
to make here is — yes, estimates of effect
are precise. There are approximately
250 events in 1,700 patients. That’s really a good —
That’s a large number of events and a fairly large number
of patients. We are not concerned
that a single study is not statistically
significant. We don’t look at single studies. We look at the overall
estimates of effect when we assess imprecision,
but we consider all the studies, even if they contribute
very few events, ideally also if they contribute
no event. So, this is the reason
there are lots of events and lots of people. So, we provide some guidance
for assessing imprecision, and I will take you through this
relatively quickly because that is
one of the issues that most frequently comes up. And in fact, there’s
a reliability study that the evidence-based
practice centers just did, and I think this is out
for consultations, not confidential information. But one of the major criteria for not coming up
with the same ratings for quality or confidence
in estimates of effect was that there was disagreement
about judging imprecision. And the main reason is that the people who are doing
these type of assessments didn’t have any guidance. Any guidance will be likely
imperfect, but it will still lead
to greater consistency and providing help for those who need to make judgments
about imprecision. So, we’ve developed
some criteria that — The first one reads that, “If
the 95% confidence intervals “in systematic reviews excludes
a relative risk of 1.0 “and the total number of events
or patients “exceeds what is called “the optimal information
size criterion, precision is adequate.” So, in other words, if you go
back to that example — statistically significant. And for a systematic review,
you may want to say that this is precise if the optimal information size
criterion is fulfilled. The optimal information size
criterion is nothing else but a reasonable
sample-size calculation for this particular
intervention. So, to come back to that… And what we frequently find is
overestimates of effect, right? And we also frequently find spuriously statistically
significant findings. They happen 5% out
of the time, right? So, one way of addressing
of whether — beyond statistical
significance — we can be confident that we really have
a sufficient body of evidence is to look at optimal
information size. And it would be
a sample-size calculation, just as for any trial
or observational study, a power calculation,
where you calculate that, given for a certain estimate
of effect, do you have a sufficient
number of events and a sufficient number
of people enrolled in the study to actually show
a certain difference? And if that’s the case,
in addition to seeing the statistically
significant findings, you can rest assured that imprecision is unlikely
to be a problem. And sample size calculations
are simple. They are based on the P-value,
obviously, on the power, and on the difference
that you actually expect. And that pretty much
is all that you need. And that, once again, can be
done very, very simply. It’s one guide. And then if the 95%
confidence interval includes appreciable benefit
or harm — and, you know,
the suggestions are to use relative risk of 0.75 or 1.25
as a rough guide — rating down for imprecision
may be appropriate, even if the optimal information
size criterion is met. In other words, if you find
a relative risk estimate of 1.0, but the confidence intervals
are very, very wide, you obviously can’t exclude that there is still appreciable
benefit or harm. And that should worry you and
shouldn’t make you too confident that there really
is no association. And the guides
are what they are. They would depend
on the thresholds that you would like to provide. The optimal information size — Once again, just to provide
more guidance here that you can find
in the handbook there, as well, and the profiler,
we suggest the following rule — if the total number of patients
included in a systematic review is less than the number
of patients generated by a conventional
sample size calculation for a single
adequately powered trial, consider rating down
for imprecision. And as I said, this has been
called optimal information size. This can be made clearer,
so under — If you were to consider this a
relative risk reduction of 25%, this would be a relative
risk reduction of 0% — so, no effect. If you find a result that looks
like this in your meta-analysis, to a point estimate that is larger than
a 25% risk reduction, confidence intervals
not overlapping, it’s pretty clear-cut —
the results are not imprecise. If you find something like that and your threshold for relative
risk reduction is really 25% — and this is what you would need
to achieve in order to be confident that the results
are precise enough — despite the fact that they may
be statistically significant, you may rate down
for imprecision because you really are not
confident that the effect that you would try
to achieve is achieved. At the same time — this is
the example that I described — you may see no effect
of an intervention, and the confidence interval
may be relatively narrow and not include what we use
as a rough guide — the 25% relative risk reduction. You may say,
“This is precise enough. “We shouldn’t downgrade further. “We don’t expect
additional information to change this dramatically,” as opposed to a situation
like this, where, despite the fact
that you have no effect, your confidence interval
still includes the possibility of an appreciable benefit
or harm. Under those circumstances, you
really are not very confident that you can really say
that there is no effect. And the issue, then,
when you go to guidelines, becomes that your thresholds
are becoming key. And I’ll — The thresholds
usually are based on absolute estimates of effect. So, just to take you through
this relatively quickly — So, if, for instance, you would see mortality
estimates as follows — so, these are absolute estimates
of effect, the risk difference of 2%, .05%,
0%, and a 0.5% increase. So, let’s assume that your threshold for applying
an intervention would be a risk difference
of 0.5% — so, 0.5% or one
out of 200 people who would receive
the intervention — die less. And if your true estimate
of effect was the following — right? — so, this is including
thresholds — was the following, you would say, “Okay,
I have enough information. “I’m pretty confident
that these estimates of effect “are good enough for me to say that we don’t need
to downgrade.” If your threshold, however,
because of cost, downsides, and other side effects,
would be a risk reduction of approximately 1.25% —
1%, sorry — which comes with an NNT
of 100 — sorry — yes, 100 — excuse me. So, a risk difference of 1%, and if your true estimate
of effect was the following, despite it showing benefit,
it would cross this line. You may still seek
more information, or you would ask
for more information, and you might downgrade
for the quality of evidence. Okay? So, thresholds
really become important when you look at guidelines. Okay, there are rough guides
that you can use when you make this judgment. These are curves
that we’ve produced which basically tell you about
the optimal information size, and you can see
where your body of evidence actually falls on these curves. What it explains is,
if you are above the line, the optimal information size
criteria are met for the various relative
estimates of effect, the control group event rate,
and the total sample size. So, this is fairly easy to apply if you use this
as a rough guide. Okay, so, that will hopefully
help with making judgments about precision and imprecision. The last 15 minutes
or 20 minutes that we have before the break, I will focus on what can raise
the quality. There are three criteria. The first is when you have
very large estimates of effect. And we use two guides. One is relative risk reduction
of 50% or a relative risk of 2, or we are even more confident
that an effect truly exists if the relative risk reduction
is 80% and the relative risk is 5. You can just think of this
as the parachute example. Sometimes the
observational-study evidence just comes
with such large effects that we increase our confidence
that the effect truly exists. And I will not spend
much more time on that. These are the guides
that we have out there. They are related to precision, so ideally these estimates
are also precise. Okay. And then I give you
just one example to clarify one frequent misunderstanding
and that is… Now, what if you have
an estimate of effect that is statistically
significant and exceeds the effect
that I just mentioned, the relative risk of 2,
but the individual studies are not necessarily showing
a relative risk increase of at least 2
or a relative risk of 2? This is just to emphasize
the point. We do look at whether
the results are precise enough when we make a judgment
about the magnitude of effect. But we base it on the overall
estimate of effect. So, if the overall estimate
of effect is larger than 2, with sufficiently narrow
confidence intervals — ideally those that meet
the optimal information size — we would upgrade
the quality of evidence. And that’s one example. The next example
for raising the quality is some hypothetical example
here, in this case
that we have discussed with this age group
in particular because that was a question
that they had — whether those effect relations
also apply to populations — because those effect relations
or those response relations are one criterion for upgrading
the quality of evidence. So, if we look at this
hypothetical example of vaccine efficacy, if you would find the following
on a population level — If 50% of the population
was immunized and you find a 20% lowering
of the risk, as opposed to when the proportion of the population
that is immunized becomes larger and there would be an even
further increase in the risk, this would qualify
as a dose-response relation because it clearly expresses that there’s efficacy
of a vaccine, right? The more people
who are immunized, the lower is the risk
on a population level. So, that would be one example
for dose-response relation, as opposed to the other
clear-cut dose-response relations
in the clinical arena. And then the third criterion
that is a bit more complicated relates to plausible residual
confounding and biases, and they may be working
to reduce a demonstrated effect or increase an effect. And I will present you with an example from
the clinical literature and then tell you
that it applies to the vaccine literature,
as well, in our view. So, those of you who have
treated patients with diabetes or have relatives with diabetes
may know that, based on early work
on the drug phenformin — this drug may cause
lactic acidosis, which is a very
severe condition, with a 50% mortality — that a drug that is based
on this phenformin, which is called metformin, was suspected to have
the same toxicity. In fact, it’s on the labeling
of the drug, and it’s very commonly known
amongst practitioners because there was a lot
of concerns by the producers that the drug
would actually cause it. So, given that that is the case,
it is very likely that there would have been
over-reporting of this particular complication because all the practitioners
who treat patients, all the G.P.s —
or many of them — would be aware of this
particular problem. Now, when people actually looked
at the evidence and looked at the large
observational studies that were done, despite the fact that there
was likely over-reporting, the observational studies failed
to demonstrate an association, so that all biases — or in
this case, outcome reporting, which can be considered
a bias — would have actually led
to showing an association, despite the fact that
no association was observed. Under those circumstances, groups may increase
the quality of evidence. And a good example is really
from the vaccine field — in my view, at least — and it
has to do with the MMR vaccine and the concern about autism,
where there was a lot of concern based on this spurious
publication or the publication
in “The Lancet.” And this is based
on a Cochrane review. When the Cochrane reviewers
actually looked at the evidence, at the observational studies
that were out there, they did not find an increased
risk of autism, right? They did not find it. So, despite the fact that, based on the publication
in the “The Lancet,” it was withdrawn,
and the sensitivity and the possible over-reporting
of vaccine adverse effects — in this case, autism — despite that, there is
no association. And that may increase
your confidence in the fact that there is no association, exemplified by the fact
that “The Lancet” had to actually withdraw
the publication with the falsified data,
the fabricated data, which actually
supports it, right? Okay. So, those are criteria for upgrading the quality
of evidence. There are three criteria for upgrading the quality
of evidence so that overall, we would be looking
at the following. We would be looking at —
And I apologize. I can’t move from here. I was told that, except
for taking the microphone. So I need to stay here and turn around a little bit
in order to point to the screen, but that we would start
with a body of evidence from either randomized trials
or observational studies. We don’t suggest mixing them. We suggest if you have a body of
evidence from randomized trials and a body of evidence
from observational studies for one outcome
to do this separately and see which body of evidence
ends up as higher quality and report that or report both, that you would start with
an initial quality rating of high or low
and you would assess whether these five criteria
lower the quality of evidence and whether these three criteria
increase the quality of evidence and then come up
with an overall rating that is in four categories, clearly recognizing
that this is a continuum. There is a continuum
of the quality of evidence. We use categories in order
to facilitate communication and to provide a rough —
or not rough, to provide an estimate
or information to practitioners how confident
they actually can be in this particular
recommendation. That is the overall summary. This is done by outcome. The overall quality
of the evidence reflects the extent then
of our confidence that the estimates of an effect,
as I said, are adequate to support
a particular decision or recommendation. Guideline developers
must specify and determine the importance
of all relevant outcomes in our view. And the overall quality, as I
also said earlier, of evidence is based on the lowest quality
of all critical outcomes. So now this is something
to pay attention to. It has to do with assessing the
overall quality of the evidence. Now, let’s assume
this frequently comes up, and it has to do with concerns
that we over-penalize or that we are too severe,
too stringent in the application
of these criteria. So, let’s assume that we have
a systematic review, a meta-analysis of several
critical and important outcomes. Okay? The intervention may just be
any intervention. And hospitalizations were considered to be
a critical outcome. And this is what you would find. You would find a risk reduction
for hospitalizations. No downgrading takes place. It’s high-quality evidence
for hospitalizations. Let’s assume that you have
a second outcome, which is mortality. It is considered critical. And the quality here
is moderate, and perhaps this is due
to imprecision because you’re not entirely sure whether immortality’s
increased or decreased over the other effects
of mortality. And let’s now also assume
that you have a third outcome that is rated as important, but
not critical, which is nausea. And it comes with the following
estimate of effect. And the fourth outcome,
that is serious adverse events. It’s critical,
and it’s considered high because the confidence interval is considered to be
narrow enough under those circumstances. So based on what I said before, what would be the overall
quality of the evidence? It’s a test question,
the first test question. There are gonna be
lots of test questions. No, I’m just kidding. So, what do you think the overall quality
of the evidence should be? Let’s say that these
were the four outcomes and this is what you find. Moderate.
So, why is it moderate? It’s the lowest critical, right?
It’s pretty straightforward. It’s the lowest critical,
so, yes. So, the overall quality
of evidence is not low because nausea, despite the fact
that it is only low quality, it was rated as important
and not critical. So moderate. Okay. Let’s assume the following case. Now, mortality is a critical — So these are all critical
outcomes now, all critical outcomes. Mortality —
It’s high-quality evidence that this intervention
reduces mortality. You were interested in disease-specific
quality of life. It was rated
as a critical outcome, moderate due to precision. Hospitalization
was also high quality. And serious adverse events
was also high quality. It was felt that there was ne’er
enough confidence intervals to not downgrade. So, what would the overall
quality be here? They’re all critical. Okay. Why would it be high? Okay. So, if we were to apply
the criteria that I just said, that it would be based on the lowest quality
of the critical outcomes, it would be only moderate,
right? But either you did the reading or our common sense was similar
to your common sense that it would be wrong to
penalize this body of evidence. So, you said three out
of the four were high. That could be one way
of dealing with it. Our way, or the way that
we apply this criteria — because it is really important
for many of your questions, I believe —
is the following. This outcome
that would determine the lowest quality of evidence is actually going in the same
direction, right? And even having more information
about it would not alter
the recommendation that you would like to make because there are
two critical outcomes that clearly go
in one direction. They cross the threshold for
recommending an intervention against serious adverse events. And it is very unlikely, apart from the fact
that I just mentioned, that you would ever get
more information into the specific
quality of life. But the point is,
it goes in the same direction with the other
critical outcomes, and under those circumstances, we would not penalize
the body of evidence and maintain
a high quality rating. And that, in particular,
once again, if the threshold for
the acceptable harm is crossed. So where this is the threshold for where the serious adverse
events should be falling into, considering the benefits
that are obtained. Okay? So it’s important. So the quality of the evidence
would be high, rather than moderate. Okay. Last example —
All critical outcomes — hospitalization is one outcome, disease-specific quality of life
is another outcome, high mortality is moderate, and the serious adverse events
are high. And if you take all
of this together — You know, if you take these effects
together and then look at how large
a plausible increase in the risk of serious
adverse events you would be willing to accept
in order to recommend this — If you consider that and if you consider that it
wouldn’t cross the threshold, that it would not be clearly
on one side of the threshold, it means that you really do need
additional information and that your overall confidence
really should be reduced. And under those circumstances,
rightly so, the overall quality of
the evidence would be moderate, based on the critical outcomes
that you have here, the lowest critical outcome,
in particular, because the threshold
is not crossed. Okay? So the overall quality is determined by the lowest
critical outcome, except for the circumstances, the situation
that I described there. Okay, so, then, to sum up
this presentation — Again, I’m glad that we are
staying on time. The rest of the afternoon —
in particular, after lunch — will be much more your work. In fact, it will be your work
for the most part. I will speak relatively little,
except for in my small group. And you will apply what
you learned in the morning. Obviously, no tests,
but a hands-on exercise. So, to sum up, we defined the
overall quality of the evidence, and this is very much in line with what the ACIP
has decided to do. It basically determines
the confidence in estimates of the effect
of the intervention. When we provide a rating of 1,
4+, “A,” or high, we are very confident
that the true effect actually lies close
to the estimate of effect that we have obtained. Down to the very low category, where we have
very little confidence in the estimate of effect
and the true estimate is likely to be substantially
different from the estimate of effect
that we have in front of us. So, what we have covered so far
is the following. We’ve talked about
question development, we’ve talked to some degree
about the evidence profiles. We’ve talked about the eight
criteria, with examples. And we’ve talked
about this issue of how we arrive at the overall
quality of the evidence. And this is what
we’re gonna apply in the first
small-group session. Okay? In the break, I’ll be happy to
help those who had difficulties with installing the software. Possibly, I can solve some
of the problems, hopefully. But what we will be doing
after the break is, I will just provide
a very brief introduction. We will assign small groups
or divide up into small groups, and the introduction will tell
you what you will be doing for the rest of the afternoon. Then we’ll get together once
more in the large group, where we’ll talk about, once
again, this process, then. And you will do this in
the small group, then, as well. And we finish up with issues,
challenges, questions, and feedback. We have two minutes, three
minutes for questions until… It’s two minutes to 12:00. Are there…?
Do you have any…? Yes. WOMAN:
…with the different examples of how you would rate things
with strength of evidence. Back a few. Okay, yeah, in this situation, what about the situation where,
for the adverse events, you had poor quality evidence,
low evidence, but it didn’t show any risk? DR. SCHUNEMANN: Okay. So, there must be reasons for
why you rate it as low quality. And given that there are reasons
for low quality and given that
you must have determined that they are critical
for decision making, we typically suggest that you lower your overall
quality of the evidence because, once again,
you’ve determined that it’s critical for decision
making, and for some reason, you don’t have the information
that you really need. Now, there are some exceptions,
as I’ve said earlier. Sometimes there are suggested
or surmised some severe adverse events
that really don’t exist. And all that you can gather is observational study
evidence, right? And you fail to detect
an association. And under those circumstances,
it is maybe appropriate to say that this is really
not a critical outcome. It may be important
because you want to report it, but it is not a critical outcome
because there is no association. People make up
all sorts of associations, but sometimes they don’t exist. And in particular, this could
be, then, supported by when the events are very, very
rare in a certain situation for the severe adverse
consequences. And under those circumstances, it’s not that the evidence
would change, but that your rating of whether that’s critical
or not would change, and therefore,
it would not be downgraded for the overall quality
of evidence. You see, because people make up
serious — Well, “make up.” But we always think of a lot
of severe adverse events, and sometimes they just
are not related to it. Well, it really, unfortunately,
happens after the fact, but when you have the evidence, that’s why you can’t completely
separate it. You make the judgment after you’ve put together
the evidence profile because you find out
that despite the fact that the panel thought
it was a critical outcome, it really is not
a critical outcome because it’s not associated. It doesn’t happen. So a practical example —
For certain depression drugs, there was a concern about
increased suicide rates, and when people looked at actually hundreds of thousands
of patients at registries, there was no indication of
an association of suicide risk with this particular
antidepressant drug. There were, in a million
patients, only 20 events, and they were clearly — Once again, the relative risk
was not increased. So that,
under those circumstances, it may not be a critical outcome
related to that intervention. And then, there is another way
of dealing with this that I will explain better
after the break. Okay? DR. SCHUNEMANN: Once again, this
is for you to work sequentially through the process, and the
software can be downloaded. It’s free.
We regularly update it. There’s a really
major, major update expected in about a year or so, year and a half that will
transform this into a guideline-authoring tool
where you actually can, then, just produce
an entire recommendation. There is already
a recommendation and module in here. So let me just start. When you open it up, you would
typically create a new profile, which is shown there. So you click on your profile. Oh, this is to remind you that there is this
context-specific help file, which, as I just said to Jean,
it’s about 300 printed pages, so no normal human being
would ever read this. But that’s why
it’s context-specific, so when you actually have
a question in the profile, while you’re producing
something, and you move the cursor around,
you get the question mark. And that’s when you can click,
and it brings you to the relevant sections
in the handbook. Or you have the search function
and things like that, which is really very helpful
for most groups. So you start by defining
the profile group, and what we mean by profile
group is simply — So, frequently, you would work
on recommendations in relation
to hepatitis “B” vaccine, and there would be different
recommendations, so you could just call this hepatitis “B” vaccine
recommendation. Or you could —
Under those circumstances, we are looking at rotavirus. You could just say rotavirus
profiles or something like that. It’s just, then,
because in many groups, we work on
very different questions, and they wanted this
to be separated, obviously. So profile groups. And then when you open this,
you just say “Add a profile,” and that will then open
the interface with a question formulation
just as a reminder. This is what you have to choose. There are four different types
of questions that you can choose from. This is not to be filled out unless you click
the edit button. This comes up automatically
once you select the question. [ Clears throat ] And this is how the question
might be formulated. So, is there a comparison? Is it dealing with a health
problem or a healthy population? There’s a pulldown menu there,
intervention, the comparator. And the health problem
or the population, depending on what format
you are choosing. And, then, the setting
is something that you can fill out
or not fill out. It relates to whether
you’re dealing, for instance, with an inpatient setting
or an outpatient setting or you can just click on
this particular word, and the help file will explain
what we mean by that. And, then, the next step
is to — Then you start adding
your outcomes, and your outcomes are — I will go through that
in the handout in a second. The outcomes are obviously
determined by the guideline panel, by the group
that you’re working with. And you start by adding
the first outcome. Input the outcome name, and then it really takes you
through it sequentially. You need to decide whether it’s a continuous
or a dichotomous outcome, whether it is pulled
in a meta-analysis, or there are different pulldown
options there. Then you start by entering
the number of studies that were identified. Once you’ve done that, it asks you for the underlying
study design. And then you do your quality
assessment, one after another. And the big benefit here is,
as I’ve repeatedly said, you don’t forget things. You work through it
sequentially, and you make sure that you
provide information that is relevant
for your judgment, indicated to you
by the red arrow. These little pin buttons there mean that you should provide
an explanation or a footnote. We’ve called it footnote here. There’s a footnote manager, so you provide
a narrative description of why you are making
certain judgments. It serves as a record. It actually is automatically
integrated in the evidence profile. And then you get to the screen that asks you to provide
the numerical information. So we have different ways
of filling this out. One is, we can direct the import
from Cochrane review if you have the html files
or the relevant files. Alternatively,
the information is put in. You put in the number of events and the total number
of patients, the relative estimates
of effect. One question that frequently
comes up is how you choose
these control group risks. I’m just saying it now. It will become clear as you are
working through it. The control-group risks
basically reflect reasonable estimates of
what we just talked about — burden of disease or
baseline risks in a population. And it helps you because
the relative estimates of effect are usually similar or the same. It helps you to determine what happens in terms of
absolute estimates of effect. So, for instance, while,
for this particular outcome, the relative risks
may be 0.92 — so a very small 8%
relative risk reduction — it does influence how large
the absolute effect is on the basis of what you think
your burden of disease or your baseline risk truly is. And, once again, the absolute estimates effect
can differ quite dramatically. So obviously, with about
1/3 lower baseline risk, you would have 1/3
lower absolute benefit or 1/3 lower risk difference. So it’s important
to sometimes just — We talked about it over lunch — to just get a sense of what the different burden
of disease or baseline risk does to your absolute estimates
of effect. Then, quickly, to continue,
there’s a possibility of concluding continuous
outcome measures. That is not part of the example
here, but it’s all there. And then… Then you can actually preview what you’re doing
in the preview tab, preview the summary
findings table. You can see how it looks as you
progress through the exercise. And there is also the print
and export function in this particular screen. Two different ways
of presenting it — complete evidence profiles or we have
the summary of findings tables, which is the same
information, just condensed. You may want to take
a look at that. And that, I think,
is pretty much it. If you continue to use
this software, you can check for software
updates occasionally. As I said, we are doing
this frequently. So, any questions
about the software? This was about four minutes,
so we will do much more as we are moving
through the small groups. DR. SCHUNEMANN:
So, I’m going to speak about moving from evidence
to recommendations. And as I said earlier, I showed this gray box in terms
of the quality assessment, the quality of evidence
assessment. When it actually comes
to formulating recommendations on the basis of the evidence, we have probably been dealing
with a black box because it’s much more
complicated. And laying out the processes
that we went through is probably also more
complicated because there’s less written
about it. So let’s talk about how we get,
actually, to a recommendation and how we can perhaps make this
black box more transparent. Starting with the definition
that I provided to you — “The strength
of a recommendation “reflects the extent to which
we can, across the range “of patients or population
or people “for whom the recommendation
are intended, “be confident that desirable
effects of a management strategy outweigh undesirable effects.” And, you know, we make this
category “A” or category “B” recommendations there. I mentioned the four factors
to you. I will not go through them
in great detail anymore because, you know,
you have them in the handout, and we’ve talked about them
a little bit, but I wanted to give you
a little bit of — Well, I can skip that. It’s not, maybe,
that important to you, as I’m talking too much
about history. So, it is about weighing the benefits against
the downsides here — or the desirable effects
against the downsides. Now…it is pretty clear
that each of these outcomes could potentially provide
benefits against the downsides and lead to this whole
balance here being a continuum. The way that ACIP and other
groups are doing it is that there
are two categories — category “B” and category “A,”
or strong and conditional, and four, again,
as I’ve repeatedly said. But it is a continuum, so we
just need to be aware of that. However, the continuum doesn’t necessarily
help practitioners in making decisions because either they are being
told to do something — If it comes to immunizations, I would have expressed it this
way, the just do it — or they, perhaps, need to think whether the patient
in front of them is the appropriate candidate. But there is not much more
in between. It’s either this or the other
so that at the end, it’s really dichotomous
decisions. And we provide a little bit
more gradient to that decision by having what we would say,
perhaps, four categories. It’s either do it
or don’t do it — stop vaccinating people
with certain vaccines — or it’s in the middle. And in the middle, a guideline
panel may still say, you know, “We think that it’s
usually providing more benefit than downsides,” or, “It’s usually providing
more downsides than benefits and probably don’t do it.” So, we weigh these factors by the magnitude
of their occurrence, as well as the importance
to the patients as we’ve gone through the “values and preference”
exercise. So, and if the panel believes that the benefits probably
outweigh the downsides, and they’re not
entirely certain, then they would make
a category “B” recommendation to provide the vaccine
or a conditional recommendation to provide the vaccine
and immunize that population. If they believe
that the downsides actually outweigh the benefits,
they would do the opposite. They would say, you know, “It’s probably not indicated
in this population.” However, certain values
and preferences — risk takers who are extremely
downside adverse may accept the vaccination
and would want to receive it to avoid any potential
disease consequences and are not that worried about the adverse reaction
that they experience. And they would accept it. Under those circumstances,
the patient and the clinician just need to have
a good conversation, and there needs to be
informed decision making. Once again, I think there are situations
in the immunization field where both of these conditions or both of these type
of recommendations could be issued. Perhaps there are not, but I would have thought
it does exist. Many instances,
it will be like this. You will make strong
recommendations. They are either for an intervention
because the benefits, as you look at the estimates
of effect as well as how important
they are, clearly outweigh the importance and the occurrence
of the downsides. And then you make strong
recommendations against. All right, that’s the framework. You need to balance it out. It is complicated, and ideally,
what decision — You would use decision modeling. In most instances, you don’t
have the decision modeling, where you actually plug
all of this together. And that is the field
that I mentioned earlier that really still requires
a lot of work. But let’s come back
to some examples in terms of
how this has been used. So, I talked a lot about
the Avian Influenza example and the oseltamivir example. And it is a good example
for what we’ve just described — for what we’ve just described in terms of the quality
of evidence, where, for an exception for where the quality of
evidence may be very low, and you would still make
strong recommendations, which, in this framework,
is actually possible. So Avian Influenza example
that I talked about earlier. So, the methods that we actually have been
talking about here with WHO, in terms of developing
guidelines for dealing
with Avian Influenza. This was in 2006,
we began the work. It’s a panel that I’d shared
with 13 voting members. There were all sorts
of representatives, clinicians who treated influenza
A(H5N1) patients, infectious-disease experts,
basic scientists, public health officers,
methodologists. And there was actually
a UNICEF representative as a patient advocate. At the time,
we had independent groups providing
the systematic reviews. They looked at the RCTs. They looked at the RCTs that
were obviously not performed in patients
with Avian Influenza, but they were performed in patients
with seasonal influenza, provided indirect evidence. This question came up
a few times. There were case series
of these Avian Influenza. At that time, there were 36
described cases out of — It’s described
in the literature — out of about a hundred or so
when we started this work. And then we looked at basic
science evidence as well. That’s what the panel
came up with, so there was no clinical trial, direct clinical trial
in this field, obviously. There were four systematic
reviews that reported on five randomized trials
in seasonal influenza. Hospitalization was
one of the outcomes. I’m just showing you
the outcomes that showed some benefit here. I’m not showing you
the full-evidence profile. The odds ratio’s 0.22 —
not statistically significant. In pneumonia 0.15 — So some benefit there, a
relatively large odds reduction. But obviously wide confidence
intervals in very few cases. Three published case series
summarizing 36 cases, with a mortality risk of 1.0. But big problems in terms of when the drug
was administered — a lot of heat originating
in these cases. And, then, at that time,
there was no alternative that was more promising
in terms of drug treatment, and the cost were $40
per treatment course. This is Avian Influenza —
non-pandemic, obviously. So, this is what the panel had
in their hands, and these were
the first attempts. As I said, five years ago, the first attempts to make
the decision making were transparent. With all the limitations
that are out there and with a much more detailed
decision analytic modeling that you could have done. We didn’t do it
under those circumstances because it was rapid advice. But the panel laid out what
their considerations were. So, in terms of the quality
of evidence, it was very low-quality evidence because it was all indirect
evidence, as I just described, the overall quality of evidence. In terms of the balance between
benefits and downsides, yes, there was uncertainty,
but the key issue here was that the base,
that the mortality, the baseline risk of mortality
was between 50% and 80%. Right? So dying. Not only being infected —
dying. 50% to 80%. That even a relatively
small effect. As we saw there for pneumonia,
it was a large effect, but even the small effect for preventing mortality
or pneumonia, given the high baseline risk, would still lead to a relatively
large risk reduction. So just consider that. You have an 80% baseline risk
of dying. Even if oseltamivir had only a 5% relative risk
reduction for mortality, all right, it would still amount to
a 4% absolute risk reduction. So 5% out of 80%. 4% actual risk reduction,
which is a number needed to treat of 25% with a treatment
course of 5 days to prevent one death. And that is very different from many of the other
interventions that we are dealing with. There was a lot of uncertainty
around it, but that was one of the
underlying considerations here, that, despite small effects,
large baseline risks, you may still have
a large risk difference. Values and preferences. So, it was considered that there
was very little variability. Most patients would place a high
value in preventing death, a very low value
on temporary nausea when most of these
patients were very sick. On the neurologic complications,
perhaps everybody thought it was very important to reduce
hospitalizations. And the resources
were considered to be not as important because
it was $40 per treatment course under non-pandemic conditions. So relatively cheap. As I described here,
there was another issue, and that was, the issue
of regret came up. Many of these patients
were pediatric patients, and that patient, through
the values and preferences, and most parents would have
placed a high value in avoiding regret of not having
done everything. So, anyway, those were
the considerations that influenced it, and this is the recommendation
that was actually issued. In patients with confirmed,
or strongly suspected, infection with Avian Influence,
A(H5N1) virus, clinicians should administer
oseltamivir treatment as soon as possible. And it was considered
to be a strong recommendation based on very
low-quality evidence. A complete exception
to the rule, right? For perhaps good reasons. Anyway, we use this as
an example for when, perhaps, a strong recommendation
is justified in the face of very
low-quality evidence. Explanation, in terms of values
and preferences — The recommendation places a high
value on the prevention of death in an illness
with a high-case fatality. It places relatively low value
on adverse reactions, development of resistance,
and costs of treatment. And what we have
in the background document is that the panel was asked to assume patient’s values
and preferences when making this decision. Sorry. What we actually provided
in the full document was also the information
that this recommendation was actually based on a vote,
which we very rarely do. We usually have consensus
methods to find consensus under those circumstances. The results of the vote
were actually in the document. Does ACIP do that? Are the results
of the voting…? Okay. It was new for WHO for sure.
[ Laughs ] Okay, so, one other point
to touch on is implications of strong
or weak recommendations. So, for policy makers —
I mentioned this earlier. Policy makers means that
a recommendation can be adapted as a policy in most situations and for clinicians that most
patients should receive the recommended course
of action. For weak or conditional
recommendations, I mentioned it also earlier —
policy maker, there’s a need
for substantial debate and involvement of stakeholders. Clinicians —
They should be more prepared to make shared decisions
with the patients. Now, from this simple
decision table, we’ve moved
with guideline groups to actually be more explicit. And this is the area where
we need a lot of work, a lot, a lot of work
over the next years. But the decision criteria
are the same. It’s these four factors
that I mentioned. When we now meet with panels, we have all of the evidence
profiles prepared, as we are doing here. And we also have started to provide these additional
summary tables because frequently if you’re dealing with
a lot of recommendations, there’s not enough time to go
through everything in detail and then provide the summary
at the meeting, or write the summary
at the meeting, so that we have independent
groups summarize it narratively. So, for instance, this was a question about TB
with the WHO guideline panel, where we then describe
the evidence, provide an explanation for why
there’s low-quality evidence. We then describe the benefits
and the downsides and have a summary here. We describe the values
and preferences, and we provide information about the potential research
implications that the panel assume. And then we make these “yes or
no” ratings where, once again, it indicates that
if you are very certain about each of these factors that you are more likely to make
a strong recommendation and less likely to make
a weak recommendation. If you have a lot
of “no” answers, you are more likely to make a weak or conditional
recommendation — category “B.” And, then, we
leave this open, though. We just provide this,
and we work with the panel through this decision criteria
and try to find consensus on each of these four factors, which, once again,
ensures engagement, it ensures that everything
is well-prepared, it avoids the guessing
at the meeting, and it provides the transparency
that is being asked for what actually influenced
these decisions. Because it is these decisions
that you need to make in order to make the move
to a recommendation. And then, the other possibility that you can consider, or that we consider, in the work
with larger panels is, in order to move the work ahead, we have these neutral
recommendations formulated. ACIP, are you doing that
when you move to the panel, that you have the
recommendations formulated, pre-formulated? Is that how you do it? So, in order to keep the panel
engaged and make sure that they actually
do contribute, you don’t want to necessarily
write everything out and determine beforehand
what the recommendation is. So we formulate
these neutral recommendations so that we have something
to work with and so that
we’re not going astray and that all of the issues are
covered in the recommendations so that,
under those circumstances, you could formulate it
neutrally. The guideline group recommends
[indistinct] are or are not used in the treatment
of, you know, all patients with multi-drug resistant TB
in this case. And, then, we just leave
this open, whether it’s a strong
or conditional recommendation — either for or against,
as you can see. And sometimes, because
indirectness ratings may influence the overall
quality of the evidence, we sometimes don’t
pre-specify the quality. Now, just a few final words
on why we need to be clearer in terms of how we formulate
recommendations. It becomes a very, very, very
important issue in our view, and there is always
a lot of debate about this. There is actually evidence
that guideline groups have not been doing a great job in terms of formulating
recommendations. This is from Rick Shiffman’s
group, from a Yale paper from 2009. They actually looked at the AHRQ
guideline clearing house,, identified 1,275
recommendations randomly. They found huge inconsistency
across guidelines as well as within guidelines, in terms of how they were
formulated. About 1/3 did not express
the recommendation clearly. Most of them were not written
as executable action items. And that’s just to re-emphasize that we really should be
doing this. Over half of them didn’t
indicate any strength. So that is really important —
in my view, important evidence that we could be doing
much better in terms of formulating
recommendations. Just thought to provide you
with another example of how a recommendation could be
formulated, how we do this, and how it’s actually
recommended in Farouk’s ACIP document, which is for strong
or conditional recommendations. Oh, actually, this should read
“suggests.” Apologies. For strong recommendations, we
use the wording “recommend.” For weak recommendations, we can use the wording “may”
or “suggests.” And for recommendations against,
for strong recommendations, we would recommend against
or suggest not use it. So, one additional example is,
the guideline group recommends rapid DST testing
for resistance to INH and rifampin TB drugs
or rifampin alone, over conventional testing or no testing at the time
of diagnosis of TB. Conditional recommendation,
low-quality evidence, or 2+ quality evidence,
and then values and preferences. A high value was placed on outcomes such
as preventing death and transmission of MDR as
a result of delayed diagnosis as well as avoiding
spending resources. So, just a brief explanation in terms of the values
and preferences and that the values
and preferences came from the guideline panel. Hard to read? And just to explain to you that
we’ve now moved even further when we provide
these decision tables. So, based on these experiences
with TB panels and so on, we’ve realized that we can
provide narrative information to make it even more
transparent. I’m not saying that this needs to necessarily
be done all the time, but it helps with working
through the issues, in particular, with complicated
panels or difficult panels, where we summarize the quality
of the evidence and provide key reasons
for down- and upgrading in a narrative format for people who have difficulty
with evidence profiles and want to see it
in a narrative format. We provide information about
the baseline risk for benefits, is it similar across different
subgroups of patients, should there be separate
recommendations for subgroups so that groups are actually
considering this. Then what the perspective
taking is for the values
and preferences — usually patients, but it could
be, obviously, society. The source of the values —
Is it from the guideline panel, or is it from the literature
that was assessed? The source of the variability —
If there is anything. And, then, the methods
for determining values and whether it was satisfactory. So is it considered satisfactory
that the panel actually provides the value
and preference statements, or should it have
been the literature? And, then, the cost
of treatment — We are dealing
with feasibility issues as well as opportunity costs and whether they’re different
across settings. Because these are
the decision criteria. This just to enhance
the transparency around formulating
recommendations. It is 2:55. I would stop here, but these
are the decision tables that lead us to make
recommendations. I would stop here
with this presentation and just take questions… …if there are any. There are none. Yeah, I know it is hard to read. So, the third column says, “This is a summary of
the reasons for the decision.” So, in other words, we’re asking
panels to make a decision about, whether, you know, the quality
of evidence is high or moderate, which would make you more likely to offer a strong
recommendation. The next question here is balance of benefits
versus harms and burden. So, is there certainty that
the benefits outweigh the harm? This would say yes. And then here, it would just provide
a brief narrative description that there
is considerable benefit while very little clinical harm
or downsides are expected. And the same is true here
for the values and preference. So the benefits are much
higher-valued than the expected minor harms and to provide
a brief explanation. So, as you think of this
as an unfolding explanation to where it becomes
more detailed as you move from left to right. Yes. WOMAN: So, I have a sort of
practical question, probably for both you
and Farouk. Since we’re just getting started
with this in ACIP, still, I’m not quite clear yet
on the actual format of what the recommendations
from ACIP will look like, whether published
as brief policy notes in the weekly MMWR or as full recommendation
and report statements. And I’m just curious,
in practical terms, of what kind of tables do you generally end up
including for readers, whether they would actually
be published in the body
of the recommendation or, I think, as SAGE does,
as online links. DR. SCHUNEMANN: Farouk,
do you want to…? Do you want me to take…? I could tell you what we’ve been
doing in other groups. WOMAN: Like, for example,
I’m not sure. Is this a kind of table that
would be presented to the reader in the end or to the user? DR. SCHUNEMANN: Okay. This is probably
one of the tables that we would put
in the online repository for people to see —
We would put there for those who want to understand
the rationale for a recommendation
to go through. The work that we’ve done in terms of what people really
want to see probably is more — So, we haven’t assessed whether people want to see this,
necessarily, but we do know that they want to see
the summary of findings tables when you present
the recommendations. So either
the full-evidence profile or an abbreviated
evidence profile. But we provide this for reasons
of transparency in the appendices. Farouk, sir, did you…? DR. MOOKADAM: Yeah, I think what
we put in the vaccine paper — I think most people would want
to see that information. So any additional details, we
can think about where to put it. DR. SCHUNEMANN: Yeah. So… WOMAN: Some of the slides that you have shown
were very, very condensed. [ Speaks indistinctly ] DR. MOOKADAM: Yeah, I think
it depends on your situation. If you think this will help them
in making a decision, then you can, you know,
write it out. Not all ACIP members
may understand all the statistics and numbers. DR. SCHUNEMANN: Yeah, so
sometimes these narratives, once again, help panel members
understand better. We need the tables in order
to make the quality assessment and to have
a numerical estimate. But sometimes the narratives just help them to make
informed judgments. So, this is probably
a stage in between. This is some —
It’s hard for you to see, but if you actually sit
in front of it or — Yeah, if you have it
in front of you, it helps the chair of the panel to actually move through
the issues relatively quickly because you can say, “Okay,
for this recommendation, “what is the quality
of the evidence? “The quality of the evidence
is high. Remember that when you make
a recommendation.” Balance of benefits
versus harms and burden. So, under those circumstances,
he could just quickly say, “Remember, there was some
big mortality reduction, “there was not a lot
of adverse consequences, so it looks like the balance
is clearly in favor,” which is a yes. So it helps the process along. In terms of publishing it, you
can put it in the background, or you can make it
your main table. We don’t really know
what works best.

Leave a Reply

Your email address will not be published. Required fields are marked *