Study design, sample size and statistical power

Simon Spooner · Jan 18, 2013

Members do not see these Ads. Sign Up.
All,

In another thread, David Holland wrote:

davidh said: ↑

I know that statistics can be wrong. I know that small cohorts are not as trustworthy as big cohorts..........
Click to expand...

I think it's time to try to discuss sample size and it's implications in research studies so that we may all have a better grounding in reading and interpreting research. I'll try this thread, if no-one chimes in I'll leave it and assume no-one is interested. If you don't want to post hit the thanks button and at least that way I'll know someone else is interested in exploring this subject. The caveat being, I'm a chiropodist, I'm not a statistician. I have a reasonable grounding in this, but I don't consider myself to be an expert in this topic. There are entire textbooks written on the subject.

Lets start with the second part of David's statement that: "I know that small cohorts are not as trustworthy as big cohorts.........."

So, hands up all those that think davidh is right...

Take two differing studies, one with a cohort of 50 subjects, one with a cohort of 100 subjects, but both designed to examine for statistically significant differences between two differing sets of variables. Can we automatically assume, as David Holland appears to, that the study with 100 subjects is more "trustworthy" than the study with 50 subjects?

Rob Kidd · Jan 18, 2013

While sample size can be important, the key issues are more about study design and the "power function". The question "how big should my sample size be?" has been addressed in texts many times. The answer is largely down to how big the difference is between the samples of the thing you are examining. The other issue is "common sense" when it comes to interpretation. Stats do not provided answers; your (correct) interpretation does. Another angle to this last sentence is recognising that a statistical difference does not necessarily indicate a clinincal (biological) difference. Example: with a group of 20 study and 20 control subjects, a blood pressure drug does not show a statistical siginificant difference - the correct test was used. The sample size was increased to 1000. Using a different, but correct test as the sample size was different, a lowering of the diastolic BP by 3mm of mercury was found to be significantly different. Does that mean the drug is a success? Many would argue not as 3mm is stuff all from a clinical standpoint. Stats lead to answers; scientist provide them................ Several times I have examined a thesis where the (probably) wrong stats have been used, but the outcome was sensible from a biological viewpoint. and then, I have examined others where the correct stats were used, but interpretations of the results were not biologically plausible. At all times engage brain first!

Simon Spooner · Jan 18, 2013

Rob Kidd said: ↑

While sample size can be important, the key issues are more about study design and the "power function". The question "how big should my sample size be?" has been addressed in texts many times. The answer is largely down to how big the difference is between the samples of the thing you are examining. The other issue is "common sense" when it comes to interpretation. Stats do not provided answers; your (correct) interpretation does. Another angle to this last sentence is recognising that a statistical difference does not necessarily indicate a clinincal (biological) difference. Example: with a group of 20 study and 20 control subjects, a blood pressure drug does not show a statistical siginificant difference - the correct test was used. The sample size was increased to 1000. Using a different, but correct test as the sample size was different, a lowering of the diastolic BP by 3mm of mercury was found to be significantly different. Does that mean the drug is a success? Many would argue not as 3mm is stuff all from a clinical standpoint. Stats lead to answers; scientist provide them................ Several times I have examined a thesis where the (probably) wrong stats have been used, but the outcome was sensible from a biological viewpoint. and then, I have examined others where the correct stats were used, but interpretations of the results were not biologically plausible. At all times engage brain first!
Click to expand...

Thanks for taking the time Robert. I concur with your point regarding clinical versus statistical significance. You seem to be suggesting that sample size, the use of correct statistical tests and statistical significance is not too important in clinical research; that correct clinical inferences from the data are. Without getting diverted too much from the topic of the thread- how can the correct inferences be drawn from a data set when inappropriate statistical analyses have been performed? And, back on topic, when the sample size is so small that the power is low and the potential for beta error high?

Kevin Kirby · Jan 18, 2013

Simon Spooner said: ↑

Lets start with the second part of David's statement that: "I know that small cohorts are not as trustworthy as big cohorts.........."

So, hands up all those that think davidh is right...

Take two differing studies, one with a cohort of 50 subjects, one with a cohort of 100 subjects, but both designed to examine for statistically significant differences between two differing sets of variables. Can we automatically assume, as David Holland appears to, that the study with 100 subjects is more "trustworthy" than the study with 50 subjects?
Click to expand...

I have found that people who constantly complain that research can't be good just because the sample size is too small have never actually done much human research of their own.

Even small sample sizes in clinical research can be quite useful for the clinician if the research is well-designed and the authors don't make claims in their paper that is over and above what the data demonstrates.

Good thread, Simon. But please keep it simple for those of us who are statistically challenged.... what is a beta error?

Griff · Jan 18, 2013

My understanding of statistical errors is as follows:

Type 1 error (alpha) - You find a difference between two groups when one does not exist
Type 2 error (beta) - You don't identify a difference that does exist

Will need to dig out my notes on 'power analysis' and refresh my memory on this topic before I quickly get out of my depth...

Simon Spooner · Jan 18, 2013

Kevin Kirby said: ↑

I have found that people who constantly complain that research can't be good just because the sample size is too small have never actually done much human research of their own.

Even small sample sizes in clinical research can be quite useful for the clinician if the research is well-designed and the authors don't make claims in their paper that is over and above what the data demonstrates.

Good thread, Simon. But please keep it simple for those of us who are statistically challenged.... what is a beta error?
Click to expand...

Beta is the probability of making a type 2 error, a type 2 error is the failure to reject a false null-hypothesis. So, you do a study with a sample and your null hypothesis is that there is no signifcant difference between arch height in subjects with hallux valgus and subjects without hallux valgus, you measure arch height in a sample, perform an appropriate statistical test and find no significant difference between the group of subjects with hallux valgus and the group without hallux valgus. Therefore, you accept the null-hypothesis that arch height is not different in those with and without hallux valgus. Beta error is the probability that you got that conclusion wrong. 1-beta = statistical power, which as most people know is linked to sample size. But in order to calculate beta in our example, we first need to calculate gamma . It's late for me, and this is a complex topic. I'll be back on it tomorrow though, promise.

Rob Kidd · Jan 18, 2013

Simon Spooner said: ↑

Thanks for taking the time Robert. I concur with your point regarding clinical versus statistical significance. You seem to be suggesting that sample size, the use of correct statistical tests and statistical significance is not too important in clinical research; that correct clinical inferences from the data are. Without getting diverted too much from the topic of the thread- how can the correct inferences be drawn from a data set when inappropriate statistical analyses have been performed? And, back on topic, when the sample size is so small that the power is low and the potential for beta error high?
Click to expand...

Points one at a time. I am not making any statement about sample size per se. What was meaning to say is that the important issue is design, power function (ie how big should my samply size be?), and not to lose sight of the clinical (biological) question. To the other point re: incorect tests. Perhaps I will rephrase that by analogy: there is no such thing as a bad pub, only good pubs and better pubs. For instance, if you have a data set of interval data eg interlandmark distances of bone dimensions in mm, then you could use summary measures such as the median and interquartile ranges - and they would not be wrong. However, assuming the usual stuff about data distributions, a more powerful way (in the statistical sense) would be to use the mean and standard deviation.

Among the several things that I have said in this and the last post is a clear message that stats lead to answers, they are not answers in themselves. The worker's interpretation is the key issue; just because something is statistically significant does not make it clinically (biological) real. For a crass example, I seem to remember from very many years ago, our glorious administrator once quoting that there was association between ice cream consumption and rape............. biological rubbish though.

Athol Thomson · Jan 18, 2013

Kevin Kirby said: ↑

I have found that people who constantly complain that research can't be good just because the sample size is too small have never actually done much human research of their own.

Even small sample sizes in clinical research can be quite useful for the clinician if the research is well-designed and the authors don't make claims in their paper that is over and above what the data demonstrates.
Click to expand...

I agree.

It seems like common sense to look at the methodology involved before making a judgment on sample size.

Some studies take a lot of heat about small sample sizes but if you read through the methods and see what was actually involved in testing each subject it is apparent that the process was time-consuming and that vast amounts of data were produced that must be processed.

Keep in mind that not all researchers are full-time or have grants to support them.

Must admit I know nothing about stats though....so good thread.

Athol

Ros Kidd · Jan 19, 2013

My bible on this subject is "Foundations of Clinical Research/ Applications to Practice" Portney and Watkins. This is simply because it is directly applicable to clinical issues where most stats books are bogged down in engineering or whatever. We (ie me and my staff) usually found journal club and Grand Round excellent events for discussing clinical significance.

markjohconley · Jan 19, 2013

Excellent thread, passed stats at uni but 40 yrs ago and all i remember it went from extremely easy to comprehend in the early lectures to WHOOOAAAHH how did you get from there to here!
Going to get my daughters text and update, mark

(The physiotherapy database PEDro has a couple of short but excellent tutorials on 'valid trials' and 'clinically useful therapy')

Kevin Kirby · Jan 19, 2013

My last statistics course was at UC Davis in 1978 (35 years ago) where we had an Eastern Indian visiting professor teaching with an nearly unintelligible accent and teaching material that was quite different from the material in the book we were told to purchase. I do remember something about him talking about pulling red and white balls out of a big glass jar and talking (I think if I understood him correctly through the accent) something about probability.....that was, by far, my worse undergraduate educational experience!!:craig:

.....so go easy on me......

Simon Spooner · Jan 19, 2013

Morning.

So, last night we established that beta is the probability of making a type 2 error, a type 2 error is the failure to reject a false null-hypothesis. And that statistical power is 1-beta.

One of the workers that wrote textbooks on statistical power was Cohen (1977), he and others provide tables of statistical power (1-beta) as a function of the significance criteria (alpha) and the probability of detecting a clinically relevant difference (delta).

Alpha is the probability of making a type 1 error; a type 1 error is the incorrect rejection of a true null hypothesis. Usually researchers set the alpha level at 0.05 or 0.01 (these numbers look familiar?), but theoretically you can set them to whatever you want. Of course the larger the alpha you set, the greater the probability of making a type 1 error.

So, lets say we measure hallux abductus angle in a sample of males and females and we want to know if there is a significant statistical difference between the men and the women, we can employ a 2-sample unpaired T-test to do this (provided the data meet the assumptions of the t-test- more on assumption of statistical tests later, if you want) and we can retrospectively perform a power analysis to calculate beta.

We set alpha at 0.05, but in order to find beta from Cohen's tables we also need to know delta, and we can calculate the probability of detecting a clinically relevant difference (delta) in the following way:

delta = gamma sq.root (n/2)

where: gamma = the population effect size and n= sample size

So, now we have to work out gamma in order to find delta (we already know our sample size (n) because this is a retrospective power analysis). Gamma, the population effect size can be calculated as follows:

gamma = mu / SD

where: mu is the minimum detectable difference in measurements and SD is pooled standard deviation for both the males and females within our sample.

So, now we need to know what mu is equal to This is where our precision of instrumentation and our reliability of measurement comes into play. Let's say we used a goniometer to measure the hallux abductus angle in our sample population and that the goniometer in question was calibrated in two degree increments. We could say that the minimum detectable difference in measurements (mu) was 2 degrees- right? However, we may have also performed a reliability study which demonstrated that actually the within-day (or between-day) variance in our measurements means that we can only really measure confidently to within 4 degrees or 6 degrees, 8 degrees etc.- right? Basically, the greater the precision of your clinical measurements (smaller mu), the smaller the sample size you'd need to achieve the same statistical power, all other factors being equal. The more astute of you will have realised that we could also set Mu to be equal to the minimum which we believe is "clinically significant"- Robert's point.

I'll leave it there for now. Not least because that's a lot for you to get your heads around, but also because I don't have Cohen's tables of power in front of me. However, I hope you can see the relationship between sample size and the precision of measurement in the calculation of delta, which in turn is used along with alpha to calculate beta. So you can't just say that because one study has a bigger sample size than another, the study with the bigger sample is automatically more "trustworthy"- it's more complicated than that. Robert can probably add some more, as he seems well versed in the statistical arts. I'm off to walk the dog now, but I'll leave you with a quote from Francis Galton:

“Statistics are the only tools by which an opening can be cut through the formidable thicket of difficulties that bars the path of those who pursue the Science of Man”

Note that he said statistics could cut an opening in the thicket, he didn't say they would allow you to pass through it!

davidh · Jan 19, 2013

I'm grateful to Simon for starting this thread, and for mentioning me in his first post, since it gives me something to answer.

I don't think there is any doubt that Simon is more au-fait with statistical analysis than I am. I last studied stats 23 years ago at undergrad level.
I did however buy and read relevant parts of Douglas Altman's Practical Statistics for Medical Research when doing my Masters, but I'm afraid I had a colleague from the Maths Dept at Durham work through my statistical analysis - quite permissable BTW.

Simon quoted a paper on another thread - here.
I said I read as far as the sample - n-11, then stopped.
Hence this thread.

I re-read the abstract of that paper this morning.
11 subjects tested some orthotics. Did each of the 11 subjects possess identical feet?
The chances are not - although I think an assumption has been made that they had. Anyone who has looked will know there is huge variability in "normal" feet.

Was diurnal variation plotted so that peak ligamentous laxity for each subject was noted and subjects only tested either then or at the time when lowest ligamentous laxity occurred? You can have up to 16 degrees of variation in AJC inversion over the course of a day.
It was not - so at the very least we can be reasonably sure that like was not tested with like.
The only constants in this study were the orthotics themselves (and if shoes had wear patterns that would tend to make that less constant, the surface the test was carried out on, and the Vicon gait analysis system.

Here's a snippet.
"Although not statistically significant, peak power absorp-
tion decreased by 23% and 34.6% between the no orthoses
and standard conditions and between the no orthoses and
inverted orthoses conditions, respectively (Table 2). This
lack of significance may be due to a low number of subjects
included in the study. A post hoc analysis suggests that
greater than 40 subjects would have been necessary to have
adequate statistical power for these variables. An important
limitation of this study was that there was only one clinician
referring subjects to the Motion Analysis lab for this study
and the inverted orthotic technique is only used for severe
patients. This made recruitment of a large number of sub-
jects difficult."

Rob Kidd · Jan 19, 2013

Simon Spooner said: ↑

Morning.

So, last night we established that beta is the probability of making a type 2 error, a type 2 error is the failure to reject a false null-hypothesis. And that statistical power is 1-beta.

One of the workers that wrote textbooks on statistical power was Cohen (1977), he and others provide tables of statistical power (1-beta) as a function of the significance criteria (alpha) and the probability of detecting a clinically relevant difference (delta).

Alpha is the probability of making a type 1 error; a type 1 error is the incorrect rejection of a true null hypothesis. Usually researchers set the alpha level at 0.05 or 0.01 (these numbers look familiar?), but theoretically you can set them to whatever you want. Of course the larger the alpha you set, the greater the probability of making a type 1 error.

So, lets say we measure hallux abductus angle in a sample of males and females and we want to know if there is a significant statistical difference between the men and the women, we can employ a 2-sample unpaired T-test to do this (provided the data meet the assumptions of the t-test- more on assumption of statistical tests later, if you want) and we can retrospectively perform a power analysis to calculate beta.

We set alpha at 0.05, but in order to find beta from Cohen's tables we also need to know delta, and we can calculate the probability of detecting a clinically relevant difference (delta) in the following way:

delta = gamma sq.root (n/2)

where: gamma = the population effect size and n= sample size

So, now we have to work out gamma in order to find delta (we already know our sample size (n) because this is a retrospective power analysis). Gamma, the population effect size can be calculated as follows:

gamma = mu / SD

where: mu is the minimum detectable difference in measurements and SD is pooled standard deviation for both the males and females within our sample.

So, now we need to know what mu is equal to This is where our precision of instrumentation and our reliability of measurement comes into play. Let's say we used a goniometer to measure the hallux abductus angle in our sample population and that the goniometer in question was calibrated in two degree increments. We could say that the minimum detectable difference in measurements (mu) was 2 degrees- right? However, we may have also performed a reliability study which demonstrated that actually the within-day (or between-day) variance in our measurements means that we can only really measure confidently to within 4 degrees or 6 degrees, 8 degrees etc.- right? Basically, the greater the precision of your clinical measurements (smaller mu), the smaller the sample size you'd need to achieve the same statistical power, all other factors being equal. The more astute of you will have realised that we could also set Mu to be equal to the minimum which we believe is "clinically significant"- Robert's point.

I'll leave it there for now. Not least because that's a lot for you to get your heads around, but also because I don't have Cohen's tables of power in front of me. However, I hope you can see the relationship between sample size and the precision of measurement in the calculation of delta, which in turn is used along with alpha to calculate beta. So you can't just say that because one study has a bigger sample size than another, the study with the bigger sample is automatically more "trustworthy"- it's more complicated than that. Robert can probably add some more, as he seems well versed in the statistical arts. I'm off to walk the dog now, but I'll leave you with a quote from Francis Galton:

“Statistics are the only tools by which an opening can be cut through the formidable thicket of difficulties that bars the path of those who pursue the Science of Man”

Note that he said statistics could cut an opening in the thicket, he didn't say they would allow you to pass through it!
Click to expand...

With respect, and I do mean with respect, we are losing track of the biology. Getting bogged down in Alpha to Omega (joke) is no help the clinician, IMHO. However you do note, quite rightly, that Alpha may be set where one thinks it fits, so to speak. And that is the key issue - it is a question of what you can live with in order to answer your biological (clinical) questions. At the risk of repeating myself more than 3 times, stats are tools to find answers, not the answers themselves. I am delighted that people are asking these questions. Rob

Simon Spooner · Jan 19, 2013

davidh said: ↑

I'm grateful to Simon for starting this thread, and for mentioning me in his first post, since it gives me something to answer.

I don't think there is any doubt that Simon is more au-fait with statistical analysis than I am. I last studied stats 23 years ago at undergrad level.
I did however buy and read relevant parts of Douglas Altman's Practical Statistics for Medical Research when doing my Masters, but I'm afraid I had a colleague from the Maths Dept at Durham work through my statistical analysis - quite permissable BTW.

Simon quoted a paper on another thread - here.
I said I read as far as the sample - n-11, then stopped.
Hence this thread.

I re-read the abstract of that paper this morning.
11 subjects tested some orthotics. Did each of the 11 subjects possess identical feet?
The chances are not - although I think an assumption has been made that they had. Anyone who has looked will know there is huge variability in "normal" feet.

Was diurnal variation plotted so that peak ligamentous laxity for each subject was noted and subjects only tested either then or at the time when lowest ligamentous laxity occurred? You can have up to 16 degrees of variation in AJC inversion over the course of a day.
It was not - so at the very least we can be reasonably sure that like was not tested with like.
The only constants in this study were the orthotics themselves (and if shoes had wear patterns that would tend to make that less constant, the surface the test was carried out on, and the Vicon gait analysis system.

Here's a snippet.
"Although not statistically significant, peak power absorp-
tion decreased by 23% and 34.6% between the no orthoses
and standard conditions and between the no orthoses and
inverted orthoses conditions, respectively (Table 2). This
lack of significance may be due to a low number of subjects
included in the study. A post hoc analysis suggests that
greater than 40 subjects would have been necessary to have
adequate statistical power for these variables. An important
limitation of this study was that there was only one clinician
referring subjects to the Motion Analysis lab for this study
and the inverted orthotic technique is only used for severe
patients. This made recruitment of a large number of sub-
jects difficult."
Click to expand...

When you are talking about diurnal variation david, you are really talking about within-day measurement error, since one cannot be differentiated from the other in such studies. So long as the researchers take into account their within day error within the analyses (as I described above) the confidence limits of the analyses can be calculated. Since the subjects were symptomatic, can you assume the feet were "normal"- no such thing as normal feet, really. What the study set out to do was to examine why those specific patients did not get better with one type of foot orthosis, but did get better with another type. What they reported was that the rearfoot kinematics were the same in both kinds of orthoses, but the kinetics were only altered in the type of orthoses which made them get better. Which suggests that foot orthoses do not create their therapeutic effect by holding the subtalar joint in neutral during stance, as per your conjecture. We don't need to worry about the power too much, as we don't need to statistically test for significant differences between the data, we can just look at the kinematic plots and the kinetic plots, it doesn't take a statistical test to see the obvious lack of differences in the kinematic plot, and the obvious differences between foot orthoses in the kinetics plot- a per Robert's point. So, the subjects got better with one type of device than another, neither device seems to make much difference to rearfoot kinematics, but one of the designs of device (the one that made them get better) seems to change the rearfoot kinetics, while the one they didn't get better while wearing didn't. So we can look at the data, we don't need to perform and statistical tests that would be prone to type 1 or type 2 error, and qualitatively describe what we see- And I see, that the foot orthoses more likely exerted their therapeutic effects by altering the kinetics at the reafoot, than by holding the subtalar joint in neutral in this study. I also gave you a list of other studies which had demonstrated that the foot orthoses didn't alter rearfoot kinetics too much, perhaps you'd be so kind as to provide a reference which supports your assertion that foot orthoses exert their therapeutic effect by holding the subtalar joint in neutral during stance?

Anyway, lets keep this thread on topic. I should be happy to continue our discussion where you left it last night within the other thread, here: http://www.podiatry-arena.com/podiatry-forum/showthread.php?t=85779 For those interested, David is arguing that foot orthoses exert their therapeutic effects by "holding the subtalar joint in an approximation of neutral during stance"...

Simon Spooner · Jan 19, 2013

Rob Kidd said: ↑

With respect, and I do mean with respect, we are losing track of the biology. Getting bogged down in Alpha to Omega (joke) is no help the clinician, IMHO. However you do note, quite rightly, that Alpha may be set where one thinks it fits, so to speak. And that is the key issue - it is a question of what you can live with in order to answer your biological (clinical) questions. At the risk of repeating myself more than 3 times, stats are tools to find answers, not the answers themselves. I am delighted that people are asking these questions. Rob
Click to expand...

Yes, we understand that. I am trying to show how we can incorporate the levels of what are considered as "biologically significant" within our statistical analyses, to ensure that what the statistics tell us, relates to what we need to know clinically. In my opinion, it's more about setting mu right than setting the alpha level (See above).

Rob Kidd · Jan 19, 2013

"gamma = mu / SD

where: mu is the minimum detectable difference in measurements and SD is pooled standard deviation for both the males and females within our sample."

Above is a much cut version of your post. My alarm bells go off straight away when I hear about pooled SD - the reasons largely being that the female of the species (in all but a few primates, but including sapiens) has a greater SD than the male. The reasons lie in sexual dimorphism, which takes us away from this thread. And anyhow, the question may not relate to a sexual difference, or it may be about one sex or the other.

The stats analysis package that I use is SAS - and in the belt out of results it gives routinely, it give two versions for Student: one with and one without a pooled SD. When I was a graduate student, about a million years ago, I tested it at length to see what differences might be; in fact, you really had to abuse the data to achieve a different answer. BUT, pooling things like SD might be obscuring the very thing that you are looking for - as in my case sexual dimorphisms... Rob

Simon Spooner · Jan 19, 2013

Rob Kidd said: ↑

"gamma = mu / SD

where: mu is the minimum detectable difference in measurements and SD is pooled standard deviation for both the males and females within our sample."

Above is a much cut version of your post. My alarm bells go off straight away when I hear about pooled SD - the reasons largely being that the female of the species (in all but a few mammals, but including sapiens) has a greater ST than the male. The reasons lie in sexual dimorphism, which takes us away from this thread. And anyhow, the question may not relate to a sexual difference, or it may be about one sex or the other.

The stats anaysis package that I use is SAS - and in the belt out of results it gives routinely, it give two versions for Student: one with and without a pooled SD. When I was a graduate student, about a million years ago, I tested it at length to see what differences might be; in fact, you really had to abuse the data to achieve a different answer. BUT, pooling things like SD might be obscuring the very thing that you are looking for - as in my case sexual dimorphisms... Rob
Click to expand...

That's just the way, Cohen defines how to calculate it. It's the pooled standard deviation from the two samples which form the t-test, I just happened to use gender as an example, it could be any two groups. And the test itself is looking for differences between the two groups, so if there is sexual dimorphism in the variable of interest, and you have enough statistical power, that's what you'll find!

davidh · Jan 19, 2013

Simon Spooner said: ↑

When you are talking about diurnal variation david, you are really talking about within-day measurement error, since one cannot be differentiated from the other in such studies. So long as the researchers take into account their within day error within the analyses (as I described above) the confidence limits of the analyses can be calculated. (much cut........) For those interested, David is arguing that foot orthoses exert their therapeutic effects by "holding the subtalar joint in an approximation of neutral during stance"...
Click to expand...

As long as they recognise the magnitude of diurnal variation (I bet they didn't) all should be well.

David isn't arguing that orthoses do anything. David stated, in reply to a question, how he thought orthoses work.
We also debated the word hold. Blinda and Robert queried what I wrote. I showed a definition from Google which seemed to fit, but admitted it was a bad choice of word.

blinda · Jan 19, 2013

davidh said: ↑

....

David isn't arguing that orthoses do anything. David stated, in reply to a question, how he thought orthoses work.
We also debated the word hold. Blinda and Robert queried what I wrote. I showed a definition from Google which seemed to fit, but admitted it was a bad choice of word.
Click to expand...

blinda is going off thread and asking davidh `why is davidh talking in the third person?`

She is also enjoying this thread, despite possessing a very limited understanding of stats, so she wishes to thank Simon and Rob for simplifying a rather complex subject.

davidh · Jan 19, 2013

blinda said: ↑

blinda is going off thread and asking davidh `why is davidh talking in the third person?`

She is also enjoying this thread, despite possessing a very limited understanding of stats, so she wishes to thank Simon and Rob for simplifying a rather complex subject.
Click to expand...

To emphasise a point, is all.

Kevin Kirby · Jan 19, 2013

I would request that all those involved can please keep this thread on topic so the information contained here can be of maximal educational benefit for all those following along.

Understanding statistics is vitally important for both the researcher and clinician. However, statistics can also be confusing unless the individual has had either a good undergraduate or post-graduate course on the subject. Unfortunately, for most podiatric clinicians, statistics is not one of their best subjects. With that in mind, let me try to direct some questions that will hopefully maximize the eductional benefit to all of us on this important subjec.

Question #1: I want to perform a research study on the correlation between the weight of the individual and the minimum tension force required to dorsiflex the hallux to a 10 degree dorsiflexed position while the subject is standing in relaxed bipedal stance. I can measure the weight of the individual to within 100 g with a scale and can measure tension force to within 0.1 Newtons with the tension measuring device and can measure hallux dorsiflexion to within 0.5 degrees of accuracy.

I would like to have enough subjects within my study so that I can determine to a 0.05 level of significance if there is a positive correlation between the weight of the individual and the tension force required to dorsiflexion their hallux to a 10 degree dorsiflexed position.

A. How would I determine, before I start the study, the number of subjects needed to be relatively certain that I could determine a 0.05 level of significance in the results of my study?

B. What would be the most proper statistical test or types of analyses to properly determine if body weight of my subjects correlated to tension force required to reach 10 degrees of hallux dorsiflexion in my research study? Why?

Thanks in advance.:drinks

Kevin Kirby · Jan 19, 2013

For those of you who are unfamiliar with the term "statistical significance", here is a nicely written article explaining the concept.

What is Statistical Significance?

Simon Spooner · Jan 19, 2013

Kevin Kirby said: ↑

Question #1: I want to perform a research study on the correlation between the weight of the individual and the minimum tension force required to dorsiflex the hallux to a 10 degree dorsiflexed position while the subject is standing in relaxed bipedal stance. I can measure the weight of the individual to within 100 g with a scale and can measure tension force to within 0.1 Newtons with the tension measuring device and can measure hallux dorsiflexion to within 0.5 degrees of accuracy.

I would like to have enough subjects within my study so that I can determine to a 0.05 level of significance if there is a positive correlation between the weight of the individual and the tension force required to dorsiflexion their hallux to a 10 degree dorsiflexed position.

A. How would I determine, before I start the study, the number of subjects needed to be relatively certain that I could determine a 0.05 level of significance in the results of my study?

B. What would be the most proper statistical test or types of analyses to properly determine if body weight of my subjects correlated to tension force required to reach 10 degrees of hallux dorsiflexion in my research study? Why?

Thanks in advance.:drinks
Click to expand...

Kevin, I'll come back to your A. on Monday if I may, since I don't have my books in front of me and wouldn't want to provide inaccurate information- suffice to say, when calculating statistical power prospectively, it is sometimes necessary to make certain educated guesses.

As for your B. Here is how I would approach it, I'm not saying this is the best way, just how I'd do it.

The first thing I would do is to plot a scattergram of the body weight data versus the tension data (x = body weight, y = tension), I would take a visual inspection of the graph produced to get an instinctive idea of whether the relationship between the two sets of data is linear, or is better represented by some other function. Next, I would use some software to fit some "lines of best fit". I'd start with a linear function and look at the r square value I got for that model (basically r-square tells us how well the data fits the model so: 1 would be a perfect fit, 0= rubbish fit). A linear model is y = mx +c, where: y = tension, m = the slope of the line, x = body weight and c is a constant which is the intercept value on the y (or x) axis.

Next, I would fit a quadratic model, the basic form of which is Y= mx + x>2 +c and look at how much this improved the r square value (by including another x term in the model we've lost a degree of freedom in any subsequent analysis- so you have to decide whether the increase in fit out-weighs the loss of degrees of freedom in the analysis), then I would perform a cubic model fitting: y= mx + x>2 + x>3 +c and look at the r square value, again you've now lost another degree of freedom, no problem if you've got a big sample- hell of a problem if your sample is n=3!!! etc. etc.

At some stage you need to decide whether the linear model is a "good enough fit" for the data, or whether to attempt to coax the variables into a more linear relationship, or whether to perform a non-linear analysis.

If you decide that the data is "linear enough" you could perform a Pearson's correlation test to glean the strength of the correlation. But of greater clinical value should be the ability to predict the x variable, i.e. tension, from the y variable, i.e. body weight. In which case a simple linear regression analysis would be the way forward since this would provide an equation in the form of y= mx +c to allow the clinician to predict (within the 95% confidence intervals of the model) the tension in the plantar fascia given a person's body weight. However, we should be mindful that an n=2 sample could provide a perfect fit to a linear model with 1 degree of freedom! So, some have suggested that a sample of about a hundred should be generated before a linear regression is attempted- this number seems to be somewhat arbitrary.

If the linear r square value is poor, you can then attempt to perform scale transforms into the data to make the relationship between variables more linear. For example, you might take logarithms of the body weight data and re-plot the log body weight versus tension data to see if this improves the r square of the linear model. If it does then you can perform the analysis as described above, if not... you might want to consider a non-linear analysis. Semmes-Weinstein filaments are calibrated on a log-linear scale for this reason, BTW.

Take home message: just because a correlation statistic such as Pearson's product moment reports no significant relationship between two variables, does not mean that the two variables are not correlated; it merely means that the linear model does not fit the relationship between the two variables very well- since this is what it tests. In reality, the two variables may be correlated, but not in a linear fashion.

That's all I've got time for tonight.

Kevin Kirby · Jan 19, 2013

Simon Spooner said: ↑

Kevin, I'll come back to your A. on Monday if I may, since I don't have my books in front of me and wouldn't want to provide inaccurate information- suffice to say, when calculating statistical power prospectively, it is sometimes necessary to make certain educated guesses.

As for your B. Here is how I would approach it, I'm not saying this is the best way, just how I'd do it.

The first thing I would do is to plot a scattergram of the body weight data versus the tension data (x = body weight, y = tension), I would take a visual inspection of the graph produced to get an instinctive idea of whether the relationship between the two sets of data is linear, or is better represented by some other function. Next, I would use some software to fit some "lines of best fit". I'd start with a linear function and look at the r square value I got for that model (basically r-square tells us how well the data fits the model so: 1 would be a perfect fit, 0= rubbish fit). A linear model is y = mx +c, where: y = tension, m = the slope of the line, x = body weight and c is a constant which is the intercept value on the y (or x) axis.

Next, I would fit a quadratic model, the basic form of which is Y= mx + x>2 +c and look at how much this improved the r square value (by including another x term in the model we've lost a degree of freedom in any subsequent analysis- so you have to decide whether the increase in fit out-weighs the loss of degrees of freedom in the analysis), then I would perform a cubic model fitting: y= mx + x>2 + x>3 +c and look at the r square value, again you've now lost another degree of freedom, no problem if you've got a big sample- hell of a problem if your sample is n=3!!! etc. etc.

At some stage you need to decide whether the linear model is a "good enough fit" for the data, or whether to attempt to coax the variables into a more linear relationship, or whether to perform a non-linear analysis.

If you decide that the data is "linear enough" you could perform a Pearson's correlation test to glean the strength of the correlation. But of greater clinical value should be the ability to predict the x variable, i.e. tension, from the y variable, i.e. body weight. In which case a simple linear regression analysis would be the way forward since this would provide an equation in the form of y= mx +c to allow the clinician to predict (within the 95% confidence intervals of the model) the tension in the plantar fascia given a person's body weight. However, we should be mindful that an n=2 sample could provide a perfect fit to a linear model with 1 degree of freedom! So, some have suggested that a sample of about a hundred should be generated before a linear regression is attempted- this number seems to be somewhat arbitrary.

If the linear r square value is poor, you can then attempt to perform scale transforms into the data to make the relationship between variables more linear. For example, you might take logarithms of the body weight data and re-plot the log body weight versus tension data to see if this improves the r square of the linear model. If it does then you can perform the analysis as described above, if not... you might want to consider a non-linear analysis. Semmes-Weinstein filaments are calibrated on a log-linear scale for this reason, BTW.

Take home message: just because a correlation statistic such as Pearson's product moment reports no significant relationship between two variables, does not mean that the two variables are not correlated; it merely means that the linear model does not fit the relationship between the two variables very well- since this is what it tests. In reality, the two variables may be correlated, but not in a linear fashion.

That's all I've got time for tonight.
Click to expand...

Excellent response, Simon.

Now, I'm going to publish your words verbatim in my next book...but not give you credit for any of it.

Seriously, looking forward to Monday's installment of Statistics for Dummies.

Kevin Kirby · Jan 19, 2013

To supplement Simon's excellent response to my question, here are some good articles further explaining a few of the terms used.

Pearson's Product-Moment Correlation

R-squared or Coefficient of Determination

Scatter Diagrams and Linear Regression

Rules for Developing Regression Models

Data Tranformation

Simon Spooner · Jan 20, 2013

Simon Spooner said: ↑

But of greater clinical value should be the ability to predict the x variable, i.e. tension, from the y variable, i.e. body weight.
Click to expand...

Ooops, that should be the other way around- predict the y variable (tension) from the x variable- body weight. As Kevin intimated, my mind was distracted by another issue last night when I was writing this.

Petcu Daniel · Jan 21, 2013

Kevin Kirby said: ↑

... With that in mind, let me try to direct some questions that will hopefully maximize the eductional benefit to all of us on this important subjec.

Question #1: I want to perform a research study on the correlation between the weight of the individual and the minimum tension force required to dorsiflex the hallux to a 10 degree dorsiflexed position while the subject is standing in relaxed bipedal stance. I can measure the weight of the individual to within 100 g with a scale and can measure tension force to within 0.1 Newtons with the tension measuring device and can measure hallux dorsiflexion to within 0.5 degrees of accuracy.

I would like to have enough subjects within my study so that I can determine to a 0.05 level of significance if there is a positive correlation between the weight of the individual and the tension force required to dorsiflexion their hallux to a 10 degree dorsiflexed position.

A. How would I determine, before I start the study, the number of subjects needed to be relatively certain that I could determine a 0.05 level of significance in the results of my study?

B. What would be the most proper statistical test or types of analyses to properly determine if body weight of my subjects correlated to tension force required to reach 10 degrees of hallux dorsiflexion in my research study? Why?

Thanks in advance.:drinks
Click to expand...

Thank you Simon for starting this very useful post ! And also Kevin for following its educational benefit !

It will be very interesting for me [hope, at least, not only for me] if you can highlight this subject from the practical point of view of daily clinical practice decision process of prescribing FFO. I'm thinking here at the value of single-case experiment design.
Regarding this, my question is related to the order [arrangement] of the steps which has to be made:

A. which kind of experiment design will be used ?
B. number of subjects,
C. type of statistical test

Is that correct ?
Respectfully,
Daniel

Simon Spooner · Jan 21, 2013

Kevin Kirby said: ↑

Question #1: I want to perform a research study on the correlation between the weight of the individual and the minimum tension force required to dorsiflex the hallux to a 10 degree dorsiflexed position while the subject is standing in relaxed bipedal stance. I can measure the weight of the individual to within 100 g with a scale and can measure tension force to within 0.1 Newtons with the tension measuring device and can measure hallux dorsiflexion to within 0.5 degrees of accuracy.

I would like to have enough subjects within my study so that I can determine to a 0.05 level of significance if there is a positive correlation between the weight of the individual and the tension force required to dorsiflexion their hallux to a 10 degree dorsiflexed position.

A. How would I determine, before I start the study, the number of subjects needed to be relatively certain that I could determine a 0.05 level of significance in the results of my study?
Click to expand...

OK Kevin, as promised. As I said, when performing a prospective sample size calculation, we have to make certain educated guesses. Principally this is around the standard deviations of the variables measured within your study.

We can specify our alpha level- you wanted 0.05, and we know that the accuracy of your tension measuring device is 0.1N, what we don't know are the standard deviations for the variables of interest- in this case the body weight and plantar fascial tension data. So what you'd need to do is either look for some existing standard deviation data in the literature to apply to your sample size calculation, or carry out a pilot study to get an idea of what these standard deviations might be.

Once you've done this you can then calculate gamma which is 0.1N/ your standard deviation estimate.

Simon Spooner · Jan 21, 2013

Petcu Daniel said: ↑

Thank you Simon for starting this very useful post ! And also Kevin for following its educational benefit !

It will be very interesting for me [hope, at least, not only for me] if you can highlight this subject from the practical point of view of daily clinical practice decision process of prescribing FFO. I'm thinking here at the value of single-case experiment design.
Regarding this, my question is related to the order [arrangement] of the steps which has to be made:

A. which kind of experiment design will be used ?
B. number of subjects,
C. type of statistical test

Is that correct ?
Respectfully,
Daniel
Click to expand...

Daniel, I'm not sure what it is that you are asking.

jb3 · Jan 21, 2013

Dear All,

A very interesting discussion, but no-one has qualified their contributions by making it clear that the discussion so far has purely focused on quantitative research. I think it's worth saying, when addressing the very first point about sample size, that in some research methodologies it's perfectly acceptable to have a sample size as small as single figures. I'm of course referring to case study research, one of the mainstays of phenomenological research. Before everyone says hold on, everyone knows that case studies aren't valid, I would suggest that folk examine the work of Flyvbjerg (2011) and Silverman (2010), who's excellent work suggests that there are significant misconceptions, usually regarding generalisability, which prejudice people's views. They both suggest that studies of very small groups or single cases help develop a strategy to raise questions about something, a good example being Karl Popper's black swan falsification test.

Of course, if you aren't interested in qualitative methods then feel free to ignore this, but I thought it worth saying that we need to remember that not all research is about p values, significance, and correlation co-efficients.

JBB

Petcu Daniel · Jan 22, 2013

Simon Spooner said: ↑

Daniel, I'm not sure what it is that you are asking.
Click to expand...

Sorry for my English !

If the sample size is depending by the type of the study [for ex. parallel study or crossover study, http://hedwig.mgh.harvard.edu/sample_size/size.html ] then, we shouldn't establish initially the type of the study followed by determination of the sample size ??

Sincerely,
Daniel

Simon Spooner · Jan 22, 2013

Petcu Daniel said: ↑

Sorry for my English !

If the sample size is depending by the type of the study [for ex. parallel study or crossover study, http://hedwig.mgh.harvard.edu/sample_size/size.html ] then, we shouldn't establish initially the type of the study followed by determination of the sample size ??

Sincerely,
Daniel
Click to expand...

Right. The type of study will determine the sample size and the analysis performed to calculate sample size and power.

Petcu Daniel · Apr 10, 2017

2 very good papers on the thread' subject:
J Orthop Res. 1990 Mar;8(2):304-9.
Statistical significance and statistical power in hypothesis testing.
Lieber RL.
Abstract
Experimental design requires estimation of the sample size required to produce a meaningful conclusion. Often, experimental results are performed with sample sizes which are inappropriate to adequately support the conclusions made. In this paper, two factors which are involved in sample size estimation are detailed--namely type I (alpha) and type II (beta) error. Type I error can be considered a "false positive" result while type II error can be considered a "false negative" result. Obviously, both types of error should be avoided. The choice of values for alpha and beta is based on an investigator's understanding of the experimental system, not on arbitrary statistical rules. Examples relating to the choice of alpha and beta are presented, along with a series of suggestions for use in experimental design.
Full text on: http://muscle.ucsd.edu/More_HTML/papers/pdf/Lieber_JOR_1990.pdf

J Bone Joint Surg Am. 1989 Jul;71(6):800-10.
Corrective shoes and inserts as treatment for flexible flatfoot in infants and children.
Wenger DR1, Mauldin D, Speck G, Morgan D, Lieber RL.
Author information
Abstract
We performed a prospective study to determine whether flexible flatfoot in children can be influenced by treatment. One hundred and twenty-nine children who had been referred by pediatricians, and for whom the radiographic findings met the criteria for flatfoot, were randomly assigned to one of four groups: Group I, controls; Group II, treatment with corrective orthopaedic shoes; Group III, treatment with a Helfet heel-cup; or Group IV, treatment with a custom-molded plastic insert. All of the patients in Groups II, III, and IV had a minimum of three years of treatment, and ninety-eight patients whose compliance with the protocol was documented completed the study. Analysis of radiographs before treatment and at the most recent follow-up demonstrated a significant improvement in all groups (p less than 0.01), including the controls, and no significant difference between the controls and the treated patients (p greater than 0.4). We concluded that wearing corrective shoes or inserts for three years does not influence the course of flexible flatfoot in children.
Full text on: http://muscle.ucsd.edu/More_HTML/papers/pdf/Wenger_JBJSA_1989.pdf

Study design, sample size and statistical power

Simon Spooner MVP

Rob Kidd Well-Known Member

Simon Spooner MVP

Kevin Kirby MVP

Griff Moderator

Simon Spooner MVP

Rob Kidd Well-Known Member

Athol Thomson Active Member

Ros Kidd Active Member

markjohconley Well-Known Member

Kevin Kirby MVP

Simon Spooner MVP

davidh Podiatry Arena Veteran

Rob Kidd Well-Known Member

Simon Spooner MVP

Simon Spooner MVP

Rob Kidd Well-Known Member

Simon Spooner MVP

davidh Podiatry Arena Veteran

blinda MVP

davidh Podiatry Arena Veteran

Kevin Kirby MVP

Kevin Kirby MVP

Simon Spooner MVP

Kevin Kirby MVP

Kevin Kirby MVP

Simon Spooner MVP

Petcu Daniel Well-Known Member

Simon Spooner MVP

Simon Spooner MVP

jb3 Member

Petcu Daniel Well-Known Member

Simon Spooner MVP

Petcu Daniel Well-Known Member

Studying the ostrich's foot to help design running shoes

Motivators and barriers for studying podiatry in Australia and New Zealand

A Finite Element Study of Different Shoe Soles

Pilot Study Quantifies Proprioceptive Insole

TREADON study in plantar heel pain

International student studying podiatry

Case Study advice - What has gone wrong?

Share This Page

Useful Searches