Kaggle data. Available at https://www.kaggle.com/datasets.

]]>Five Thirty Eight Data. Available at https://github.com/fivethirtyeight/data.

]]>Maciej J Swat, Pierre Grenon, Sarala Wimalaratne. ProbOnto. Available at https://sites.google.com/site/probonto/home.

]]>First, there are several ways to remove leading and trailing blanks. I used the commands

mv1[, i] %<>% sub("^ +", "", .) mv1[, i] %<>% sub(" +$", "", .)

which might not be the best, but it worked for every instance of ” Y” and “Y ” except one instance of “Y “. It took forever to find the value, because searching on “Y ” obviously didn’t work. I finally said, find anything that is not what I thought all the possible values might be. It was row 168. When you printed the value,

> mv1[168, 14] [1] "Y "

it looked fine. So I figured I had to look at the underlying ascii codes. There’s a way to do this in R, the charToRaw function.

> charToRaw(mv1[168, i]) [1] 59 a0

I knew that the code for blank in ascii is 32, so something is definitely up here. You can find out quickly on the web at a0 is a non-breaking space. How that got into the data set is a mystery to me. Now, how do you fix it? There is a rawToChar function, but if you just say

rawToChar(a0)

it won’t work. There’s an as.raw function, but it doesn’t take hexadecimal. So you have to convert a0 to 160 and then it works. There’s also an intToUtf8 function and probably other ways as well.

]]>This xkcd cartoon by Scott Munro is open source, so I can hotlink the image directly. But if you go to the source, https://xkcd.com/1838/, be sure to hover over the image for a second punch line.

]]>William Vorhies, Why R is Bad for You. Data Science Central Blog , posted on May 15, 2017. Available at http://www.datasciencecentral.com/profiles/blogs/why-r-is-bad-for-you.

]]>Before I say too much I have to point out that many statisticians hate, hate, hate Microsoft Excel. There are good reasons for this. But for your analysis, it might be okay.

The discrepancy that you are seeing is due to rounding. When SPSS reports a p-value that prints as .000, it means that when you round to three decimal places it is closer to .000 than it is to .001. So you know the p-value is smaller than .0005. A p-value of 0.0000000039433 is also smaller than .0005. So the two p-values are consistent.

As a general rule, I round any p-value smaller than .001 up to .001. Some people think that very small p-values should be reported in all their microscopic glory because they make the results seem more impressive. It’s almost like a competition. Ha! My p-value has more zeros in front of it than your p-value. But beyond three decimal places, the p-value is extremely sensitive to even very minor departures from the underlying assumptions.

There are a few exceptions to this in Physics and Genetics, but reporting more precision in your p-value than you really need marks you as an amateur. So be a reasonable person and report the p-value as .001.

Some people might prefer p <.001 and I won’t complain about that. But never, never, never use scientific notation on very small p-values to give your p-value a false sense of precision.

]]>Cancer.net. One in Five Clinical Trials for Adults with Cancer Never Finish â€“ New Study Examines the Reasons. Genitourinary Cancers Symposium. January 28, 2014. Available at http://www.cancer.net/one-five-clinical-trials-adults-cancer-never-finish-%E2%80%93-new-study-examines-reasons.

]]>One person shared the transcript of an FDA review of a drug that was tested in 41 patients, 38 of who came from Brigham & Women’s Hospital with the remaining 3 coming from a hospital associated with the University of Nebraska. As you might expect, the FDA panel was not thrilled with the ability to generalize from this study to the overall population, but they did acknowledge that the patients themselves were geographically dispersed and came to Brigham & Women’s Hospital because this was a rare condition that few places were qualified to offer treatment.

Another person suggested a 2.5 to 1 ratio maximum, meaning that if you had K centers, no more than 2.5 / K of them could come from a single center. For example, with 10 centers, no more that 2.5/10 = 25% should come from any one center. For pivotal studies you might want to consider a stricter limit like 2 / K. This is a very ad hoc approach of course, but you could run some simulations using an influence function and a score function to see how much the imbalance hurts. I can’t say that I completely follow this last comment.

Another person shared a guidance document CMPP/EWP/2330/99 which discusses meta-analyses and relying on a single pivotal trial. This guidance offer the general admonition that none of the centers in a multi-center trial should dominate the study either in the number of subjects or in the magnitude of the effect. The latter means, I presume that if the efficacy is a result of positive results mostly from just one of the centers, that is a serious concern.

Addendum: I did a quick google search and found a discussion on enrollment caps in a multicenter trial on the MEDSTATS forum on January 28, 2015. One reply had a link to an article in the Journal of the American College of Cardiology that argued that, at least in one clinical trial, high enrolling sites differed in important ways from low enrolling sites. This article has an accompanying editorial.

Another search found a Johns Hopkins report that discussed (among other things) the impact of enrollment caps on individual sites. And a 2007 blog entry suggests a cap of 3/K to 5/K.

]]>