Author Archives: pmean

Recommended: ROSE: A package for binary imbalanced learning

Logistic regression and other statistical methods for predicting a binary outcome run into problems when the outcome being tested is very rare, even in data sets big enough to insure that the rare outcome occurs hundreds or thousands of times. The problem is that attempts to optimize the model across all of the data will end up looking predominantly at optimizing the negative cases, and could easily ignore and misclassify all or almost all of the positive cases since they consistute such a small percentage of the data. The ROSE package generates artificial balanced samples to allow for better estimation and better evaluation of the accuracy of the model. Continue reading

PMean: How big is the stuff I’m working on

I have been working part-time on a project for the Great Plains Collaborative (GPC) under the direction of Russ Waitman and the gentle guidance of Dan Connolly, both at Kansas University Medical Center. I hoping to submit a paper soon on the work I’ve done, but if you are curious about the size and scope of the electronic health records that I’ve been slinging around, this blog entry might help. Continue reading

Recommended: Proving the null hypothesis in clinical trials

I’m attending a great short course on non-inferiority trials and the speaker provided a key reference of historical interest. This reference is the one that got the Statistics community interested in the concept of non-inferiority. The full text is behind a paywall, but you can look at the abstract. A footnote is a paper, Dunnett and Gent 1977, (also trapped behind a paywall) addressed this problem earlier. Continue reading

Recommended: Blind analysis: Hide results to seek the truth

This paper advocates something I would call a triple blind, keeping the doctor, the patient, and the statistician who analyzes the data in the dark as to which treatment group is which. This avoids problems where the people analyzing the data will either consciously or subconsciously manipulate the data to get a preferred result. Interesting idea, though it represents an awful amount of work to pull it off. Continue reading