Recommended: What would Florence Nightingale make of big data?

This is a nice video, professionally produced and very short (4 minutes) that shows the importance Florence Nightingale attached to Statistics. It reviews how she used Statistics aggressively to lobby for improvements to health care, and speculates on what she would think about the efforts today to use big data for decision making. The narrator is David Spiegelhalter, a famous statistician. Continue reading

Recommended: Case for omitting tied observations in the two-sample t-test and the Wilcoxon-Mann-Whitney Test

When you are running a non-parametric test, like the Wilcoxon-Mann-Whitney test, you can only be 100% of the properties of that test (including Type I and Type II error rates) if the data are continuous. If there are ties in the data, the properties of the test are unknown. This paper shows four commonly used approaches for settings where values might be tied and runs simulations to measure Type I and Type II error rates for both the two-sample t-test and the Wilcoxon-Mann-Whitney test under a range of tied values and a range of distributions. The results are, at least to me, quite surprising. Continue reading

Recommended: When life gives you coloured cells, make categories

A lot of people use formatting to denote important information in an Excel spreadsheet. In particular, they will use the color of a cell to designate a particular category. Pretty much all formatting information is lost when you import from Excel to R or any other statistical package. But rather than ask people to go back and fix things, there are a few tricks that you can use to recover this information, as is shown in this blog entry. Continue reading

PMean: Writing the introduction section of a research thesis or dissertation

The introduction section of your research thesis or dissertation is the first thing that most people will read after reading the abstract. Some people use the introduction section to provide a literature review, and I won’t talk about that here. I did offer a nice recommendation on how to write a literature review in an earlier post. The introduction should provide present your research problem (research question, research hypothesis), but first you have to offer some context. Continue reading

Recommended: Data Science Has Become About Lending False Credibility To Decisions We’ve Already Made

A rather harsh and cynical take on data science, but still worth reading. Let me share a story about this. Back in my college days (that would be the 1970s), someone found a New Yorker cartoon and shared it with me. It showed a politician, obviously a very powerful politician because his office had a view of the Washington Monument. He was speaking to his aide “That’s the gist of what I want to say. Now go and find me some statistics to base it on.” So the issues that this person brings up are no different than those from four decades ago. There’s no easy solution to the problem. You can’t say, “I’ll only work with people who have a commitment to the truth, no matter where it might lead” because even people without strong overt biases still have subtle biases that can profoundly skew the results. Requiring a priori specifications and reserving a hold out sample for a final quality check can help, but mostly it is just being careful and detail oriented and transparent in all your work. Continue reading