In contrast to the five thirty eight databases which are mostly smallish, the Kaggle data sets are, as a rule, very large. They also include a lot of text data, for natural language processing, sentiment analysis, etc. Continue reading

# Recommended: Five Thirty Eight Data

This is a github repository of a lot of interesting data sets created by the Five Thirty Eight website. I presume there is a story associated with most of these data sets. The data sets look to be moderate in size for the most part and would make interesting teaching examples. Continue reading

# Recommended: ProbOnto

If you work with probability distributions a lot, you find there are mutliple parameterizations (e.g., the two different forms of the exponential distribution), as well as interesting relationships (the geometric distribution is a discrete version of the exponential distribution). I have found Wikipedia to be a nice guide for some of this, but the coverage is uneven in quality. One of the Wikipedia links mentioned a new website, ProbOnto, that offers a systematic and standardized attempt to catalog every important probability distribution and the relationships among these distributions. Continue reading

# PMean: It only looks like a blank

I was having trouble with trailing blanks in an R program. There were some strings that looked like ” Y” and “Y ” and it’s easy enough to fix this, but one of the “Y ” values was not converting properly. The second character wasn’t a blank, but it looked like it. Here’s what I had to do. Continue reading

# Recommended: This is your machine learning system?

This xkcd cartoon by Scott Munro is open source, so I can hotlink the image directly. But if you go to the source, https://xkcd.com/1838/, be sure to hover over the image for a second punch line.

# Recommended: Why R is Bad for You

Arguing about R versus SAS often takes on a religious fervor, so I normally hesitate to recommend articles that trash one package or the other. But this one raises an interesting point which makes it worth reading. Note that “recommended” does not mean that I endorse these conclusions. But rather than bias you with my perception of the issue, just read this on your own. Continue reading

# PMean: A p-value of .000

Dear Professor Mean, I ran a statistical test in SPSS and got a p-value of .000. I re-ran the same data in Microsoft Excel and got a p-value of 3.9433E-9. I know from scientific notation that this is the same as 0.0000000039433. Why are these numbers different? Continue reading

# Recommended: One in Five Clinical Trials for Adults with Cancer Never Finish

This is a research summary of a study that found one out of every five cancer trials that “did not finish” which actually means that they stopped early for futility, if I am reading between the lines properly. Of those studies, 40% stopped early because of poor accrual. Continue reading

# PMean: Extremely imbalanced multi-center trials

There was some recent discussion of issues with multi-center trials where one center dominates, contributing as much as 94% of all the patients. What does this do to the generalizability of the study. I wanted to summarize these comments here, because it relates to some of the issues I’m looking at right now in accrual models for multi-center trials. Continue reading

# Recommended: The numbers for the Science March…

I normally don’t recommend other people’s tweets, but these two, from Siobhan Tompson, were too funny to pass up. Continue reading