I attended a talk about a decade ago on the problems with for-profit publishing of scientific research and the need to aggressively adopt the open source publication model. It was a message I was ready for, because I had benefited greatly from citing open source resources on my website. I knew that if I cited an open source resource, anyone anywhere could look up that resource. They didn’t need access to a University Library.
This article explains how the for-profit research journals (perhaps better described as a reader-pays model, in contrast to an author-pays model) developed a system that locked in research libraries to their product and then hiked the price. Then they developed journal bundles that further squeezed libraries by forcing them into a take-it-all-or-leave-it-all system that devastated their budgets.
There is still a struggle between the reader-pays model of for-profit publishing and the author-pays model of open source publishing, and I believe there is room for both approaches, though I would argue that we need to promote open source publishing more aggressively than we currently are doing.
This article provides a very nice historical context to the development of for-profit publishing in scientific research. It oversimplifies things, perhaps, and may be a bit too harsh, but it is definitely worth reading.
As an ironic footnote, newspapers have been devastated by the Internet because of the expectations of readers that all of their content should be available for free. There is a note at the bottom of the Guardian article that reads: “Since you’re here we have a small favour to ask. More people are reading the Guardian than ever but advertising revenues across the media are falling fast. And unlike many news organisations, we haven’t put up a paywall – we want to keep our journalism as open as we can. So you can see why we need to ask for your help. The Guardian’s independent, investigative journalism takes a lot of time, money and hard work to produce. But we do it because we believe our perspective matters – because it might well be your perspective, too.”
Take some time to read this and think about it. I normally ignore pitches like this on Wikipedia and elsewhere, but the irony of citing a newspaper article available for free to criticize for-profit research publishing got to me, so I became a supporter of the Guardian at $6.99 per month.
This is yet another interesting source of data. This site specializes in databases prepared by the United States government. Continue reading
In contrast to the five thirty eight databases which are mostly smallish, the Kaggle data sets are, as a rule, very large. They also include a lot of text data, for natural language processing, sentiment analysis, etc. Continue reading
This is a github repository of a lot of interesting data sets created by the Five Thirty Eight website. I presume there is a story associated with most of these data sets. The data sets look to be moderate in size for the most part and would make interesting teaching examples. Continue reading
If you work with probability distributions a lot, you find there are mutliple parameterizations (e.g., the two different forms of the exponential distribution), as well as interesting relationships (the geometric distribution is a discrete version of the exponential distribution). I have found Wikipedia to be a nice guide for some of this, but the coverage is uneven in quality. One of the Wikipedia links mentioned a new website, ProbOnto, that offers a systematic and standardized attempt to catalog every important probability distribution and the relationships among these distributions. Continue reading
This xkcd cartoon by Scott Munro is open source, so I can hotlink the image directly. But if you go to the source, https://xkcd.com/1838/, be sure to hover over the image for a second punch line.
Arguing about R versus SAS often takes on a religious fervor, so I normally hesitate to recommend articles that trash one package or the other. But this one raises an interesting point which makes it worth reading. Note that “recommended” does not mean that I endorse these conclusions. But rather than bias you with my perception of the issue, just read this on your own. Continue reading
This is a research summary of a study that found one out of every five cancer trials that “did not finish” which actually means that they stopped early for futility, if I am reading between the lines properly. Of those studies, 40% stopped early because of poor accrual. Continue reading
I normally don’t recommend other people’s tweets, but these two, from Siobhan Tompson, were too funny to pass up. Continue reading
Logistic regression and other statistical methods for predicting a binary outcome run into problems when the outcome being tested is very rare, even in data sets big enough to insure that the rare outcome occurs hundreds or thousands of times. The problem is that attempts to optimize the model across all of the data will end up looking predominantly at optimizing the negative cases, and could easily ignore and misclassify all or almost all of the positive cases since they consistute such a small percentage of the data. The ROSE package generates artificial balanced samples to allow for better estimation and better evaluation of the accuracy of the model. Continue reading