If you are interested in text mining, this is a good data set to start with. It is a bunch of text messages, each one line long, that have been classified by a human as either spam or ham (ham is a legitimate message). Continue reading
I have to help write NIH grants from time to time, and I need to always keep front and center the criteria that NIH peer reviewers use when they evaluate grants. They look at five broad areas: significance, investigators, innovation, approach, and environment. This document explains what each of these five broad areas means. Continue reading
This xkcd cartoon by Scott Munro is open source, so I can hotlink the image directly. But if you go to the source, https://xkcd.com/327/, be sure to hover over the image for a second punch line.
I’m ginvg a talk about i2b2 (among other things) and when browsing through their website, I cam across an interesting project, SHRINE. This is an acronym for Shared health Research Informatics NEtwork., and represents a way of allowing users to review information across multiple i2b2 sites. It requires the individual institutions who have i2b2 systems to cooperate with one another, which is not always easy. But this has tremendous potential. Continue reading
This xkcd cartoon by Scott Munro is open source, so I can hotlink the image directly. But if you go to the source, https://xkcd.com/1179/, be sure to hover over the image for a second punch line.
I’ve been using a version of LaTeX (MikTeX) for a couple of years, and it’s not bad. But when I heard about Yihui Xie’s R package, tinytex, I jumped at the opportunity to try it. Dr. Xie is the author of knitr, a package that makes it easy to create well documented R programs where the code and the output are gracefully merged. He created this new package, tinytex, because he felt that the current versions of LaTex had complex installation processes and forced you to choose between a minimal installation that couldn’t do anything useful and a full installation that was bloated with features you’d never use. I can’t say too much about the package yet except that he is right in that it is very easy to install. If I find out more, I’ll let you know. Continue reading
What percentage effort is reasonable for Biostatistics support on a research grant? The UC Davis Biostatistics Group says 10% as a bare minimum, 35-60% for straightforward projects with uncomplicated analyses, and 50-100%+ for large or complex projects. They give examples of large and complex projects: interim analysis, multi-site projects, development of novel statistical methods, and assembly of data from large, complex, or poorly documented administrative or survey data sets.
They also describe how to split the effort between a PhD Biostatistician, who supervises the overall effort, and a MS Biostatistician, who does most of the data management and statistical analysis.
Another point worth noting is that any grant listing less than 10% effort for a Biostatistician requires a special sign off. Continue reading
There has been a lot written about data management problems with using spreadsheets, and there is a group the European Spreadsheet Risks Interest Group that has documented the problem carefully and thoroughly. This page on their website lists the big, big, big problems that have occurred because of spreadsheet mistakes. Any program is capable of producing mistakes, of course, but spreadsheets are particularly prone to errors for a variety of reasons that this group documents. Continue reading
There has been a lot written about how lousy Microsoft Excel (and other spreadsheet products) are at data management, but the warning sinks in so much more effectively when you can cite an example where the use of Excel leads to an embarrassing retraction. Perhaps the best example is the paper by Carmen Reinhart and Peter Rogoff where a major conclusion was invalidated when a formula inside their Excel spreadsheet accidentally included only 15 of the relevant 20 countries. Here’s a nice description of that event and some suggestions on how to avoid this in the future. Continue reading
At first glance, you might think that this article looks like a vindication of traditional statistics. Classical time series models (methods that were available in the 1960′s) outperform newer machine language forecasting models. Then, you might worry that the comparisons were unfair. But neither viewpoint is accurate. The classical time series models have certain structural advantages for certain types of problems, but you might be better off with machine learning if you use classical time series as a preprocessing step, such as de-seasonalizing your data. If nothing else, this article provides a nice overview of some of the major machine learning methods. Continue reading