Logistic regression and other statistical methods for predicting a binary outcome run into problems when the outcome being tested is very rare, even in data sets big enough to insure that the rare outcome occurs hundreds or thousands of times. The problem is that attempts to optimize the model across all of the data will end up looking predominantly at optimizing the negative cases, and could easily ignore and misclassify all or almost all of the positive cases since they consistute such a small percentage of the data. The ROSE package generates artificial balanced samples to allow for better estimation and better evaluation of the accuracy of the model. Continue reading

# Tag Archives: Big data

# Recommended: How Bright Promise in Cancer Testing Fell Apart

A nice overview of the problems with shoddy research in genetics testing. It highlights the work of “forensic statistics” of Keith Baggerly and Kevin Coombes. Continue reading

# Recommended: Standards and guidelines for the interpretation of sequence variants

This article outlines a standardized way to describe genetic variants.

# Recommended: Genetics: Clues in the code

If you want to understand the value of genomic medicine, you can learn a lot by reviewing the case of Nicholas Volker, one of the first success stories in this area. Here’s a nice review. Continue reading

# Recommended: Text Mining with R

This is an O’Reilly book (cute animal on the cover is a rabbit) that is available online for free. It’s a great resource for someone just getting started with text mining. Continue reading

# PMean: Examining the storage format for sparse matrices in R

I’ve been working with sparse matrices a bit for my work with the Greater Plains Collaborative. They are a very useful way of storing matrices where most of the entries are zero. This occurs quite often in medical data. There are thousands of medical procedures that you can torture your patients with, so any matrix that has indicator variables for every medical procedure will be quite big. Fortunately, both for us and for the patients, the number of procedures that a particular patient has to endure is quite a bit smaller. So for each row of the matrix, the number of non-zero entries will be very small, probably in the single digits. A sparse matrix will be much smaller because it stores only the location of the non-zero entries. Here’s some R code that shows how this works. I have the code available at my new github site. Continue reading

# Recommended: Tessera. Open source environment for deep analysis of large complex data

I have not had time to preview this software, but it looks very interesting, It takes large problems and converts them to a form for parallel processing, not by changing the underlying algorithm, which would be very messy, but by splitting the data into subsets, analyzing each subset, and recombining these results. Such a method “Divide and Recombine” should work well for some analysis, but perhaps not so well for others. It is based on the R programming language. If I get a chance to work with this software, I’ll let you know what I think. Continue reading

# Recommended: Special issue–Using Big Data to Transform Care

The July 2014 issue of Health Affairs is devoted entirely to “big data”. The articles provide a general overview to big data, several applications of big data, big data and genomics, use of electronic health records, and ethical issues including privacy concerns. For now, at least, the articles are available for free to any user. Continue reading

# Recommended: More on Big Data Training for the Scientific Workforce

The United States National Institutes for Health is very interested in big data and has developed a working group, Big Data to Knowledge (BD2K). This blog post from Sally Rockey, the Deputy Director for Extramural Research, summarizes some of the recent activity of BD2K. Continue reading

# PMean: NIH is interested in big data

The National Institutes of Health has shown a recent interest in “big data.” You can define big data in several ways, but a common characteristic is the three V’s. Big data takes up a lot of space (volume) and/or it comes at you very rapidly (velocity) and/or it comes in a wide range of differing formats (variety). One of the recent Requests for Applications (RFAs) from NIH spells out what types of research into big data that they are interested in seeing. I might be interested in applying, and would love to find some collaborators. Here’s a summary of what the RFA is all about. Continue reading