PMean: NIH is interested in big data

The National Institutes of Health has shown a recent interest in “big data.” You can define big data in several ways, but a common characteristic is the three V’s. Big data takes up a lot of space (volume) and/or it comes at you very rapidly (velocity) and/or it comes in a wide range of differing formats (variety). One of the recent Requests for Applications (RFAs) from NIH spells out what types of research into big data that they are interested in seeing. I might be interested in applying, and would love to find some collaborators. Here’s a summary of what the RFA is all about.

The RFA  calls for “the development of innovative approaches to tough problems.” You don’t have to have “fully hardened” software, but you do need to demonstrate proof of concept. NIH stresses repeatedly throughout the proposal of the need to address biomedical data.Your application needs to fit into one of four areas.

1. Data compression/reduction. Big data often strains your computational resources. To make big data work for you need some way to compress the data or minimize redundancy while still maintaining data integrity.

2. Data visualization. You need graphical tools that allow you to “intuitively delve into and explore within the data, gaining both insight into a dataset’s structure and interrelationships and extracting knowledge from the data.” This is the area that I might be most interested in, and I’ve done some work on visualizing large text files that might be worth expanding on (here and here).

3. Data provenance. This was a new one on me. The RFA defines data provenance as “the chronology or record of transfer, use, and alteration of data that documents the reverse path from a particular set of data back to the initial creation of a source dataset.” Although the RFA does not use the term “reproducible research,” it is pretty clear that data provenance is an important component of reproducible research.

4. Data wrangling. This is another term that is new to me, but I like the imagery that it provides. The RFA defines data wrangling as “activities that make data more usable by changing their form but not their meaning.” Tools for data warngling will help you in lots of ways: submitting data to a repository, making it available on the Internet, loading data into standard statistical software packages. The RFA contrasts data wrangling from data mining, which is not a focus of the RFA. Data mining, as the RFA defines it, involves “the extraction of data content.”

I may or may not write a grant in response to the RFA, but even if I don’t, it was worthwhile to read the RFA to see four areas in big data that NIH sees the need for more work.

You can view the full RFA at