Recommended: The FAIR Guiding Principles for scientific data management and stewardship

I’ve always been supportive of efforts to share data. For me, it’s a bit selfish, because I want to find interesting real world examples to use in teaching and on my web pages. But the issue goes way beyond this, of course. Sharing data is an ethical imperative, especially for federally funded research or research that relies on volunteer subjects. It has led to many important discoveries beyond the realm of the original context in which the data was collected. In order for data sharing to be effective, you need to embrace four guiding principles: your data needs to findable, accessible, interoperable, and re-usable. This paper highlights those principles and offers some current examples of data sharing systems.

Continue reading

Recommended: 10 Easy Steps to a Complete Understanding of SQL

This page outlines some of the fundamental properties of SQL programming that you need to know as you start learning SQL. For example, SQL is a declarative language, meaning that you tell it what you want and not how to compute it. Also SQL syntax is not well-ordered, meaning that the order in which SQL statements are evaluated is not the same as the order that they appear. Continue reading

Recommended: Tibbles (Tibbles are a modern take on data frames)

I’m an old dog R programmer who tends to rely on features of R that were available 10 years ago (an eternity for computers). But it’s time for this old dog to learn new tricks. One thing I need to use in my R programs is called a “tibble” (sometimes called a “tidy tibble”). It’s a minor but important improvement on data frames and many of the newer packages are using tibbles instead of data frames. Tibbles are available in the package, tibble. This web page offers a nice description of the improvements on tibbles. Continue reading

Recommended: dplyr and pipes: the basics

One of the recent developments in R that I was unaware of until I attended some talks at the Joint Statistical Meetings was the use of dplyr and pipes. It’s an approach to data management that isn’t different from earlier approaches, but the code is much easier to read and maintain. This blog post explains in simple terms how these work and why you would use them. Continue reading

PMean: Bad examples of data analysis are bad examples to use in teaching

I’m on various email discussion groups and every once in a while someone sends out a request that sounds something like this.

I’m teaching a class (or running a journal club or giving a seminar) on research design (or evidence based medicine or statistics) and I’d like to find an example of a research study that use bad statistical analysis.

And there’s always a flood of responses back. But if I were less busy, I’d jump into the conversation and say “Stop! Don’t do it!” Here’s why. Continue reading

Recommended: The Importance of Reproducible Research in High-Throughput Biology

I have not viewed this video yet, but have attended a similar talk and read a similar research paper by Keith Baggerly. His general message is that large biological and genetic experiments are sometimes designed so poorly as to invalidate the results. You can often discover these design flaws through a careful examination of the data sets themselves and their metadata. This process of uncovering design flaws is sometimes called “Forensic Statistics.” Continue reading