Tag Archives: Datasets

Recommended: Medicare Claims Synthetic Public Use Files (SynPUFs)

This page is moving to a new website.

The Centers for Medicare & Medicaid Services (CMS) provides researchers with access to Medicare claims data, which is a wonderful resources. But you have to sign a restrictive agreement before they will give you this data and you have to pay a non-trivial amount of money to get the data. Fair enough, because CMS has to guarantee patient confidentiality among other things. But what if you want to “play” with the data before taking the plunge? Thankfully, CMS has provided to the general public a synthetic (read fake) data set that has the same data structure. This allows you to prototype your programs on the synthetic data and then transition easily to the real data. Continue reading →

Recommended: TSHS Resource Portal

This page is moving to a new website.

The Teaching of Statistics in the Health Sciences (TSHS) section of the American Statistical Association has put together a set of resources for teachers including several very interesting datasets. some of the resources are open to anyone, but others require a registration. Continue reading →

Recommended: 100+ Interesting Data Sets for Statistics

This list starts out with a data set of 216,930 previous Jeopardy questions and goes from there. Not everything suggested is easily amenable for statistical analysis, but the list is extremely interesting and diverse. In particular, this list is very helpful for anyone interested in text data. Continue reading →

Recommended: Institute for Digital Research and Education — Statistical Computing

This is a wonderful site, but for some reason, it is difficult to find. The Institute for Digital Research and Education (IDRE) at UCLA has put together some wonderful resources on how to do simple data analyses in R, SAS, SPSS, and Stata. The examples cover just about everything you’d ever want to do in any of these statistical packages. If you are making a transition from one statistical package to another, this site offers you the opportunity to see how things are done in the package you know well and compare it to how things are done in the package you are learning. Of special note are the worked textbook examples from many classic statistics textbooks. Continue reading →

Recommended: ENCODE: Encyclopedia of DNA Elements

The genetics research community should be lauded for the openness with which they share research data. You can find numerous data sources that are free and without ANY restrictions. One very good example is ENCODE, the Encyclopedia of DNA Elements. This repository, mostly of human data, but some mouse, fruit fly, and round worm data as well. It has data from many different assays including ChIP-seq, RNA-seq, and DNase-seq. It looks like a great teaching resource, though it does require a fairly hefty understanding of genetics to browse through the data. Continue reading →

PMean: Some interesting publicly available data sets

I’ve been looking for a few interesting data sets for use as teaching examples. I wanted data associated with peer-reviewed publications. It’s a difficult and tedious search, but here are a few promising leads. Continue reading →

PMean: Simple longitudinal data sets to illustrate data management

I am working on a class that will teach basic data management and graphics using the R programming language with parallel classes in SPSS and SAS. On the third or fourth day of the class, we will look at managing longitudinal data sets, as these require special skills. I wanted to find a couple of reasonably simple longitudinal data sets that were available on the web and which had at least a few missing values in them. Here’s a couple of data sets that might work. Continue reading →

PMean: Two data sets illustrating the analysis of continuous variables

This page has moved to a new website.