Tag Archives: Datasets

Recommended: Medicare Claims Synthetic Public Use Files (SynPUFs)

This page is moving to a new website.

The Centers for Medicare & Medicaid Services (CMS) provides researchers with access to Medicare claims data, which is a wonderful resources. But you have to sign a restrictive agreement before they will give you this data and you have to pay a non-trivial amount of money to get the data. Fair enough, because CMS has to guarantee patient confidentiality among other things. But what if you want to “play” with the data before taking the plunge? Thankfully, CMS has provided to the general public a synthetic (read fake) data set that has the same data structure. This allows you to prototype your programs on the synthetic data and then transition easily to the real data. Continue reading

Recommended: Institute for Digital Research and Education — Statistical Computing

This is a wonderful site, but for some reason, it is difficult to find. The Institute for Digital Research and Education (IDRE) at UCLA has put together some wonderful resources on how to do simple data analyses in R, SAS, SPSS, and Stata. The examples cover just about everything you’d ever want to do in any of these statistical packages. If you are making a transition from one statistical package to another, this site offers you the opportunity to see how things are done in the package you know well and compare it to how things are done in the package you are learning. Of special note are the worked textbook examples from many classic statistics textbooks. Continue reading

Recommended: ENCODE: Encyclopedia of DNA Elements

The genetics research community should be lauded for the openness with which they share research data. You can find numerous data sources that are free and without ANY restrictions. One very good example is ENCODE, the Encyclopedia of DNA Elements. This repository, mostly of human data, but some mouse, fruit fly, and round worm data as well. It has data from many different assays including ChIP-seq, RNA-seq, and DNase-seq. It looks like a great teaching resource, though it does require a fairly hefty understanding of genetics to browse through the data. Continue reading

PMean: Simple longitudinal data sets to illustrate data management

I am working on a class that will teach basic data management and graphics using the R programming language with parallel classes in SPSS and SAS. On the third or fourth day of the class, we will look at managing longitudinal data sets, as these require special skills. I wanted to find a couple of reasonably simple longitudinal data sets that were available on the web and which had at least a few missing values in them. Here’s a couple of data sets that might work. Continue reading