This is a page outlining several related efforts at RStudio to make it seaier for you to work with data stored in various relational databases. Continue reading
This is a series of videos and homework exercises that you can work on at your own pace. I have only viewed the outline for this, but anything from DataCamp comes highly recommended. Continue reading
This is the github repository of Ben Baumer. He is one of the co-authors of “Modern Data Science with R” and the data and code from that book is available here. He also provides code and data for OpenWAR, an open source method for calculating a baseball statistic, Wins Above Replacement. Finally, there is an R library for extracting, transforming, and loading “medium” sized datasets into SQL. Medium here means multi-gigabyte sized files. Related to this are a couple of “medium” sized data sets from the Internet Movie Database and from the NYC CitiBike dataset. Continue reading
This paper talks about how to get students to think about large databases in an introductory class that normally uses “toy” problems with a few dozen rows of data. Continue reading
This is one of the best articles I have ever read in the popular press about the complexities of the research process.
This article by Susan Dominus covers some high profile research by Amy Cuddy. She and two co-authors found that your body language not only influences how others view you, but it influences how you view yourself. Striking a “power pose” meaning something like a “legs astride or feet up on a desk” can improve your sense of power and control and these subjective feelings are matched by physiological changes, Your testosterone goes up and your cortisol goes down. Both of these, apparently, are good things.
The research team publishes these findings in Psychological Science, a prominent journal in this field. The article receives a lot of press coverage. Dr. Cuddy becomes the public face of this research, most notably by garnering an invitation to give a TED talk and does a bang-up job. Her talk becomes the second most viewed TED talk of all time.
But there’s a problem. The results of the Psychological Science publication do not get replicated. One of the other two authors expresses doubt about the original research findings. Another research team reviews the data analysis and labels the work “p-hacking”.
It turns out that there is a movement in the research world to critically examine existing research findings and to see if the data truly supports the conclusions that have been made. Are the people leading this movement noble warriors for truth or are they shameless bullies who tear down peer-reviewed research in non-peer-reviewed blogs.
I vote for “noble warriors” but read the article and decide for yourself what you think. It’s a complicated area and every perspective has more than one side to it.
One of the noble warriors/shameless bullies is Andrew Gelman, a popular statistician and social scientist. He comments extensively about the New York Times article on his blog, which is also worth reading as well as many comments that others have made on his blog post. It’s also worth digging up some of his earlier commentary about Dr. Cuddy. Continue reading
The authors looked at all systematic reviews (excluding methodological reviews) published in a few key journals as well as a random sample of Cochrane reviews to see how often the authors tried to search for unpublised data. The answer is not often enough (64% or 130/203). The article also describes the success rate in getting unpublished data when the attempt was made (89% or 116/130) and how often authors found evidence of publication bias when they did such an assessment (40% or 27/68). Although some people have argued that it is not that important to search for unpublised data, this is still a big concern. A closely related article is Searching for unpublished data for Cochrane reviews: cross sectional study. Continue reading
This is a new effort to get data out into the open for others to use. A data note can be on data that was not published or it could be an addendum describing data used in another publication. This is just getting started, but could end up being a great teaching resource. Continue reading
I have not had a chance to use this, but it comes highly recommended. OpenRefine is a program that uses a graphical user interface to clean up messy data, but it saves all the clean up steps to insure that your work is well documented and reproducible. I listed Martin Magdinier as the “author” in the citation below because he has posted most of the blog entries about OpenRefine, but there are many contributors to this package and website. Continue reading
This is the first in a series of articles on reducing waste in research. It focuses on funding agencies and recommends that funders should support more work on making research replicable, be more transparent on how they set priorities, make sure that research proposals are justified through a systematic review of previous research, and encourage greater openness of research in progress to encourage collagoration. Other articles in this series cover research design, conduct, and analysis, regulation and management, inaccessible research, and incomplete reports of research. Continue reading
While researchers often use data from health insurance systems to conduct observational studies, the authors of this research paper point out that you can also conduct randomized trials as well. You can randomly assign different levels of insurance coverage and then get claims data to evaluate how much difference there is, if any, in the levels of coverage. This approach is attractive because you do not need a lot of resources, and you can very quickly get a very large sample size. Since insurance data is collected for administrative needs rather than research needs, you have to contend with inaccurate or incomplete data, potentially causing loss of statistical efficiency or producing biased results. The authors offer some interesting examples of actual studies, propose new potential studies, and offer general guidance on how to conduct a randomized trial from health insurance systems. Continue reading