This is a page outlining several related efforts at RStudio to make it seaier for you to work with data stored in various relational databases. Continue reading
This is a series of videos and homework exercises that you can work on at your own pace. I have only viewed the outline for this, but anything from DataCamp comes highly recommended. Continue reading
This is the github repository of Ben Baumer. He is one of the co-authors of “Modern Data Science with R” and the data and code from that book is available here. He also provides code and data for OpenWAR, an open source method for calculating a baseball statistic, Wins Above Replacement. Finally, there is an R library for extracting, transforming, and loading “medium” sized datasets into SQL. Medium here means multi-gigabyte sized files. Related to this are a couple of “medium” sized data sets from the Internet Movie Database and from the NYC CitiBike dataset. Continue reading
This paper talks about how to get students to think about large databases in an introductory class that normally uses “toy” problems with a few dozen rows of data. Continue reading
This is a chapter in a classic book, Medical Uses of Statistics. The writer of this particular chapter was a giant in Statistics, Frederick Mosteller. This chapter talks about some of the style issues associated with the data that you would normally present in your results section of your research paper. The advice is a bit dated, perhaps, but still well worth reading. Continue reading
This is one of the best articles I have ever read in the popular press about the complexities of the research process.
This article by Susan Dominus covers some high profile research by Amy Cuddy. She and two co-authors found that your body language not only influences how others view you, but it influences how you view yourself. Striking a “power pose” meaning something like a “legs astride or feet up on a desk” can improve your sense of power and control and these subjective feelings are matched by physiological changes, Your testosterone goes up and your cortisol goes down. Both of these, apparently, are good things.
The research team publishes these findings in Psychological Science, a prominent journal in this field. The article receives a lot of press coverage. Dr. Cuddy becomes the public face of this research, most notably by garnering an invitation to give a TED talk and does a bang-up job. Her talk becomes the second most viewed TED talk of all time.
But there’s a problem. The results of the Psychological Science publication do not get replicated. One of the other two authors expresses doubt about the original research findings. Another research team reviews the data analysis and labels the work “p-hacking”.
It turns out that there is a movement in the research world to critically examine existing research findings and to see if the data truly supports the conclusions that have been made. Are the people leading this movement noble warriors for truth or are they shameless bullies who tear down peer-reviewed research in non-peer-reviewed blogs.
I vote for “noble warriors” but read the article and decide for yourself what you think. It’s a complicated area and every perspective has more than one side to it.
One of the noble warriors/shameless bullies is Andrew Gelman, a popular statistician and social scientist. He comments extensively about the New York Times article on his blog, which is also worth reading as well as many comments that others have made on his blog post. It’s also worth digging up some of his earlier commentary about Dr. Cuddy. Continue reading
I ran across a nice discussion of how to write the results section of a research paper, but it has one comment about the phrase “trend towards significance” that I had to disagree with. So I wrote a comment that they may or may not end up publishing (note: it did look like the published my comment, but it’s a bit tricky to find).
Here’s what I submitted. Continue reading
I shouldn’t do this, because we’ve all made mistakes, especially me. But I took a peek at a website with the intriguing title “100+ commonly asked data science interview questions” with the thought “Maybe I could be a data scientist”. But the author of this list choked on the very first question. It’s interesting to examining why the question is bad. Continue reading
The authors looked at all systematic reviews (excluding methodological reviews) published in a few key journals as well as a random sample of Cochrane reviews to see how often the authors tried to search for unpublised data. The answer is not often enough (64% or 130/203). The article also describes the success rate in getting unpublished data when the attempt was made (89% or 116/130) and how often authors found evidence of publication bias when they did such an assessment (40% or 27/68). Although some people have argued that it is not that important to search for unpublised data, this is still a big concern. A closely related article is Searching for unpublished data for Cochrane reviews: cross sectional study. Continue reading
This is a new effort to get data out into the open for others to use. A data note can be on data that was not published or it could be an addendum describing data used in another publication. This is just getting started, but could end up being a great teaching resource. Continue reading