Networking is important, and until recently I have failed to build bridges with some of the very smart people working at the University of Kansas in Lawrence. But I will be giving a colloquium talk to a group (Center for Research Methods and Data Analysis) at KU in January.It may be for a different, but closely related group, but it doesn’t matter. It’s an excuse to get out of the office and meet people. Here’s the tentative title and abstract for my talk and a brief review of some other talks I’ll be giving. Continue reading

# Monthly Archives: November 2017

# Recommended: Can A.I. be taught to explain itself

This is a nice article in the popular press that talks about some of the problems with “black box” models (in particular deep neural nets) used extensively in many big data projects. It is a bit shy on technical details, which is understandable for a paper like the New York Times. Even so, the stories are quite intriguing. This is a wake up call for those people who fail to recognize the serious problems with many big data models. Continue reading

# PMean: Resources for R

I attended several talks about R at the Joint Statistics meetings and noted some interesting packages and other resources during these talks. I lost track of that list until recently, but they are still relevant, so here they are. Continue reading

# PMean: Losing track of your transformed variables in R

I got an interesting question from one of my students, and it illustrates a subtle issue that may confuse beginning R programmers. The student was trying to compute a ratio of brain weight to body weight in a small data set, but then was unable to calculate any summary statistics on that ratio. Here’s what caused the problem. Continue reading

# Recommended: Databases using R

This is a page outlining several related efforts at RStudio to make it seaier for you to work with data stored in various relational databases. Continue reading

# Recommended: Intro to SQL for Data Science

This is a series of videos and homework exercises that you can work on at your own pace. I have only viewed the outline for this, but anything from DataCamp comes highly recommended. Continue reading

# Recommended: beanumber repository

This is the github repository of Ben Baumer. He is one of the co-authors of “Modern Data Science with R” and the data and code from that book is available here. He also provides code and data for OpenWAR, an open source method for calculating a baseball statistic, Wins Above Replacement. Finally, there is an R library for extracting, transforming, and loading “medium” sized datasets into SQL. Medium here means multi-gigabyte sized files. Related to this are a couple of “medium” sized data sets from the Internet Movie Database and from the NYC CitiBike dataset. Continue reading

# Recommended: Teaching precursors to data science in introductory and second courses in statistics

This paper talks about how to get students to think about large databases in an introductory class that normally uses “toy” problems with a few dozen rows of data. Continue reading

# Recommended: Writing about numbers

This is a chapter in a classic book, Medical Uses of Statistics. The writer of this particular chapter was a giant in Statistics, Frederick Mosteller. This chapter talks about some of the style issues associated with the data that you would normally present in your results section of your research paper. The advice is a bit dated, perhaps, but still well worth reading. Continue reading