I’m somewhat new to geocoding. One of the first things you might be interested in, if you have geographic data, is an indicator as to whether a certain address, zip code, or county is urban or rural. This is actually quite a complex topic. This paper outlines some of the basic systems to classifying a location as urban, rural, or something in between (e.g., suburban). Continue reading
This is a nice compilation of issues that you should be concerned. The examples are mostly from things that interest Google, but you will find this advice itself is useful no matter what type of data you work with. The advice is split into three broad categories: technical (e.g., look at your distributions), process (e.g., separate validation, description, and evaluation), and communication (e.g., data analysis starts with questions, not data or a technique). Continue reading
I’m an experienced R programmer trying to learn a little about SQL. One of my good friends who lives totally in the database world (I call her the Teradata Queen), shared a link to a blog post at SQLServerCentral about using R. Microsoft is including R in its SQL Server distribution, so this is an opportunity for a lot of interesting work combining the data manipulation power of SQL Server with the data analysis power of R. Anyway, the blog post explains some of the cost and performance issues associated with R scripts running on a SQL Server CPU. Continue reading
When evaluating a series of research articles, you often have to assess the quality of the individual papers based on the type of blinding, for example. What do you do if the paper does not discuss these items? I have usually advocated a “no news is bad news policy.” If a paper does not mention blinding, assume that no blinding was done. It seems reasonable, but the paper by Mhaskar et al provides empirical evidence that sometimes authors leave out information that would strengthen the credibility of their study. A similar paper is at https://www.ncbi.nlm.nih.gov/pubmed/22424985 Continue reading
This article is a synthesis of a panel discussion at the 2014 Joint Statistical Meetings on the flipped classroom. The article discusses it solely from the perspective of Biostatistics classes, though they offer some references for the flipped classroom in a more general setting. A flipped classroom is a course where the traditional didactic lectures are recorded and watched at home and the homework that would normally be done at home is done instead in the classroom. This homework in a Biostatistics class often takes the form of active learning in small groups, such as critiquing published research studies or conducting analyses on real world data sets. The key component, according to the authors, is the in class interactions during these assignments. Students learn from each other as they work in groups.
Now you could do active learning in a traditional course format. What a flipped classroom does is increases the emphasis and the amount of time spent in active learning.
The common theme of the paper is that the flipped classroom has been successfully applied in a variety of settings. It is not a “one size fits all” approach, but rather can be adapted to the needs of the particular class. Some students may not like the flipped classroom format, and you shouldn’t underestimate the amount of time needed to prepare the videotaped lectures (one rule of thumb is ten hours of work for every hour of video). Still the student reactions and the instructors perceptions of the flipped classroom are generally positive. Continue reading
I have some guidance on how to organize data (written back in 1999), but these guidelines are far superior. To be honest, you should use a database for anything more than half complicated, but for those simple data sets where you can use a text file or a spreadsheet, Dr. Broman’s comments are very helpful. Continue reading
This interview with Nate Silver was conducted shortly after his keynote address at the 2013 Joint Statistical Meetings. I was at those meetings, but was stuck in a class (a very good class by the way, but I still felt stuck) on software engineering for statisticians. This article summarizes the main points of Mr. Silver’s keynote address and adds some extra insights through an interview after the speech. The best part was the quote at the end.
When asked that “Data science is the term of the day. Do you think there is a difference between data science and statistics? Silver replied, “I think data-scientist is a sexed up term for a statistician”, the reaction from the audience was for most, one of instantaneous laughter and applause. “Statistics is a branch of science. Data scientist is slightly redundant in some way and people shouldn’t berate the term statistician.”
If Nate Silver can say something this controversial, then maybe I shouldn’t be so bashful. Continue reading
I’m an old dog R programmer who tends to rely on features of R that were available 10 years ago (an eternity for computers). But it’s time for this old dog to learn new tricks. One thing I need to use in my R programs is called a “tibble” (sometimes called a “tidy tibble”). It’s a minor but important improvement on data frames and many of the newer packages are using tibbles instead of data frames. Tibbles are available in the package, tibble. This web page offers a nice description of the improvements on tibbles. Continue reading
One of the recent developments in R that I was unaware of until I attended some talks at the Joint Statistical Meetings was the use of dplyr and pipes. It’s an approach to data management that isn’t different from earlier approaches, but the code is much easier to read and maintain. This blog post explains in simple terms how these work and why you would use them. Continue reading
Hadley Wickham has written many popular R packages, so many that they are sometimes referred to as the “Hadleyverse.” This is a nice biography that emphasizes the impact that Dr. Wickham has had on R. Continue reading