How do you teach data science? That’s not an easy question, because data science means different things to different people. This site shows different curricula depending on what you want your program to emphasize. Continue reading
File this under the “dark side” of data science. Alfie Kohn is a critic of many of the motivational methods used in business and education, and he makes many good points in this blog post about relying on readily available data without questioning its quality. Continue reading
A rather harsh and cynical take on data science, but still worth reading. Let me share a story about this. Back in my college days (that would be the 1970s), someone found a New Yorker cartoon and shared it with me. It showed a politician, obviously a very powerful politician because his office had a view of the Washington Monument. He was speaking to his aide “That’s the gist of what I want to say. Now go and find me some statistics to base it on.” So the issues that this person brings up are no different than those from four decades ago. There’s no easy solution to the problem. You can’t say, “I’ll only work with people who have a commitment to the truth, no matter where it might lead” because even people without strong overt biases still have subtle biases that can profoundly skew the results. Requiring a priori specifications and reserving a hold out sample for a final quality check can help, but mostly it is just being careful and detail oriented and transparent in all your work. Continue reading
This is one of those articles where you have to restrain yourself. Its message, that good old statistical tools like logistic regression can perform as well as these new fangled machine learning approaches that you haven’t taken the time to learn, is quite tempting. But I’d be cautious here. Maybe logistic regression is still competitive, but maybe the systematic overview got a bunch of biased studies. It’s worthwhile to cite this whenever someone makes an overly strong claim about machine learning models, but don’t use this as an excuse to keep from learning the new stuff yourself. This article is stuck behind a paywall. Sorry! Continue reading
Purchased from CartoonStock.com for this blog site only. Do not reproduce this cartoon without their permission.
“In a world where the price of calculation continues to decrease rapidly, but the price of theorem proving continues to hold steady or increase, elementary economics indicates that we ought to spend a larger and larger fraction of our time on calculation.” John Tukey, as quoted in “Sunset Salvo”, The American Statistician 1986; 40(10): 72-76.
Building a great statistical model does no one any good if it doesn’t pay attention to non-statistical issues. This story talks about a machine learning model to identify which houses in Flint Michagan that were the best candidates for removal of lead pipes. The model worked fairly well, but came up against problems like individual city council members wanting to assure their constituents that enough was being done in their district. I’m not sure what the actual moral of this story is, but it does serve as a warning to be careful when you are modeling data in a contentous area. Continue reading
I’m planning to give a talk on “The Dark Side of Data Science” and I’m hoping to get some interesting references and articles from my colleagues. Here is a first draft of my abstract, with a few references that I am already familiar with. Continue reading
When you are working with text mining, you might want to reduce the dimensionality of your problem. The word2vec algorithm, developed by Tomas Mikolov and others at Google, offers a nice approach. This page shows how to apply this algorithm within R. Continue reading
This is a MOOC (Massive Open Online Course) covering deep learning models. I have not taken it, but it comes highly recommended by others. It uses Python as the underlying language. Continue reading