At first glance, you might think that this article looks like a vindication of traditional statistics. Classical time series models (methods that were available in the 1960′s) outperform newer machine language forecasting models. Then, you might worry that the comparisons were unfair. But neither viewpoint is accurate. The classical time series models have certain structural advantages for certain types of problems, but you might be better off with machine learning if you use classical time series as a preprocessing step, such as de-seasonalizing your data. If nothing else, this article provides a nice overview of some of the major machine learning methods. Continue reading

# Tag Archives: Big data

# PMean: Two articles published in the Encyclopedia of Big Data

I don’t call myself a “big data” analyst, but when a call went out seeking authors for various topics for the Encyclopedia of Big Data, I volunteered to write two articles. Here are the details. Continue reading

# Recommended: Welcome to developerWorks

I got this recommendation from a friend. IBM has a large number of free resources explaining things like cloud computing and blockchain. I’m most interested in their section on analytics. There’s a nice introduction, for example, to natural language processing. Continue reading

# PMean: My work on a CTSA grant

I’m on a Clincal and Translational Science Award (CTSA) research grant (5UL1TR000001-05, formerly 1U54RR031295-01A1), which is pretty cool. My name is even mentioned a few times in the grant. I thought that as I plan what I would do for this grant, I would see what the grant promised and write down what, exactly, that those promises mean. As I talk with various people (especially Russ Waitman, who is supervising my work on this grant), I will revise and update my plans. Still, I thought it would be valuable to put some thoughts down now, both to help me focus on what I should be doing and to offer an early draft of those ideas to the various people that I will end up interacting with. Continue reading

# Recommended: The Origins of ‘Big Data’

I’m not a big fan of the term “big data” but I’ve been applying for a couple of jobs that ask for expertise in big data instead of expertise in Statistics. So in one of the cover letters, I wrote that I was doing big data analysis before the term was even coined. That forced me to do a quick fact check, and it looks like the term first came into wide use in the late 1990s. Here’s an article on the person who first coined the term “big data.” Continue reading

# Recommended: Can A.I. be taught to explain itself

This is a nice article in the popular press that talks about some of the problems with “black box” models (in particular deep neural nets) used extensively in many big data projects. It is a bit shy on technical details, which is understandable for a paper like the New York Times. Even so, the stories are quite intriguing. This is a wake up call for those people who fail to recognize the serious problems with many big data models. Continue reading

# Recommended: beanumber repository

This is the github repository of Ben Baumer. He is one of the co-authors of “Modern Data Science with R” and the data and code from that book is available here. He also provides code and data for OpenWAR, an open source method for calculating a baseball statistic, Wins Above Replacement. Finally, there is an R library for extracting, transforming, and loading “medium” sized datasets into SQL. Medium here means multi-gigabyte sized files. Related to this are a couple of “medium” sized data sets from the Internet Movie Database and from the NYC CitiBike dataset. Continue reading

# Recommended: Teaching precursors to data science in introductory and second courses in statistics

This paper talks about how to get students to think about large databases in an introductory class that normally uses “toy” problems with a few dozen rows of data. Continue reading

# Recommended: This is your machine learning system?

This xkcd cartoon by Scott Munro is open source, so I can hotlink the image directly. But if you go to the source, https://xkcd.com/1838/, be sure to hover over the image for a second punch line.

# Recommended: ROSE: A package for binary imbalanced learning

Logistic regression and other statistical methods for predicting a binary outcome run into problems when the outcome being tested is very rare, even in data sets big enough to insure that the rare outcome occurs hundreds or thousands of times. The problem is that attempts to optimize the model across all of the data will end up looking predominantly at optimizing the negative cases, and could easily ignore and misclassify all or almost all of the positive cases since they consistute such a small percentage of the data. The ROSE package generates artificial balanced samples to allow for better estimation and better evaluation of the accuracy of the model. Continue reading