Tag Archives: Data management

Recommended: EuSpRIG horror stories.

There has been a lot written about data management problems with using spreadsheets, and there is a group the European Spreadsheet Risks Interest Group that has documented the problem carefully and thoroughly. This page on their website lists the big, big, big problems that have occurred because of spreadsheet mistakes. Any program is capable of producing mistakes, of course, but spreadsheets are particularly prone to errors for a variety of reasons that this group documents. Continue reading

Recommended: The Reinhart-Rogoff error – or how not to Excel at economics

There has been a lot written about how lousy Microsoft Excel (and other spreadsheet products) are at data management, but the warning sinks in so much more effectively when you can cite an example where the use of Excel leads to an embarrassing retraction. Perhaps the best example is the paper by Carmen Reinhart and Peter Rogoff where a major conclusion was invalidated when a formula inside their Excel spreadsheet accidentally included only 15 of the relevant 20 countries. Here’s a nice description of that event and some suggestions on how to avoid this in the future. Continue reading

Recommended: OpenRefine: A free, open source, powerful tool for working with messy data

I have not had a chance to use this, but it comes highly recommended. OpenRefine is a program that uses a graphical user interface to clean up messy data, but it saves all the clean up steps to insure that your work is well documented and reproducible. I listed Martin Magdinier as the “author” in the citation below because he has posted most of the blog entries about OpenRefine, but there are many contributors to this package and website. Continue reading

Recommended: Good enough practices in scientific computing

There is more than one way to approach a data analysis and some of the ways lead to easier modifications and updates and help make your work more reproducible. This paper talks about steps that they recommend based on years of teaching software carpentry and data carpentry classes. One of the software products mentioned in this article, OpenRefine, looks like a very interesting way to clean up messy data in a way that leaves a well documented trail. Continue reading

PMean: Merging in dplyr is a lot faster

At the Joint Statistics Meetings, I found out that the advantages of some of the new libraries for data manipulation (like dplyr and tidyr) go beyond just the flexibility of the new methods of data manipulation. These libraries produce code that is easier to read and which also runs a lot faster. I did not appreciate how much faster until I tried a test today. Continue reading