Tag Archives: Data management

Recommended: Good enough practices in scientific computing

There is more than one way to approach a data analysis and some of the ways lead to easier modifications and updates and help make your work more reproducible. This paper talks about steps that they recommend based on years of teaching software carpentry and data carpentry classes. One of the software products mentioned in this article, OpenRefine, looks like a very interesting way to clean up messy data in a way that leaves a well documented trail. Continue reading

PMean: Merging in dplyr is a lot faster

At the Joint Statistics Meetings, I found out that the advantages of some of the new libraries for data manipulation (like dplyr and tidyr) go beyond just the flexibility of the new methods of data manipulation. These libraries produce code that is easier to read and which also runs a lot faster. I did not appreciate how much faster until I tried a test today. Continue reading

PMean: What should go into a data codebook

Before you start your data entry, you should create a data codebook. If you don’t have a data codebook when you hand your data  over to someone else, take the time to create one for their benefit and yours. The data codebook contains a description of your data set. There’s no standard form for a data codebook, and what you describe may depend on a variety of factors, such as the complexity of your data set, the number of people involved in data collection and data entry, and the number of people that you are likely to share your data with. Here are some of the elements that you should think about putting in a data codebook. Continue reading