Category Archives: Statistics

PMean: What do you hate most about independent consulting

Someone on the Statistical Consulting forum mentioned that she is going to become an independent consultant when she graduates and wanted to find out from people who are currently in that position what the one thing is that they hate most. This email drew a lot of responses including several people who cautioned this women about the difficulties for a young person to become an independent consultant. Here are the thoughts I shared on the thing I hate most and what the issues are with embarking out on your own as an independent consulting early in your career. Continue reading

Recommended: Oracle Dates and Times

I’m working with R and SQL, and some of the work uses SQLite, and some of the work uses Oracle. There are subtle differences between the two, and for that matter between any two database programs. While there are SQL standards, most packages have minor deviations, or enhancements. Dates in Oracle represent one deviation. In particular, Oracle does not use the ISO 8601 standard date format (yyyy-mm-dd) by default. Here’s a nice overview of how to work with Oracle dates. Continue reading

PMean: By the skin of my teeth

I have to brag a bit. I’m working part-time at Kansas University Medical Center (along with a couple other part-time jobs) and my boss asked me two weeks ago if I was interested in writing a paper on the data analyses I had been working on. It would be submitted to the AMIA 2017 Joint Summit on Translational Research and I’d be the first author. Continue reading

PMean: Turning off large blocks of an R Markdown document

When you’re running a large and complicated program using R Markdown, you can use the CACHE option to save a lot of time. CACHE will notice if a program chunk has stayed the same and avoid running it again. I tend to avoid using the CACHE option, though, because sometimes it fails to execute something that you want executed, even though it looks on the surface like nothing has changed. So I created some simple program chunks that allow me to explicitly turn off parts of the R Markdown program that I don’t need to evaluate at the time. Think of it as a manual cache.

It’s a very simple thing, but one which confounded me for a while, so I am writing about it here. That way I won’t forget six months down the road. Continue reading

PMean: Merging in dplyr is a lot faster

At the Joint Statistics Meetings, I found out that the advantages of some of the new libraries for data manipulation (like dplyr and tidyr) go beyond just the flexibility of the new methods of data manipulation. These libraries produce code that is easier to read and which also runs a lot faster. I did not appreciate how much faster until I tried a test today. Continue reading

Recommended: The FAIR Guiding Principles for scientific data management and stewardship

I’ve always been supportive of efforts to share data. For me, it’s a bit selfish, because I want to find interesting real world examples to use in teaching and on my web pages. But the issue goes way beyond this, of course. Sharing data is an ethical imperative, especially for federally funded research or research that relies on volunteer subjects. It has led to many important discoveries beyond the realm of the original context in which the data was collected. In order for data sharing to be effective, you need to embrace four guiding principles: your data needs to findable, accessible, interoperable, and re-usable. This paper highlights those principles and offers some current examples of data sharing systems.

Continue reading