Top r Questions

List of Tags
260
Andrie

When discussing performance with colleagues, teaching, sending a bug report or searching for guidance on mailing lists and here on SO, a reproducible example is often asked and always helpful.

What are your tips for creating an excellent example? How do you paste data structures from in a text format? What other information should you include?

Are there other tricks in addition to using dput(), dump() or structure()? When should you include library() or require() statements? Which reserved words should one avoid, in addition to c, df, data, etc?

How does one make a great reproducible example?

Answered By: Joris Meys ( 147)

A minimal reproducible example consists of the following items :

  • a minimal dataset, necessary to reproduce the error
  • the minimal runnable code necessary to reproduce the error, which can be run on the given dataset.
  • the necessary information on the used packages, R version and system it is run on.
  • in case of random processes, a seed (set by set.seed()) for reproducibility

Looking at the examples in the help files of the used functions is often helpful. In general, all the code given there fulfills the requirements of a minimal reproducible example : data is provided, minimal code is provided, and everything is runnable.

Producing a minimal dataset

For most cases, this can be easily done by just providing a vector / dataframe with some values. Or you can use one of the built-in datasets, which are provided with most packages.

Making a vector is easy. Sometimes it is necessary to add some randomness to it, and there are a whole number of functions to make that. sample() can randomize a vector, or give a random vector with only a few values. letters is a useful vector containing the alphabet. This can be used for making factors.

A few examples :

  • random values : x <- rnorm(10) for normal distribution, x <- runif(10) for uniform distribution, ...
  • a permutation of some values : x <- sample(1:10) for vector 1:10 in random order.
  • a random factor : x <- sample(letters[1:4], 20, replace = TRUE)

For matrices, one can use matrix(), eg :

matrix(1:10, ncol = 2)

Making dataframes can be done using data.frame(). One should pay attention to name the entries in the dataframe, and to not make it overly complicated.

An example :

Data <- data.frame(
    X = sample(1:10),
    Y = sample(c("yes", "no"), 10, replace = TRUE)
)

For some questions, specific formats can be needed. For these, one can use any of the provided as.someType functions : as.factor, as.Date, as.xts, ... These in combination with the vector and/or dataframe tricks.

Copy your data

If you have some data that would be too difficult to construct using these tips, then you can always make a subset of your original data, using eg head(), subset() or the indices. Then use eg. dput() to give us something that can be put in R immediately :

> dput(head(iris,4))
structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6), Sepal.Width = c(3.5, 
3, 3.2, 3.1), Petal.Length = c(1.4, 1.4, 1.3, 1.5), Petal.Width = c(0.2, 
0.2, 0.2, 0.2), Species = structure(c(1L, 1L, 1L, 1L), .Label = c("setosa", 
"versicolor", "virginica"), class = "factor")), .Names = c("Sepal.Length", 
"Sepal.Width", "Petal.Length", "Petal.Width", "Species"), row.names = c(NA, 
4L), class = "data.frame")

Worst case scenario, you can give a text representation that can be read in using textConnection :

zz <- textConnection("Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
")
Data <- read.table(zz, header = TRUE)
close(zz)

Producing minimal code

This should be the easy part, but often isn't. What you should not do, is :

  • add all kind of data conversions. Make sure the provided data is already in the correct format (unless that is the problem of course)
  • copy-paste a whole function / chunk of code that gives an error. First try to locate which lines exactly result in the error. More often than not you'll find out what the problem is yourself.

What you should do, is :

  • add which packages should be used if you use any.
  • if you open connections or make files, add some code to close them or delete the files (using unlink())
  • if you change options, make sure the code contains a statement to revert them back to the original ones. (eg op <- par(mfrow=c(1,2)) ...some code... par(op) )
  • testrun your code in a new, empty R session to make sure the code is runnable. People should be able to just copy-paste your data and your code in the console and get exactly the same as you have.

Give extra information

In most cases, just the R version and the operating system will suffice. When conflicts arise with packages, giving the output of sessionInfo() can really help. When talking about connections to other applications (be it through ODBC or anything else), one should also provide version numbers for those, and if possible also the necessary information on the setup.

222
Paul McMurdie

For documenting classes with roxygen(2), specifying a title and description/details appears to be the same as for functions, methods, data, etc. However, slots and inheritance are their own sort of animal. What is the best practice -- current or planned -- for documenting S4 classes in roxygen2?

Due Diligence:

I found mention of an @slot tag in early descriptions of roxygen. A 2008 R-forge mailing list post seems to indicate that this is dead, and there is no support for @slot in roxygen:

Is this true of roxygen2? The previously-mentioned post suggests a user should instead make their own itemized list with LaTeX markup. E.g. a new S4 class that extends the "character" class would be coded and documented like this:

#' The title for my S4 class that extends \code{"character"} class.
#'
#' Some details about this class and my plans for it in the body.
#'
#' \describe{
#'    \item{myslot1}{A logical keeping track of something.}
#'
#'    \item{myslot2}{An integer specifying something else.}
#' 
#'    \item{myslot3}{A data.frame holding some data.}
#'  }
#' @name mynewclass-class
#' @rdname mynewclass-class
#' @exportClass mynewclass
setClass("mynewclass",
    representation(myslot1="logical",
        myslot2="integer",
        myslot3="data.frame"),
    contains = "character"
)

However, although this works, this \describe , \item approach for documenting the slots seems inconsistent with the rest of roxygen(2), in that there are no @-delimited tags and slots could go undocumented with no objection from roxygenize(). It also says nothing about a consistent way to document inheritance of the class being defined. I imagine dependency still generally works fine (if a particular slot requires a non-base class from another package) using the @import tag.

So, to summarize, what is the current best-practice for roxygen(2) slots?

There seem to be three options to consider at the moment:

  • A -- Itemized list (as example above).
  • B -- @slot ... but with extra tags/implementation I missed. I was unable to get @slot to work with roxygen / roxygen2 in versions where it was included as a replacement for the itemized list in the example above. Again, the example above does work with roxygen(2).
  • C -- Some alternative tag for specifying slots, like @param, that would accomplish the same thing.

I'm borrowing/extending this question from a post I made to the roxygen2 development page on github.

Answered By: Full Decent ( 7)

For S4, I would say current best practice is documentation in the form:

\section{Slots}{
  \describe{
    \item{\code{a}:}{Object of class \code{"numeric"}.}
    \item{\code{b}:}{Object of class \code{"character"}.}
  }
}

This is consistent with the internal representation of slots as a list inside the object. As you point out, this syntax is different than other lines, and we may hope for a more robust solution in the future that takes incorporates knowledge of inheritance -- but today that does not exist.

As pointed out by @Brian Diggs above, this is implemented at https://github.com/klutometis/roxygen/pull/85

Fortunately, aside from the current best practice which should clearly be improved, there are many developers working on the problem and better solution may come out perhaps by the end of the year.

219
Brian Campbell

I'm a programmer with a decent background in math and computer science. I've studied computability, graph theory, linear algebra, abstract algebra, algorithms, and a little probability and statistics (through a few CS classes) at an undergraduate level.

I feel, however, that I don't know enough about statistics. Statistics are increasingly useful in computing, with statistical natural language processing helping fuel some of Google's algorithms for search and machine translation, with performance analysis of hardware, software, and networks needing proper statistical grounding to be at all believable, and with fields like bioinformatics becoming more prevalent every day.

I've read about how "Google uses Bayesian filtering the way Microsoft uses the if statement", and I know the power of even fairly naïve, simple statistical approaches to problems from Paul Graham's A Plan for Spam and Better Bayesian Filtering, but I'd like to go beyond that.

I've tried to look into learning more statistics, but I've gotten a bit lost. The Wikipedia article has a long list of related topics, but I'm not sure which I should look into. I feel like from what I've seen, a lot of statistics makes the assumption that everything is a combination of factors that linearly combine, plus some random noise in a Gaussian distribution; I'm wondering what I should learn beyond linear regression, or if I should spend the time to really understand that before I move on to other techniques. I've found a few long lists of books to look at; where should I start?

So I'm wondering where to go from here; what to learn, and where to learn it. In particular, I'd like to know:

  1. What kind of problems in programming, software engineering, and computer science are statistical methods well suited for? Where am I going to get the biggest payoffs?
  2. What kind of statistical methods should I spend my time learning?
  3. What resources should I use to learn this? Books, papers, web sites. I'd appreciate a discussion of what each book (or other resource) is about, and why it's relevant.

To clarify what I am looking for, I am interested in what problems that programmers typically need to deal with can benefit from a statistical approach, and what kind of statistical tools can be useful. For instance:

  • Programmers frequently need to deal with large databases of text in natural languages, and help to categorize, classify, search, and otherwise process it. What statistical techniques are useful here?
  • More generally, artificial intelligence has been moving away from discrete, symbolic approaches and towards statistical techniques. What statistical AI approaches have the most to offer now, to the working programmer (as opposed to ongoing research that may or may not provide concrete results)?
  • Programmers are frequently asked to produce high-performance systems, that scale well under load. But you can't really talk about performance unless you can measure it. What kind of experimental design and statistical tools do you need to use to be able to say with confidence that the results are meaningful?
  • Simulation of physical systems, such as in computer graphics, frequently involves a stochastic approach.
  • Are there other problems commonly encountered by programmers that would benefit from a statistical approach?
Answered By: Ian Fellows ( 96)

Interesting question. As a statistician whose interest is more and more aligned with computer science perhaps I could provide a few thoughts...

  1. Don't learn frequentist hypothesis testing. While the bulk of my work is done in this paradigm, it doesn't match the needs of business or data mining. Scientists generally have specific hypotheses in mind, and might wish to gauge the probability that, given their hypothesis isn't true, the data would be as extreme as it is. This is rarely the type of answer a computer scientist wants.

  2. Bayesian is useful, even if you don't know why you are assuming the priors that you are using. A baysian analysis can give you a precise probability estimate for various contingencies, but it is important to realize that the only reason you have this precise estimate is because you made a fuzzy decision regarding the prior probability. (For those not in the know, with baysian inference, you can specify an arbitrary prior probability, and update this based on the data collected to get a better estimate).

Machine learning and classification might be a good place to get started. The machine learning literature is more focused on computer science problems, though it's mission is almost identical to that of statistics ( see: http://anyall.org/blog/2008/12/statistics-vs-machine-learning-fight/ ).

Since you spoke of large databases with large numbers of variables, here are a few algorithms that come in handy in this domain.

  • adaboost: If you have a large number of crappy classifiers, and want to make one good classifier. (see also logit boost)
  • Support Vector Machines: A powerful and flexible classifier. Can learn non-linear patterns (okay linear in the non-linear kernel space if you want to be picky about it).
  • k-nearest neighbor: A simple but powerful algorithm. It does not scale well, but there are approximate nearest neighbor alternatives that are not quite so pathological.
  • CART: This algorithm partitions the data based on a number of predictor variables. It is particularly good if there are variable interactions, or there exists a very good predictor that only works on a subset of the data.
  • Least angle regression: if the value that you are trying to predict is continuous and you have a lot of data and a lot of predictors.

This is by no means complete, but should give you a good jumping off point. A very good and accessible book on the subject is Duda, Hart, Stork: Pattern Classification

Also, a big part of statistics is descriptive visualizations and analysis. These are of particular interest to the programmer because they allow him/her to convey information back to the user. In R, ggplot2 is my package of choice for creating visualizations. On the descriptive analysis side (and useful in text analysis) is multi-dimensional scaling, which can give a spacial interpretation of non-spacial data (for example the ideologies of senators http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aoas/1223908041).

215
jebyrnes

Apparently, folk have figured out how to make xkcd style graphs in Mathematica and in LaTeX. Can we do it in R? Ggplot2-ers? A geom_xkcd and/or theme_xkcd?

I guess in base graphics, par(xkcd=TRUE)? How do I do it?

xkcd#1064

As a first stab (and as much more elegantly shown below) in ggplot2, adding the jitter argument to a line makes for a great hand-drawn look. So -

ggplot(mapping=aes(x=seq(1,10,.1), y=seq(1,10,.1))) + geom_line(position="jitter", color="red", size=2) + theme_bw()

It makes for a nice example - but the axes and fonts appear trickier. Fonts appear solved (below), though. Is the only way to deal with axes to blank them out and draw them in by hand? Is there a more elegant solution? In particular, in ggplot2, can element_line in the new theme system be modified to take a jitter-like argument?

Answered By: Mark Bulling ( 103)

Thinking along the same line as some of the other answers, I've "un-ggplotted" the chart and also added on the flexibility of the x-axis label locations (which seems to be common in xkcd) and an arbitrary label on the chart.

Note that I had a few issues with loading the Humor Sans font and manually downloaded it to working directory.

enter image description here

And the code...

library(ggplot2)
library(extrafont)

### Already have read in fonts (see previous answer on how to do this)
loadfonts()

### Set up the trial dataset 
data <- NULL
data$x <- seq(1, 10, 0.1)
data$y1 <- sin(data$x)
data$y2 <- cos(data$x)
data$xaxis <- -1.5

data <- as.data.frame(data)

### XKCD theme
theme_xkcd <- theme(
    panel.background = element_rect(fill="white"), 
    axis.ticks = element_line(colour=NA),
    panel.grid = element_line(colour="white"),
    axis.text.y = element_text(colour=NA), 
    axis.text.x = element_text(colour="black"),
    text = element_text(size=16, family="Humor Sans")
    )

 ### Plot the chart
 p <- ggplot(data=data, aes(x=x, y=y1))+
      geom_line(aes(y=y2), position="jitter")+
      geom_line(colour="white", size=3, position="jitter")+
      geom_line(colour="red", size=1, position="jitter")+
      geom_text(family="Humor Sans", x=6, y=-1.2, label="A SIN AND COS CURVE")+
      geom_line(aes(y=xaxis), position = position_jitter(h = 0.005), colour="black")+
      scale_x_continuous(breaks=c(2, 5, 6, 9), 
      labels = c("YARD", "STEPS", "DOOR", "INSIDE"))+labs(x="", y="")+
      theme_xkcd

ggsave("xkcd_ggplot.jpg", plot=p, width=8, height=5)
148
grautur

Whenever I want to do something "map"py in R, I usually try to use a function in the apply family. (Side question: I still haven't learned plyr or reshape -- would plyr or reshape replace all of these entirely?)

However, I've never quite understood the differences between them [how {sapply, lapply, etc.} apply the function to the input/grouped input, what the output will look like, or even what the input can be], so I often just go through them all until I get what I want.

Can someone explain how to use which one when?

[My current (probably incorrect/incomplete) understanding is...

  1. sapply(vec, f): input is a vector. output is a vector/matrix, where element i is f(vec[i]) [giving you a matrix if f has a multi-element output]
  2. lapply(vec, f): same as sapply, but output is a list?
  3. apply(matrix, 1/2, f): input is a matrix. output is a vector, where element i is f(row/col i of the matrix)
  4. tapply(vector, grouping, f): output is a matrix/array, where an element in the matrix/array is the value of f at a grouping g of the vector, and g gets pushed to the row/col names
  5. by(dataframe, grouping, f): let g be a grouping. apply f to each column of the group/dataframe. pretty print the grouping and the value of f at each column.
  6. aggregate(matrix, grouping, f): similar to by, but instead of pretty printing the output, aggregate sticks everything into a dataframe.]
Answered By: joran ( 200)

R has many *apply functions which are ably described in the help files (e.g. ?apply). There are enough of them, though, that beginning useRs may have difficulty deciding which one is appropriate for their situation or even remembering them all. They may have a general sense that "I should be using an *apply function here", but it can be tough to keep them all straight at first.

Despite the fact (noted in other answers) that much of the functionality of the *apply family is covered by the extremely popular plyr package, the base functions remain useful and worth knowing.

This answer is intended to act as a sort of signpost for new useRs to help direct them to the correct *apply function for their particular problem. Note, this is not intended to simply regurgitate or replace the R documentation! The hope is that this answer helps you to decide which *apply function suits your situation and then it is up to you to research it further. With one exception, performance differences will not be addressed.

  • apply - When you want to apply a function to the rows or columns of a matrix (and higher-dimensional analogues).

    # Two dimensional matrix
    M <- matrix(seq(1,16), 4, 4)
    
    # apply min to rows
    apply(M, 1, min)
    [1] 1 2 3 4
    
    # apply min to columns
    apply(M, 2, max)
    [1]  4  8 12 16
    
    # 3 dimensional array
    M <- array( seq(32), dim = c(4,4,2))
    
    # Apply sum across each M[*, , ] - i.e Sum across 2nd and 3rd dimension
    apply(M, 1, sum)
    # Result is one-dimensional
    [1] 120 128 136 144
    
    # Apply sum across each M[*, *, ] - i.e Sum across 3rd dimension
    apply(M, c(1,2), sum)
    # Result is two-dimensional
         [,1] [,2] [,3] [,4]
    [1,]   18   26   34   42
    [2,]   20   28   36   44
    [3,]   22   30   38   46
    [4,]   24   32   40   48
    

    If you want row/column means or sums for a 2D matrix, be sure to investigate the highly optimized, lightening-quick colMeans, rowMeans, colSums, rowSums.

  • lapply - When you want to apply a function to each element of a list in turn and get a list back.

    This is the workhorse of many of the other *apply functions. Peel back their code and you will often find lapply underneath.

       x <- list(a = 1, b = 1:3, c = 10:100) 
       lapply(x, FUN = length) 
       $a 
       [1] 1
       $b 
       [1] 3
       $c 
       [1] 91
    
       lapply(x, FUN = sum) 
       $a 
       [1] 1
       $b 
       [1] 6
       $c 
       [1] 5005
    
  • sapply - When you want to apply a function to each element of a list in turn, but you want a vector back, rather than a list.

    If you find yourself typing unlist(lapply(...)), stop and consider sapply.

       x <- list(a = 1, b = 1:3, c = 10:100)
       #Compare with above; a named vector, not a list 
       sapply(x, FUN = length)  
       a  b  c   
       1  3 91
    
       sapply(x, FUN = sum)   
       a    b    c    
       1    6 5005 
    

    In more advanced uses of sapply it will attempt to coerce the result to a multi-dimensional array, if appropriate. For example, if our function returns vectors of the same length, sapply will use them as columns of a matrix:

       sapply(1:5,function(x) rnorm(3,x))
    

    If our function returns a 2 dimensional matrix, sapply will do essentially the same thing, treating each returned matrix as a single long vector:

       sapply(1:5,function(x) matrix(x,2,2))
    

    Unless we specify simplify = "array", in which case it will use the individual matrices to build a multi-dimensional array:

       sapply(1:5,function(x) matrix(x,2,2), simplify = "array")
    

    Each of these behaviors is of course contingent on our function returning vectors or matrices of the same length or dimension.

  • vapply - When you want to use sapply but perhaps need to squeeze some more speed out of your code.

    For vapply, you basically give R an example of what sort of thing your function will return, which can save some time coercing returned values to fit in a single atomic vector.

    x <- list(a = 1, b = 1:3, c = 10:100)
    #Note that since the adv here is mainly speed, this
    # example is only for illustration. We're telling R that
    # everything returned by length() should be an integer of 
    # length 1. 
    vapply(x, FUN = length, FUN.VALUE = 0) 
    a  b  c  
    1  3 91
    
  • mapply - For when you have several data structures (e.g. vectors, lists) and you want to apply a function to the 1st elements of each, and then the 2nd elements of each, etc., coercing the result to a vector/array as in sapply.

    This is multivariate in the sense that your function must accept multiple arguments.

    #Sums the 1st elements, the 2nd elements, etc. 
    mapply(sum, 1:5, 1:5, 1:5) 
    [1]  3  6  9 12 15
    #To do rep(1,4), rep(2,3), etc.
    mapply(rep, 1:4, 4:1)   
    [[1]]
    [1] 1 1 1 1
    
    [[2]]
    [1] 2 2 2
    
    [[3]]
    [1] 3 3
    
    [[4]]
    [1] 4
    
  • rapply - For when you want to apply a function to each element of a nested list structure, recursively.

    To give you some idea of how uncommon rapply is, I forgot about it when first posting this answer! Obviously, I'm sure many people use it, so YMMV. This one is best illustrated with a user-defined function to apply:

    #Append ! to string, otherwise increment
    myFun <- function(x){
        if (is.character(x)){
        return(paste(x,"!",sep=""))
        }
        else{
        return(x + 1)
        }
    }
    
    #A nested list structure
    l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"), 
              b = 3, c = "Yikes", 
              d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5)))
    
    
    #Result is named vector, coerced to character           
    rapply(l,myFun)
    
    #Result is a nested list like l, with values altered
    rapply(l, myFun, how = "replace")
    
  • tapply - For when you want to apply a function to subsets of a vector and the subsets are defined by some other vector, usually a factor.

    The black sheep of the *apply family, of sorts. The help files use of the phrase "ragged array" can be a bit confusing, but it is actually quite simple.

    A vector:

       x <- 1:20
    

    A factor (of the same length!) defining groups:

       y <- factor(rep(letters[1:5], each = 4))
    

    Add up the values in x within each subgroup defined by y:

       tapply(x, y, sum)  
        a  b  c  d  e  
       10 26 42 58 74 
    

    More complex examples can be handled where the subgroups are defined by the unique combinations of a list of several factors. tapply is similar in spirit to the split-apply-combine functions that are common in R (aggregate, by, ave, ddply, etc.) Hence it's black sheep status.

144
Christopher DuBois

I want to sort a data.frame by multiple columns in R. For example, with the data.frame below I would like to sort by column z (descending) then by column b (ascending):

dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"), 
      levels = c("Low", "Med", "Hi"), ordered = TRUE),
      x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9),
      z = c(1, 1, 1, 2))
dd
    b x y z
1  Hi A 8 1
2 Med D 3 1
3  Hi A 9 1
4 Low C 9 2
Answered By: Dirk Eddelbuettel ( 193)

You can use the order() function directly without resorting to add-on tools -- see this simpler answer which uses a trick right from the top of the example(order) code:

R> dd[with(dd, order(-z, b)), ]
    b x y z
4 Low C 9 2
2 Med D 3 1
1  Hi A 8 1
3  Hi A 9 1

Edit some 2+ years later: It was just asked how to do this by column index. The answer is to simply pass the desired sorting column(s) to the order() function:

R> dd[ order(-dd[,4], dd[,1]), ]
    b x y z
4 Low C 9 2
2 Med D 3 1
1  Hi A 8 1
3  Hi A 9 1
R> 

rather than using the name of the column (and with() for easier/more direct access).