r - Row operations in data.table using `by = .I` -


here explanation row operations in data.table

one alternative came mind use unique id each row , apply function using by argument. this:

library(data.table)  dt <- data.table(v0 =letters[c(1,1,2,2,3)],                  v1=1:5,                  v2=3:7,                  v3=5:1)  # create column row positions dt[, rowpos := .i]  # calculate standard deviation row dt[ ,  sdd := sd(.sd[, -1, with=false]), = rowpos ]  

questions:

  1. is there reason not use approach? perhaps other more efficient alternatives?

  2. why using by = .i doesn't work same?

    dt[ , sdd := sd(.sd[, -1, with=false]), = .i ]

1) well, 1 reason not use it, @ least rowsums example performance, , creation of unnecessary column. compare option f2 below, 4x faster , not need rowpos column:

dt <- data.table(v0 =letters[c(1,1,2,2,3)], v1=1:5, v2=3:7, v3=5:1) f1 <- function(dt){   dt[, rowpos := .i]    dt[ ,  sdd := rowsums(.sd[, 2:4, with=false]), = rowpos ] } f2 <- function(dt){dt[, sdd := rowsums(dt[, 2:4, with=false])]}  library(microbenchmark) microbenchmark(f1(dt),f2(dt)) # unit: milliseconds #   expr      min       lq     mean   median       uq      max neval cld # f1(dt) 3.669049 3.732434 4.013946 3.793352 3.972714 5.834608   100   b # f2(dt) 1.052702 1.085857 1.154132 1.105301 1.138658 2.825464   100   

2) on second question, although dt[, sdd := sum(.sd[, 2:4, with=false]), = .i] not work, dt[, sdd := sum(.sd[, 2:4, with=false]), = 1:nrow(dt)] works perfectly. given according ?data.table ".i integer vector equal seq_len(nrow(x))", 1 might expect these equivalent. difference, however, .i use in j, not in by, because it's value returned by rather evaluated beforehand.

it might expected (see comment on question above @eddi) by = .i should throw error. not occur, because loading data.table package creates object .i in data.table namespace accessible global environment, , value null. can test typing .i @ command prompt. (note, same applies .sd, .eachi, .n, .grp, , .by)

.i # error: object '.i' not found library(data.table) .i # null data.table::.i # null 

the upshot of behaviour of by = .i equivalent by = null.

3) although have seen in part 1 in case of rowsums, loops row-wise efficiently, there faster ways creating rowpos column. looping when don't have fast row-wise function?

benchmarking by = rowpos , by = 1:nrow(dt) versions against for loop set() informative here, , demonstrates loop version faster either of by = approaches:

f.rowpos <- function(){   dt <- data.table(v0 = rep(letters[c(1,1,2,2,3)], 1e3), v1=1:5, v2=3:7, v3=5:1)   dt[, rowpos := .i]    dt[ ,  sdd := sum(.sd[, 2:4, with=false]), = rowpos ][]  }  f.nrow <- function(){   dt <- data.table(v0 = rep(letters[c(1,1,2,2,3)], 1e3), v1=1:5, v2=3:7, v3=5:1)   dt[, sdd := sum(.sd[, 2:4, with=false]), = 1:nrow(dt) ][]  }  f.forset<- function(){   dt <- data.table(v0 = rep(letters[c(1,1,2,2,3)], 1e3), v1=1:5, v2=3:7, v3=5:1)   dt[, sdd:=0l]   (i in 1l:nrow(dt)) {     set(dt, i, 5l, sum(dt[i, 2:4]))   }   dt }  microbenchmark(f.rowpos(),f.nrow(), f.forset(), times = 5) unit: seconds        expr      min       lq     mean   median       uq      max neval cld  f.rowpos() 4.465371 4.503614 4.510916 4.505922 4.521629 4.558042     5   b    f.nrow() 4.499120 4.499920 4.541131 4.558701 4.571267 4.576647     5   b  f.forset() 2.540556 2.603505 2.654036 2.606108 2.750719 2.769292     5   

so, in conclusion, in situations there not optimised function such rowsums operates row, there alternatives using rowpos column faster, while not requiring creation of redundant column.