here explanation row operations in data.table
one alternative came mind use unique id
each row , apply function using by
argument. this:
library(data.table) dt <- data.table(v0 =letters[c(1,1,2,2,3)], v1=1:5, v2=3:7, v3=5:1) # create column row positions dt[, rowpos := .i] # calculate standard deviation row dt[ , sdd := sd(.sd[, -1, with=false]), = rowpos ]
questions:
is there reason not use approach? perhaps other more efficient alternatives?
why using
by = .i
doesn't work same?dt[ , sdd := sd(.sd[, -1, with=false]), = .i ]
1) well, 1 reason not use it, @ least rowsums
example performance, , creation of unnecessary column. compare option f2 below, 4x faster , not need rowpos column:
dt <- data.table(v0 =letters[c(1,1,2,2,3)], v1=1:5, v2=3:7, v3=5:1) f1 <- function(dt){ dt[, rowpos := .i] dt[ , sdd := rowsums(.sd[, 2:4, with=false]), = rowpos ] } f2 <- function(dt){dt[, sdd := rowsums(dt[, 2:4, with=false])]} library(microbenchmark) microbenchmark(f1(dt),f2(dt)) # unit: milliseconds # expr min lq mean median uq max neval cld # f1(dt) 3.669049 3.732434 4.013946 3.793352 3.972714 5.834608 100 b # f2(dt) 1.052702 1.085857 1.154132 1.105301 1.138658 2.825464 100
2) on second question, although dt[, sdd := sum(.sd[, 2:4, with=false]), = .i]
not work, dt[, sdd := sum(.sd[, 2:4, with=false]), = 1:nrow(dt)]
works perfectly. given according ?data.table
".i integer vector equal seq_len(nrow(x))", 1 might expect these equivalent. difference, however, .i
use in j
, not in by
, because it's value returned by
rather evaluated beforehand.
it might expected (see comment on question above @eddi) by = .i
should throw error. not occur, because loading data.table
package creates object .i
in data.table namespace accessible global environment, , value null
. can test typing .i
@ command prompt. (note, same applies .sd
, .eachi
, .n
, .grp
, , .by
)
.i # error: object '.i' not found library(data.table) .i # null data.table::.i # null
the upshot of behaviour of by = .i
equivalent by = null
.
3) although have seen in part 1 in case of rowsums
, loops row-wise efficiently, there faster ways creating rowpos column. looping when don't have fast row-wise function?
benchmarking by = rowpos
, by = 1:nrow(dt)
versions against for
loop set()
informative here, , demonstrates loop version faster either of by =
approaches:
f.rowpos <- function(){ dt <- data.table(v0 = rep(letters[c(1,1,2,2,3)], 1e3), v1=1:5, v2=3:7, v3=5:1) dt[, rowpos := .i] dt[ , sdd := sum(.sd[, 2:4, with=false]), = rowpos ][] } f.nrow <- function(){ dt <- data.table(v0 = rep(letters[c(1,1,2,2,3)], 1e3), v1=1:5, v2=3:7, v3=5:1) dt[, sdd := sum(.sd[, 2:4, with=false]), = 1:nrow(dt) ][] } f.forset<- function(){ dt <- data.table(v0 = rep(letters[c(1,1,2,2,3)], 1e3), v1=1:5, v2=3:7, v3=5:1) dt[, sdd:=0l] (i in 1l:nrow(dt)) { set(dt, i, 5l, sum(dt[i, 2:4])) } dt } microbenchmark(f.rowpos(),f.nrow(), f.forset(), times = 5) unit: seconds expr min lq mean median uq max neval cld f.rowpos() 4.465371 4.503614 4.510916 4.505922 4.521629 4.558042 5 b f.nrow() 4.499120 4.499920 4.541131 4.558701 4.571267 4.576647 5 b f.forset() 2.540556 2.603505 2.654036 2.606108 2.750719 2.769292 5
so, in conclusion, in situations there not optimised function such rowsums
operates row, there alternatives using rowpos column faster, while not requiring creation of redundant column.