Use devtools
or remotes
to fetch the
package from this repository:
if(!require(devtools)) install.packages("devtools")
::install_github("Anirban166/data.table.threads") devtools
if(!require(remotes)) install.packages("remotes")
::install_github("Anirban166/data.table.threads") remotes
findOptimalThreadCount(rowCount, columnCount)
is the
go-to function which runs a set of benchmarks for various
data.table
functions that are parallelizable.
> benchmarkData <- data.table.threads::findOptimalThreadCount(1e7, 10)
1 thread, 10000000 rows, and 10 columns.
Running benchmarks with
...10 threads, 10000000 rows, and 10 columns. Running benchmarks with
It returns an object with print and plot methods.
> benchmarkData
function Thread count Fastest median runtime (ms)
data.table - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
8 82.736011
forder 6 15.670897
GForce_sum 6 54.386931
subsetting 6 23.329410
frollmean 5 7.319135
fcoalesce 6 22.716911
between 10 18.825437
fifelse 10 7.006490
nafill 1 3.194330 CJ
The output here is a table which shows the fastest runtime (median
value in milliseconds) for each data.table
function along
with the corresponding thread count that achieved it.
> plot(benchmarkData)
As for the generated plot, it delineates the speedup across multiple threads (from 1 to the number of threads available in your system; 10 in my case or this example) for each function.
setThreadCount(benchmarkData, functionName, efficiencyFactor)
can then be used to set the thread count based on the observed results
for a user-specified function and efficiency value (of the range [0, 1])
for the speedup:
> setOptimalThreadCount(benchmarks, functionName = "forder", efficientcyFactor = 0.5, verbose = TRUE)
3, based on an efficiency factor of 0.5 for data.table::forder() based on the performed benchmarks.
The number of threads that data.table will use has been set to > getDTthreads()
1] 3 [