distcomp.Rd
distcomp
is a collection of methods to fit models to data that may be
distributed at various sites. The package arose as a way of addressing the
issues regarding data aggregation; by allowing sites to have control over
local data and transmitting only summaries, some privacy controls can be
maintained. Even when participants have no objections in principle to data
aggregation, it may still be useful to keep data local and expose just the
computations. For further details, please see the reference cited below.
The initial implementation consists of a stratified Cox model fit with distributed survival data and a Singular Value Decomposition of a distributed matrix. General Linear Models will soon be added. Although some sanity checks and balances are present, many more are needed to make this truly robust. We also hope that other methods will be added by users.
We make the following assumptions in the implementation:
(a) the aggregate data is logically a stacking of data at each site, i.e.,
the full data is row-partitioned into sites where the rows are observations;
(b) Each site has the package distcomp
installed and a workspace setup
for (writeable) use by the opencpu
server
(see distcompSetup()
; and (c) each site is exposing distcomp
via an opencpu
server.
The main computation happens via a master process, a script of R code,
that makes calls to distcomp
functions at worker sites via opencpu
.
The use of opencpu
allows developers to prototype their distributed implementations
on a local machine using the opencpu
package that runs such a server locally
using localhost
ports.
Note that distcomp
computations are not intended for speed/efficiency;
indeed, they are orders of magnitude slower. However, the models that are fit are
not meant to be recomputed often. These and other details are discussed in the
paper mentioned above.
The current implementation, particularly the Stratified Cox Model, makes direct use of
code from survival::coxph()
. That is, the underlying Cox model code is
derived from that in the R survival
survival package.
For an understanding of how this package is meant to be used, please see the documented examples and the reference.
Software for Distributed Computation on Medical Databases: A Demonstration Project. Journal of Statistical Software, 77(13), 1-22. doi:10.18637/jss.v077.i13
Appendix E of Modeling Survival Data: Extending the Cox Model by Terry M. Therneau and Patricia Grambsch. Springer Verlag, 2000.
The examples in system.file("doc", "examples.html", package="distcomp")
The source for the examples: system.file("doc_src", "examples.Rmd", package="distcomp")
.