How to Extend GLOBEM
GLOBEM also provide flexbile ways for researhers and developers to add new algorithms, new datasets, as well as new modeling targets.
How to add a new algorithm
The platform supports researchers in developing their own algorithms easily. Reading through Platform Description before implementing new algorithms is strongly recommended.
An algorithm just needs to extend the abstract class DepressionDetectionAlgorithmBase
and implement:
- Define the function
prep_data_repo
(as the feature preparation module) It takes inDatasetDict
as the input and returns aDataRepo
object (see the definition here), which is a simple data object that savesX
,y
, andpids
(participant ids). This can be used for preparing both training and testing sets. - Define the function
prep_model
(as the model computation module) It returns aDepressionDetectionClassifierBase
object (see the definition here), which needs to supportfit
(model training),predict
(model prediction), andpredict_proba
(model prediction with probability distribution). - Add a configuration file in
config
(as the configuration module) At least one yaml file with a unique name needs to be put in theconfig
folder. The config file will contain controllable parameters that can be adjusted manually. Please refer toconfig/README.md
for more details. - Register the new algorithm in
algorithm/algorithm_factory.py
by adding appropriate class import and if-else logic.
The platform further prepare two templates for easier implementation of common traditional ML and DL algorithms.
How to add an ML algorithm
We provide a basic traditional machine learning algorithm DepressionDetectionAlgorithm_ML_basic
that extends DepressionDetectionAlgorithmBase
.
Its prep_data_repo
function
- takes the feature vector at the same day of the collected label
- performs a feature normalization
- filters empty features and days with a large amount of missing data
- imputes the rest of the missing data using median
- puts the data into a
DataRepo
and return it
Its prep_model
function is left empty for custom implementation.
This object can serve as a starting point and other traditional ML algorithms can extend DepressionDetectionAlgorithm_ML_basic
.
For example, the implementation of Saeb et al.'s algorithm
can be found algorithm/ml_saeb.py
and config/ml_saeb.yaml
.
How to add a DL algorithm
We use ERM (algorithm/dl_erm.py
) as the basic deep learning algorithm DepressionDetectionAlgorithm_DL_erm
that extends DepressionDetectionAlgorithmBase
.
Its prep_data_repo
function
- prepares a set of data loaders
MultiSourceDataGenerator
as training&validation or testing set - puts them into a
DataRepo
and returns it
Its prep_model
function
- defines a standard deep-learning classifier
DepressionDetectionClassifier_DL_erm
that extendsDepressionDetectionClassifierBase
- defines how a deep model should be trained, saved, and evaluated.
The training setup is parameterized in config files such as
config/dl_erm_1dCNN.yaml
.
This algorithm can serve as a starting point, and other DL algorithms can extend DepressionDetectionAlgorithm_DL_erm
and DepressionDetectionClassifier_DL_erm
.
For example, the implementation of IRM algorithm can be found at algorithm/dl_irm.py
and config/dl_irm.yaml
.
For both traditional ML and DL algorithms, if the pre-implementation is not help, developers can also start from the plain DepressionDetectionAlgorithmBase
and DepressionDetectionClassifierBase
.
How to add a new dataset
To include a new dataset in the pipeline, follow the steps:
- Define the name of the new dataset with the template
[group name]_[dataset NO in the group]
, e.g.,ABC_1
. - Following the same structure as other dataset folders in
data_raw
, the new dataset folder (e.g.,,ABC_1
) needs to contain three subfolders. Please refer toGLOBEM Datasets
page for more details:FeatureData
- A csv file
rapids.csv
indexed bypid
anddate
for feature data, and separate files[data_type].csv
indexed bypid
anddate
for each data type. - Each row is a feature vector of a subject at a given date. Example columns: [
pid
,date
,feature1
,feature2
...]. - Columns include all sensor features of Phone Location, Phone Screen, Calls, Bluetooth, Fitbit Steps, and Fitbit Sleep from RAPIDS toolkit.
- A csv file
SurveyData
- csv files indexed by
pid
anddate
for label data. - For depression detection specifically, there are two files:
dep_weekly.csv
anddep_endterm.csv
. - For other tasks, there are three files:
pre.csv
,post.csv
, andema.csv
.
- csv files indexed by
ParticipantsInfoData
- A csv file
platform.csv
indexed bypid
for data collection device platform (i.e., iOS or Android). - Example columns of the file: [
pid
,platform
].
- A csv file
- Register the new path in
data/data_factory.py
by adding new key-value pairs in the following dictionaries:feature_folder
,survey_folder
, anddevice_info_folder
(e.g., adding{"ABC": {1: ...}}
). - Register the new dataset key into the
config/global_config.yaml
intoglobal_config["all"]["ds_keys"]
(e.g., appending"ABC_1"
).
How to add a new modeling target
Our current platform only supports binary classification tasks. Future work will be needed to extend to multi-classification and regression tasks. To build a model for a new target other than depression detection, please follow the steps:
- Pick a column in either
ema.csv
, orpost.csv
(seedata_raw/README.md
for more details) as the target name.- Note that the picked column needs to be consistent across all datasets defined in
config/global_config.yaml
. A column inpre.csv
would also work as long as the date can be handled correctly. HereUCLA_10items_POST
frompost.csv
is used as an example, a metric measuring loneliness.
- Note that the picked column needs to be consistent across all datasets defined in
- Define the binary label for the target in
data/data_factory.py
'sthreshold_book
.- A simple threshold based method is used to add a
key:value
pair to thethreshold_book
, wherekey
is the target name andvalue
is a dionctionary{"threshold_as_false": th1, "threshold_as_true":th2}
(note thatth1
is different fromth2
). - For example, for
UCLA_10items_POST
, scores < = 24 will be defined asFalse
, and scores > 24 will beTrue
. This corresponds to adding the followingkey:value
pair to thethreshold_book
:"UCLA_10items_POST": {"threshold_as_false": 24, "threshold_as_true":25}
.
- A simple threshold based method is used to add a
- Define it in the
config/global_config.yaml
to involve it in the pipeline.- Replace
global_config["all"]["prediction_tasks"]
to be[the new target]
. Continuing the example, it will be["UCLA_10items_POST"]
.
- Replace