Running in parallel

Since the typical application of DynaSig-ML involves computing thousands of Dynamical Signatures, we have made it simple to parallelize the computation. Simply divide the list of PDB files in a number of sub-lists equal to the number of parallel jobs you would like to run. In the dynasigml_mir125a_example repository, the make_jobs.py script is already set up to generate 99 Slurm jobs that can be run in parallel, in a created folder called parallel_jobs. It can easily be adapted to your parallel scheduler of choice if you cannot use Slurm. Each job should take somewhere between 10-15 minutes to run.

The basic principle is to generate one DynaSigDF per sub-list of files, with all other parameters being the same (beta_values, exp_measures, models). For example, the run_one_dynasigdf.py script does exactly that for the miR-125a dataset, provided with an index from 0 to 98 (99 total parallel jobs) as its only command-line argument:

from dynasigml.dynasig_df import DynaSigDF
import sys
import glob
import numpy as np

def load_data(filename):
    with open(filename) as f:
        lines = f.readlines()
    data_dict = dict()
    for line in lines[1:]:
        ll = line.split()
        data_dict[ll[2]] = [float(ll[0]), float(ll[1])]
    return data_dict


if __name__ == "__main__":
    if len(sys.argv) != 2:
        raise ValueError("I need one argument: the job index")
    job_index = int(sys.argv[1])
    n_total_jobs = 99
    filenames_list = sorted(glob.glob("mir125a_variants/*.pdb"))
    step = int(len(filenames_list)/float(n_total_jobs) + 1)
    start = job_index * step
    stop = (job_index+1) * step
    sub_filenames_list = filenames_list[start:stop]
    data_dict = load_data('data_mir125.df')
    beta_values = [np.e ** (x / 2) for x in range(-6, 7)]
    exp_data = []
    for filename in sub_filenames_list:
        mutid = filename.split('.')[0].split('mir125a_')[-1]
        exp_data.append(data_dict[mutid])
    dsdf_name = "split_dsdfs/dsdf_{}".format(job_index)
    # eff is for maturation efficiency
    dsdf = DynaSigDF(sub_filenames_list, exp_data, ["eff", "mcfold_energy"], dsdf_name, beta_values=beta_values)