Demo Tutorial Guide

This guide will go over a short demo project to make users more familiar with the various nuts and bolts that go behind a typical ChemHTPS project’s work-flow.

chemhtpsshell --setup_project --project_name  trial

The above command will start a new project called “trial” at the current location.

$ ls trial
archive  db  jobpool  job_templates  lost+found  queue_list.dat  screeninglib  trial.config

Among the contents of a project folder:

  • “archive/” is where successfull jobs get tarred and archived to
  • “db/” should be used for storing the generated data in a data-base
  • “jobpool/” keeps all the jobs generated that are yet to be run under “short/” and “long/” sub-folders
  • “job_templates/” keeps the project-specific SLURM scripts and program-template files. These are essentially copies of the files in the “chemhtps/chemhtps/metadata/job_templates/” folder. Users can make their desired SLURM scripts in the original metadata location as well
  • “lost+found/” stores all the unsuccessful jobs
  • “queue_list.dat” contains the various cluster/partition limits that can be defined by the user before a job-run
  • “screeninglib/” contains all the files and folders for library-generation purposes
  • “trial.config” is the config-file for this project

Although, ChemHTPS can be used purely for work-flow purposes, in our example we will show its use while interfacing with ChemLG for library generation.

Once, the project has been setup as shown above, the config-file “trial.config” will look like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
project_name = trial
user_name = yudhajit
log_file = trial.log
error_file = trial.err

generatelib
generation_file = none
building_block = none
cores = 0
clusters = none

generatejobs
library_in = none
job_type = none
program = none
template = none

feedjobs
feed_local = TRUE
job_sched = SLURM

postprocess

populatedb

Once the user has downloaded and installed chemlg, (for instructions see https://hachmannlab.github.io/chemlg/), the “config.dat” file under the “chemlg/chemlg/templates/” folder should be copied into the “trial/screeninglib/” folder

Without going into stuff that will require an in-depth discussion of the mechanics of ChemLG, in this example we will be using a library that will be generated from a file already containing a list of SMILES codes for molecules.

1
2
3
4
5
6
7
8
9
C(=O)(N)N
C(=S)(N)N
CNC(=O)NC
CN(C)C(=O)N
CC(=O)N
C(=O)(C(F)(F)F)N
C(C(CO)O)O
C(CO)O
C(CCO)CO

We will use ChemLG to just convert the smiles in the above file into their .xyz formats. The ChemLG config-file for this will look like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Do not change the order of the generation rules below.
1. Include building blocks                  == None
2. Min and max no. of bonds                 == None
3. Min and max no. of atoms                 == None
4. Min and max mol.weight                   == None
5. Min and max no. of rings                 == None
6. Min and max no. of aromatic rings        == None
7. Min and max no. of non-aromatic rings    == None
8. Min and max no. of single bonds          == None
9. Min and max no. of double bonds          == None
10. Min and max no. of triple bonds         == None
11. Max no. of specific atoms               == None
12. Lipinski rule                           == False
13. Fingerprint matching                    == None
14. Substructure                            == None
15. Substructure exclusion                  == None
16. Include_BB                              == False


Combination type for molecules  :: Link
Number of generations           :: 0
Molecule format in output file  :: xyz
Maximum files per folder        :: 1000
Library name                    :: trial_lib_

and the ChemHTPS will be something like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
project_name = trial
user_name = yudhajit
log_file = trial.log
error_file = trial.err

generatelib
generation_file = ./screeninglib/config.dat
building_block = ./screeninglib/trial-smiles.dat
cores = 1
clusters = run locally

generatejobs
library_in = none
job_type = none
program = none
template = none

feedjobs
feed_local = TRUE
job_sched = SLURM

postprocess

populatedb

Note that the generation number in the ChemLG config-file has been kept 0 since we are not generating any new molecules and are converting the existing ones between file-formats. Now, the library-generator call will be done (note: this MUST be done from within the project folder).

$ chemhtpsshell --generatelib

After the run, the “screeninglib/” folder will look like:

$ ls screeninglib/
building_blocks.dat  config.dat  final_library.csv  logfile.txt  trial_lib_xyz  trial-smiles.dat
$ ls screeninglib/trial_lib_xyz/1_1000
1.xyz  2.xyz  3.xyz  4.xyz  5.xyz  6.xyz  7.xyz  8.xyz  9.xyz

Moving on to the job-generation step, the options for generatejobs need to be modified. For our case the config-file looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
project_name = trial_htps
user_name = yudhajit
log_file = trial_htps.log
error_file = trial_htps.err

generatelib
generation_file = ./screeninglib/config.dat
building_block = ./screeninglib/trial-smiles.dat
cores = 1
clusters = run locally

generatejobs
library_in = trial_lib_xyz 
job_type = long
program = ORCA
template = ./job_templates/ORCA/test_orca.inp

feedjobs
feed_local = TRUE
job_sched = SLURM

postprocess

populatedb

We will be generating input files using the template file as shown above. For more information on template files and their working, please refer to:

ChemHTPS Template Guide

Once this step is completed, the job-folders look like the following:

$ ls jobpool/long/
jg1_ORCA_trial_lib_xyz_1_1000
$ ls jobpool/long/jg1_ORCA_trial_lib_xyz_1_1000/
1  2  3  4  5  6  7  8  9
$ ls jobpool/long/jg1_ORCA_trial_lib_xyz_1_1000/1/
1.inp  1.xyz

Note that in the above case, the jobs are all put together in a folder with the prefix of jg1. This signifies that this group of jobs was created first and so on and so forth. This distinction therefore allows the job-feeder module to proceed serially and prioritize the earlier jobs that were created. Furthermore, every cycle of job-generation can be referred back to in the “job_gen.log” log-file. One can look in the log-file to check what the config-options were for every job-generator module call. Each individual job-folder within the larger class contains the correponding .xyz and the .inp input files as seen above with 1.xyz and 1.inp files, respectively.

After making appropriate changes to feed_local and job_sched options in the config-file under the feedjobs section, we can finally execute all the jobs under the “jobpool/” folder by:

$ chemhtpsshell --feedjobs

These jobs will be submitted according to the limits set in the “queue_list.dat” file:

1
2
3
4
5
6
#cluster partition limit type
ub-hpc general-compute 0 long
ub-hpc debug 0 short
ub-hpc gpu 0 long
ub-hpc largemem 0 long
chemistry beta 10 long

In our case, we will set the limit of jobs to be run in the “beta” cluster/partition combination as 10 since we have 9 jobs to run. The definitions of the cluster/partition combinations can be changed in this file and the corresponding job-scheduler (SLURM) scripts shoulld be available in the “job_templates/” folder. One can try it out and note the logs and the error messages in “feedjobs.out” and “job_checker.err”, respectively.

Finally, we would like to make two important points, one of them being a reminder. All the module-calls apart from –setup_project must be made from within the project directory. Additionally, all these non-setup module-calls can also be combined into a one-line command after making the desired changes for all the modules in the config-file. This command simply looks like:

$ chemhtpsshell --generatelib --generatejobs --feedjobs

Note that even when the order of the module-calls look like:

$ chemhtpsshell --generatelib --feedjobs --generatejobs

The end result of the execution however, will still proceed in the library –> job-generation –> job-feeder logic-flow.