Submit to a Compute Cluster via PBS

There are some algorithms which work well on massively parallel compute clusters. BCL Conversion is such an example and is the basis for this application example.

The concepts illustrated here are not limited to BCL Conversion; as such, they may also be applied in other scenarios. For instance the example PBS script below uses syntax for illumina's CASAVA tool, but could easily be re-purposed for the bcl2fastq tool.

Also, in this example, Portable Batch System (PBS) is used as the job submission mechanism to the compute cluster, which has read/write access to the storage system holding the data to be converted.

Example PBS file

For illustrative purposes, an example PBS file is shown here. (As there are many ways to configure PBS, it is likely that the content of your PBS file(s) will differ from the example provided.)

#!/bin/bash

#PBS -N run_casava

#PBS -q himem

#PBS -l nodes=1:ppn=20

export RUN_DIR=/data/instrument_data/120210_SN1026_0092_BXXXXXXXXX

export OUTPUT_DIR=/data/processed_data/processed_data.1.8.2/120210_SN1026_0092_BXXXXXXXXX

export SAMPLE_SHEET=/data/SampleSheets/samplesheet.csv

cd $PBS_O_WORKDIR

source /etc/profile.d/modules.sh

module load casava-1.8.2

export TMPDIR=/scratch/

export NUM_PROCESSORS=$((PBS_NUM_NODES*PBS_NUM_PPN))

configureBclToFastq.pl --input-dir $RUN_DIR/Data/Intensities/BaseCalls --output-dir $OUTPUT_DIR 
 --sample-sheet $SAMPLE_SHEET --force  --ignore-missing-bcl --ignore-missing-stats
 --use-bases-mask y*,I6,y* --mismatches 1

cd $OUTPUT_DIR

make -j $NUM_PROCESSORS

Solution

Process configuration

In this example, the BCL Conversion process is configured to:

  • Accept a ResultFile input.

  • Produce at least two ResultFile outputs.

The process is configured with the following process level UDFs:

The syntax for the external program parameter is as follows:

python /opt/gls/clarity/customextensions/ClusterBCL.py -l {processLuid} -u {username} -p {password}
 -c {udf:Number of Cores} -m {udf:Number of mismatches} 
 -b "{udf:Bases mask}" -a {compoundOutputFileLuid0}.txt -e {compoundOutputFileLuid1}.txt -r "{udf:Run Name}"

Parameters

User Interaction and Results

  1. The user runs the BCL Conversion process on the output of the Illumina Sequencing process. The sequencing process is aware of the Run ID, as this information is stored as a process level user-defined field (UDF).

  2. The user supplies the following information, which is stored as process level UDFs on the BCL Conversion process:

    • The name of the folder in which the converted data should be stored.

    • The bases mask to be used.

    • The number of mismatches.

    • The number of CPUs to dedicate to the job.

  3. The BCL Conversion process launches a script (via the EPP node on the Clarity LIMS server) which does the following:

    • Builds the PBS file based on the user's input.

    • Submits the job by invoking the 'qsub' command along with the PBS file.

Assumptions and Notes

  • Portable Batch System (PBS) is used as the job submission mechanism to the compute cluster.

  • The compute cluster has read/write access to the storage system holding the data to be converted.

  • There is an EPP node running on the Clarity LIMS server.

  • The PBS client tools have been installed and configured on the Clarity LIMS server, such that the 'qsub' command can be launched directly from the server.

  • When the 'qsub' command is invoked, a PBS file is referenced; this file contains the job description and parameters.

  • The script was written in Python (version 2.7) and relies upon the GLSRestApiUtil.py module. Both files are attached below. The required Python utility is available for download at Obtain and Use the REST API Utility Classes.

  • The example code is provided for illustrative purposes only. It does not contain sufficient exception handling for use 'as is' in a production environment.

Attachments

ClusterBCL.py:

Last updated