HTCondor is a specialized high-throughput computing (HTC) software system that enables users to harness the power of distributed computing resources to efficiently manage and execute large-scale computational workloads. It provides a framework for managing job scheduling, resource allocation, and workload management across a pool of computing resources.
HTCondor allows users (Galaxy) to submit jobs to the system, which are then scheduled and executed on available computing resources. It ensures efficient distribution of jobs based on factors such as resource availability, job requirements, and user-defined policies.
Ansible >= 2.9
Infrastructure:
Resource | Recommended Images |
---|---|
Central Manager VM | VGGP |
Submit VM | RockyLinux |
Exec Nodes VMs | VGGP |
NFS Server VM | VGGP |
NFS Attached Volume | - |
VGGP images contain a set of prebuilt instruments such as Pulsar, NFS, CVMFS, Apptainer, Docker, and Telegraf. These tools are required to run and monitor Galaxy jobs on the Virtual Galaxy Compute Nodes (VGCN). You can find the images at https://usegalaxy.eu/static/vgcn/. It is recommended to use the latest main version.
NB! Red Hat Enterprise Linux 9 (RHEL 9) deprecated SHA-1 for signing for security reasons and supports more secure SHA-256. However older HTCondor versions still uses SHA-1 signing. To handle this problem it’s recommended:
/opt/galaxy
(to access Galaxy server) and /data/share
(to access uploaded data and result data produced by tools)/etc/exports
:
/data/share *(rw,sync,no_root_squash)
/opt/galaxy *(rw,sync,no_root_squash)
Example Ansible Playbook: ```YAML
Example Ansible playbook:
- name: Mount NFS
hosts: <[CM, submit_vm]>
become: true
become_user: root
tasks:
- name: Mount network share
ansible.posix.mount:
src: ""
path: ""
fstype: nfs
state: mounted
with_items:
- { src: "<nfs_server_ip>:/data/share", mountpoint: /data/share }
- { src: "<nfs_server_ip>:/opt/galaxy", mountpoint: /opt/galaxy }
Alternatively, you can use the mount.yml
ansible playbook provided in the infrastructure-playbook repository
.
It can be run by:
ansible-playbook --private-key <path_to_priv_key> -i hosts mount.yml
/etc/auto.master.d/data.autofs
/data /etc/auto.data nfsvers=3
/- /etc/auto.usrlocal nfsvers=3
/etc/auto.data
share -rw,hard,intr,nosuid,quota <nfs_server_ip>:/data/share
/etc/auto.usrlocal
/opt/galaxy -rw,hard,nosuid <nfs_server_ip>:/opt/galaxy
This
/etc/auto.master.d/data.autofs
file is included inauto.master
configuration file. The/data
directory should be automounted using the configuration file/etc/auto.data
. All other directories (/-
represents the default) should be automounted using the configuration file/etc/auto.usrlocal
. NFS version 3 is used.The mount options specified for the NFS share include read-write access (rw), hard mount (hard), interruptible mount (intr), no setuid allowed (nosuid), and the use of quotas (quota)
This step can be automated using usegalaxy-eu.autofs
role specifying the next variables:
- role: usegalaxy-eu.autofs
vars:
autofs_service.install: True
autofs_service.enable: True
nfs_kernel_tuning: True
autofs_mount_points:
- data
- usrlocal
autofs_conf_files:
data:
- share -rw,hard,nosuid <nfs_server_ip>:/data/share
usrlocal:
- /opt/galaxy -rw,hard,nosuid <nfs_server_ip>:/opt/galaxy
HTCondor consists of three primary roles: Submit, Central Manager and Executor:
Submit Role:
The Submit role is responsible for submitting jobs to the HTCondor system. Users (Galaxy) interact with the Submit role to define the requirements, input data, and execution details of the jobs. Once the job is submitted, the Submit role communicates with the Central Manager to find suitable resources for executing the job. It acts as the entry point for Galaxy to submit computational tasks to the HTCondor system.
Central Manager Role:
The Central Manager acts as the central coordination point within an HTCondor pool. It maintains a global view of the available computing resources and manages the job scheduling and resource allocation processes. When a job is submitted, the Central Manager receives the job request from the Submit role and matches it with appropriate resources in the pool and provides monitoring and status updates to the Submit role and other components.
Executor Role:
The Executor role represents the computing resources available in the HTCondor pool, such as individual machines or clusters. Executors run on these resources and execute the submitted jobs. Once the Central Manager allocates a job to an Executor, it transfers the necessary input files and instructions to the Executor (in this setup the files and submit configurations are stored in shared directories). The Executor then manages the execution of the job, which can include launching and monitoring the application, handling input/output operations, and reporting the job status back to the Central Manager.
Starting from version 9.0.0, HTCondor introduces a new authentication mechanism called IDTOKENS. With this method, it is necessary to set the same password for all machines in the cluster. To enable this authentication mechanism, the condor_collector process (located on the Central Manager in this configuration) will automatically generate the pool signing key named POOL upon startup if the file does not exist.
Using the latest stable version of HTCondor is recommended, which is currently v10.6.0. When setting up HTCondor, it is generally advised to follow a specific sequence: start with the central manager, add the access point(s), and finally, include the execute machine(s).
The existing Ansible role usegalaxy_eu.htcondor
can be used to set up HTCondor or done manually.
curl -fsSL https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="<htcondor_password>" /bin/bash -s -- --no-dry-run --central-manager <central_manager_name_or_IP>
/etc/condor/condor_config.local
. This configuration will start the necessary daemons. The collector daemon is the one that is responsible for issuing the pool signing key for authorization of other machines.
ALLOW_WRITE = *
ALLOW_READ = $(ALLOW_WRITE)
ALLOW_NEGOTIATOR = $(ALLOW_WRITE)
ALLOW_ADMINISTRATOR = $(ALLOW_NEGOTIATOR)
ALLOW_OWNER = $(ALLOW_ADMINISTRATOR)
ALLOW_CLIENT = *
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD
FILESYSTEM_DOMAIN = <condor_fs_domain>
UID_DOMAIN = <condor_uid_domain>
TRUST_UID_DOMAIN = True
SOFT_UID_DOMAIN = True
sudo systemctl restart condor
sudo systemctl enable condor
You can automate the installation process using the usegalaxy_eu.htcondor
Ansible role. Specify the following variables:
!NB: Be careful with condor_password
variable and define it in a vault-encrypted file
- role: usegalaxy_eu.htcondor
vars:
condor_role: central-manager
condor_host: <htcondor_CM_private_IP>
condor_password: <htcondor_password> # sensitive value
condor_allow_write: "*"
condor_daemons:
- COLLECTOR
- MASTER
- NEGOTIATOR
- SCHEDD
condor_allow_negotiator: $(ALLOW_WRITE)
condor_allow_administrator: $(ALLOW_NEGOTIATOR)
condor_fs_domain: <fs_domain_name>
condor_uid_domain: <uid_domain_name>
condor_enforce_role: false
condor_token_request_auto_approve
command automatically approves any daemons starting on a specified network for a fixed period of time. Within the auto-approval rule’s lifetime, start the submit and execute hosts inside the appropriate network. The token requests for the corresponding daemons (the condor_master, condor_startd, and condor_schedd) will be automatically approved and installed into /etc/condor/tokens.d/
This feature can be implemented simply by running:
condor_token_request_auto_approve -netblock <private_network_CIDR> -lifetime 3600
You can create a cron job that is executed every lifetime period, so there’s no need to worry about the time constraint and the tokens will always be approved:
0 * * * * /usr/bin/condor_token_request_auto_approve -netblock <private_network_CIDR> -lifetime 3660
Example Ansible task:
tasks:
- name: Condor auto approve
ansible.builtin.cron:
name: condor_auto_approve
minute: 0
job: '/usr/bin/condor_token_request_auto_approve -netblock <private_network_CIDR> -lifetime 3660'
curl -fsSL https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="<htcondor_password>" /bin/bash -s -- --no-dry-run --submit <central_manager_name_or_IP>
/etc/condor/condor_config.local
.
ALLOW_WRITE = *
ALLOW_READ = $(ALLOW_WRITE)
ALLOW_NEGOTIATOR = $(ALLOW_WRITE)
ALLOW_ADMINISTRATOR = $(ALLOW_NEGOTIATOR)
ALLOW_OWNER = $(ALLOW_ADMINISTRATOR)
ALLOW_CLIENT = *
DAEMON_LIST = MASTER, SCHEDD
FILESYSTEM_DOMAIN = <condor_fs_domain>
UID_DOMAIN = <condor_uid_domain>
TRUST_UID_DOMAIN = True
SOFT_UID_DOMAIN = True
sudo systemctl restart condor
sudo systemctl enable condor
Automate the step using usegalaxy_eu.htcondor
role. Specify the following variables:
- role: usegalaxy_eu.htcondor
vars:
condor_role: submit
condor_host: <htcondor_CM_private_IP>
condor_password: <htcondor_password> # sensitive value
condor_allow_write: "*"
condor_daemons:
- MASTER
- SCHEDD
condor_allow_negotiator: $(ALLOW_WRITE)
condor_allow_administrator: $(ALLOW_NEGOTIATOR)
condor_fs_domain: <fs_domain_name>
condor_uid_domain: <uid_domain_name>
condor_enforce_role: false
curl -fsSL https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="<htcondor_password>" /bin/bash -s -- --no-dry-run --execute <central_manager_name_or_IP>
/etc/condor/condor_config.local
.
ALLOW_WRITE = *
ALLOW_READ = $(ALLOW_WRITE)
ALLOW_ADMINISTRATOR = *
ALLOW_NEGOTIATOR = $(ALLOW_ADMINISTRATOR)
ALLOW_CONFIG = $(ALLOW_ADMINISTRATOR)
ALLOW_DAEMON = $(ALLOW_ADMINISTRATOR)
ALLOW_OWNER = $(ALLOW_ADMINISTRATOR)
ALLOW_CLIENT = *
DAEMON_LIST = MASTER, SCHEDD, STARTD
FILESYSTEM_DOMAIN = <condor_fs_domain>
UID_DOMAIN = <condor_uid_domain>
TRUST_UID_DOMAIN = True
SOFT_UID_DOMAIN = True
# run with partitionable slots
CLAIM_PARTITIONABLE_LEFTOVERS = True
NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=100%
SLOT_TYPE_1_PARTITIONABLE = True
ALLOW_PSLOT_PREEMPTION = False
STARTD.PROPORTIONAL_SWAP_ASSIGNMENT = True
sudo systemctl restart condor
sudo systemctl enable condor
Automate the step using usegalaxy_eu.htcondor
role. Specify the following variables:
- role: usegalaxy_eu.htcondor
vars:
condor_role: execute
condor_host: <htcondor_CM_private_IP>
condor_password: <htcondor_password> # sensitive value
condor_allow_write: "*"
condor_daemons:
- MASTER
- SCHEDD
- STARTD
condor_allow_negotiator: $(ALLOW_WRITE)
condor_allow_administrator: $(ALLOW_WRITE)
condor_fs_domain: <fs_domain_name>
condor_uid_domain: <uid_domain_name>
condor_enforce_role: false
condor_extra: |
# run with partitionable slots
CLAIM_PARTITIONABLE_LEFTOVERS = True
NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=100%
SLOT_TYPE_1_PARTITIONABLE = True
ALLOW_PSLOT_PREEMPTION = False
STARTD.PROPORTIONAL_SWAP_ASSIGNMENT = True
Some useful commands that will help to check the installation and configuration success and/or debugging:
| Command | Description |
| ————————————————– | ———————————————————————————————————————————————————————————————————————————- |
| condor_version
| Displays the version and build information of the HTCondor software. It provides details such as the HTCondor version number, release date, and additional information about the software installation. |
| condor_status
| Retrieves the current status of the HTCondor pool. It displays information about the available resources, such as the number of slots, their state (idle, busy, etc.), and resource utilization. |
| condor_status -af Name Slottype Cpus Memory Disk
| Extends the condor_status command to provide more detailed information about the slots in the HTCondor pool. It includes the name, slot type, CPU count, memory usage, and disk space availability for each slot in the pool. |
| condor_history
| Retrieves the historical information about completed HTCondor jobs. It displays details such as job status, submission time, completion time, and resource usage for previously executed jobs in the HTCondor system. |
| condor_q
| Displays the current status of jobs in the HTCondor queue. It provides information about the jobs’ ID, status (running, idle, held, etc.), priority, submission time, and other details. |
| condor_q -better-analyze <job_id>
| Performs detailed analysis of a specific job in the HTCondor queue. It provides insights into the job’s resource requirements, resource usage, priority, and other factors that impact the job’s execution and performance. |
| condor_q -run
| Displays the jobs in the HTCondor queue that are currently running. It provides real-time information about the running jobs, including their ID, status, resource usage, and other relevant details. |
| condor_q -hold
| Lists the jobs in the HTCondor queue that are currently on hold. It provides information about the held jobs, such as their ID, reason for being held, and other relevant details. |
| condor_tail <job_id>
| Displays the output produced by a running or completed HTCondor job. It allows one to monitor the job’s progress and view its standard output and error streams in real-time or as the job progresses. |
| condor_ssh_to_job <job_id>
| Establishes an SSH connection to the machine where a specific HTCondor job is running. It allows to access the job’s execution environment interactively, enabling troubleshooting, debugging, or performing additional actions. |
| condor_submit <submit_file>
| Submits a job to the HTCondor system using a job description file (submit file). It specifies the job’s requirements, input files, and execution details, allowing users to submit custom jobs for execution on the HTCondor pool. |
| condor_rm <job_id>
| Removes a specific HTCondor job from the queue. It cancels the job’s execution and removes it from the HTCondor system, freeing up the allocated resources and stopping any ongoing processing associated with the job. |
| condor_auto_approve
| Automatically approves all pending job submissions in the HTCondor queue, bypassing manual approval. |
HTCondor Documentation.
HTCondor Cluster Deployment using Terraform and Ansible provisioning.
VGCN Infrastructure Management - Jenkins setup for managing Virtual Galaxy Compute Nodes.
Pre-built VGGP Images repository.