HTCondor/ARC Installation on VObox¶
This documentation describes how to configure a VObox to enable it submit ALICE jobs to HTCondor CEs or ARC. Refer to the appropriate section as needed.
The VObox will typically have been set up first as a WLCG VObox as documented here:
WLCG VObox deployment documention
Mind adding the VOMS client configuration for ALICE:
~# yum install wlcg-voms-alice
HTCondor¶
The VObox will run its own HTCondor services that are independent of the HTCondor services for your CE and batch system. The following instructions assume you are using CentOS/EL 7.5+. See below for installations compatible with EL 9.
Install HTCondor on CentOS 7¶
-
Install the EGI UMD 4 repository rpm:
~# yum install http://repository.egi.eu/sw/production/umd/4/centos7/x86_64/updates/umd-release-4.1.3-1.el7.centos.noarch.rpm
-
Install HTCondor 9.0.16 or a later 9.0.x version (not yet 10.x):
~# cd ~# yum update ~# yum install condor
JAliEn Configuration¶
This configuration is needed for HTCondor that used run a JobRouter (not needed anymore).
-
Go to the HTCondor configuration folder:
~# cd /etc/condor
-
Create local configuration for HTCondor:
~# touch config.d/01_alice_jobrouter.config
-
Add and adjust the following configuration content:
config.d/01_alice_jobrouter.config
DAEMON_LIST = MASTER, SCHEDD, COLLECTOR # the next line is needed since recent HTCondor versions COLLECTOR_HOST = $(FULL_HOSTNAME) GSI_DAEMON_DIRECTORY = /etc/grid-security GSI_DAEMON_CERT = $(GSI_DAEMON_DIRECTORY)/hostcert.pem GSI_DAEMON_KEY = $(GSI_DAEMON_DIRECTORY)/hostkey.pem GSI_DAEMON_TRUSTED_CA_DIR = $(GSI_DAEMON_DIRECTORY)/certificates SEC_CLIENT_AUTHENTICATION_METHODS = SCITOKENS, FS, GSI SEC_DEFAULT_AUTHENTICATION_METHODS = FS, GSI SEC_DAEMON_AUTHENTICATION_METHODS = FS, GSI AUTH_SSL_CLIENT_CADIR = /etc/grid-security/certificates COLLECTOR.ALLOW_ADVERTISE_MASTER = condor@fsauth/$(FULL_HOSTNAME) COLLECTOR.ALLOW_ADVERTISE_SCHEDD = $(FULL_HOSTNAME) ALL_DEBUG = D_FULLDEBUG D_COMMAND SCHEDD_DEBUG = D_FULLDEBUG GRIDMANAGER_DEBUG = D_FULLDEBUG FRIENDLY_DAEMONS = condor@fsauth/$(FULL_HOSTNAME), root@fsauth/$(FULL_HOSTNAME), $(FULL_HOSTNAME) ALLOW_DAEMON = $(FRIENDLY_DAEMONS) SCHEDD.ALLOW_WRITE = $(FRIENDLY_DAEMONS), *@cern.ch/$(FULL_HOSTNAME) # more stuff from the CERN VOboxes CONDOR_FSYNC = False GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE = 10000 GRIDMANAGER_JOB_PROBE_INTERVAL = 600 GRIDMANAGER_MAX_PENDING_REQUESTS = 500 GRIDMANAGER_GAHP_CALL_TIMEOUT = 3600 GRIDMANAGER_SELECTION_EXPR = (ClusterId % 2) GRIDMANAGER_GAHP_RESPONSE_TIMEOUT = 300 GRIDMANAGER_DEBUG = ALLOW_DAEMON = $(ALLOW_DAEMON), $(FULL_HOSTNAME), $(IP_ADDRESS), unauthenticated@unmapped COLLECTOR.ALLOW_ADVERTISE_MASTER = $(COLLECTOR.ALLOW_ADVERTISE_MASTER), $(ALLOW_DAEMON) COLLECTOR.ALLOW_ADVERTISE_SCHEDD = $(COLLECTOR.ALLOW_ADVERTISE_SCHEDD), $(ALLOW_DAEMON) DELEGATE_JOB_GSI_CREDENTIALS_LIFETIME = 0 GSI_SKIP_HOST_CHECK = true
-
Restart HTCondor now and automatically at boot time:
~# service condor restart ~# chkconfig condor on
-
Check HTCondor is running and produces the following initial output:
~# pstree | grep condor |-condor_master-+-condor_collecto | |-condor_procd | |-condor_schedd | `-condor_shared_p
Install HTCondor on EL 9¶
-
Install the HTCondor LTS ("stable") release:
~# (umask 077; uuidgen > ~/pool-pwd-$$.txt) ~# curl -fsSL https://get.htcondor.org | GET_HTCONDOR_PASSWORD=`cat ~/pool-pwd-$$.txt` /bin/bash -s -- --no-dry-run --channel stable
-
Move the original configuration file out of the way:
~# mv /etc/condor/config.d/00-minicondor ~/00-minicondor.orig
-
Add the following configuration contents:
/etc/condor/config.d/99-minicondor.vobox
# HTCONDOR CONFIGURATION TO CREATE A POOL WITH ONE MACHINE # --> modified to allow it to be used ONLY for submitting to REMOTE CEs! # # This file was created upon initial installation of HTCondor. # It contains configuration settings to set up a secure HTCondor # installation consisting of **just one single machine**. # YOU WILL WANT TO REMOVE THIS FILE IF/WHEN YOU DECIDE TO ADD ADDITIONAL # MACHINES TO YOUR HTCONDOR INSTALLATION! Most of these settings do # not make sense if you have a multi-server pool. # # See the Quick Start Installation guide at: # https://htcondor.org/manual/quickstart.html # # --- NODE ROLES --- # Every pool needs one Central Manager, some number of Submit nodes and # as many Execute nodes as you can find. Consult the manual to learn # about addtional roles. use ROLE: CentralManager use ROLE: Submit # --> next line commented out to prevent jobs from running on this host: # use ROLE: Execute # --- NETWORK SETTINGS --- # Configure HTCondor services to listen to port 9618 on the IPv4 # loopback interface. NETWORK_INTERFACE = 127.0.0.1 # --> next line added to allow job submissions to remote CEs: NETWORK_INTERFACE = * BIND_ALL_INTERFACES = False CONDOR_HOST = 127.0.0.1 # --> next line added to avoid condor_status errors: CONDOR_HOST = $(HOSTNAME) # --- SECURITY SETTINGS --- # Verify authenticity of HTCondor services by checking if they are # running with an effective user id of user "condor". SEC_DEFAULT_AUTHENTICATION = REQUIRED SEC_DEFAULT_INTEGRITY = REQUIRED ALLOW_DAEMON = condor@$(UID_DOMAIN) ALLOW_NEGOTIATOR = condor@$(UID_DOMAIN) # Configure so only user root or user condor can run condor_on, # condor_off, condor_restart, and condor_userprio commands to manage # HTCondor on this machine. # If you wish any user to do so, comment out the line below. ALLOW_ADMINISTRATOR = root@$(UID_DOMAIN) condor@$(UID_DOMAIN) # Allow anyone (on the loopback interface) to submit jobs. ALLOW_WRITE = * # Allow anyone (on the loopback interface) to run condor_q or condor_status. ALLOW_READ = * # --- PERFORMANCE TUNING SETTINGS --- # Since there is just one server in this pool, we can tune various # polling intervals to be much more responsive than the system defaults # (which are tuned for pools with thousands of servers). This will # enable jobs to be scheduled faster, and job monitoring to happen more # frequently. SCHEDD_INTERVAL = 5 NEGOTIATOR_INTERVAL = 2 NEGOTIATOR_CYCLE_DELAY = 5 STARTER_UPDATE_INTERVAL = 5 SHADOW_QUEUE_UPDATE_INTERVAL = 10 UPDATE_INTERVAL = 5 RUNBENCHMARKS = 0 # --- COMMON CHANGES --- # Uncomment the lines below and do 'sudo condor_reconfig' if you wish # condor_q to show jobs from all users with one line per job by default. #CONDOR_Q_DASH_BATCH_IS_DEFAULT = False #CONDOR_Q_ONLY_MY_JOBS = False # next line added to run only the daemons necessary on a VObox DAEMON_LIST = MASTER, SCHEDD, COLLECTOR
/etc/condor/config.d/99-alice-vobox.conf
# non-standard settings for an ALICE VObox CONDOR_FSYNC = False GRIDMANAGER_DEBUG = GRIDMANAGER_GAHP_CALL_TIMEOUT = 3600 GRIDMANAGER_GAHP_RESPONSE_TIMEOUT = 300 GRIDMANAGER_JOB_PROBE_INTERVAL = 600 GRIDMANAGER_MAX_PENDING_REQUESTS = 500 GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE = 10000 GRIDMANAGER_SELECTION_EXPR = (ClusterId % 2)
-
Restart HTCondor now and automatically at boot time:
~# systemctl restart condor ~# systemctl enable --now condor
-
Check HTCondor is running and produces the following initial output:
~# pstree | grep condor |-condor_master-+-condor_collecto | |-condor_procd | |-condor_schedd | `-condor_shared_p
LDAP and VObox Configuration (for the ALICE grid team)¶
In the Environment section, these values need to be added/adjusted as needed:
Definition | Description |
---|---|
CE_LCGCE=your-ce01.your-domain:9619, your-ce02.your-domain:9619, ... |
CE list example |
USE_TOKEN=[0-2] |
use X509 proxy, WLCG token, or both |
SUBMIT_ARGS=-append "+TestClassAd=1" ... |
Specify extra options for condor_submit command, e.g. add extra ClassAd(s) to the job description |
Mind the firewall settings on the VObox. See Network setup for more details.
Miscellaneous Scripts¶
Cleanup script for job logs and stdout/stderr files removal:
Clean up script
#!/bin/sh
cd ~/htcondor || exit
GZ_SIZE=10k
GZ_MINS=60
GZ_DAYS=2
RM_DAYS=7
STAMP=.stamp
prefix=cleanup-
log=$prefix`date +%y%m%d`
exec >> $log 2>&1 < /dev/null
echo === START `date`
for d in `ls -d 20??-??-??`
do
(
echo === $d
stamp=$d/$STAMP
[ -e $stamp ] || touch $stamp || exit
if find $stamp -mtime +$RM_DAYS | grep . > /dev/null
then
echo removing...
/bin/rm -r $d < /dev/null
exit
fi
cd $d || exit
find . ! -name .\* ! -name \*.gz \( -mtime +$GZ_DAYS -o \
-size +$GZ_SIZE -mmin +$GZ_MINS \) -exec gzip -9v {} \;
)
done
find $prefix* -mtime +$RM_DAYS -exec /bin/rm {} \;
echo === READY `date`
Crontab line for the cleanup script:
37 * * * * /bin/sh $HOME/cron/htcondor-cleanup.sh
ARC¶
LDAP Configuration¶
The following configuration parameters need to be added/adjusted as needed:
LDAP configuration examples
# optional (normally not needed): the site BDII to take running and queued job numbers from
CE_SITE_BDII=ldap://site-bdii.gridpp.rl.ac.uk:2170/mds-vo-name=RAL-LCG2,o=grid
# specifies whether to use BDII and which GLUE schema version (only 2 is supported in JAliEn)
CE_USE_BDII=2
# a list of ARC CEs to be used for jobagent submission
CE_LCGCE=arc-ce01.gridpp.rl.ac.uk:2811/nordugrid-Condor-grid3000M, ...
# arguments for arcsub command (load-balancing is done by the JAliEn CE itself)
CE_SUBMITARG=--direct
# additional parameters to arcsub, in particular to pass XRSL clauses as shown
CE_SUBMITARG_LIST=xrsl:(queue=mcore_alice)(memory="2000")(count="8")(countpernode="8")(walltime="1500")(cputime="12000")
Debug ARC for Operations (to be tested)
Set the following variable in ~/.alien/config/CE.env
file to let arc*
CLI tools log debug output in CE.log.N
files:
ARC_DEBUG=1