HTCondor/ARC Installation on VObox#
This documentation describes how to configure a VObox to enable it submit ALICE jobs to HTCondor CEs or ARC. Refer to the appropriate section as needed.
The VObox will typically have been set up first as a WLCG VObox as documented here:
WLCG VObox deployment documention
Mind adding the VOMS client configuration for ALICE:
HTCondor#
The VObox will run its own HTCondor services that are independent of the HTCondor services for your CE and batch system. The following instructions assume you are using CentOS/EL 7.5+. See below for installations compatible with EL 9.
Install HTCondor on CentOS 7#
-
Install the EGI UMD 4 repository rpm:
-
Install HTCondor 9.0.16 or a later 9.0.x version (not yet 10.x):
JAliEn Configuration#
This configuration is needed for HTCondor that used run a JobRouter (not needed anymore).
-
Go to the HTCondor configuration folder:
-
Create local configuration for HTCondor:
-
Add and adjust the following configuration content:
config.d/01_alice_jobrouter.config
DAEMON_LIST = MASTER, SCHEDD, COLLECTOR # the next line is needed since recent HTCondor versions COLLECTOR_HOST = $(FULL_HOSTNAME) GSI_DAEMON_DIRECTORY = /etc/grid-security GSI_DAEMON_CERT = $(GSI_DAEMON_DIRECTORY)/hostcert.pem GSI_DAEMON_KEY = $(GSI_DAEMON_DIRECTORY)/hostkey.pem GSI_DAEMON_TRUSTED_CA_DIR = $(GSI_DAEMON_DIRECTORY)/certificates SEC_CLIENT_AUTHENTICATION_METHODS = SCITOKENS, FS, GSI SEC_DEFAULT_AUTHENTICATION_METHODS = FS, GSI SEC_DAEMON_AUTHENTICATION_METHODS = FS, GSI AUTH_SSL_CLIENT_CADIR = /etc/grid-security/certificates COLLECTOR.ALLOW_ADVERTISE_MASTER = condor@fsauth/$(FULL_HOSTNAME) COLLECTOR.ALLOW_ADVERTISE_SCHEDD = $(FULL_HOSTNAME) ALL_DEBUG = D_FULLDEBUG D_COMMAND SCHEDD_DEBUG = D_FULLDEBUG GRIDMANAGER_DEBUG = D_FULLDEBUG FRIENDLY_DAEMONS = condor@fsauth/$(FULL_HOSTNAME), root@fsauth/$(FULL_HOSTNAME), $(FULL_HOSTNAME) ALLOW_DAEMON = $(FRIENDLY_DAEMONS) SCHEDD.ALLOW_WRITE = $(FRIENDLY_DAEMONS), *@cern.ch/$(FULL_HOSTNAME) # more stuff from the CERN VOboxes CONDOR_FSYNC = False GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE = 10000 GRIDMANAGER_JOB_PROBE_INTERVAL = 600 GRIDMANAGER_MAX_PENDING_REQUESTS = 500 GRIDMANAGER_GAHP_CALL_TIMEOUT = 3600 GRIDMANAGER_SELECTION_EXPR = (ClusterId % 2) GRIDMANAGER_GAHP_RESPONSE_TIMEOUT = 300 GRIDMANAGER_DEBUG = ALLOW_DAEMON = $(ALLOW_DAEMON), $(FULL_HOSTNAME), $(IP_ADDRESS), unauthenticated@unmapped COLLECTOR.ALLOW_ADVERTISE_MASTER = $(COLLECTOR.ALLOW_ADVERTISE_MASTER), $(ALLOW_DAEMON) COLLECTOR.ALLOW_ADVERTISE_SCHEDD = $(COLLECTOR.ALLOW_ADVERTISE_SCHEDD), $(ALLOW_DAEMON) DELEGATE_JOB_GSI_CREDENTIALS_LIFETIME = 0 GSI_SKIP_HOST_CHECK = true
-
Restart HTCondor now and automatically at boot time:
-
Check HTCondor is running and produces the following initial output:
Install HTCondor on EL 9#
-
Install the HTCondor LTS ("stable") release:
-
Move the original configuration file out of the way:
-
Add the following configuration contents:
/etc/condor/config.d/99-minicondor.vobox
# HTCONDOR CONFIGURATION TO CREATE A POOL WITH ONE MACHINE # --> modified to allow it to be used ONLY for submitting to REMOTE CEs! # # This file was created upon initial installation of HTCondor. # It contains configuration settings to set up a secure HTCondor # installation consisting of **just one single machine**. # YOU WILL WANT TO REMOVE THIS FILE IF/WHEN YOU DECIDE TO ADD ADDITIONAL # MACHINES TO YOUR HTCONDOR INSTALLATION! Most of these settings do # not make sense if you have a multi-server pool. # # See the Quick Start Installation guide at: # https://htcondor.org/manual/quickstart.html # # --- NODE ROLES --- # Every pool needs one Central Manager, some number of Submit nodes and # as many Execute nodes as you can find. Consult the manual to learn # about addtional roles. use ROLE: CentralManager use ROLE: Submit # --> next line commented out to prevent jobs from running on this host: # use ROLE: Execute # --- NETWORK SETTINGS --- # Configure HTCondor services to listen to port 9618 on the IPv4 # loopback interface. NETWORK_INTERFACE = 127.0.0.1 # --> next line added to allow job submissions to remote CEs: NETWORK_INTERFACE = * BIND_ALL_INTERFACES = False CONDOR_HOST = 127.0.0.1 # --> next line added to avoid condor_status errors: CONDOR_HOST = $(HOSTNAME) # --- SECURITY SETTINGS --- # Verify authenticity of HTCondor services by checking if they are # running with an effective user id of user "condor". SEC_DEFAULT_AUTHENTICATION = REQUIRED SEC_DEFAULT_INTEGRITY = REQUIRED ALLOW_DAEMON = condor@$(UID_DOMAIN) ALLOW_NEGOTIATOR = condor@$(UID_DOMAIN) # Configure so only user root or user condor can run condor_on, # condor_off, condor_restart, and condor_userprio commands to manage # HTCondor on this machine. # If you wish any user to do so, comment out the line below. ALLOW_ADMINISTRATOR = root@$(UID_DOMAIN) condor@$(UID_DOMAIN) # Allow anyone (on the loopback interface) to submit jobs. ALLOW_WRITE = * # Allow anyone (on the loopback interface) to run condor_q or condor_status. ALLOW_READ = * # --- PERFORMANCE TUNING SETTINGS --- # Since there is just one server in this pool, we can tune various # polling intervals to be much more responsive than the system defaults # (which are tuned for pools with thousands of servers). This will # enable jobs to be scheduled faster, and job monitoring to happen more # frequently. SCHEDD_INTERVAL = 5 NEGOTIATOR_INTERVAL = 2 NEGOTIATOR_CYCLE_DELAY = 5 STARTER_UPDATE_INTERVAL = 5 SHADOW_QUEUE_UPDATE_INTERVAL = 10 UPDATE_INTERVAL = 5 RUNBENCHMARKS = 0 # --- COMMON CHANGES --- # Uncomment the lines below and do 'sudo condor_reconfig' if you wish # condor_q to show jobs from all users with one line per job by default. #CONDOR_Q_DASH_BATCH_IS_DEFAULT = False #CONDOR_Q_ONLY_MY_JOBS = False # next line added to run only the daemons necessary on a VObox DAEMON_LIST = MASTER, SCHEDD, COLLECTOR
/etc/condor/config.d/99-alice-vobox.conf
# non-standard settings for an ALICE VObox CONDOR_FSYNC = False GRIDMANAGER_DEBUG = GRIDMANAGER_GAHP_CALL_TIMEOUT = 3600 GRIDMANAGER_GAHP_RESPONSE_TIMEOUT = 300 GRIDMANAGER_JOB_PROBE_INTERVAL = 600 GRIDMANAGER_MAX_PENDING_REQUESTS = 500 GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE = 10000 GRIDMANAGER_SELECTION_EXPR = (ClusterId % 2)
-
Restart HTCondor now and automatically at boot time:
-
Check HTCondor is running and produces the following initial output:
LDAP and VObox Configuration (for the ALICE grid team)#
In the Environment section, these values need to be added/adjusted as needed:
Definition | Description |
---|---|
CE_LCGCE=your-ce01.your-domain:9619, your-ce02.your-domain:9619, ... |
CE list example |
USE_TOKEN=[0-2] |
use X509 proxy, WLCG token, or both |
SUBMIT_ARGS=-append "+TestClassAd=1" ... |
Specify extra options for condor_submit command, e.g. add extra ClassAd(s) to the job description |
Mind the firewall settings on the VObox. See Network setup for more details.
Miscellaneous Scripts#
Cleanup script for job logs and stdout/stderr files removal:
Clean up script
#!/bin/sh
cd ~/htcondor || exit
GZ_SIZE=10k
GZ_MINS=60
GZ_DAYS=2
RM_DAYS=7
STAMP=.stamp
prefix=cleanup-
log=$prefix`date +%y%m%d`
exec >> $log 2>&1 < /dev/null
echo === START `date`
for d in `ls -d 20??-??-??`
do
(
echo === $d
stamp=$d/$STAMP
[ -e $stamp ] || touch $stamp || exit
if find $stamp -mtime +$RM_DAYS | grep . > /dev/null
then
echo removing...
/bin/rm -r $d < /dev/null
exit
fi
cd $d || exit
find . ! -name .\* ! -name \*.gz \( -mtime +$GZ_DAYS -o \
-size +$GZ_SIZE -mmin +$GZ_MINS \) -exec gzip -9v {} \;
)
done
find $prefix* -mtime +$RM_DAYS -exec /bin/rm {} \;
echo === READY `date`
Crontab line for the cleanup script:
ARC#
LDAP Configuration#
The following configuration parameters need to be added/adjusted as needed:
LDAP configuration examples
# optional (normally not needed): the site BDII to take running and queued job numbers from
CE_SITE_BDII=ldap://site-bdii.gridpp.rl.ac.uk:2170/mds-vo-name=RAL-LCG2,o=grid
# specifies whether to use BDII and which GLUE schema version (only 2 is supported in JAliEn)
CE_USE_BDII=2
# a list of ARC CEs to be used for jobagent submission
CE_LCGCE=arc-ce01.gridpp.rl.ac.uk:2811/nordugrid-Condor-grid3000M, ...
# arguments for arcsub command (load-balancing is done by the JAliEn CE itself)
CE_SUBMITARG=--direct
# additional parameters to arcsub, in particular to pass XRSL clauses as shown
CE_SUBMITARG_LIST=xrsl:(queue=mcore_alice)(memory="2000")(count="8")(countpernode="8")(walltime="1500")(cputime="12000")