Skip to content

HTCondor/ARC Installation on VObox

This documentation describes how to configure a VObox to enable it submit ALICE jobs to HTCondor CEs or ARC. Refer to the appropriate section as needed.

The VObox will typically have been set up first as a WLCG VObox as documented here:

WLCG VObox deployment documention

Mind adding the VOMS client configuration for ALICE:

~# yum install wlcg-voms-alice

HTCondor

The VObox will run its own HTCondor services that are independent of the HTCondor services for your CE and batch system. The following instructions assume you are using CentOS/EL 7.5+. See below for installations compatible with EL 9.

Install HTCondor on CentOS 7

  1. Install the EGI UMD 4 repository rpm:

    ~# yum install http://repository.egi.eu/sw/production/umd/4/centos7/x86_64/updates/umd-release-4.1.3-1.el7.centos.noarch.rpm
    
  2. Install HTCondor 9.0.16 or a later 9.0.x version (not yet 10.x):

    ~# cd 
    ~# yum update
    ~# yum install condor
    

JAliEn Configuration

This configuration is needed for HTCondor that used run a JobRouter (not needed anymore).

  1. Go to the HTCondor configuration folder:

    ~# cd /etc/condor
    
  2. Create local configuration for HTCondor:

    ~# touch config.d/01_alice_jobrouter.config
    
  3. Add and adjust the following configuration content:

    config.d/01_alice_jobrouter.config
    DAEMON_LIST = MASTER, SCHEDD, COLLECTOR
    
    # the next line is needed since recent HTCondor versions
    
    COLLECTOR_HOST = $(FULL_HOSTNAME)
    
    GSI_DAEMON_DIRECTORY = /etc/grid-security
    GSI_DAEMON_CERT = $(GSI_DAEMON_DIRECTORY)/hostcert.pem
    GSI_DAEMON_KEY  = $(GSI_DAEMON_DIRECTORY)/hostkey.pem
    GSI_DAEMON_TRUSTED_CA_DIR = $(GSI_DAEMON_DIRECTORY)/certificates
    
    SEC_CLIENT_AUTHENTICATION_METHODS = SCITOKENS, FS, GSI
    SEC_DEFAULT_AUTHENTICATION_METHODS = FS, GSI
    SEC_DAEMON_AUTHENTICATION_METHODS = FS, GSI
    
    AUTH_SSL_CLIENT_CADIR = /etc/grid-security/certificates
    
    COLLECTOR.ALLOW_ADVERTISE_MASTER = condor@fsauth/$(FULL_HOSTNAME)
    COLLECTOR.ALLOW_ADVERTISE_SCHEDD = $(FULL_HOSTNAME)
    
    ALL_DEBUG = D_FULLDEBUG D_COMMAND
    SCHEDD_DEBUG = D_FULLDEBUG
    GRIDMANAGER_DEBUG = D_FULLDEBUG
    
    FRIENDLY_DAEMONS = condor@fsauth/$(FULL_HOSTNAME), root@fsauth/$(FULL_HOSTNAME), $(FULL_HOSTNAME)
    ALLOW_DAEMON = $(FRIENDLY_DAEMONS)
    
    SCHEDD.ALLOW_WRITE = $(FRIENDLY_DAEMONS), *@cern.ch/$(FULL_HOSTNAME)
    
    # more stuff from the CERN VOboxes
    
    CONDOR_FSYNC = False
    GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE = 10000
    
    GRIDMANAGER_JOB_PROBE_INTERVAL = 600
    
    GRIDMANAGER_MAX_PENDING_REQUESTS = 500
    GRIDMANAGER_GAHP_CALL_TIMEOUT = 3600
    GRIDMANAGER_SELECTION_EXPR = (ClusterId % 2)
    GRIDMANAGER_GAHP_RESPONSE_TIMEOUT = 300
    GRIDMANAGER_DEBUG =
    ALLOW_DAEMON = $(ALLOW_DAEMON), $(FULL_HOSTNAME), $(IP_ADDRESS), unauthenticated@unmapped
    COLLECTOR.ALLOW_ADVERTISE_MASTER = $(COLLECTOR.ALLOW_ADVERTISE_MASTER), $(ALLOW_DAEMON)
    COLLECTOR.ALLOW_ADVERTISE_SCHEDD = $(COLLECTOR.ALLOW_ADVERTISE_SCHEDD), $(ALLOW_DAEMON)
    
    DELEGATE_JOB_GSI_CREDENTIALS_LIFETIME = 0
    
    GSI_SKIP_HOST_CHECK = true
    
  4. Restart HTCondor now and automatically at boot time:

    ~# service condor restart
    ~# chkconfig condor on
    
  5. Check HTCondor is running and produces the following initial output:

    ~# pstree | grep condor
    
     |-condor_master-+-condor_collecto
     |               |-condor_procd
     |               |-condor_schedd
     |               `-condor_shared_p
    

Install HTCondor on EL 9

  1. Install the HTCondor 10.x repository rpm:

    ~# yum install https://research.cs.wisc.edu/htcondor/repo/10.x/htcondor-release-current.el9.noarch.rpm
    
  2. Install HTCondor 10.x:

    ~# cd 
    ~# yum update
    ~# yum install condor
    
  3. Add the following configuration contents:

    /etc/condor/config.d/00-minicondor.vobox
    # HTCONDOR CONFIGURATION TO CREATE A POOL WITH ONE MACHINE
    # --> modified to allow it to be used ONLY for submitting to REMOTE CEs!
    #
    # This file was created upon initial installation of HTCondor.
    # It contains configuration settings to set up a secure HTCondor
    # installation consisting of **just one single machine**.
    # YOU WILL WANT TO REMOVE THIS FILE IF/WHEN YOU DECIDE TO ADD ADDITIONAL
    # MACHINES TO YOUR HTCONDOR INSTALLATION!  Most of these settings do
    # not make sense if you have a multi-server pool.
    #
    # See the Quick Start Installation guide at:
    #     https://htcondor.org/manual/quickstart.html
    #
    
    # ---  NODE ROLES  ---
    
    # Every pool needs one Central Manager, some number of Submit nodes and
    # as many Execute nodes as you can find. Consult the manual to learn
    # about addtional roles.
    
    use ROLE: CentralManager
    use ROLE: Submit
    # --> next line commented out to prevent jobs from running on this host:
    # use ROLE: Execute
    
    # --- NETWORK SETTINGS ---
    
    # Configure HTCondor services to listen to port 9618 on the IPv4
    # loopback interface.
    # --> next line commented out to allow job submissions to remote CEs:
    # NETWORK_INTERFACE = 127.0.0.1
    BIND_ALL_INTERFACES = False
    CONDOR_HOST = 127.0.0.1
    # --> next line added to avoid condor_status errors:
    CONDOR_HOST = $(HOSTNAME)
    
    # --- SECURITY SETTINGS ---
    
    # Verify authenticity of HTCondor services by checking if they are
    # running with an effective user id of user "condor".
    SEC_DEFAULT_AUTHENTICATION = REQUIRED
    SEC_DEFAULT_INTEGRITY = REQUIRED
    ALLOW_DAEMON = condor@$(UID_DOMAIN)
    ALLOW_NEGOTIATOR = condor@$(UID_DOMAIN)
    
    # Configure so only user root or user condor can run condor_on,
    # condor_off, condor_restart, and condor_userprio commands to manage
    # HTCondor on this machine.
    # If you wish any user to do so, comment out the line below.
    ALLOW_ADMINISTRATOR = root@$(UID_DOMAIN) condor@$(UID_DOMAIN)
    
    # Allow anyone (on the loopback interface) to submit jobs.
    ALLOW_WRITE = *
    # Allow anyone (on the loopback interface) to run condor_q or condor_status.
    ALLOW_READ = *
    
    # --- PERFORMANCE TUNING SETTINGS ---
    
    # Since there is just one server in this pool, we can tune various
    # polling intervals to be much more responsive than the system defaults
    # (which are tuned for pools with thousands of servers).  This will
    # enable jobs to be scheduled faster, and job monitoring to happen more
    # frequently.
    SCHEDD_INTERVAL = 5
    NEGOTIATOR_INTERVAL = 2
    NEGOTIATOR_CYCLE_DELAY = 5
    STARTER_UPDATE_INTERVAL = 5
    SHADOW_QUEUE_UPDATE_INTERVAL = 10
    UPDATE_INTERVAL = 5
    RUNBENCHMARKS = 0
    
    # --- COMMON CHANGES ---
    
    # Uncomment the lines below and do 'sudo condor_reconfig' if you wish
    # condor_q to show jobs from all users with one line per job by default.
    #CONDOR_Q_DASH_BATCH_IS_DEFAULT = False
    #CONDOR_Q_ONLY_MY_JOBS = False
    
    /etc/condor/config.d/99-alice-vobox.conf
    # non-standard settings for an ALICE VObox
    
    AUTH_SSL_CLIENT_CADIR = /etc/grid-security/certificates
    AUTH_SSL_USE_CLIENT_PROXY_ENV_VAR = True
    
    CONDOR_FSYNC = False
    
    DELEGATE_JOB_GSI_CREDENTIALS_LIFETIME = 0
    
    GRIDMANAGER_DEBUG =
    GRIDMANAGER_GAHP_CALL_TIMEOUT = 3600
    GRIDMANAGER_GAHP_RESPONSE_TIMEOUT = 300
    GRIDMANAGER_JOB_PROBE_INTERVAL = 600
    GRIDMANAGER_MAX_PENDING_REQUESTS = 500
    GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE = 10000
    GRIDMANAGER_SELECTION_EXPR = (ClusterId % 2)
    
    GSI_SKIP_HOST_CHECK = true
    
  4. Restart HTCondor now and automatically at boot time:

    ~# systemctl enable --now condor
    
  5. Check HTCondor is running and produces the following initial output:

    ~# pstree | grep condor
        |-condor_master-+-condor_collecto
        |               |-condor_negotiat
        |               |-condor_procd
        |               |-condor_schedd
        |               `-condor_shared_p
    

LDAP and VObox Configuration

In the Environment section, these values need to be added/adjusted as needed:

Definition Description
CE_LCGCE=your-ce01.your-domain:9619, your-ce02.your-domain:9619, ... CE list example
USE_TOKEN=[0-2] use X509 proxy, WLCG token, or both
SUBMIT_ARGS=-append "+TestClassAd=1" ... Specify extra options for condor_submit command,
e.g. add extra ClassAd(s) to the job description

Mind the firewall settings on the VObox. See Network setup for more details.

Miscellaneous Scripts

Cleanup script for job logs and stdout/stderr files removal:

Clean up script
#!/bin/sh

cd ~/htcondor || exit

GZ_SIZE=10k
GZ_MINS=60
GZ_DAYS=2
RM_DAYS=7

STAMP=.stamp
prefix=cleanup-
log=$prefix`date +%y%m%d`
exec >> $log 2>&1 < /dev/null
echo === START `date`
for d in `ls -d 20??-??-??`
do
    (
        echo === $d
        stamp=$d/$STAMP
        [ -e $stamp ] || touch $stamp || exit
        if find $stamp -mtime +$RM_DAYS | grep . > /dev/null
        then
            echo removing...
            /bin/rm -r $d < /dev/null
            exit
        fi
        cd $d || exit
        find . ! -name .\* ! -name \*.gz \( -mtime +$GZ_DAYS -o \
             -size +$GZ_SIZE -mmin +$GZ_MINS \) -exec gzip -9v {} \;
     )
done
find $prefix* -mtime +$RM_DAYS -exec /bin/rm {} \;
echo === READY `date`

Crontab line for the cleanup script:

37 * * * * /bin/sh $HOME/cron/htcondor-cleanup.sh

ARC

LDAP Configuration

The following configuration parameters need to be added/adjusted as needed:

LDAP configuration examples
# optional (normally not needed): the site BDII to take running and queued job numbers from

CE_SITE_BDII=ldap://site-bdii.gridpp.rl.ac.uk:2170/mds-vo-name=RAL-LCG2,o=grid

# specifies whether to use BDII and which GLUE schema version (only 2 is supported in JAliEn)
CE_USE_BDII=2

# a list of ARC CEs to be used for jobagent submission
CE_LCGCE=arc-ce01.gridpp.rl.ac.uk:2811/nordugrid-Condor-grid3000M, ...

# arguments for arcsub command (load-balancing is done by the JAliEn CE itself)
CE_SUBMITARG=--direct

# additional parameters to arcsub, in particular to pass XRSL clauses as shown
CE_SUBMITARG_LIST=xrsl:(queue=mcore_alice)(memory="2000")(count="8")(countpernode="8")(walltime="1500")(cputime="12000")

Debug ARC for Operations (to be tested)

Set the following variable in ~/.alien/config/CE.env file to let arc* CLI tools log debug output in CE.log.N files:

ARC_DEBUG=1