Skip to content

HTCondor/ARC Installation on VObox#

This documentation describes how to configure a VObox to enable it submit ALICE jobs to HTCondor CEs or ARC. Refer to the appropriate section as needed.

The VObox will typically have been set up first as a WLCG VObox as documented here:

WLCG VObox deployment documention

Mind adding the VOMS client configuration for ALICE:

~# yum install wlcg-voms-alice

HTCondor#

The VObox will run its own HTCondor services that are independent of the HTCondor services for your CE and batch system. The following instructions assume you are using CentOS/EL 7.5+. See below for installations compatible with EL 9.

Install HTCondor on CentOS 7#

  1. Install the EGI UMD 4 repository rpm:

    ~# yum install http://repository.egi.eu/sw/production/umd/4/centos7/x86_64/updates/umd-release-4.1.3-1.el7.centos.noarch.rpm
    
  2. Install HTCondor 9.0.16 or a later 9.0.x version (not yet 10.x):

    ~# cd 
    ~# yum update
    ~# yum install condor
    

JAliEn Configuration#

This configuration is needed for HTCondor that used run a JobRouter (not needed anymore).

  1. Go to the HTCondor configuration folder:

    ~# cd /etc/condor
    
  2. Create local configuration for HTCondor:

    ~# touch config.d/01_alice_jobrouter.config
    
  3. Add and adjust the following configuration content:

    config.d/01_alice_jobrouter.config
    DAEMON_LIST = MASTER, SCHEDD, COLLECTOR
    
    # the next line is needed since recent HTCondor versions
    
    COLLECTOR_HOST = $(FULL_HOSTNAME)
    
    GSI_DAEMON_DIRECTORY = /etc/grid-security
    GSI_DAEMON_CERT = $(GSI_DAEMON_DIRECTORY)/hostcert.pem
    GSI_DAEMON_KEY  = $(GSI_DAEMON_DIRECTORY)/hostkey.pem
    GSI_DAEMON_TRUSTED_CA_DIR = $(GSI_DAEMON_DIRECTORY)/certificates
    
    SEC_CLIENT_AUTHENTICATION_METHODS = SCITOKENS, FS, GSI
    SEC_DEFAULT_AUTHENTICATION_METHODS = FS, GSI
    SEC_DAEMON_AUTHENTICATION_METHODS = FS, GSI
    
    AUTH_SSL_CLIENT_CADIR = /etc/grid-security/certificates
    
    COLLECTOR.ALLOW_ADVERTISE_MASTER = condor@fsauth/$(FULL_HOSTNAME)
    COLLECTOR.ALLOW_ADVERTISE_SCHEDD = $(FULL_HOSTNAME)
    
    ALL_DEBUG = D_FULLDEBUG D_COMMAND
    SCHEDD_DEBUG = D_FULLDEBUG
    GRIDMANAGER_DEBUG = D_FULLDEBUG
    
    FRIENDLY_DAEMONS = condor@fsauth/$(FULL_HOSTNAME), root@fsauth/$(FULL_HOSTNAME), $(FULL_HOSTNAME)
    ALLOW_DAEMON = $(FRIENDLY_DAEMONS)
    
    SCHEDD.ALLOW_WRITE = $(FRIENDLY_DAEMONS), *@cern.ch/$(FULL_HOSTNAME)
    
    # more stuff from the CERN VOboxes
    
    CONDOR_FSYNC = False
    GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE = 10000
    
    GRIDMANAGER_JOB_PROBE_INTERVAL = 600
    
    GRIDMANAGER_MAX_PENDING_REQUESTS = 500
    GRIDMANAGER_GAHP_CALL_TIMEOUT = 3600
    GRIDMANAGER_SELECTION_EXPR = (ClusterId % 2)
    GRIDMANAGER_GAHP_RESPONSE_TIMEOUT = 300
    GRIDMANAGER_DEBUG =
    ALLOW_DAEMON = $(ALLOW_DAEMON), $(FULL_HOSTNAME), $(IP_ADDRESS), unauthenticated@unmapped
    COLLECTOR.ALLOW_ADVERTISE_MASTER = $(COLLECTOR.ALLOW_ADVERTISE_MASTER), $(ALLOW_DAEMON)
    COLLECTOR.ALLOW_ADVERTISE_SCHEDD = $(COLLECTOR.ALLOW_ADVERTISE_SCHEDD), $(ALLOW_DAEMON)
    
    DELEGATE_JOB_GSI_CREDENTIALS_LIFETIME = 0
    
    GSI_SKIP_HOST_CHECK = true
    
  4. Restart HTCondor now and automatically at boot time:

    ~# service condor restart
    ~# chkconfig condor on
    
  5. Check HTCondor is running and produces the following initial output:

    ~# pstree | grep condor
    
     |-condor_master-+-condor_collecto
     |               |-condor_procd
     |               |-condor_schedd
     |               `-condor_shared_p
    

Install HTCondor on EL 9#

  1. Install the HTCondor LTS ("stable") release:

    ~# (umask 077; uuidgen > ~/pool-pwd-$$.txt)
    ~# curl -fsSL https://get.htcondor.org | GET_HTCONDOR_PASSWORD=`cat ~/pool-pwd-$$.txt` /bin/bash -s -- --no-dry-run --channel stable  
    
  2. Move the original configuration file out of the way:

    ~# mv /etc/condor/config.d/00-minicondor ~/00-minicondor.orig
    
  3. Add the following configuration contents:

    /etc/condor/config.d/99-minicondor.vobox
    # HTCONDOR CONFIGURATION TO CREATE A POOL WITH ONE MACHINE
    # --> modified to allow it to be used ONLY for submitting to REMOTE CEs!
    #
    # This file was created upon initial installation of HTCondor.
    # It contains configuration settings to set up a secure HTCondor
    # installation consisting of **just one single machine**.
    # YOU WILL WANT TO REMOVE THIS FILE IF/WHEN YOU DECIDE TO ADD ADDITIONAL
    # MACHINES TO YOUR HTCONDOR INSTALLATION!  Most of these settings do
    # not make sense if you have a multi-server pool.
    #
    # See the Quick Start Installation guide at:
    #     https://htcondor.org/manual/quickstart.html
    #
    
    # ---  NODE ROLES  ---
    
    # Every pool needs one Central Manager, some number of Submit nodes and
    # as many Execute nodes as you can find. Consult the manual to learn
    # about addtional roles.
    
    use ROLE: CentralManager
    use ROLE: Submit
    # --> next line commented out to prevent jobs from running on this host:
    # use ROLE: Execute
    
    # --- NETWORK SETTINGS ---
    
    # Configure HTCondor services to listen to port 9618 on the IPv4
    # loopback interface.
    NETWORK_INTERFACE = 127.0.0.1
    # --> next line added to allow job submissions to remote CEs:
    NETWORK_INTERFACE = *
    BIND_ALL_INTERFACES = False
    CONDOR_HOST = 127.0.0.1
    # --> next line added to avoid condor_status errors:
    CONDOR_HOST = $(HOSTNAME)
    
    # --- SECURITY SETTINGS ---
    
    # Verify authenticity of HTCondor services by checking if they are
    # running with an effective user id of user "condor".
    SEC_DEFAULT_AUTHENTICATION = REQUIRED
    SEC_DEFAULT_INTEGRITY = REQUIRED
    ALLOW_DAEMON = condor@$(UID_DOMAIN)
    ALLOW_NEGOTIATOR = condor@$(UID_DOMAIN)
    
    # Configure so only user root or user condor can run condor_on,
    # condor_off, condor_restart, and condor_userprio commands to manage
    # HTCondor on this machine.
    # If you wish any user to do so, comment out the line below.
    ALLOW_ADMINISTRATOR = root@$(UID_DOMAIN) condor@$(UID_DOMAIN)
    
    # Allow anyone (on the loopback interface) to submit jobs.
    ALLOW_WRITE = *
    # Allow anyone (on the loopback interface) to run condor_q or condor_status.
    ALLOW_READ = *
    
    # --- PERFORMANCE TUNING SETTINGS ---
    
    # Since there is just one server in this pool, we can tune various
    # polling intervals to be much more responsive than the system defaults
    # (which are tuned for pools with thousands of servers).  This will
    # enable jobs to be scheduled faster, and job monitoring to happen more
    # frequently.
    SCHEDD_INTERVAL = 5
    NEGOTIATOR_INTERVAL = 2
    NEGOTIATOR_CYCLE_DELAY = 5
    STARTER_UPDATE_INTERVAL = 5
    SHADOW_QUEUE_UPDATE_INTERVAL = 10
    UPDATE_INTERVAL = 5
    RUNBENCHMARKS = 0
    
    # --- COMMON CHANGES ---
    
    # Uncomment the lines below and do 'sudo condor_reconfig' if you wish
    # condor_q to show jobs from all users with one line per job by default.
    #CONDOR_Q_DASH_BATCH_IS_DEFAULT = False
    #CONDOR_Q_ONLY_MY_JOBS = False
    
    # next line added to run only the daemons necessary on a VObox
    DAEMON_LIST = MASTER, SCHEDD, COLLECTOR
    
    /etc/condor/config.d/99-alice-vobox.conf
    # non-standard settings for an ALICE VObox
    
    CONDOR_FSYNC = False
    
    GRIDMANAGER_DEBUG =
    GRIDMANAGER_GAHP_CALL_TIMEOUT = 3600
    GRIDMANAGER_GAHP_RESPONSE_TIMEOUT = 300
    GRIDMANAGER_JOB_PROBE_INTERVAL = 600
    GRIDMANAGER_MAX_PENDING_REQUESTS = 500
    GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE = 10000
    GRIDMANAGER_SELECTION_EXPR = (ClusterId % 2)
    
  4. Restart HTCondor now and automatically at boot time:

    ~# systemctl restart condor
    ~# systemctl enable --now condor
    
  5. Check HTCondor is running and produces the following initial output:

    ~# pstree | grep condor
        |-condor_master-+-condor_collecto
        |               |-condor_procd
        |               |-condor_schedd
        |               `-condor_shared_p
    

LDAP and VObox Configuration (for the ALICE grid team)#

In the Environment section, these values need to be added/adjusted as needed:

Definition Description
CE_LCGCE=your-ce01.your-domain:9619, your-ce02.your-domain:9619, ... CE list example
USE_TOKEN=[0-2] use X509 proxy, WLCG token, or both
SUBMIT_ARGS=-append "+TestClassAd=1" ... Specify extra options for condor_submit command,
e.g. add extra ClassAd(s) to the job description

Mind the firewall settings on the VObox. See Network setup for more details.

Miscellaneous Scripts#

Cleanup script for job logs and stdout/stderr files removal:

Clean up script
#!/bin/sh

cd ~/htcondor || exit

GZ_SIZE=10k
GZ_MINS=60
GZ_DAYS=2
RM_DAYS=7

STAMP=.stamp
prefix=cleanup-
log=$prefix`date +%y%m%d`
exec >> $log 2>&1 < /dev/null
echo === START `date`
for d in `ls -d 20??-??-??`
do
    (
        echo === $d
        stamp=$d/$STAMP
        [ -e $stamp ] || touch $stamp || exit
        if find $stamp -mtime +$RM_DAYS | grep . > /dev/null
        then
            echo removing...
            /bin/rm -r $d < /dev/null
            exit
        fi
        cd $d || exit
        find . ! -name .\* ! -name \*.gz \( -mtime +$GZ_DAYS -o \
             -size +$GZ_SIZE -mmin +$GZ_MINS \) -exec gzip -9v {} \;
     )
done
find $prefix* -mtime +$RM_DAYS -exec /bin/rm {} \;
echo === READY `date`

Crontab line for the cleanup script:

37 * * * * /bin/sh $HOME/cron/htcondor-cleanup.sh

ARC#

LDAP Configuration#

The following configuration parameters need to be added/adjusted as needed:

LDAP configuration examples
# optional (normally not needed): the site BDII to take running and queued job numbers from

CE_SITE_BDII=ldap://site-bdii.gridpp.rl.ac.uk:2170/mds-vo-name=RAL-LCG2,o=grid

# specifies whether to use BDII and which GLUE schema version (only 2 is supported in JAliEn)
CE_USE_BDII=2

# a list of ARC CEs to be used for jobagent submission
CE_LCGCE=arc-ce01.gridpp.rl.ac.uk:2811/nordugrid-Condor-grid3000M, ...

# arguments for arcsub command (load-balancing is done by the JAliEn CE itself)
CE_SUBMITARG=--direct

# additional parameters to arcsub, in particular to pass XRSL clauses as shown
CE_SUBMITARG_LIST=xrsl:(queue=mcore_alice)(memory="2000")(count="8")(countpernode="8")(walltime="1500")(cputime="12000")

Debug ARC for Operations (to be tested)

Set the following variable in ~/.alien/config/CE.env file to let arc* CLI tools log debug output in CE.log.N files:

ARC_DEBUG=1