.. _gwdatafind-htcondor:

##############################
Using GWDataFind with HTCondor
##############################

`HTCondor <https://htcondor.org/>`__ is a specialised workload management
system for compute-intensive processing.
HTCondor is used to specify discrete work units (jobs) you want completed
that are then distributed across the available resources with sophisticated
scheduling, prioritisation, monitoring, and reporting capabilities.
The LIGO Scientific Collaboration and its partners leverage HTCondor to
process huge amounts of scientific analysis.

=============================================
Configuring HTCondor job data with GWDataFind
=============================================

The most common use case of combining GWDataFind with HTCondor is to query
for the URIs of input data files as part of planning a job or workflow.

For large analyses, the URIs returned by GWDataFind are commonly split into
logical chunks, one or a few files at a time, where each HTCondor job will
only process data files from that chunk.
Other chunks are processed in parallel with results combined in a
subsequent analysis stage.

The best practice usage of input data files with HTCondor is to specify
each data file needed by a job as part of the
`transfer_input_files <https://htcondor.readthedocs.io/en/lts/users-manual/file-transfer.html>`__
submit command.
Each argument passed to ``transfer_input_files`` can be a file path or URI,
HTCondor will then transfer each file into the (temporary) working directory of
the job.
The process that is started on the compute node can then see each of the input
files as a local file in the current working directory tree.

.. admonition:: Pelican and OSDF
    :class: tip
    :name: _gwdatafind-htcondor-pelican

    The LIGO Scientific Collaboration (and partners) leverage
    `the Open Science Data Federation (OSDF) <https://osg-htc.org/services/osdf.html>`__
    for data distribution.
    Depending on the GWDataFind Server you communicate with, you may be able
    to directly query for OSDF URIs to pass to HTCondor.

-----
Rules
-----

The basic requirements for using GWDataFind URLs with HTCondor are:

1. Pass *absolute* URLs or paths to ``transfer_input_files`` for each job,
   or via a macro variable for each DAGMan node.

2. Pass *relative* paths (normally just a file (base)name) to
   the executable, either directly or via a cache file.

3. Include the disk space required to store the data files in the
   ``request_disk`` command for the job. If you're note sure how big
   the files will be, it's probably OK to give a conservative overestimate.

4. If access to the files requires an authorisation token, include that
   in the job configuration.

------------------------------
Example 1: Explicit file paths
------------------------------

To configure a single job where the executable takes explicit file paths
as arguments, consider the following example:

.. code-block:: python
    :name: gwdatafind-htcondor-file-transfer-explicit
    :caption: Passing input files to HTCondor (explicit)

    from os.path import basename
    from gwdatafind import find_urls

    # find input data OSDF URIs for GW170817
    urls = find_urls(
        "L",
        "L1_GWOSC_O2_4KHZ_R1",
        1187008880,
        1187008884,
        host="datafind.gwosc.org",
        urltype="osdf",
    )
    filenames = map(basename, urls)

    # write condor file transfer instructions for the job
    with open("job.submit", 'w') as submit_file:
        print(f"""
    universe = vanilla
    executable = /bin/head
    arguments = -c4 {' '.join(filenames)}
    log = job.log
    error = job.err
    output = job.out
    request_cpus = 1
    request_disk = 10GB
    request_memory = 100MB
    should_transfer_files = YES
    transfer_input_files = {','.join(urls)}
    queue
    """, file=submit_file)

This will lead to a `job.submit` file that looks something like this:

.. code-block:: ini
    :name: gwdatafind-htcondor-file-transfer-explicit-submit
    :caption: ``job.submit``

    universe = vanilla
    executable = /bin/head
    arguments = -c4 L-L1_GWOSC_O2_4KHZ_R1-1187008512-4096.gwf
    log = job.log
    error = job.err
    output = job.out
    request_cpus = 1
    request_disk = 10GB
    request_memory = 100MB
    should_transfer_files = YES
    transfer_input_files = osdf:///gwdata/O2/strain.4k/frame.v1/L1/1186988032/L-L1_GWOSC_O2_4KHZ_R1-1187008512-4096.gwf
    queue

.. admonition:: Directory structure on the execute machine
    :class: note

    The simple example above demonstrates how to transfer files into the
    top-level job directory, assuming that the process spawned by the
    job doesn't attempt to change directories or expect data to exist in
    a subdirectory.

    If the executable doesn't run from the base directory, or changes
    directory *before* reading the data, ensure that the local cache file
    is written from the point-of-view of the executable at the moment
    it attempts to read the data.

-----------------------------
Example 2: Using a cache file
-----------------------------

A common pattern is for an executable to read a file that lists the paths
of the data files to be used for the job.

GWDataFind includes a `gwdatafind.io.Cache` object that simplifies translating
lists of URLs into various common cache formats.
Consider the following example:

.. code-block:: python
    :name: gwdatafind-htcondor-file-transfer-cache
    :caption: Passing input files to HTCondor with a cache file

    from gwdatafind import find_urls
    from gwdatafind.io import Cache

    # find input data OSDF URIs for GW170817
    urls = find_urls(
        "L",
        "L1_GWOSC_O2_4KHZ_R1",
        1187008880,
        1187008884,
        host="datafind.gwosc.org",
        urltype="osdf",
    )

    # create a cache containing just the basenames of each file, as seen
    # from the job running on the HTCondor Execute Point (compute node)
    cache = Cache(map(basename, urls))
    cachefile = "cache.txt"

    # write the cache in LAL format (by default) to be used by the job
    cache.write(cachefile)

    # write condor file transfer instructions for the job
    with open("job.submit", 'w') as submit_file:
        print(f"""
    universe = vanilla
    executable = /bin/science
    arguments = {cachefile}

    ... other instructions ...

    transfer_input_files = {','.join(urls)},{cachefile}
    queue
    """, file=submit_file)

This example will result in a local cache file that looks like this:

.. code-block:: text
    :name: gwdatafind-htcondor-file-transfer-local-cache
    :caption: ``cache.txt``

    L L1_GWOSC_O2_4KHZ_R1 1187008512 4096 L-L1_GWOSC_O2_4KHZ_R1-1187008512-4096.gwf

The job submit file should then include the following:

.. code-block:: ini
    :name: gwdatafind-htcondor-file-transfer-local-cache-submit
    :caption: ``job.submit``

    should_transfer_files = YES
    transfer_input_files = osdf:///gwdata/O2/strain.4k/frame.v1/L1/1186988032/L-L1_GWOSC_O2_4KHZ_R1-1187008512-4096.gwf,cache.txt

.. admonition:: Include the cache file in ``transfer_input_files``
    :class: important

    For jobs that use a cache file, it is critical to include the cache
    file itself in the ``transfer_input_files`` list, otherwise it won't
    be available to the executable.