Using GWDataFind with HTCondor¶
HTCondor is a specialised workload management system for compute-intensive processing. HTCondor is used to specify discrete work units (jobs) you want completed that are then distributed across the available resources with sophisticated scheduling, prioritisation, monitoring, and reporting capabilities. The LIGO Scientific Collaboration and its partners leverage HTCondor to process huge amounts of scientific analysis.
Configuring HTCondor job data with GWDataFind¶
The most common use case of combining GWDataFind with HTCondor is to query for the URIs of input data files as part of planning a job or workflow.
For large analyses, the URIs returned by GWDataFind are commonly split into logical chunks, one or a few files at a time, where each HTCondor job will only process data files from that chunk. Other chunks are processed in parallel with results combined in a subsequent analysis stage.
The best practice usage of input data files with HTCondor is to specify
each data file needed by a job as part of the
transfer_input_files
submit command.
Each argument passed to transfer_input_files can be a file path or URI,
HTCondor will then transfer each file into the (temporary) working directory of
the job.
The process that is started on the compute node can then see each of the input
files as a local file in the current working directory tree.
Pelican and OSDF
The LIGO Scientific Collaboration (and partners) leverage the Open Science Data Federation (OSDF) for data distribution. Depending on the GWDataFind Server you communicate with, you may be able to directly query for OSDF URIs to pass to HTCondor.
Rules¶
The basic requirements for using GWDataFind URLs with HTCondor are:
Pass absolute URLs or paths to
transfer_input_filesfor each job, or via a macro variable for each DAGMan node.Pass relative paths (normally just a file (base)name) to the executable, either directly or via a cache file.
Include the disk space required to store the data files in the
request_diskcommand for the job. If you’re note sure how big the files will be, it’s probably OK to give a conservative overestimate.If access to the files requires an authorisation token, include that in the job configuration.
Example 1: Explicit file paths¶
To configure a single job where the executable takes explicit file paths as arguments, consider the following example:
from os.path import basename
from gwdatafind import find_urls
# find input data OSDF URIs for GW170817
urls = find_urls(
"L",
"L1_GWOSC_O2_4KHZ_R1",
1187008880,
1187008884,
host="datafind.gwosc.org",
urltype="osdf",
)
filenames = map(basename, urls)
# write condor file transfer instructions for the job
with open("job.submit", 'w') as submit_file:
print(f"""
universe = vanilla
executable = /bin/head
arguments = -c4 {' '.join(filenames)}
log = job.log
error = job.err
output = job.out
request_cpus = 1
request_disk = 10GB
request_memory = 100MB
should_transfer_files = YES
transfer_input_files = {','.join(urls)}
queue
""", file=submit_file)
This will lead to a job.submit file that looks something like this:
job.submit¶universe = vanilla
executable = /bin/head
arguments = -c4 L-L1_GWOSC_O2_4KHZ_R1-1187008512-4096.gwf
log = job.log
error = job.err
output = job.out
request_cpus = 1
request_disk = 10GB
request_memory = 100MB
should_transfer_files = YES
transfer_input_files = osdf:///gwdata/O2/strain.4k/frame.v1/L1/1186988032/L-L1_GWOSC_O2_4KHZ_R1-1187008512-4096.gwf
queue
Directory structure on the execute machine
The simple example above demonstrates how to transfer files into the top-level job directory, assuming that the process spawned by the job doesn’t attempt to change directories or expect data to exist in a subdirectory.
If the executable doesn’t run from the base directory, or changes directory before reading the data, ensure that the local cache file is written from the point-of-view of the executable at the moment it attempts to read the data.
Example 2: Using a cache file¶
A common pattern is for an executable to read a file that lists the paths of the data files to be used for the job.
GWDataFind includes a gwdatafind.io.Cache object that simplifies translating
lists of URLs into various common cache formats.
Consider the following example:
from gwdatafind import find_urls
from gwdatafind.io import Cache
# find input data OSDF URIs for GW170817
urls = find_urls(
"L",
"L1_GWOSC_O2_4KHZ_R1",
1187008880,
1187008884,
host="datafind.gwosc.org",
urltype="osdf",
)
# create a cache containing just the basenames of each file, as seen
# from the job running on the HTCondor Execute Point (compute node)
cache = Cache(map(basename, urls))
cachefile = "cache.txt"
# write the cache in LAL format (by default) to be used by the job
cache.write(cachefile)
# write condor file transfer instructions for the job
with open("job.submit", 'w') as submit_file:
print(f"""
universe = vanilla
executable = /bin/science
arguments = {cachefile}
... other instructions ...
transfer_input_files = {','.join(urls)},{cachefile}
queue
""", file=submit_file)
This example will result in a local cache file that looks like this:
cache.txt¶L L1_GWOSC_O2_4KHZ_R1 1187008512 4096 L-L1_GWOSC_O2_4KHZ_R1-1187008512-4096.gwf
The job submit file should then include the following:
job.submit¶should_transfer_files = YES
transfer_input_files = osdf:///gwdata/O2/strain.4k/frame.v1/L1/1186988032/L-L1_GWOSC_O2_4KHZ_R1-1187008512-4096.gwf,cache.txt
Include the cache file in transfer_input_files
For jobs that use a cache file, it is critical to include the cache
file itself in the transfer_input_files list, otherwise it won’t
be available to the executable.