This page is our best attempt at getting you up and running on VAMPIRE as fast as possible. VAMPIRE is the Very Awesome Multi-Processor Integrated Research Environment. It uses University of Maryland, College Park Department of Astronomy Linux desktop computers together with the Condor software package to become a powerful distributed computing network. This page is designed for department members who want to learn how to use VAMPIRE. Outsiders are welcome to look at this page and learn a little about Condor, but you should know that the way we do things here may be slightly (or drastically) different from your needs.
First a quick word about what kind of jobs are appropriate for VAMPIRE. VAMPIRE is good for non-interactive jobs that are compute intensive. VAMPIRE is good for running a job that you would run on a normal desktop for a long time. The job can be memory intensive, but it cannot require more memory than a typical, individual computer has. VAMPIRE is even better for running multiple jobs like that. If you have five jobs that would take about one week each to run on the average desktop, you would have to wait over a month for them to complete on one desktop. With VAMPIRE, it would complete in less than two weeks. VAMPIRE is even better at doing large numbers of smaller jobs. If you can split those five jobs into five hundred different two-hour jobs, then VAMPIRE could finish it in less than two days.
There are several main environments in which one can run a job with Condor. Condor calls these "universes." The "vanilla universe" is the simplest (and least powerful) of all Condor universes. This is the one with which we are most familiar. This page deals almost entirely with with the vanilla universe. The other universes that are available in Condor are
- Standard Universe: "tricks" the process into thinking that it is running on your own machine even though it is running on another client. The standard universe also takes care of checkpointing and remote system calls for you. Your executable must be re-linked with condor libraries. This may not work for complicated programs. Read the Condor manual for more limitations. Use vanilla or standard universes unless you specifically have need for one of the following.
- PVM Universe: Parallel Virtual Machine interface implementation.
- MPI Universe: Message Passing Interface implementation for Condor. It currently only works with MPICH and not with LAM/MPI. VAMPIRE is not appropriate for MPI jobs. Borg is your best bet for MPI jobs.
- Globus Universe: for GRID computing. Not currently available for VAMPIRE.
- Java Universe: for Java programs. This is currently not available for VAMPIRE, but it could be added if someone really wanted it.
- Scheduler Universe: To run a scheduler on top of Condor. Look into this if you want DAG.
To understand more about Condor universes, read Condor's description.
In order to run a job on Condor in the vanilla universe, you must be able to run the program as a batch job. This means that all input and output must come from and go to a file without any user interactivity. This is true of all Condor jobs. The vanilla universe has the added requirement that the program has to take care of its own checkpointing. A vanilla universe job is run on whatever machine condor tells it. If that machine suddenly becomes unavailable, condor will stop the job and try again on another machine by running it again. Condor does not dump any memory or send any signals (other than SIGTERM) to the program. It might flush the output streams, though (I'm not sure). When your program is executed it must be able to figure out whether it is the first time it has been run or whether it is restarting.
This may sound complicated, but there are at least two easy strategies to do this. The easier the strategy is, the less efficient it is. The easiest way is to do no outputting until the job runs to completion, and then output everything. If the job is interrupted, however, the time it spent computing is wasted. The next method is to output on a consistent basis. If the program is restarting, it reads the previous output and figures out what to do next. This is often done with a shell-script wrapper. The frequency of the outputting depends on how much computing time you are willing to lose. The most important consideration is how long each individual job will take. If your job will take two weeks, then outputing every hour should be sufficient. If, however, you have one hundred jobs that will take one hour each, then outputing every ten or fifteen minutes might make sense. Outputing too frequently decreases your efficiency.
Once the code is prepared, all that remains to be done is to submit the job to condor. This involves preparing a configuration file for the job and a simple command.
Let's go through some examples together. The first example is
very simple. You can find all the files you need here. Copy all files to your VAMPIRE machine in
a directory that is viewable by the entire network. Compile the C
code by running gcc -Wall vampirehello.c -o vampirehello.
There should be no warnings. This makes your main executable, but we
will run it within the wrapper, wrapper.csh.
If you take a look inside the wrapper, you will see that all it
does is run vampirehello and redirect input and output.
There is also a sleep command in there so that we can see what is
going on. Next run the command pwd and make sure that
the directory is accessible from another computer (i.e., make sure it
starts with /n/diskname/.)
Look inside condor.cfg now. It gives the name of the
executable, the type of universe, the name of the condor log file
(useful for analyzing why the job was run or not), a line that tells
condor you want an email sent to you if there is an error, the name of
the file to which you want stderr to point, and the magic word
Queue. This tells condor to put exactly one job in the
queue. That job will have all of the settings described in the config
file up to that point.
Because the job will be run as user condor, you need to give condor
permission to read, write, and execute the files necessary. The way I
do this is to run touch vampirehello_output condor.log
errormessages.txt and then run chmod -R 777 . at
the command line. Make sure that you do not have any sensitive files
in or beneath your current directory. The reason that I create the
files rather than allowing condor to create them is that I will be the
owner of them. You can avoid this potential problem, I believe, by
using the standard universe. [N.B. It may be possible to avoid this
chmod unpleasantness by activating the setuid bit on the executable.
Currently, setuid is deactivated for most filesystems for obvious
security reasons. In the meantime, make sure to take precautions like
setting the sticky bit in your VAMPIRE directory.]
Now all you have to do is submit the job by running
/n/ida/condor/bin/condor_submit condor.cfg. Feel free to
put /n/ida/condor/bin in your path (I will assume you
have done so from now on). You must be logged into a VAMPIRE machine
to submit a job, and the preferred method is to log into ida (the
master condor node). Immediately after submitting your job, run
condor_q -g your_username. You should see your job listed
something like the following.
chaos> condor_q -g kayhan -- Schedd: chaos.astro.umd.edu : <129.2.15.35:46242> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 21.0 kayhan 6/14 14:52 0+00:00:03 R 0 0.0 wrapper.csh 1 jobs; 0 idle, 1 running, 0 held
Under ID you have your cluster number (21 in this case), and under
ST (for status) will be R (running), I (idle), S (suspended), or X
(exiting). If a VAMPIRE computer is available (hint: run
condor_status and look for "Unclaimed" to find out), then
your job should run to completion very quickly, and you will see that
you have written a friendly greeting to your output file.
Congratulations! You have just successfully run your first VAMPIRE
job.
Let's spice things up a bit, now. Running one "Hello World"
program is not interesting, so let's run 100. First, delete some
files with rm condor.log errormessages.txt
vampirehello_output. Now edit condor.cfg so that
the last line says Queue 100. Touch the files, chmod the
directory, and submit the job as before. Look at the queue again
(with condor_q) and you will see something like the
following.
chaos> condor_q -g kayhan -- Schedd: chaos.astro.umd.edu : <129.2.15.35:46242> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 22.0 kayhan 6/15 12:53 0+00:00:00 I 0 0.0 wrapper.csh 22.1 kayhan 6/15 12:53 0+00:00:00 I 0 0.0 wrapper.csh 22.2 kayhan 6/15 12:53 0+00:00:00 I 0 0.0 wrapper.csh 22.3 kayhan 6/15 12:53 0+00:00:00 I 0 0.0 wrapper.csh 22.4 kayhan 6/15 12:53 0+00:00:00 I 0 0.0 wrapper.csh 22.5 kayhan 6/15 12:53 0+00:00:00 I 0 0.0 wrapper.csh 22.6 kayhan 6/15 12:53 0+00:00:00 I 0 0.0 wrapper.csh 22.7 kayhan 6/15 12:53 0+00:00:00 I 0 0.0 wrapper.csh 22.8 kayhan 6/15 12:53 0+00:00:00 I 0 0.0 wrapper.csh 22.9 kayhan 6/15 12:53 0+00:00:00 I 0 0.0 wrapper.csh 22.10 kayhan 6/15 12:53 0+00:00:00 I 0 0.0 wrapper.csh 22.11 kayhan 6/15 12:53 0+00:00:00 I 0 0.0 wrapper.csh
Of course, it will go all the way down to 100. The number after
the dot in the ID column is the process number. So 22.6 refers to the
job which is cluster 22 and process 6. Use this number to refer to an
individual process or to the whole cluster. For example, the command
to remove a job from the queue is condor_rm. If I ran
condor_rm 22.6, it would remove that one process from the
queue. If I ran condor_rm 22, it would remove all 100
jobs in that cluster from the queue or all of them that remained in
the queue, anyway.
So now you have many friendly greetings in your output file. You may be wondering why it took so long. There are two reasons. First, there is always some overhead involved in getting the processes in memory on other computers and then writing to a networked disk. Second, there were 100 jobs that were trying to open the same file for appending. In fact, you may notice that there are not exactly 100 lines in your output file. Something went wrong there. Even if it did work correctly, having all of the data interspersed would not work in a real-world situation. We need different output files for each process. Let's do that.
Get rid of the output files again. Now edit the wrapper script so that it doesn't redirect stdout to a file:
./$EXECUTABLE < $INPUTFILE
While we're at it, get rid of the sleep statements, too. Next, edit the config file. Change the error message line and add a line after it:
Error = errormessages.$(Process).txt Output = vampirehello_output.$(Process).txt
What we are doing is letting condor take care of redirecting stdout for us. Condor will redirect stdout (and stderr) to a file that is numbered based on its process number. We need to touch all of these files. Here is how to do that in tcsh:
chaos> touch condor.log
chaos> set i = 0
chaos> while ($i < 100)
while? touch errormessages.${i}.txt vampirehello_output.${i}.txt
while? @ i++
while? end
chaos> chmod -R 777 .
Unfortunately, I don't know how to get condor to pad the process number with zeroes, but at least that makes touching the files an easier task. Submit the job as usual, and see what happens. You should get 100 friendly greetings in 100 different files.
Unless you are running Monte Carlo simulations that seed
themselves, running the same program exactly the same way 100
different times is not that useful. We want different inputs for each
instance of our program. One way to do that is to create a different,
numbered input file and set it up in the same way we set up stderr and
stdout (Input = vampirehello_input.$(Process).txt). It should
be pretty obvious how to do that now so let's try something else. Rather
than giving the wrapper an input file for it to read, we can give it
arguments as if it were at the command line.
First, get rid of all those output and error files, as well as the log file. Let's not do that many any more. Other people might be using VAMPIRE for real stuff! First edit the wrapper so that it prints a command line argument:
if ($# > 0) then
echo $1
endif
./$EXECUTABLE < $INPUTFILE
Next, change the queue line in the config file so that it says:
Arguments = First Queue 2 Arguments = Second Queue 2
Because we wanted to do something different for each job, we had
to use the Queue keyword twice: once for each set of
processes with those characteristics. We could have changed any of
the settings that we wanted and then queued any number of runs with
those settings. We could have changed the name of the output file,
the executable, anything. Now, remembering that we need 4
errormessages and 4 outputs, touch the files, chmod, and submit again.
Output files 0 and 1 should have "First" written in them, and files 2
and 3 should have "Second" written in them. Now we're cooking with
gas!
The last thing we need to do is set up a wrapper that figures out if it is starting for the first time or restarting. Delete those obsolete outputs, and let's get to work. First, let's edit the config file:
Arguments = vampirehello_output.$(Process).txt Queue 2
This tells the wrapper script what its output file is. There are many ways to do this. This is just one method that I thought was particularly straightforward. Change the main body of the wrapper so that is says:
set i = 0
set checkfile = $1
if (-r $checkfile) then
set i = `wc $checkfile | awk '{print $1}'`
endif
while ($i < 10)
./$EXECUTABLE < $INPUTFILE
@ i++
end
The wrapper will now run the executable 10 times. If it gets bumped off of one machine and onto another, it will be able to figure out how many times it has run the executable already by counting the number of lines in the output file. In practice you will probably want to use what has already been outputed as input of some kind. Submit the job. Remember to touch and chmod, though! You now have twenty friendly greetings in two different files. In all likelihood, the job never had to make use of its ability to know where it was because the job was so short. If it had been a real job that had taken many hours or many days, then it would have been essential since no single machine is ever available for that long.
Now you should be able to write your own wrapper that can see if has ever run before and use previous output to start in the middle of the job. Be sure to test its ability to start and restart before you submit the job. Try running it and then interrupting it in the middle and see if it can restart properly. It has to be able to do this at any point.
To learn more about condor, you can read the fine manual here.