This page is our best attempt at getting you up and running on VAMPIRE as fast as possible. VAMPIRE is the Very Awesome Multi-Processor Integrated Research Environment. It uses University of Maryland, College Park Department of Astronomy Linux desktop computers together with the Condor software package to become a powerful distributed computing network. This page is designed for department members who want to learn how to use VAMPIRE. Outsiders are welcome to look at this page and learn a little about Condor, but you should know that the way we do things here may be slightly (or drastically) different from your needs.

How to run a job on VAMPIRE.

First a quick word about what kind of jobs are appropriate for VAMPIRE. VAMPIRE is good for non-interactive jobs that are compute intensive. VAMPIRE is good for running a job that you would run on a normal desktop for a long time. The job can be memory intensive, but it cannot require more memory than a typical, individual computer has. VAMPIRE is even better for running multiple jobs like that. If you have five jobs that would take about one week each to run on the average desktop, you would have to wait over a month for them to complete on one desktop. With VAMPIRE, it would complete in less than two weeks. VAMPIRE is even better at doing large numbers of smaller jobs. If you can split those five jobs into five hundred different two-hour jobs, then VAMPIRE could finish it in less than two days.

There are several main environments in which one can run a job with Condor. Condor calls these "universes." The "vanilla universe" is the simplest (and least powerful) of all Condor universes. This is the one with which we are most familiar. This page deals almost entirely with with the vanilla universe. The other universes that are available in Condor are

To understand more about Condor universes, read Condor's description.

In order to run a job on Condor in the vanilla universe, you must be able to run the program as a batch job. This means that all input and output must come from and go to a file without any user interactivity. This is true of all Condor jobs. The vanilla universe has the added requirement that the program has to take care of its own checkpointing. A vanilla universe job is run on whatever machine condor tells it. If that machine suddenly becomes unavailable, condor will stop the job and try again on another machine by running it again. Condor does not dump any memory or send any signals (other than SIGTERM) to the program. It might flush the output streams, though (I'm not sure). When your program is executed it must be able to figure out whether it is the first time it has been run or whether it is restarting.

This may sound complicated, but there are at least two easy strategies to do this. The easier the strategy is, the less efficient it is. The easiest way is to do no outputting until the job runs to completion, and then output everything. If the job is interrupted, however, the time it spent computing is wasted. The next method is to output on a consistent basis. If the program is restarting, it reads the previous output and figures out what to do next. This is often done with a shell-script wrapper. The frequency of the outputting depends on how much computing time you are willing to lose. The most important consideration is how long each individual job will take. If your job will take two weeks, then outputing every hour should be sufficient. If, however, you have one hundred jobs that will take one hour each, then outputing every ten or fifteen minutes might make sense. Outputing too frequently decreases your efficiency.

Once the code is prepared, all that remains to be done is to submit the job to condor. This involves preparing a configuration file for the job and a simple command.

Examples

Let's go through some examples together. The first example is very simple. You can find all the files you need here. Copy all files to your VAMPIRE machine in a directory that is viewable by the entire network. Compile the C code by running gcc -Wall vampirehello.c -o vampirehello. There should be no warnings. This makes your main executable, but we will run it within the wrapper, wrapper.csh.

If you take a look inside the wrapper, you will see that all it does is run vampirehello and redirect input and output. There is also a sleep command in there so that we can see what is going on. Next run the command pwd and make sure that the directory is accessible from another computer (i.e., make sure it starts with /n/diskname/.)

Look inside condor.cfg now. It gives the name of the executable, the type of universe, the name of the condor log file (useful for analyzing why the job was run or not), a line that tells condor you want an email sent to you if there is an error, the name of the file to which you want stderr to point, and the magic word Queue. This tells condor to put exactly one job in the queue. That job will have all of the settings described in the config file up to that point.

Because the job will be run as user condor, you need to give condor permission to read, write, and execute the files necessary. The way I do this is to run touch vampirehello_output condor.log errormessages.txt and then run chmod -R 777 . at the command line. Make sure that you do not have any sensitive files in or beneath your current directory. The reason that I create the files rather than allowing condor to create them is that I will be the owner of them. You can avoid this potential problem, I believe, by using the standard universe. [N.B. It may be possible to avoid this chmod unpleasantness by activating the setuid bit on the executable. Currently, setuid is deactivated for most filesystems for obvious security reasons. In the meantime, make sure to take precautions like setting the sticky bit in your VAMPIRE directory.]

Now all you have to do is submit the job by running /local/pkg/condor/release/bin/condor_submit condor.cfg. Feel free to put /local/pkg/condor/release/bin in your path (I will assume you have done so from now on). You must be logged into a VAMPIRE machine to submit a job, and the preferred method is to log into ida (the master condor node). Immediately after submitting your job, run condor_q -g your_username. You should see your job listed something like the following.

chaos> condor_q -g kayhan 


-- Schedd: chaos.astro.umd.edu : <129.2.15.35:46242>

 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  21.0   kayhan          6/14 14:52   0+00:00:03 R  0   0.0  wrapper.csh       

1 jobs; 0 idle, 1 running, 0 held

Under ID you have your cluster number (21 in this case), and under ST (for status) will be R (running), I (idle), S (suspended), or X (exiting). If a VAMPIRE computer is available (hint: run condor_status and look for "Unclaimed" to find out), then your job should run to completion very quickly, and you will see that you have written a friendly greeting to your output file. Congratulations! You have just successfully run your first VAMPIRE job.

Let's spice things up a bit, now. Running one "Hello World" program is not interesting, so let's run 100. First, delete some files with rm condor.log errormessages.txt vampirehello_output. Now edit condor.cfg so that the last line says Queue 100. Touch the files, chmod the directory, and submit the job as before. Look at the queue again (with condor_q) and you will see something like the following.

chaos> condor_q -g kayhan


-- Schedd: chaos.astro.umd.edu : <129.2.15.35:46242>
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  22.0   kayhan          6/15 12:53   0+00:00:00 I  0   0.0  wrapper.csh       
  22.1   kayhan          6/15 12:53   0+00:00:00 I  0   0.0  wrapper.csh       
  22.2   kayhan          6/15 12:53   0+00:00:00 I  0   0.0  wrapper.csh       
  22.3   kayhan          6/15 12:53   0+00:00:00 I  0   0.0  wrapper.csh       
  22.4   kayhan          6/15 12:53   0+00:00:00 I  0   0.0  wrapper.csh       
  22.5   kayhan          6/15 12:53   0+00:00:00 I  0   0.0  wrapper.csh       
  22.6   kayhan          6/15 12:53   0+00:00:00 I  0   0.0  wrapper.csh       
  22.7   kayhan          6/15 12:53   0+00:00:00 I  0   0.0  wrapper.csh       
  22.8   kayhan          6/15 12:53   0+00:00:00 I  0   0.0  wrapper.csh       
  22.9   kayhan          6/15 12:53   0+00:00:00 I  0   0.0  wrapper.csh       
  22.10  kayhan          6/15 12:53   0+00:00:00 I  0   0.0  wrapper.csh       
  22.11  kayhan          6/15 12:53   0+00:00:00 I  0   0.0  wrapper.csh       

Of course, it will go all the way down to 100. The number after the dot in the ID column is the process number. So 22.6 refers to the job which is cluster 22 and process 6. Use this number to refer to an individual process or to the whole cluster. For example, the command to remove a job from the queue is condor_rm. If I ran condor_rm 22.6, it would remove that one process from the queue. If I ran condor_rm 22, it would remove all 100 jobs in that cluster from the queue or all of them that remained in the queue, anyway.

So now you have many friendly greetings in your output file. You may be wondering why it took so long. There are two reasons. First, there is always some overhead involved in getting the processes in memory on other computers and then writing to a networked disk. Second, there were 100 jobs that were trying to open the same file for appending. In fact, you may notice that there are not exactly 100 lines in your output file. Something went wrong there. Even if it did work correctly, having all of the data interspersed would not work in a real-world situation. We need different output files for each process. Let's do that.

Get rid of the output files again. Now edit the wrapper script so that it doesn't redirect stdout to a file:

./$EXECUTABLE < $INPUTFILE

While we're at it, get rid of the sleep statements, too. Next, edit the config file. Change the error message line and add a line after it:

Error = errormessages.$(Process).txt
Output = vampirehello_output.$(Process).txt

What we are doing is letting condor take care of redirecting stdout for us. Condor will redirect stdout (and stderr) to a file that is numbered based on its process number. We need to touch all of these files. Here is how to do that in tcsh:

chaos> touch condor.log
chaos> set i = 0
chaos> while ($i < 100)
while? touch errormessages.${i}.txt vampirehello_output.${i}.txt
while? @ i++
while? end
chaos> chmod -R 777 .

Unfortunately, I don't know how to get condor to pad the process number with zeroes, but at least that makes touching the files an easier task. Submit the job as usual, and see what happens. You should get 100 friendly greetings in 100 different files.

Unless you are running Monte Carlo simulations that seed themselves, running the same program exactly the same way 100 different times is not that useful. We want different inputs for each instance of our program. One way to do that is to create a different, numbered input file and set it up in the same way we set up stderr and stdout (Input = vampirehello_input.$(Process).txt). It should be pretty obvious how to do that now so let's try something else. Rather than giving the wrapper an input file for it to read, we can give it arguments as if it were at the command line.

First, get rid of all those output and error files, as well as the log file. Let's not do that many any more. Other people might be using VAMPIRE for real stuff! First edit the wrapper so that it prints a command line argument:

if ($# > 0)	then
    echo $1
endif
./$EXECUTABLE < $INPUTFILE

Next, change the queue line in the config file so that it says:

Arguments = First
Queue 2

Arguments = Second
Queue 2

Because we wanted to do something different for each job, we had to use the Queue keyword twice: once for each set of processes with those characteristics. We could have changed any of the settings that we wanted and then queued any number of runs with those settings. We could have changed the name of the output file, the executable, anything. Now, remembering that we need 4 errormessages and 4 outputs, touch the files, chmod, and submit again. Output files 0 and 1 should have "First" written in them, and files 2 and 3 should have "Second" written in them. Now we're cooking with gas!

The last thing we need to do is set up a wrapper that figures out if it is starting for the first time or restarting. Delete those obsolete outputs, and let's get to work. First, let's edit the config file:

Arguments = vampirehello_output.$(Process).txt
Queue 2

This tells the wrapper script what its output file is. There are many ways to do this. This is just one method that I thought was particularly straightforward. Change the main body of the wrapper so that is says:

set i =	0
set checkfile = $1
if (-r $checkfile)	then
    set	i = `wc $checkfile | awk '{print $1}'`
endif
while ($i < 10)
    ./$EXECUTABLE < $INPUTFILE
    @ i++
end

The wrapper will now run the executable 10 times. If it gets bumped off of one machine and onto another, it will be able to figure out how many times it has run the executable already by counting the number of lines in the output file. In practice you will probably want to use what has already been outputed as input of some kind. Submit the job. Remember to touch and chmod, though! You now have twenty friendly greetings in two different files. In all likelihood, the job never had to make use of its ability to know where it was because the job was so short. If it had been a real job that had taken many hours or many days, then it would have been essential since no single machine is ever available for that long.

Now you should be able to write your own wrapper that can see if has ever run before and use previous output to start in the middle of the job. Be sure to test its ability to start and restart before you submit the job. Try running it and then interrupting it in the middle and see if it can restart properly. It has to be able to do this at any point.

To learn more about condor, you can read the fine manual here.

Back to main VAMPIRE page.