VAMPIRE

VAMPIRE (Very Awesome Multi-Processor Interconnected Research Environment), is a serial, distributed computing network created and run by Kayhan Gultekin, Zoe Leinhardt, and Kevin Walsh. It takes Astronomy department PCs that are otherwise idle at night and turns them into the third most powerful (albeit serial) cluster in the department.

Proto-VAMPIRE is now running. There are currently 43 or so processors managed by Condor on the network. This web page will eventually contain information on the status of VAMPIRE (!ON or !OFF) and the jobs that are running on it. Maybe Dave Rupke will teach me enough perl so that I can have a webpage that reflects what is going on in real time!

VAMPIRE is undead.  (!ON) VAMPIRE is asleep.  (!OFF)

WHAT IS VAMPIRE?

Very Awesome Multi-Processor Interconnected Research Environment

VAMPIRE is a series of computers running the Condor software package. Condor enables jobs to be run on these computers when they would otherwise be idle.

Condor is a specialized workload management system for compute-intensive jobs. Like other full-featured batch systems, Condor provides a job queueing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management. Users submit their serial or parallel jobs to Condor, Condor places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion.

--Condor webpage

Unused Cycles in the Night

  • Most machines are idle at night and weekends.
  • If you can see this bullet point, Kevin never made the plot.
  • Condor monitors machines for periods of inactivity and takes puts them to use.

    High-speed Processors in the Night

  • 14 processors unused for about 14 hours every day.
  • Equivalent of 8-machine dedicated cluster.
  • Constantly getting bigger.
  • Will be on every public linux box.
  • Private linux boxes of very generous owners.

    How it Draws Blood

  • Advertising Daemons.
  • Condor sees what machines are available and runs jobs on them.
  • When someone logs back into the machine (or unidles it), Condor removes the job from that particular machine and finds another home for it.
  • The job is not simply niced.

    Why Should you join the Army of the Night?

  • Use Condor if you want more computing time for CPU-intensive jobs. Who doesn't want more computing time?
  • Numerical Simulations
  • Autmoated Data Analysis/Reduction
  • Donate your private machine because
  • Essentially no impact from your end.
  • We will bump up your priority if you do.
  • Derek, Cole, Chris, MarkW, and Glen have already allowed their machines for use!

    Good Blood and Bad Blood for VAMPIRE

    Good Blood

  • Embarrassingly Parallel jobs (E.g., 100 different Monte Carlo runs.)
  • Several very long runs with frequent checkpoints. (CPU Hogs)
  • A large Array of shortish runs. (Short runs / Lots of parameter space.)
  • A random-number generated novel. (The greatest novel ever written.)

    Bad Blood

  • User input required.
  • Infrequent outputs/checkpoints.
  • Memory-intensive jobs.
  • True Parallel Jobs. (For Now)
  • Java. (For Now)
  • Netscape.


    HOW DO I SET UP A JOB?

    Config Files

    Config files are used to tell Condor what to run, where to run it, how to run it, and on what machine to run it. Let's look at a sample config file:

    Executable = /home/kayhan/condor/run_my_job
    Universe = vanilla
    Log = condor.log
    Input = input.dat
    Output = output.dat
    Error = errormessages.txt
    Notification = Error
    Requirements = Memory >= 512
    Rank = (machine == "crater.astro.umd.edu") || (machine == "grus.astro.umd.edu")

    Initialdir = /home/kayhan/condor/A/
    Queue

    Initialdir = /home/kayhan/condor/B/
    Queue



    Let's take a look at each of those lines in turn.

    Executable = /home/kayhan/run_my_job
    This line tells condor what executable to run. In our invocation of condor, others must have execute permissions.

    Universe = vanilla
    This line tells condor in what "universe" to run. There are two main universes: vanilla and standard. In vanilla universe, when an job is suspended, it simply stops running. When it resumes, the executable is run again. It is up to the process to know if it is being re-called and how to handle that. In standard universe, when a job is moved off of a machine, memory is written to disk and read in again when resumed. This may take prohibitively long for some jobs, e.g., ~30min for an N=105 body simulation. For more information on universes in condor, see their webpage.

    Log = condor.log
    This is the file that logs important information for statistics and status of your jobs.

    Input = input.dat
    Output = output.dat
    Error = errormessages.txt
    These are the files to use for stdin, stdout, and stderr.

    Notification = Error
    This line tells condor only to email me if there is an error.

    Requirements = Memory >= 512
    Only run on machines that have at least 512 MB of physical memory. There are many different items that can be used as a requirement.

    Rank = (machine == "crater.astro.umd.edu") || (machine == "grus.astro.umd.edu")
    Prefer to run on machines that match the above criteria if they are available.

    To get an idea of some of the criteria that can be used for Requirements and Rank, look here.

    Initialdir = /home/kayhan/condor/A/
    Queue

    Initialdir = /home/kayhan/condor/B/
    Queue

    This is how to submit the job to run twice in two different directories. This is important if the output of run_my_job is always a file with the name myjob.out. If the output file is named by its initial conditions, e.g., then the following could be used instead:
    Initialdir = /home/kayhan/condor/onlydir/
    Queue 2


    Wrappers

    Why do we need a wrapper?

  • Condor may start and stop your job several times.
  • Your executable file must "know"
  • if it is being started for the first time.
  • if it is being restarted after changing machines.

    General Algorithm

  • Am I starting new or restarting?
  • If new, make a note of it and start job.
  • If restarting, look at most recent checkpoint.
  • While running job, make checkpoints.

    Examples

  • HNBody
  • zeus
  • pkdgrav


    HOW DO I KEEP TRACK OF MY JOBS?

    Useful Commands

    condor_submit

    condor_q

    condor_status

    condor_rm

    condor_userprio -all


    Condor View

    A java applet that displays current and recent usage of VAMPIRE machines and VAMPIRE jobs.
  • View by machine.
  • View by user.
  • Live Feed?

    IS THIS REALLY WORKING?

    Yes.