This is a read-only archive. Find the latest Linux articles, documentation, and answers at the new Linux.com!

Linux.com

Feature

Setting up a Condor cluster

By M. Shuaib Khan on September 01, 2006 (8:00:00 AM)

Share    Print    Comments   

Have you ever been in a situation where you had to run multiple instances of the same application, with different input data each time, in sequence, because the job was too computation-intensive and your machine not powerful enough to run all the instances simultaneously? The solution to that problem could be to harness the machines that are already connected to your local network and apply their unused CPU cycles to your projects. Condor, a specialized batch system for managing compute-intensive jobs, may be your answer.

Condor lets you queue multiple jobs, searches for free machines on the network (those with no keyboard activity, no load average, and no active Telnet users), and submits jobs to them, then returns the results to the machine from where it was submitted. Condor is a batch system -- once a job is submitted, there is no interaction between the job and the user. Any input to the job must be in a file which is submitted along with the executable to the Condor pool, while all the output during the execution of the job is written to a file, which is sent back as the result of execution to the submitting machine.

Setting up Condor

Before you set up a Condor pool, you need to know the four different roles a machine can play in a your pool:

  • Central manager -- The central manager collects information about the resources available to the pool, and negotiates between a machine that is submitting a job and the machine that will execute the job. Only one machine in a pool can play this role.
  • Execute machine -- Any machine (including the central manager) configured to execute jobs submitted to the pool.
  • Submit machine -- Any machine (including the central manager) configure to submit jobs to the pool.
  • Checkpoint server -- Any one machine in the pool can act as a backup machine for the jobs running on the pool. Setting one up is optional, and for our basic pool, we are going to ignore it.

Before you set up a Condor pool, you must decide which machine will play the central manager role, and which of the remaining clients are going to be the submit and execute machines (or both). For the simplest case, we'll set up a pool of two machines. One will be the central manager and also a submit and execute machine; the other will be only a submit/execute machine. You can use the same procedure to set up Condor on as many machines as you want.

Before you set up Condor on a machine, create a Condor user on that machine whose home directory will hold Condor-related files, such as logs.

#groupadd condor
#useradd -m -g condor condor

Now copy the downloaded Condor tar-archive into /home/condor and unpack it. Change into the unpacked directory, which I'll refer to as the release directory, and run the condor_configure script in order to install Condor on the machine:

#condor-configure --install --type=execute,submit,manager --local-dir=/home/condor --verbose

The command above configures the central manager. To configure a submit/execute machine, use slightly different syntax:

#condor-configure --install --type=execute,submit --local-dir=/home/condor --central-manager=hostname of central manager --verbose

If you ever want to change the configuration of Condor on a machine, you can run the script again.

Open the /etc/condor_config file in the release directory. Set the LOCAL_DIR variable to /home/condor, and set the HOSTALLOW_WRITE variable to an appropriate value (e.g. '*'). Make sure /dev/mouse is pointing to your mouse device, and /var/run/utmp is pointing to utmp on your machine. Next, edit the /home/condor/condor_local_config file and set CONDOR_IDS to '0.0'. This tells Condor to run its daemons as root.

Copy the files in the bin subdirectory of the release directory to a well-known location (such as /local/bin) so that Condor users can have access to them, and copy the files in the sbin subdirectory to a location that gives only the administrator access to them in his path.

Now you're ready to run condor_master on each machine to start the daemons:

#condor_master

On the central manager you should see the following daemons running if you run $ps aux | egrep condor_:

  • condor_ master
  • condor_ collector
  • condor_ negotiator
  • condor_ startd
  • condor_ schedd

On other machines, the following daemons should be running:

  • condor_ master
  • condor_ startd
  • condor_ schedd

If you don't see these daemons running, there is a problem with your configuration. Look at /home/condor/logs/Masterlog to try to figure out what might be wrong.

You can run condor_status on any machine to list the machines that are currently in the pool, and their status (acclaimed, available, etc.).

Once Condor is running, it's time to put it to use. To submit a job to Condor, you need to write a description file for it. Writing a description file is easy, and the example below will show you how to write one.

#Example description file foo.cmd for job foo
Executable = foo
Universe = vanilla
input = test.data
output = foo.out
error = foo.error
Log = foo.log
Queue

The Executable variable points to the job which is to be run (it's a good idea to specify the absolute path to the executable), input is set to the file from which foo is supposed to take its input, output is set to the file to which foo is to write its output, error variable is set to the file to which any errors will be reported, and a log of whatever happened during the the job will be written to the file pointed to by Log variable.

Now you can submit the description file as a Condor job:

$condor_submit foo.cmd

If you would like to run multiple instances of the same job with different input files for each instance, here is how to write the description files:

#Example 2:
Executable = foo
Error = error.$(Process)
Input = input.$(Process)
Output = output.$(Process)
Log = foo.log
Queue 100

Note the entry Queue 100. It tells Condor to run 100 instances of the job, with the input file for each being input.<job number>, and output and error files being similarly numbered.

To check the Condor queue and have a look at the status of the jobs being submitted, run:

$condor_q

To remove a job from the queue, use the job ID that condor_q returns:

$condor_rm <job_id>

Conclusion

Condor is a powerful yet easy-to-use software system for managing a cluster of workstations. You can configure it in various ways, such as allowing it to run jobs only at night, or run jobs only on particular machines or machines with particular resources. The owner of any machine in the Condor pool can change the configuration of Condor to his likes so that jobs that are being executed on his machine are of a particular type or are executed at a particular time. Turn to the official documentation for ways to tune Condor for your needs.

Share    Print    Comments   

Comments

on Setting up a Condor cluster

Note: Comments are owned by the poster. We are not responsible for their content.

Who's on first?

Posted by: Anonymous Coward on September 02, 2006 12:43 AM
"The solution to that problem could be to harness the machines that are already connected to your local network and apply their unused CPU cycles to your projects. "

Of course NextStep systems already had this with Zilla.

#

Ok Unix had it first.

Posted by: Anonymous Coward on September 04, 2006 03:33 PM
Linux got Mosix very quickly for unix in the form of Open Mosix. Mosix allows clients to join and leave at will.

Condor is a progression to hopefully a better interface and more effective system.

NextStep Zilla was nice but not first.

#

Is condor better than Beowulf or Open Mosix?

Posted by: Anonymous Coward on September 04, 2006 06:10 AM
Condor sounds a lot like Open Mosix. Why use Condor when Open Mosix is a mature product?

And, why configure the Condor software to run as root? It seems that all you have to be able to do is to submit a batch job. Any normally privileged user should be able to do that.

#

Re:Is condor better than Beowulf or Open Mosix?

Posted by: Anonymous Coward on September 04, 2006 10:33 PM
No Condor is NOT better than OpenMOSIX.

OpenMOSIx just doesnt support batch jobs it supports nearly all types of jobs, and you can migrate just about anything.
Condor is specially built for batch jobs, and to my experience is "primitive" as compared to OpenMOSIX

#

Re:Is condor better than Beowulf or Open Mosix?

Posted by: Anonymous Coward on September 06, 2006 03:27 AM
I would consider both Condor and Open Mosix as "mature". Condor has been around and evolving for over a decade and is in use at thousands of sites. Similar story for Mosix / Open Mosix.

They are different. Open Mosix is more a transparent load balancing system, Condor is more of a batch queueing/scheduling system (as is LSF, PBS, Torque,<nobr> <wbr></nobr>...). Which approach is "better" depends very much on your workload. Load balancing is often preferred when the workload is heavy with interactive or short-lived tasks, batch queueing/scheduling is often preferred for longer-lived batch tasks. Also, OpenMosix assumes a Beowulf-style dedicated compute cluster setup. Although Condor can manage a dedicated cluster, it does not make this assumption --- it can also manage across "grid" wide-area environements, non-dedicated desktop machines, meta-schedule across other schedulers, etc. But assumptions can simplify life if the assumptions apply to your desired setup.

As for why does start Condor as root: the answer is that you do not have to do so. But if you do start the Condor daemons as root, then Condor is able to run jobs on that node with the same UID as the submitting user (i.e. root is used for UID switching), and also Condor is able to enforce node policies even in the face of a non-cooperating job. (e.g. hard for management daemons to kill a job after X minutes if the job can just kill the daemons first!).

#

Re:Is condor better than Beowulf or Open Mosix?

Posted by: Anonymous Coward on October 30, 2006 08:41 PM
...openmosix is not ready for 2.6 kernels; and with new hardware that runs only with a new kernel, condor is a good choice!

#

Plan 9

Posted by: Anonymous Coward on September 06, 2006 02:46 AM
Makes me think of Plan 9. Distributed operating system. P9 protocol. Network transparency.

#

This story has been archived. Comments can no longer be posted.



 
Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya