Running Cluster Jobs

From GPNWiki

Jump to: navigation, search

This page covers how a user can run jobs on the GPN cluster. These jobs can also be used for testing one's own cluster. There are two main methods for submitting jobs on the GPN cluster. The first is using condor on the head node. The second method involves using the globus web service GRAM.

Contents

Submitting a job with Condor:

Submitting a job with condor involves submitting the job on the head node (cluster.greatplains.net). You need an account on cluster.greatplains.net in order to submit a job to condor. If you don't have an account, contact suppport@greatplains.net to get one.

Connect to cluster.greatplains.net using ssh. The ssh server is running on port 4545, you will need to specify this port with your ssh client. If you are on windows you can get a free ssh client called putty here. If you are on linux or Mac OS X use one of the terminals that comes with the operating system. The following is an example of a user called test logging in:
ssh -p4545 test@cluster.greatplains.net
After logging in, you will need to download a sample condor job. We have a sample job that you can use, taken from http://condor.rc.rit.edu/binaries/, only thing we have modified is the queue count in virial.condor to reduce the time the jobs takes to complete. The sample job is located here. In your cluster.greatplains.net terminal, download the zip file using wget:
wget http://collaboration.greatplains.net/wiki/images/2/22/Virial.zip
Unzip the zip file:
unzip Virial.zip
Now we are ready to run the job. You can look at the condor status to see how many processors are available. In your terminal type:
condor_status
The output of the command should look similar to this:
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1@compute-0-0. LINUX      INTEL  Unclaimed Idle     0.000  1352  0+01:25:06
slot2@compute-0-0. LINUX      INTEL  Unclaimed Idle     0.000  1352  0+02:20:08
slot3@compute-0-0. LINUX      INTEL  Unclaimed Idle     0.000  1352  0+02:20:09
slot1@compute-0-1. LINUX      INTEL  Unclaimed Idle     0.000  1014  0+02:20:07
slot2@compute-0-1. LINUX      INTEL  Unclaimed Idle     0.000  1014  0+02:20:08
slot3@compute-0-1. LINUX      INTEL  Unclaimed Idle     0.000  1014  0+02:20:09
slot4@compute-0-1. LINUX      INTEL  Unclaimed Idle     0.000  1014  0+02:20:10
slot1@compute-0-2. LINUX      INTEL  Unclaimed Idle     0.000  1014  0+01:25:04
slot2@compute-0-2. LINUX      INTEL  Unclaimed Idle     0.000  1014  0+02:20:05
slot3@compute-0-2. LINUX      INTEL  Unclaimed Idle     0.000  1014  0+02:20:06
slot4@compute-0-2. LINUX      INTEL  Unclaimed Idle     0.000  1014  0+02:20:07
slot1@compute-0-3. LINUX      INTEL  Unclaimed Idle     0.000  1014  0+01:25:05
slot2@compute-0-3. LINUX      INTEL  Unclaimed Idle     0.000  1014  0+02:20:07
slot3@compute-0-3. LINUX      INTEL  Unclaimed Idle     0.000  1014  0+02:20:08
slot4@compute-0-3. LINUX      INTEL  Unclaimed Idle     0.000  1014  0+02:20:09

                     Total Owner Claimed Unclaimed Matched Preempting Backfill

         INTEL/LINUX    15     0       0        15       0          0        0

               Total    15     0       0        15       0          0        0

You can see there are a total of 15 processors available with all of them unclaimed and idle. This means we are able to run 15 jobs at once, which is the amount we have specified in virial.condor.

We are now ready to run the job, in your terminal change directory to the virial directory:
cd virial
In this directory you will see the file virial.condor, this is the script we will be submitting to condor. This submit file will create 15 jobs with different parameters, each job doing the same calculations with different parameters. To submit this file to condor type the following your terminal:
condor_submit virial.condor
The output should be something similar:
Submitting job(s)...............
Logging submit event(s)...............
15 job(s) submitted to cluster 85.

This means your fifteen jobs were submitted to condor and are referred to as cluster 85.

To see the progress of the jobs, execute:
condor_q

You should get output similar to:

-- Submitter: cluster.greatplains.net : <129.130.119.230:59379> : cluster.greatplains.net
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  85.0   test           11/17 15:51   0+00:02:04 R  0   0.0  virial7.$$(Arch) 2
  85.1   test           11/17 15:51   0+00:02:04 R  0   0.0  virial7.$$(Arch) 2
  85.2   test           11/17 15:51   0+00:02:04 R  0   0.0  virial7.$$(Arch) 2
  85.3   test           11/17 15:51   0+00:02:03 R  0   0.0  virial7.$$(Arch) 2
  85.4   test           11/17 15:51   0+00:02:03 R  0   0.0  virial7.$$(Arch) 2
  85.5   test           11/17 15:51   0+00:02:04 R  0   0.0  virial7.$$(Arch) 2
  85.6   test           11/17 15:51   0+00:02:03 R  0   0.0  virial7.$$(Arch) 2
  85.7   test           11/17 15:51   0+00:02:04 R  0   0.0  virial7.$$(Arch) 2
  85.8   test           11/17 15:51   0+00:02:04 R  0   0.0  virial7.$$(Arch) 2
  85.9   test           11/17 15:51   0+00:02:05 R  0   0.0  virial7.$$(Arch) 2
  85.10  test           11/17 15:51   0+00:02:04 R  0   0.0  virial7.$$(Arch) 2
  85.11  test           11/17 15:51   0+00:02:04 R  0   0.0  virial7.$$(Arch) 2
  85.12  test           11/17 15:51   0+00:02:04 R  0   0.0  virial7.$$(Arch) 2
  85.13  test           11/17 15:51   0+00:02:04 R  0   0.0  virial7.$$(Arch) 2
  85.14  test           11/17 15:51   0+00:02:04 R  0   0.0  virial7.$$(Arch) 2

15 jobs; 0 idle, 15 running, 0 held

The output shows us the status of the 15 jobs we submitted. Notice each job is given an ID using cluster 85 and it's job number. Notice the ST column. It displays what each job is doing. Currently all our jobs in R or running. If the jobs have not begun executing yet, you will see an I for idle. The RUN_TIME tells you how long each job has been running. When the job completes, the ST will change to C for complete and then the job will not show up when running this command. You can continue to execute condor_q to watch the status of your jobs.

When all your jobs complete we can look at the results. In your terminal change directory to the output folder within virial:
 cd output
In this folder there will be three files for each of the fifteen jobs, they are log, error, and output files. The log files will tell information about the job submission and execution. The error files will contain any errors encountered while the job was executing. Lastly the output file contains the output of the job. The error files should be empty, unless you ran into an error. For example we can check the error file for job 85.0 by the following command:
cat job_85_0.err
If nothing is returned that means there were no errors. To check the output for job 85.0 use the following command:
cat job_85_0.out
You should see something similar to:
Calculating B7 at 1.400000
20000000 steps
sigmaHSRef: 1.500000
B7HS:  1.6273153424513e+03
MC Step size:  0.078649 0.094935
actual ref step freq: 0.329039
hard sphere average:            6.1214156626506e-02  3.1e-02
hard sphere overlap average:    3.5900840712208e-04  3.3e-05
lennard jones average:          7.9881831610044e-04  1.0e-03
lennard jones overlap average:  1.9519373488511e-06  3.9e-08
abs ratio:   3.9057694815257e+03  5.5e+03
Since I didn't write the program, I don't know exactly what the output means, but it does tell us the job executed successfully. You can check the output of the other job in the same where, the numbers in the output should be different since the parameters per job are different.

Submitting a job with Globus:

We will be using WS-GRAM of globus to submit a job to the cluster. The clients tools are installed on the head node, cluster.greatplains.net or if you have the globus client installed on a different machine, you can use globusrun-ws from there to submit to the headnode.

First thing you need to do is log in to whatever machine has globusrun-ws installed on it. Login with the same method as you did in the condor example above. You need to make sure you have your DOE user certificate in a .globus folder in your home directory. You also need to make sure you have a grid credential. If you don't, type:
grid-proxy-init
Once you have your grid-proxy you can now submit a job with globus. We execute a simple hostname program and the output will be returned to the terminal.
globusrun-ws -submit -F cluster.greatplains.net:9443 -Ft Condor -s -c /bin/hostname
You should see the following output:
Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:4618e4f4-6c64-11dc-9d97-0007e9d81215
Termination time: 11/22/2008 19:11 GMT
Current job state: Active
Current job state: CleanUp-Hold
compute-0-3
Current job state: CleanUp
Current job state: Done
Destroying job...Done.
Cleaning up any delegated credentials...Done.

So the hostname program was executed on the compute node compute-0-3. If you are still waiting for the job to run and you have access to cluster.greatplains.net, you can login and check the status of you job using condor_q.

Condor Submits Your Code

This is how to submit a program you wrote, or found on the net. This program calculates pi, and is /export/home/kate/code/pi-1.c

Since the source code is available, first compile pi-1.c so it is condor aware.

condor_compile gcc pi-1.c -o pi1

This will produce and executable file called "pi1".

Second we need to create a submit file for condor.

vi submit.condor
Universe = standard
arguments =  21474836 ### pi1 needs the number of digits of pi to calculate 
Output = pi-cluster-answer ### where the output goes, pi to many digits.
Log = pi1.log
Error = pi1.error
Queue  ### run this program once.

Time to submit.

condor_submit submit.pi1

Another useful command to see if you job is waiting, running, or idle.

[kate@cluster code]$ condor_q


-- Submitter: cluster.greatplains.net : <129.130.119.230:59379> : cluster.greatplains.net
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  96.0   kate           12/8  12:58   0+01:55:45 R  0   1.5  pi1 21474836      

1 jobs; 0 idle, 1 running, 0 held

After the job is finished, output will be written to pi-cluster-answer, and condor will send you an email to the account on the cluster. If you want this mail forwarded to someplace else, add a .forward file to your home directory.

Condor Submits Your Code Multiple Times

This example was some simple code that was submitted to run 5000 times. The code was found on the internet. It forks a child process and both processes do some trivial calculation and exit. The name of the code is cluster-test1.c, and is /home/kate/code/cluster-test1

condor_compile gcc cluster-test1.c -o cluster-test
vi submit-cluster-test1

executable = cluster-test1  #name of executable from the condor_compile command above
universe = standard
error = err.$(Process) #puts condor errors in a file, where $(Process) is the process number
log = cluster-test1.log

Queue 5000 #run this job 5000 times
Science News (RSS)