Running Cluster Jobs
From GPNWiki
This page covers how a user can run jobs on the GPN cluster. These jobs can also be used for testing one's own cluster. There are two main methods for submitting jobs on the GPN cluster. The first is using condor on the head node. The second method involves using the globus web service GRAM.
Contents |
Submitting a job with Condor:
Submitting a job with condor involves submitting the job on the head node (cluster.greatplains.net). You need an account on cluster.greatplains.net in order to submit a job to condor. If you don't have an account, contact suppport@greatplains.net to get one.
Connect to cluster.greatplains.net using ssh. The ssh server is running on port 4545, you will need to specify this port with your ssh client. If you are on windows you can get a free ssh client called putty here. If you are on linux or Mac OS X use one of the terminals that comes with the operating system. The following is an example of a user called test logging in:ssh -p4545 test@cluster.greatplains.netAfter logging in, you will need to download a sample condor job. We have a sample job that you can use, taken from http://condor.rc.rit.edu/binaries/, only thing we have modified is the queue count in virial.condor to reduce the time the jobs takes to complete. The sample job is located here. In your cluster.greatplains.net terminal, download the zip file using wget:
wget http://collaboration.greatplains.net/wiki/images/2/22/Virial.zipUnzip the zip file:
unzip Virial.zipNow we are ready to run the job. You can look at the condor status to see how many processors are available. In your terminal type:
condor_statusThe output of the command should look similar to this:
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
slot1@compute-0-0. LINUX INTEL Unclaimed Idle 0.000 1352 0+01:25:06
slot2@compute-0-0. LINUX INTEL Unclaimed Idle 0.000 1352 0+02:20:08
slot3@compute-0-0. LINUX INTEL Unclaimed Idle 0.000 1352 0+02:20:09
slot1@compute-0-1. LINUX INTEL Unclaimed Idle 0.000 1014 0+02:20:07
slot2@compute-0-1. LINUX INTEL Unclaimed Idle 0.000 1014 0+02:20:08
slot3@compute-0-1. LINUX INTEL Unclaimed Idle 0.000 1014 0+02:20:09
slot4@compute-0-1. LINUX INTEL Unclaimed Idle 0.000 1014 0+02:20:10
slot1@compute-0-2. LINUX INTEL Unclaimed Idle 0.000 1014 0+01:25:04
slot2@compute-0-2. LINUX INTEL Unclaimed Idle 0.000 1014 0+02:20:05
slot3@compute-0-2. LINUX INTEL Unclaimed Idle 0.000 1014 0+02:20:06
slot4@compute-0-2. LINUX INTEL Unclaimed Idle 0.000 1014 0+02:20:07
slot1@compute-0-3. LINUX INTEL Unclaimed Idle 0.000 1014 0+01:25:05
slot2@compute-0-3. LINUX INTEL Unclaimed Idle 0.000 1014 0+02:20:07
slot3@compute-0-3. LINUX INTEL Unclaimed Idle 0.000 1014 0+02:20:08
slot4@compute-0-3. LINUX INTEL Unclaimed Idle 0.000 1014 0+02:20:09
Total Owner Claimed Unclaimed Matched Preempting Backfill
INTEL/LINUX 15 0 0 15 0 0 0
Total 15 0 0 15 0 0 0
You can see there are a total of 15 processors available with all of them unclaimed and idle. This means we are able to run 15 jobs at once, which is the amount we have specified in virial.condor.
We are now ready to run the job, in your terminal change directory to the virial directory:cd virialIn this directory you will see the file virial.condor, this is the script we will be submitting to condor. This submit file will create 15 jobs with different parameters, each job doing the same calculations with different parameters. To submit this file to condor type the following your terminal:
condor_submit virial.condorThe output should be something similar:
Submitting job(s)............... Logging submit event(s)............... 15 job(s) submitted to cluster 85.
This means your fifteen jobs were submitted to condor and are referred to as cluster 85.
To see the progress of the jobs, execute:condor_q
You should get output similar to:
-- Submitter: cluster.greatplains.net : <129.130.119.230:59379> : cluster.greatplains.net ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 85.0 test 11/17 15:51 0+00:02:04 R 0 0.0 virial7.$$(Arch) 2 85.1 test 11/17 15:51 0+00:02:04 R 0 0.0 virial7.$$(Arch) 2 85.2 test 11/17 15:51 0+00:02:04 R 0 0.0 virial7.$$(Arch) 2 85.3 test 11/17 15:51 0+00:02:03 R 0 0.0 virial7.$$(Arch) 2 85.4 test 11/17 15:51 0+00:02:03 R 0 0.0 virial7.$$(Arch) 2 85.5 test 11/17 15:51 0+00:02:04 R 0 0.0 virial7.$$(Arch) 2 85.6 test 11/17 15:51 0+00:02:03 R 0 0.0 virial7.$$(Arch) 2 85.7 test 11/17 15:51 0+00:02:04 R 0 0.0 virial7.$$(Arch) 2 85.8 test 11/17 15:51 0+00:02:04 R 0 0.0 virial7.$$(Arch) 2 85.9 test 11/17 15:51 0+00:02:05 R 0 0.0 virial7.$$(Arch) 2 85.10 test 11/17 15:51 0+00:02:04 R 0 0.0 virial7.$$(Arch) 2 85.11 test 11/17 15:51 0+00:02:04 R 0 0.0 virial7.$$(Arch) 2 85.12 test 11/17 15:51 0+00:02:04 R 0 0.0 virial7.$$(Arch) 2 85.13 test 11/17 15:51 0+00:02:04 R 0 0.0 virial7.$$(Arch) 2 85.14 test 11/17 15:51 0+00:02:04 R 0 0.0 virial7.$$(Arch) 2 15 jobs; 0 idle, 15 running, 0 held
The output shows us the status of the 15 jobs we submitted. Notice each job is given an ID using cluster 85 and it's job number. Notice the ST column. It displays what each job is doing. Currently all our jobs in R or running. If the jobs have not begun executing yet, you will see an I for idle. The RUN_TIME tells you how long each job has been running. When the job completes, the ST will change to C for complete and then the job will not show up when running this command. You can continue to execute condor_q to watch the status of your jobs.
When all your jobs complete we can look at the results. In your terminal change directory to the output folder within virial:cd outputIn this folder there will be three files for each of the fifteen jobs, they are log, error, and output files. The log files will tell information about the job submission and execution. The error files will contain any errors encountered while the job was executing. Lastly the output file contains the output of the job. The error files should be empty, unless you ran into an error. For example we can check the error file for job 85.0 by the following command:
cat job_85_0.errIf nothing is returned that means there were no errors. To check the output for job 85.0 use the following command:
cat job_85_0.outYou should see something similar to:
Calculating B7 at 1.400000 20000000 steps sigmaHSRef: 1.500000 B7HS: 1.6273153424513e+03 MC Step size: 0.078649 0.094935 actual ref step freq: 0.329039 hard sphere average: 6.1214156626506e-02 3.1e-02 hard sphere overlap average: 3.5900840712208e-04 3.3e-05 lennard jones average: 7.9881831610044e-04 1.0e-03 lennard jones overlap average: 1.9519373488511e-06 3.9e-08 abs ratio: 3.9057694815257e+03 5.5e+03Since I didn't write the program, I don't know exactly what the output means, but it does tell us the job executed successfully. You can check the output of the other job in the same where, the numbers in the output should be different since the parameters per job are different.
Submitting a job with Globus:
We will be using WS-GRAM of globus to submit a job to the cluster. The clients tools are installed on the head node, cluster.greatplains.net or if you have the globus client installed on a different machine, you can use globusrun-ws from there to submit to the headnode.
First thing you need to do is log in to whatever machine has globusrun-ws installed on it. Login with the same method as you did in the condor example above. You need to make sure you have your DOE user certificate in a .globus folder in your home directory. You also need to make sure you have a grid credential. If you don't, type:grid-proxy-initOnce you have your grid-proxy you can now submit a job with globus. We execute a simple hostname program and the output will be returned to the terminal.
globusrun-ws -submit -F cluster.greatplains.net:9443 -Ft Condor -s -c /bin/hostnameYou should see the following output:
Delegating user credentials...Done. Submitting job...Done. Job ID: uuid:4618e4f4-6c64-11dc-9d97-0007e9d81215 Termination time: 11/22/2008 19:11 GMT Current job state: Active Current job state: CleanUp-Hold compute-0-3 Current job state: CleanUp Current job state: Done Destroying job...Done. Cleaning up any delegated credentials...Done.
So the hostname program was executed on the compute node compute-0-3. If you are still waiting for the job to run and you have access to cluster.greatplains.net, you can login and check the status of you job using condor_q.
Condor Submits Your Code
This is how to submit a program you wrote, or found on the net. This program calculates pi, and is /export/home/kate/code/pi-1.c
Since the source code is available, first compile pi-1.c so it is condor aware.
condor_compile gcc pi-1.c -o pi1
This will produce and executable file called "pi1".
Second we need to create a submit file for condor.
vi submit.condor
Universe = standard arguments = 21474836 ### pi1 needs the number of digits of pi to calculate Output = pi-cluster-answer ### where the output goes, pi to many digits. Log = pi1.log Error = pi1.error Queue ### run this program once.
Time to submit.
condor_submit submit.pi1
Another useful command to see if you job is waiting, running, or idle.
[kate@cluster code]$ condor_q -- Submitter: cluster.greatplains.net : <129.130.119.230:59379> : cluster.greatplains.net ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 96.0 kate 12/8 12:58 0+01:55:45 R 0 1.5 pi1 21474836 1 jobs; 0 idle, 1 running, 0 held
After the job is finished, output will be written to pi-cluster-answer, and condor will send you an email to the account on the cluster. If you want this mail forwarded to someplace else, add a .forward file to your home directory.
Condor Submits Your Code Multiple Times
This example was some simple code that was submitted to run 5000 times. The code was found on the internet. It forks a child process and both processes do some trivial calculation and exit. The name of the code is cluster-test1.c, and is /home/kate/code/cluster-test1
condor_compile gcc cluster-test1.c -o cluster-test
vi submit-cluster-test1 executable = cluster-test1 #name of executable from the condor_compile command above universe = standard error = err.$(Process) #puts condor errors in a file, where $(Process) is the process number log = cluster-test1.log Queue 5000 #run this job 5000 times
