1. BMC Beowulf Supercomputer: Whats New
1.1. Current Status:
Number of Nodes online: 47
Nodes offline: 35
Last checked date: Nov 7, 2006
1.2. Schedule
User usage requests
| User | Start Date/Time | Stop Date/Time | Nodes needed (by number, or "all") | Notes |
| Doug Blank | ||||
| Tom Carroll | 9/29 | 10/06 | all | I can use fewer nodes, if needed. Please let me know - tcarroll |
| Helen Grundman | ||||
| Mike Noel |
1.3. Detailed Status
1.3.1. New dispatch command
The new dispatch replaces the old dispatch and pdispatch commands. It works as follows.
dispatch [[flags]* [includes]* [excludes]*]* [command]+
The flags:
-p run in parallel -d debug mode. don't actually do anything, just print out what it would do -r report format (output all on one line per node) -c add color to output -u allow duplicate node numbers
The [includes] and [excludes] are either a single node number, a range, or a list:
singles: dispatch 2 4 6 7 uptime range : dispatch 1-48 uptime list : dispatch 2,3,4 uptime
includes are given by a positive prefix or none, excludes by a negative prefix. The set of nodes is built up in a left-to-right order.
dispatch 1-48 -10-20 15 uptime
This example will add all of the nodes, remove 10-20, and then add back in 15.
The old version:
dispatch 1-48 -i 29,35,48 ls -alThe new version:
dispatch 1-48 -29,35,48 ls -alOther examples:
dispatch -p -r 1 2 3 10-20 -15 30-40 -25 uname -a
1.3.2. System
The upgrade of the Beowulf cluster from Redhat 9 to Fedora Core 4 is finished. What does this mean to you? As far as running your code is concerned, not much. Some programs (such as Java) are still the same version, and haven't changed in their functionality at all. Other items (such as MPI) have changed quite a bit, as the entire system is running up-to-date versions of most everything else.
There are three machines that seem to be flakey that I have left offline: bw29, bw35, and bw48. If someone wanted to look at those to see if you can determine if it is a hardware problem, please do. In addition, bw25 needs to have its powerlight cable checked. Can someone take bw25 down, open it up, and check the wires?
For those of you that care, here is a list of what changed:
-
- the head node (bw01) is now a dual-processor XEON computer. It has a SCSI disk, but only 256 MB of memory. We need to look into upgrading the memory. (Mike, do we have money left for that?)
- the old head node is currently not being used. It could be upgraded and used as a replacement for 29, 35, or 48. It has two, new 40 GB harddrives. (Tom, can you look it that?)
- there are currently 45 nodes up and running.
- MPI works the same way as before: lamboot, lamnodes, mpirun, lamhalt
- dispatch and pdispatch work the same way as before, eg:
-
dispatch 01 48 -r -i 25,35,48 "uptime"
- the OS can now be updated automatically. We will need to make sure that if something critical changes on the root, then it will be updated on the rest of the nodes.
- all nodes (except the root) can be rebuilt using Matt's kickstart rebuilding tools. We need to back up at least /home and /var/www/ on beowulf.brynmawr.edu. Matt?
- we beleive that we can install and use Mathematica in parallel. Matt, is this possible?
If you find anything that doesn't work as it did or as you would like, please let us all know. If you or your students would like to know more about the hardware, (and help maintain it!), please let us know. I hope that we can all get together soon to discuss more effective uses of the cluster.
1.4. Maintenance
Hardware upgrades needed:
-
Node 25: power light not connected?
-
Nodes 1 - 4: need gigabit network cards installed
Software upgrades needed:
-
All Nodes: need to test CPU and throughput speeds
Previous upgrades:
-
Java 1.5 installed on all working nodes (rpm remove 1.4; add 1.5) Done: Feb 12, 2005
-
Node 25: 25 and 27 had same IP number; rpm seems confused; reboot fixed it: Feb 12, 2005
-
All working nodes with memory upgrades are reporting full 1 gig:
dispatch 1 48 -i 36,39 "cat /proc/meminfo" | grep MemTotal | cat -n
-
Nodes 1 - 6: memory upgrade (3 x 1 gig DDR's sitting on Node 1)completed: Feb. 14, 2005
-
Node 40: CD-ROM power and IDE weren't connected; it's working now: Feb. 14, 2005
-
Node 40: upgrade.sh applied; node is now complete. Feb 14, 2005
-
Node 36: cpu heatsink was backwards; cpu survived and it's working now: Feb. 14, 2005
-
Node 36: fixed nis settings; Feb 14, 2005
-
All Nodes: Applied script to install latest versions of ATLAS, CLAPACK, and NNI. Feb. 15, 2005
-
Node 39: new Processor ordered 2/14/05; installed 3/1/05
Here's a way to get all of the machines that are free, and put them into a lam-bhost.def file:
dispatch 1-48 -r "uptime" | cut -d":" -f1,6 | cut -f1 -d"," | grep 0.00 | cut -f1 -d":" > lam-bhost.def
This uses a new -r flag on dispatch that formats the output in a report form, showing machine and results.
-
New 40GB harddrives in root node (bw01) with RAID: Apr 9, 2005
1.5. Meetings:
Planning meeting: BMC Beowulf Supercomputer Project, Phase 2
Jan 7, 2005, 10am Park Science 230
We currently have 24 nodes, each with 512MB RAM, standard ethernet networking, and 10 GB harddrives. We need to decide which of the following have highest priority for this year:
-
Add more nodes (ie, computers): there is no limit to the number of nodes we can add, and each node can increase performance of the cluster.
-
Add more memory: some programs require more RAM than we currently we have avaiable. This is especially true if we begin to utilize programs such as Mathematica, Matlab, Gaussian, or Maya (3D rendering).
-
Add faster networking: currently, we have the slowest networking money can buy. Some processes require faster communication between processors in order to be effective.
-
Add software: currently, we have only utilized free, open source software, and Mathematica. In the future, other scientists may wish to use other commercial packages.
We will start the meeting off with an overview of the Beowulf, and a brief introduction to using it.
