The DGSS project implements a distributed version of a virtual grid system simulator, which includes a system of multiple schedulers with one elected master node to perform matchmaking. All system state is replicated over all nodes, and the system is fault-tolerant in the face of random node failures. Experimental results show decent scalability results, with minimal overhead in scenario with 5 grid scheduler (GS) nodes, 80 resource manager (RM) nodes and a workload of 40,000 jobs.
Due to the distributed nature of the implementation, any number of failures of GS or RM nodes will not result in a system failure, as long as at least one GSNode survives the failures at all times. The system handles this by electing one GSNode to be the 'master' of the system, which will take on the responsibility of sequencing any messages sent through the system, providing a global ordering of events and an easy way to broadcast messages.
This project has been developed for the IN4391 - Distributed Computing Systems course at Delft University of Technology. The report submitted for this course is also available.
- Minimum: Java 1.6, recommend: Java 1.7 (for insight into the system and process CPU usage of nodes)
- Ant
- Optional: screen
- Edit the
system.properties
file to contain all the external IP addresses of all of the machines that will participate in the system. - Deploy the code to the machines that will participate.
- Execute
ant
. - On every participating machine, execute
rmiregistry
. - Start any number of GSNodes, using
./start-gs.sh <external ip>
. - Start any number of RMNodes, using
./start-rm.sh <external ip
.
- Adding jobs
- Execute
./add-jobs.sh
in order to add a configurable number of jobs to the system, or - Press
return
in the terminal window of the./start-rm.sh
to immediately add a job to the RMNode.
- Execute
- Monitoring
- Execute
./monitor-status.sh
- Press
return
in order to retrieve the current job queue size, message queue size and CPU usage of all of the registered nodes in the system.
- Execute