Modern computer systems are very robust and SCI IT has infrastructure in place to ensure that all hosts stay up, functional, and available to use, 24×7. However, maximizing uptime also requires regular maintenance, which includes the regular patching, updating, and rebooting of machines, which impacts their availability as well as the services running on them. Additionally, no infrastructure is 100% reliable and sometimes, despite best efforts, unexpected downtime occurs.
This document covers how SCI IT handles system downtime, both planned and unplanned, and how we communicate with SCI users about downtimes. All times listed are given for Salt Lake City, UT.
Patching and updating
A machine’s operating system and the software packages it provides are constantly being updated to fix bugs, add new features, address security risks, etc. Most of these updates do not require a reboot of the system and can be completed without affecting anybody, but some do require the machine to be restarted for the updates to be applied. These updates represent the most common reason for why machines may be unavailable, although generally this downtime is limited to just a few minutes.
These updates are applied differently across different categories of machines:
Infrastructure Servers
These machines, such as those that run SCI IT services (mailing list servers, monitoring, web servers, etc) will have all software updates applied at 3am on Wednesday mornings, and will be rebooted if necessary; this includes the shell.sci.utah.edu machines. Any interruption in services will be minimized, but some services may become unavailable when those servers are rebooting.
A message will be posted in the #sci-it Slack channel the day prior as a reminder that patching will occur.
Desktops, Workstations, and Compute Clusters
As of July, 2024, an established procedure for non-server updates/reboots has not yet been decided upon or implemented. If you would like your desktop or workstation updated, please contact SCI IT.
Regular Maintenance
Sometimes work that falls outside of regular patching needs to be done on systems, which may risk or require an interruption in availability. The regular time that SCI IT has set aside for such work is Fridays from 5-7pm, although this will not be used most weeks. If any work that may or will cause a service interruption is scheduled for a Friday downtime, SCI IT will announce this prior to the downtime via both email and in the #sci-it Slack channel.
Emergency Maintenance or Major Downtimes
Most work that would require a large or significant impact to SCI users will be scheduled for UofU breaks/holidays or low-usage times in the summer. These will be announced ahead of time via email and in the #sci-it and #general Slack channels. Any unexpected service outage (including identification of the outage, progress updates, and eventual resolution) will be communicated in a similar manner.