[WARNINGS]
link-id not found - [fmiosacomparativestudypartikernellandchapter1bootup|1
link-id not found - [fmiosacomparativestudypartikernellandchapter3scheduling|3
link-id not found - [fmiosacomparativestudypartikernellandchapter5memorymanagement|5
link-id not found - [fmiosacomparativestudypartikernellandchapter7deadlocks|7
link-id not found - [fmiosacomparativestudypartiiuserlandchapter8iolayers|8
link-id not found - [fmiosacomparativestudypartiiuserlandchapter10environmentlayer|10

Intro (I 1 2 3 4 5 6 7) (II 8 9 10 11) (D) (C) (B

I)

h1. Chapter 3: Scheduling

("Beware of bugs in the above code; I have only proved it correct, not tried it." - Donald Knuth)

Let's go back to the example of writing a 600-page document. Say you tell the word processor to save to disk. In the meantime, you switch to a web browser to do some research. The word processor and the browser are two different processes and allowing each to run in turn is the job of the scheduler. Scheduling is often applied to processes and threads in the same manner. Specifics on applying the scheduler to processes and threads differently will be mentioned when comparing the different case studies at the end of the chapter.

h2. 1. Basics

When looking at scheduling the first step is to look at when to schedule. The scheduler is typically not a daemon (background process). Therefore it has to be invoked during certain times through the course of a running system to make its scheduling decisions. These decisions on when to switch between processes (or threads) are usually made when one of these events occur:

  • Creating or exiting processes
  • Processes block
  • Hardware interrupt (i.e. clock, I/O)
  • Software interrupt (i.e. kill, message received)

When the word processor creates a thread to save your document, that thread must wait for the hard drive to write the data to the disk. While this is happening the thread may have nothing else to do, so it is blocked. When it blocks, the OS writes a Blocked Flag into the word processor's table entry. On most systems it also mentions that it's waiting on the disk task to complete. Therefore, when the scheduler is making decisions on which process to run next it will look at this information in the table. If the thread is blocked, then it won't be considered for CPU time. However, in the future, when the awaited I/O interrupt from the disk is handled by the OS, then the blocked thread is flagged as Ready to Run, and it may be given special consideration when the scheduler is called again to make a decision.

When an interrupt occurs, the CPU stops what it's doing and switches attention to the recently arrived interrupt. What the kernel actually does is make a context switch to an interrupt handler, which is assigned specifically to that event. This handler does several things depending on the interrupt. In some systems, specifically FMI/OS, the handler informs all of the programs subscribed (Chapter 4) to it that the event happened. After the handler does its job, the scheduler is called to determine who runs next.

Other information in the process tables, which the scheduler considers, may be a process' priority level (Section 4). Also, whether the process is compute-bound or I/O-bound. Compute-bound processes include things such as graphics rendering or calculating a million digits of pi. These processes mostly use the processor so they should get more CPU time. I/O-bound processes include things like web browsing or searching through directories. These processes spend most of their time blocking, waiting for the disk drive (or other I/O devices). Therefore, these processes don't require as much CPU time.

The two major categories of process scheduling fall under the names preemptive or nonpreemptive scheduling. Nonpreemptive scheduling allows applications to run indefinitely until they create a new process, exit, block, or otherwise surrender their CPU time. Preemptive scheduling allows a scheduler to take the CPU away from a process even though it has not reached the point in its computation where it willingly gives up the CPU. Other than the typical I/O and software interrupts, preemptive scheduling makes heavy use of interrupts from the computer's devices such as clock, disk, network devices, etc. In a preemptive-scheduled system, processes are given a small, fixed-length time slice called a quantum. If a time quantum expires, or a higher priority job arrives, the preemptive scheduler can context switch from the previous process to another.

Clarity of the differences between nonpreemptive and preemptive will come with some examples of different techniques and algorithms listed below.

Scheduling is also used for managing resources other than just CPU processing times. Many systems often employ a type of three-level scheduling which involves scheduling of three different resources:

  • Admission Scheduler
    • Deals with admission of processes to the queue of active processes.
  • Memory Scheduler
    • Applied to active processes.
    • Swaps out who's in memory and who's back-stored (written to disk).
    • Keeps a good mix of compute- and I/O- bound processes in memory so as to maximize throughput (i.e. more compute processes than I/O processes)
  • CPU Scheduler
    • Applied to processes in memory.
    • Schedules processes on processor.

These "levels" work in simultaneous succession, where one level effects which processes are in the next level:



digraph G { Admission->Memory->CPU; }

Systems may have different goals for the scheduler. The primary types of scheduled systems are: Batch, Interactive, and Realtime. All three share some common goals, but each has specialized goals, as well. Below are an overview of the different systems, the different goals of schedulers, and then, a short summary of some of the strategies used by schedulers.

h2. 2. Systems

  • Batch Systems. Batch systems stem from the very first computers. They are used in today's mainframes whose primary purpose is high throughput of processes with little regard for waiting times of individual processes.

  • Interactive Systems. Interactive systems are the most popular because they include most consumer computer systems. These systems are designed for regular user-interaction. Office workstations, gaming systems, web servers are all examples of interactive systems.

  • Realtime Systems. Realtime systems are time-sensitive. Systems such as medical equipment or driving telematics are time-critical systems. For example, if a controlling process misses a deadline a patient could die or a driver could crash.

h2. 3. Goals

  • Fairness. Fairness is important because every process should be treated with appropriate fairness. Allowing one process all of the CPU while five others of the same priority are waiting in line is not fair. However, allowing all processes to run with equal quanta even though one process is a much higher priority than the others (such as nuclear power plant safety program compared with the payroll program, or video playback compared with sending e-mail) is not fair, either. Fairness is a common goal among the different systems.

  • Policy enforcement. Scheduling systems are not written without some policy to determine the appropriate fairness of processes. Such policies include priority. Policy enforcement is also a common goal.

  • Keeping everything busy. To ensure the best efficiency when tackling tasks, the scheduler must keep the system as busy as possible. This is why compute-bound processes may run more than I/O-bound processes: while the I/O processes are blocking, the system can be kept busy with the compute-bound processes. Keeping everything busy is another common goal.

  • Maximize throughput. One of the ways to rate a system is by its throughput. Its throughput is determined by the amount of work accomplished in a certain length of time. To be deemed successful, some systems (such as mainframes processing millions of customer claims or employee payrolls) should have as high a throughput as possible. This goal is primarily seen in batch system schedulers.

  • Minimize turnaround. Another way a system can be rated is by its turnaround. Turnaround is determined by the average time a task takes, measured from the time it is submitted to when it is completed. The quicker the turnaround time, the higher the rating of the system. This goal originated in batch system schedulers, but is currently seen in almost all other systems.

  • Minimize response time. When interacting with a computer most users base their rating of the system on how quickly it responds. For example, if a user is typing into a word processor and doesn't promptly see the keystrokes appearing on the screen, it can be quite frustrating and unnerving. In this example a poorly designed scheduler may have not given the word processor sufficient privilege to run quickly enough to handle each keystroke in a timely manner. Therefore, a scheduler has to make informed decisions as to which processes to make higher priority, and when to make other processes lower priority. This goal is typically invoked in interactive system schedulers.

  • Meet deadlines. Some processes are given real-time deadlines so they must be completed by a certain time. These are typically reserved for processes on realtime systems where these deadlines are important, even critical.

h2. 4. Strategies, Techniques and Algorithms

  • FIFO. FIFO stands for First In First Out. This is perhaps the simplest of the schedulers. The first program submitted for processing is executed while the others line up after it. As the first completes, the second in the queue comes up for execution. The fairness of this technique comes from the first-come, first-served layout.

  • Shortest First (nonpreemptive) & Shortest Remaining Time Next (preemptive). These techniques are fairly self-explanatory. If the scheduler is handed a batch of processes it chooses the shortest one to run first. Although in the end, regardless of execution order, the batch is completed in the same amount of time, the average turnaround time is much shorter. For example, let's say there were four processes with 1, 2, 4, and 8 estimated minutes for completion, respectively. If they were run in the order 8, 4, 2, 1, the average turnaround time would be (8 + 12 + 14 + 15) / 4 = 12.25 minutes per task. However, if they were ran in order 1, 2, 4, 8, the average turnaround time would be (1 + 3 + 7 + 15) / 4 = 6.5 minutes per task.

    The shortest remaining time next technique is a preemptive version of shortest first. If a new process arrives that has a shorter run time than the current process' remaining time, then it is run instead.

    Both of these algorithms have the added requirement of needing to know the estimated execution time of each process. Therefore, these algorithms are only possible on systems that provide that functionality. For those that do not, the scheduler must externally predict the estimated execution time. One such method is called aging. This method bases a process' estimated running time on its previous running times on the CPU. For instance, suppose a process runs for A seconds before blocking. Then, when it is run again, it has running time B. Its estimated next running time could then be calculated by adding A/2 + B/2, with the aging factor being 1/2 (which is easy to implement since it's only a bit-shift to the right). Now, if it is run once more and its running time is C then the estimated running time for the next run will be A/4 + B/4 + C/2. The predicted run time can be maintained by the scheduler with one variable (number), P, only. In fact, if the last prediction was P, and T is the last run time, then the new prediction, P, is .5P(old) + .5T.

  • Round Robin (purely preemptive). This technique is one of the generic, and most widely used, scheduling algorithms. As the name implies, this circulates through the list of processes, giving each one its quantum on the CPU. In this manner it is very much preemptive. A big decision when writing a round-robin scheduling algorithm is striking a balance between short quanta (which cause many expensive context switches) and long quanta (which give perception of sluggishness to users).

    Since round-robin, in its pure form, does not account for priority, all processes get equal time on the CPU, regardless of priority. Therefore, this technique isn't ideal for implementing policy and fairness by itself. This is why round-robin algorithms are usually seen coupled with other algorithms that do incorporate priority.

  • Priority scheduling. The easiest way to incorporate priority is by assigning different sized quanta. Higher priority requires longer quanta. Some tweaks to the algorithm can prevent one high priority algorithm from consuming the majority of the machine. This can be accomplished by diminishing the priority of a process after each complete quantum. Another tweak is to increase efficiency by increasing priority to processes that used only fractions of their last quantum. These quick processes will likely just block again and leave the ready queue.

    Another way of accomplishing priority scheduling is by using Priority Classes. Each class is set aside for each level of priority. Within that level the scheduler will use round robin. Then the scheduler moves on to the the next level. To prevent starvation of lower-level processes the priority classes are frequently adjusted. One example of this is exponentially increasing quantum allocations depending on priority-class-tree depth. Therefore, when the scheduler does finally get to the lower processes, they will have a little more time to spend on the clock.

  • Lottery. This method closely resembles a lottery. Each process is given a "lottery ticket" and tickets are picked at random after each quantum. To account for priorities, high-priority processes get more tickets for better chances of winning. Through this technique, controlling proportions is easy. If tickets are evenly distributed then each process has even time on the CPU. If there are 10 tickets and three processes are given 1, 3, 6 tickets, respectively, then, in time, the CPU times of the processes will average out to 10%, 30%, and 60% of the total time.

  • Fair-share. This concept is based on of the idea that each user gets an even division of CPU time; regardless of the number of processes they are each running. If User A is running 8 processes and User B is running 2 processes then User B still gets 1/2 of the CPU time rather than 2/10.

  • Rate Monotonic Scheduling. Time slices are computed depending on the rate at which periodic events occur. For example, if a video is playing back at 24 frames per second the process gets CPU time 24 times a second.

  • Earliest deadline. This is similar to shortest-time first except it bases its selection on deadlines instead of shortest time. So, instead of each process announcing its estimated run time, they announce their deadline.


h2. 5. Thread scheduling

These algorithms and techniques can be applied to processes or threads. However, here are a couple different ways of implementing threads.

h3. User Space threads

If threads are implemented in user space then each process has its own thread scheduling. This provides the process with the power of using the appropriate algorithm for scheduling the threads, depending on the special needs of the process. The implication here is that there are two schedulers: one managing the processes and another one managing the threads. This redundancy slows the system down and is one of the reasons why user space threads are often not implemented.

Another disadvantage is that if one thread blocks then the process scheduler might block the entire process rendering the other threads in that process "blocked" and un-runnable, as well.

h3. Kernel Space threads

In this method, the kernel handles process scheduling and thread scheduling. Therefore, it can manage both in the most efficient way possible. However, sometimes a better way is to not discern between threads and processes at all. This means that the scheduler manages only threads since all of the processes are made up of threads anyway. In this regard, the concept of "process" reverts back to being a static structure, purely a holder of resources. Since threads are what are actually executed, the process structure is just a collection of threads with a common commonly held resources, such as memory space or opened files, that they all may share.

Although this sounds like a simple design, there are a few drawbacks. Since processes are transparent to the scheduler when it switches between threads, it is likely that the scheduler could also be switching between address spaces (i.e. different processes). This means that an expensive context switch, which happens when switching between different processes, could occur. To compensate for this, some scheduling algorithms take into consideration whether threads share an address space, thus minimizing address space switching and making the context switch less expensive.


h2. 6. O/S Comparisons

To help with some of the concepts discussed above a few case studies of different implementations are provided below, as well as FMI/OS's own implementation methods.

h3. UNIX

UNIX has two levels of the three-level schedulers. There is a memory scheduler that operates in round-robin to ensure that all process get time in active memory.

The other scheduler is the CPU scheduler. This uses priority classes with priorities ranging from n < 0 (highest; kernel processes) to n > 0 (lowest; user processes). A round-robin approach is taken through each priority range with a quantum typically of 100 milliseconds. The highest priority range (most negative) is gone through first until the processes are done. Then the scheduler moves down to the next level, and so forth.

Priorities are recalculated every second based on a few factors: previous CPU usage (using a type of aging), "niceness" factor given to it by parent process, and a base priority. The niceness factor allows the process to assign a level of priority to its child (if it assigns a lower priority it is being nicer to other processes). The base value ranges from -4 for low-level processes, such as I/O-bound, to positive numbers for user processes. I/O-bound is highest so the scheduler can get those out quickly since they usually block, anyway, waiting for the I/O.

h3. Linux

Linux uses priority classes, as well. They are: "Realtime" FIFO (highest priority), "Realtime" round-robin (high priority), and Timesharing (low priority). Although they are labeled Realtime they are not true Real-time. They basically mean that they meet realtime extensions to UNIX, standard IEEE1003.4 (POSIX substandard []). Linux also implements kernel threads and is strictly thread-scheduling.

The threads in the top level are scheduled in a FIFO configuration. These threads are not preemtable except from new FIFO threads. Once those threads are all completed the scheduler moves down to the next level, and so forth. Linux also reorganizes like UNIX does. It uses an encompassing algorithm that results in realtime threads getting most time, I/O-bound threads floating to top and gaining priority until done, and compute-bound threads getting fraction of CPU based on priority.

h3. FMI/OS

The FMI/OS scheduling system has led it to be coined a Preemptive Cooperative Multitasking Environment, or PCME. The scheduling system relies mostly on nonpreemptive, Cooperative Multitasking techniques, which occur during message passing, and reverts only to preemption from the clock, or other interrupt handlers, when no messages are being passed.

There is no actual running scheduler. The scheduler is an API, or Application Programming Interface, whose commands are called whenever something is calling for a scheduling event. Semaphores-when they block, IPC message-passing commands, and interrupt handlers are usually what call scheduling events. These scheduling events also only deal with threads since FMI/OS implements kernel threads.

The heart of the scheduler is made up of two parts:

    1. Runnable tree*.

    This is a structure containing all of the threads that are able to be run. For these threads, everything is set up and ready for them to run, they're just waiting on CPU time. Most schedulers refer to this structure as the "ready queues."

    The runnable tree has a very important data structure to facilitate constant time scheduling (O(1): see below). The main component of the structure is an array of linked lists. This array is as long as the number of priorities, typically 32 (a type of priority class scheduling). Each entry in the array is the head of a linked list of all of the threads that fall under that priority. The key to traversing the set in a timely manner is one of the other variables in the structure: @prioritymask@. This is a 32-bit variable (32 bits, one for each priority) where the bit number corresponding to the priority is set to 1 if there is a thread in that priority level, or 0 if there are none. For example, @prioritymask@ could look like @00000000001000000000000000001000@ which indicates there are threads in the linked-lists of priority 21 and 3 (bits going from left to right: 31 to 0). So, all the scheduler needs to do, to determine who to run next, is call the command @ffs()@ (find first set bit) on the mask, which returns the position of the first @1@ in the 32 bits (in this case, 21). It then uses that position as the index of the runnable tree array where it grabs the head of the linked list that resides there. The list that is there contains the threads that have that priority (priority level 21), so the first thread in it is run.

    The beauty of this method is that it is O(1) efficient, implying scalability of its operation. Regardless of the number of threads waiting to be run, the scheduler will still only take a constant amount of time. If there are hundreds of threads, the scheduler still only needs 2 lines of code to determine the next process to run.

    We chose this because the previous scheduler, from the original VSTA operating system, was bulky and inefficient. By invoking a scheduler that could not only easily implement POSIX-standard realtime capabilities, but still be able to handle a large number of processes without increasing search time, we are able to create a more versatile OS that could scale up to different sizes.

    1. The @setrunnable()@ and @runnext()@ commands.*

    The two main commands of the scheduler API are @setrunnable()@ and @runnext()@. The function @setrunnable()@ moves a thread into the runnable tree, attaching the thread to the tail of the list for its priority, and sets the appropriate bit in @prioritymask@.

    The function @runnext()@ determines the next thread to be run. It takes the current thread (that was running) and reinserts it into the runnable tree depending on the scheduling algorithm the thread has specified in its own table. If the thread flags itself as @schedfifo@, it is put back in the head of the list; if it's flagged as @schedrr@ (for round-robin) or @sched-other@ it is attached to the tail of the list. For more detail on this, check the IEEE 1003.1 standard (POSIX) ("System", SCHEDFIFO, etc). The next thread to run is now taken off of the runnable tree and is set to run.

    So when does FMI/OS call these scheduler functions?

    • After any message send or receive
    • After a process exits (terminates)
    • When a process blocks or yields
    • When an interrupt is handled

    For example, if a program sends a message to another, it blocks, waiting for a reply. The message sender sets the recipient of the message (if it is blocked, awaiting a message) as runnable (@setrunnable()@), then it calls @runnext()@ to run the highest priority thread. This next thread could be the recipient of the message, or it could be some other thread with higher priority.

    To handle clock preemption there is a clock timer that is set to the quantum when a thread is run (@runnext()@ does this). On 3x86 systems, FMI/OS uses a hardware (CPU) timer that counts down with every CPU clock cycle or tick. When it gets to zero an interrupt happens. The handler for this first decrements the priority of the currently running process (see badness, below), then sets it as runnable and calls @runnext()@.

h4. Fairness

Fairness of the scheduler comes from the implementation of badness and goodness factors. When a thread is "good" its priority is increased, when it is "bad" its priority is decreased. Goodness points are gained whenever a thread blocks or yields. Badness points are taken away whenever a thread has overrun its quantum. All threads start off as top priority, and eventually sort themselves out up and down the process tree in a sort of accordion effect, evening out the fairness of the scheduling.

One other important note is that of priority inheritance. Suppose the nuclear safety program forks off a thread for one of its safety subsystems. If that thread doesn't have the same priority as its parent then it could be treated as a lesser thread competing for the processor with, say, the payroll program. To compensate for this, the child thread inherits the priority of its parent. FMI/OS also incorporates priorities into message passing: a thread receiving a message inherits the priority of the thread sending it.

More of the message passing and interrupt calls will be looked at in Chapter 4 Inter-Process Communication.


(c)2006 Dimitri Hammond

AddComment

Also available in: HTML TXT