›INDEX

Last Updated: Loading...

Acknowledgement

These notes are based on the slides provided by the professor, Qiang Ye, for the class "CSCI-3120 Operating Systems". I am taking/took this class in Fall 2023-2024 semester. If there are corrections or issues with the notes, please use the contact page to let me know. A lot of the words and images are either directly taken from the slides or have been paraphrased slightly.

CH1: Introduction

Definitions of "Operating System":

Everything a vendor ships when you order an operating system.
The one program running at all times on the computer.
- Actually, it is the kernel part of the operating system.
- Everything else is either system program or application program.

Computer Architecture

Computer Architecture: Logical aspects of system implementation as seen by the programmer. Answers the question: "How to design a computer?".

Computer Organization: Deals with all physical aspects of computer systems. Answers the question: "How to implement the design?".

Multiprocessor Systems

Symmetric Multiprocessing (SMP)

SMP system architecture

Each CPU processor performs all kinds of tasks, including operating-system functions and user processes.

Multicore Systems

Multiprocessor system architecture

Multicore systems are systems in which multiple computing cores reside on a single chip. Multicore systems can be more efficient than systems with multiple processors because on-chip communication between cores is faster than between-chip communication.

Non-Uniform Memory Access (NUMA)

NUMA Diagram

An approach to share memory is to provide each CPU (or group of CPUs) with their own local memory that is accessed via a local bus.

Clustered Systems

Clustered systems differ from the multiprocessor systems in that they are composed of two or more individual systems. Each individual system is typically a multicore computer. Such systems are considered loosely coupled.

Computer Organization

Interrupts

An interrupt is a signal to the processor emitted by hardware or software indicating an event that needs immediate attention.

Two types of interrupts:

Hardware Interrupts: defined at the hardware level.
Software Interrupts: also known as trap or exception defined at the OS level.

Interrupt diagram

Operating System Operations

Bootstrap Program

When a computer starts, it needs to run an initial program to find the operating system which then needs to be loaded. This initial program is known as bootstrap program, tends to be simple. It initializes all aspects of the system, from CPU registers to device controllers to memory contents. In addition, it loads the operating-system kernel into memory. Thereafter, the OS will take over the system.

Power On: User turns on the computer.
BIOS/UEFI Initialization: Basic Input/Output System (BIOS) or Unified Extensible Firmware Interface (UEFI) starts.
POST: Hardware check via Power-On self-test.
Bootstrap Program Loaded: Small program that starts the boot sequence.
Bootloader Activated: Loads and stars the operating system's kernel.
OS Kernel Loaded: Core of the operating system is loaded into memory.
OS Initialization: Operating system is initialized and becomes fully operational.

Multiprogramming

The operating systems keeps several processes in memory simultaneously. The operating system picks and begins to execute one of these processes at a time. Eventually, the process may have to wait for some task such as I/O operation to complete. When this event occurs, the operating system simply switches to and executes another process.

Multitasking

Multiprogramming doesn't give any guarantee that a program will run in a timely manner. The very first program may run for hours without needing to wait. Multitasking is logical extension of multiprogramming where the CPU executes multiple processes by switching among them, but the switches occur more frequently, providing the user with a fast response time.

Multimode Operation

The operating system is responsible to ensure that hardware and software resources of the computer system are protected from incorrect or malicious programs including the operating system itself.

In order to ensure the proper execution of system, we must be able to distinguish the execution of operating-system code and user-defined code. The approach taken by most computer systems is to provide hardware support that allows differentiation of various modes of execution.

At the least, a system must have a kernel mode (supervisor mode, privileged mode, system mode) and a user mode. A bit, called the mode bit is added to the hardware of the computer to indicate the current mode.

If an attempt is made to execute a privileged instruction in user mode, the hardware does not execute the instruction. Instead, it treats it as an illegal instruction.

Intel processors have 4 separate protection rings, where ring 0 is kernel mode and ring 3 is user mode. The other modes are for things such as hypervisors.
ARM v8 systems have 7 modes.
CPUs that support virtualisation frequently have a separate mode to indicate when the virtual machine manager (VMM) is in control of the system. In this mode, the VMM has more privileges than user processes but fewer than the kernel.

History of Free Operating Systems

In the early days of modern computing (1950s), software generally came with source code. However, companies sought to limit the use of their software to paying customers. To counter the move to limit software use and redistribution, in 1984, Richard Stallman started developing a free, UNIX-compatible operating system called GNU, which is a recursive acronym for "GNU's Not Unix".

Four Essential Freedoms

The Free Software Foundation (FSF) was founded to support the free-software movement. This was a social movement with the goal of guaranteeing four essential freedoms for software users.

The freedom to run the program.
The freedom to study and change the source code.
The freedom to redistribute copies.
The freedom to distribute copies of the modified versions.

CH2: Operating System Structures

Services

Operating systems provide a collection of services to programs and users. These are functions that are helpful to the user.

These are some areas where functions are provided: - User Interface - Program execution - I/O operations - File-system manipulation - Communications (IPC) - Error detection.

Another set of OS functions exist to ensure the efficient operation of the system itself.

Resource allocation: CPU, Memory, file storage etc.
Accounting: Logging, tracking.
Protection and Security: Owners managed and maintained.

User Interfaces

Command-Line Interface (CLI)

The command-line interface (a.k.a. command interpreter) allows direct command entry. The CLI fetches a command from the user and executes it.

There are two types of commands: - Built-in commands: These are a part of the CLI program. - External commands: These commands correspond to independent programs.

You can check if a command is a built-in or an external using the type built-in command.

$ type type
type is a shell builtin

$ type cd
cd is a shell builtin

$ type cat
cat is /usr/bin/cat

System Calls

System calls provide an interface to services made available by an operating system. System calls are generally available as functions written in C and C++. Certain low-level tasks may have to be written using assembly-language instructions.

However, most programmers do not use system calls directly. Typically, application developers design programs according to a high-level API. The API specifies a set of functions that are available to an application programmer.

Example: read and write functions in C (using the stdio library) are actually wrappers around the corresponding read and write system calls.

Types of System Calls

Process control
- create/terminate process
- load/execute process
- get/set process attributes
- wait/signal events
- allocate and free memory.
File management
- create/delete files
- open/close files
- read/write/reposition file
- get/set file attributes.
Device management
- request/release device
- read, write, reposition
- get/set device attributes
- logically attach or detach devices.
Information maintenance
- get/set time, date
- get/set system data.
Communications
- create/delete communication connection
- send/receive messages
- transfer status information
- attach/detach remote devices.
Protection
- get/set file permissions
- allow/deny user access.

The following table illustrates various equivalent system calls for Windows and UNIX operating systems.

Category	Windows	Unix
Process control	CreateProcess()	fork()
	ExitProcess()	exit()
	WaitForSingleObject()	wait()
File management	CreateFile()	open()
	ReadFile()	read()
	WriteFile()	write()
	CloseHandle()	close()
Device management	SetConsoleMode()	ioctl()
	ReadConsole()	read()
	WriteConsole()	write()
Information maintenance	GetCurrentProcessID()	getpid()
	SetTimer()	alarm()
	Sleep()	sleep()
Communications	CreatePipe()	pipe()
	CreateFileMapping()	shm_open()
	MapViewOfFile()	mmap()
Protection	SetFileSecurity()	chmod()
	InitializeSecurityDescriptor()	umask()
	SetSecurityDescriptor()	chown()

Compiler, Linker, and Loader

Compiler: Source code needs to be compiled into object files. Object files are designed to be loaded into any physical memory location.
Linker: Linker combines object codes into a binary executable file. If the object code corresponding to a library is needed, the library code is also linked into the executable file.
Loader: Executable files must be brought into memory by the loader to be executed.

Compiler-linker-process

This is statically linking the libraries to the program. However, most systems allow a program to dynamically link libraries as the program is loaded. Windows, for instance, supports dynamically linked libraries (DLLs).

The benefit of this approach is that it avoids linking and loading libraries that may end up not being used by an executable file. Instead, the library is conditionally linked and is loaded if it is required during program run time.

Aside on Dynamic Libraries

Dynamically Linked Libraries (DLLs) on windows and shared libraries (on UNIX) are kinda-sorta the same thing. The are loaded on requirement into the program when the program requires it.

The main advantage of shared libraries is the reduction of used memory (in my opinion). Since it's the same library that is loaded into multiple programs, if a program has already loaded a library, then a new program that requires it is only given a mapping to it and the memory is not actually copied. This means that each program THINK it has a separate copy of the library but the in memory there is only one copy of it. So functions like printf or read, that are used a lot, are only loaded into memory ONCE rather than a few hundred times.

Object files and executable files typically have standard formats that include:

The compiled machine code
A symbol table containing metadata about functions
Variables that are references in the program.

For UNIX/Linux systems, this standard format is known as Executable and Linkable Format (ELF). Windows systems use the Portable Executable (PE) format. MacOS uses the Mach-O format.

Operating System Design

User goals: operating system should be convenient to use, easy to learn, reliable, safe, and fast.
System goals: operating system should be easy to design, implement, and maintain, as well as flexible, reliable, error-free, and efficient.

Operating System Structure

A system as large and complex as a modern operating system must be engineered carefully. A common approach is to partition the task into small components. Each of these modules should be a well-defined portion of the system, with carefully defined interfaces and functions.

Monolithic Structure

The simplest structure for organizing an operating system is no structure at all. That is, place ll the functionality of the kernel into a single, static binary file that runs in a single address space.

An example of this type of structure is the original UNIX operating system, which consists of two separable parts: Kernel, System programs.

Monolithic structure design

The Linux operating system is based on UNIX and is structured similarly, as shown in the below figure. Applications typically use the standard C library (note that the standard C on Linux is called glibc) when communicating with the system call interface to the kernel.

The Linux kernel is monolithic in that it runs entirely in kernel mode in a single address space. However, it does have a modular design that allows the kernel to be modified during run time. This is discussed in a later section.

Linux monolithic structure

Layered Structure

The operating system is divided into a number of layers (levels), each built on top of lower layers.
The bottom layer (layer 0), is the hardware; the highest (layer N) is the user interface.
With modularity, layers are selected such that each layer only uses functions and services provided by the layer immediately below it.

Layered structure design

Microkernel Structure

As UNIX expanded, the kernel became large and difficult to manage.
In 1980s researchers at Carnegie Mellon University developed an operating system called "Mach" that modularized the kernel using the microkernel approach.
This method structures the operating system by:
- removing all non-essential components from the kernel
- implementing the non-essential components as user-level programs that reside in separate address spaces, which results in a smaller kernel.
Typically, microkernels provide minimal:
- process management (i.e. CPU scheduling)
- memory management
- communications (IPC).

Microkernel structure

The main function of microkernel is to provide communication between the client program and the various services that are also running in user space. Communication is provided through message passing.

The performance of microkernels can suffer due to increased overhead. When two user-level services must communicate, messages must be copied between the services, which reside in separate address spaces. In addition, the operating system may have to switch from one process to the next to exchange the messages.

The overhead involved in copying messages and switching between processes has been the largest impediment to the growth of microkernel-based operating systems.

Modular Structure

Perhaps the best current methodology for operating-system design involves using loadable kernel modules (LKMs). The kernel has a set of core components and can link in additional services via modules, either at boot or during run time. This type of design is common in modern implementations of UNIX (such as Linux, MacOS, and Solaris) as well as Windows.

The modular structure resembles the layered structure in that each kernel section has defined, protected interfaces. However, modular structure is more flexible than a layered system, because any module can call any other module.

Hybrid Structure

Most modern systems do not follow any strictly-defined structure. Instead, they adopt the hybrid structure, which combines multiple approaches to address performance, security, usability issues.

Examples:
Linux is monolithic, because having the operating system in a single address space provides very efficient performance. However, it is also modular, so that new functionality can be dynamically added to the kernel.

Building and Booting Linux

Here's a guide to building the Linux kernel yourself.

Download Linux source code (https://www.kernel.org).
Configure kernel using the command make menuconfig.
- This step generates the .config configuration file.
Compile the main kernel using the command make.
- The make command compiles the kernel based on the configuration parameters identified in the .config file, producing the file vmlinuz, which is the kernel image.
Compile kernel modules using the command make modules.
- Just as with compiling the kernel, module compilation depends on the configuration parameters specified in the .config file.
Install kernel modules into vmlinuz using the command make modules_install.
Install new kernel on the system using the command make install.
- When the system reboots, it will begin running this new operating systems.

System Boot

BIOS

For legacy computers, the first step in the booting process is based on the Basic Input/Output System (BIOS), a small boot loader.

Originally, BIOS was stored in a ROM chip on the motherboard. In modern computers, BIOS is stored on flask memory on the motherboard so that it an be rewritten without removing the chip from the motherboard.

When the computer is first powered on, the BIOS is executed.
This initial boot loader usually does nothing more than loading a second boot loader, which is typically located at a fixed disk location called the boot block. Typically, the second boot loader is a simple program and only knows the address and the length of the remainder of the bootstrap program.
In the typical scenario, the remainder of the bootstrap program is executed to locate the OS kernel.

UEFI

For recent computers, the booting process is based on the Unified Extensible Firmware Interface (UEFI).

UEFI has several advantages over BIOS:

Better support for larger disks (over 2TB)
Flexible pre-OS environment: UEFI can support remote diagnostics and repair computers, even with no operating system installed.

Details found here: wikipedia

CH3: Processes

Modern computer systems allow multiple programs to be loaded into memory and executed concurrently. Formally, a process is a program in execute.

A process has a layout in memory that looks like the following diagram:

layout of process in memory

Text: Contains executable code.
Data: Global variables.
Heap: Memory that is dynamically allocated during runtime.
Stack: Temporary data storage when functions are invoked.

The text and data sections are fixed as their sizes do not change during program run time. The stack and heap however, can grow and shrink dynamically during the program execution.

Each time a function is called, an activation record containing function parameters, local variables, and the return address is pushed onto the stack; when control is returned from the function, the activation record is popped from the stack.

Memory Layout of a C program

Memory layout of a c program

The data section is divided into two sub-sections: initialized and uninitialized data.
A separate section is provided for the argc and argv parameters passed to the main() function from the operating system.

Process State

As a process executes, it can change state. The state of a process is defined in part by the activity of that process.

The possible states a process can be in:

New: The process is created.
Running: Instructions in the process are being executed.
Waiting: The process is waiting for some event to occur.
Ready: The process is waiting to be assigned to a processor.
Terminated: The process has been terminated.

Note that the state names are generic and they vary across operating systems.

Process state diagram

Process Control Block

The Process Control Block (PCB), also known as Task Controlling Block (TCB), is a data structure in the OS kernel, which contains the information needed to manage the scheduling of a process.

PCB diagram

State: state of the process.
Number: process ID.
Program counter: location of instruction to execute.
CPU registers: contents of all process-related registers.
CPU scheduling information: process priority, points to queues etc.
Memory management information: memory allocated to the process.
Accounting information: CPU used, clock time elapsed since state, time limits.
IO status information: I/O devices allocated to process, list of open files, etc.

Scheduling

In a computer running concurrent processes, the OS needs to schedule processes effectively. The number of processes currently in memory is known as the degree of multiprogramming.

Most processes can be described as either I/O bound (spends most of its time doing I/O operations) or CPU bound (spends most of its time doing computations).

To schedule processes effectively, an OS typically maintains multiple scheduling queues, such as the ready queue and the wait queue.

Ready queue: a set of processes residing in main memory, ready and waiting to be executed.
Wait queue: a set of processes waiting for an event (e.g. I/O).

Ready and wait queue

Here is an example of a queueing diagram:

Queueing diagram example

CPU Scheduler

The role of the CPU scheduler is to select a process from the processes in the ready queue and allocate a CPU core to the selected one. To be fair to all processes and guarantee timely interaction with users, the CPU scheduler must select a new process for the CPU frequently.

Context Switching

Switching the CPU core to another process requires performing a state save of the current process and a state restore of a different process. This task is known as a context switch. When a context switch occurs, the kernel saves the context of the old process in its PCB and loads the saved context of the new process.

Context switch time is overhead, because the system does not do useful work while switching. Switching speed varies from machine to machine, depending on memory speed, number of registers etc. Typically, it takes several microseconds.

Context switching diagram

Process Creation

The OS needs to provide mechanisms for process creation and process termination. The creating process is called a parent process, and the created process is called the child process of the parent process.

A child process may in turn create other processes, forming a tree of processes.

Process Tree

The figure above is a typical tree of processes for the Linux operating system, showing the name of each process and its pid. Most operating systems identify each process using a unique process identifier (pid), which is typically an integer number.

The systemd process (init on Arch Linux) always has a pid of 1 and servers as the root process for all user processes, and is the first user process created when the system boots.

When a process creates a new process, two possibilities exist for its execution:

The parent continues to execute concurrently with the children.
The parent waits till some or all of its children have terminated.

There are also two address-space possibilities for the new process:

The child process is a duplicate of the parent process, it has the same data as the parent (see note under for more details).
The child process has a new program loaded into itself.

NOTE (Additional details): When a child process is created on Linux, the OS maps the same physical memory of the process to two separate processes. Although each process THINKS they have an identical but separate copy of the data, the data is only stored once. However, if there is a write operation performed on the data, then that data that was written to is first copied to have separate physical versions. Linux follows the "copy-on-write" principle (https://en.wikipedia.org/wiki/Copy-on-write).

Creating Processes On Unix

On Unix, a child process is created by the fork() system call. The parent and child process continue to be executed and the next instruction is the one right after the fork(), with one difference: The return code for the fork() is zero for the child process and is a positive process identifier of the child for the parent process. If fork() leads to a failure, it return -1.

Parent: fork() returns child process ID.
Child: fork() returns 0.

After a fork() system call, the child process can use one of the exec() family of system calls to replace the process's memory space with a new program.

After a fork() the parent cal issue a wait() system call to move itself to the wait queue till the termination of the child.

When the child terminates, the wait() system call in the parent process is completed and the parent process resumes its execution.

process creation on unix with exec

Process Creation In C

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>

int main()
{
    int execlp_status;
    int wait_status;

    pid_t pid;

    /* fork a child process */
    pid = fork();

    if (pid < 0) { /* error occurred */
        fprintf(stderr, "Fork Failed");
        return 1;
    } 
    else if (pid == 0) { /* child process */
        execlp_status=execlp("ls","ls","-l",NULL);
        if (execlp_status == -1) {
            printf("Execlp Error!\n");
            exit(1);
        }
    } 
    else { /* parent process */
        /* parent will wait for the child to complete */
        wait(&wait_status);
        printf("Child Complete!\n");
    }

    return 0;
}

pid_t pid is a variable created to hold the process ID.
fork() creates a child process.
execlp() is one of the exec() family of system calls.
wait(&wait_status) waits for the child.
wait() returns the PID of the process that ended (useful if wait called with multiple processes).
wait_status holds the exit status of the child that exited.

To learn more about any of these functions, you can use the man command to read their manual. There also exists a function waitpid for more control on the wait command. man 2 wait will how the details of this function.

Useful man pages:

man 2 fork
man 2 wait
man pid_t
man execlp  # shows the family of exec functions

If you want your man pages to look better follow the instructions here: Pretty man pages.

Exec Family

The exec() family includes a series of system calls, we focus on two of them: execlp() and execvp().

The l indicates a list arrangement, the execlp() function uses a series of NULL terminates arguments.
The v indicates an array arrangement, the execvp() function uses an arrays of NULL terminated arguments.
The p indicates the current value of the environment variable PATH to be used when the system searches for executable files.

TODO: Insert details of each of the functions and how to use them.

Interprocess Communication

Processes executing concurrently can be divided into two categories:

Independent: if it does not share data with any other processes executing in the system.
Cooperating: if it can affect or be affected by the other processes executing in the system. Clearly, any process that shares data with other processes is a cooperating process.

Cooperating processes require an interprocess communication (IPC) mechanism that will allow them to exchange data. There are two fundamental models for interprocess communication: shared memory and message passing.

IPC models

Shared Memory

This uses shared memory which requires communicating processes to establish a region of shared memory.

Typically, a shared-memory region resides in the address space of the process creating the shared-memory segment. Other processes that wish to communicated using this shared-memory segment must attach it to their address space.

Consider a producer-consumer problem:

A producer process produces information that is required by other processes.
A consumer process reads and uses information is produced by the producer.

We create a buffer that can be filled by the producer and emptied by the consumer. An unbounded buffer places no practical limit on the size of the buffer. The bounded buffer assumes a fixed buffer size. In the second case, the consumer must wait if buffer is empty and the producer must wait if buffer is full.

#define BUFFER_SIZE 10

typedef struct {
    // ...
} item;

item buffer[BUFFER_SIZE];
int in = 0;
int out = 0;

The shared buffer is implemented as a circular array with two logical pointers in and out.

The producer uses the following code to add items to the buffer:

item next_produced;

while (true) {
    /* Produce and item in next_produced */
    while (((in + 1) % BUFFER_SIZE) == out) {
        // do nothing while full.
    }

    buffer[in] = next_produced;
    in = (in + 1) % BUFFER_SIZE;
}

The consumer uses the following code to read items from the buffer:

item next_consumed; 
while (true) {
    while (in == out) {
        /* do nothing when empty */
    }
    next_consumed = buffer[out]; 
    out = (out + 1) % BUFFER_SIZE;
    /* consume the item in next consumed */ 
}

This wouldn't work at the moment because buffer is just an array and the processes would maintain separate copies. buffer needs to be "modified" to be a region of shared memory. This scheme allows at most BUFFER_SIZE - 1 items in the buffer at the same time.

Message Passing

Message padding provides a mechanism to allow processes to communicate and synchronize their actions without sharing the same memory region. It is particularly useful in a distributed environment, where the communicating processes may reside on different computers connected by a network.

A message-passing facility provides at least two operations: send(message) and receive(). If process P and Q want to communicate, a communication link must exist between them.

Here are few options for the link:

Direct or indirect communication
Synchronous or asynchronous communication
no buffering or automatic buffering.

Direct Communication

In direct communication, each process that wants to communicate must explicitly name the recipient or sender of the communication.

A communication link in this scheme has the following properties:

A link can be established between every pair of processes that want to communicate. The processes need to know each other's identity to communicate.
A link is associated with exactly two processes.
Between each pair of processes, there exists exactly one link.

Indirect Communication

With indirect communication, the messages are sent to and received from mailboxes or ports.

A mailbox can be viewed abstractly as an object into which messages can be placed by processes and from which messages can be removed.
Each mailbox has unique identification.
For example, POSIX message queues use an integer value to identify a mailbox.

In this scheme, a communication link has the following properties:

A link is established between a pair of processes only if both members of the pair have a shared mailbox.
A mailbox may be associated with more than two processes.
Between each pair of communicating processes, a number of different links may exists, with each link corresponding to one mailbox.

Synchronous Communication

Communications between processes takes place through calls to send() and receive() primitives.

There are different design options for implementing each primitive. Message passing may be either blocking or non-blocking - also known as synchronous and asynchrounous.

Blocking send: After starting to send a message, the sending process is blocked until the message is received by the receiving process or by the mailbox.
Non-blocking send: The sending process send the message and resumes other operations.
Blocking receive: After starting to wait for a message, the receiver is blocked until a message is available.
Non-blocking receive: The receiver is not blocked while it waits for the message.

Buffering

Whether communication is direct or indirect, messages exchanged by communicating processes reside in a temporary queue. Both the sender and receiver maintain a queue.

Zero capacity: The queue has a maximum length of zero.
Bounded capacity: The queue has a finite length n; thus at most n messages can reside it in.
Unbounded capacity: The queue's length is potentially infinite.

The zero-capacity case is sometimes referred to as a message system with no buffering. The other cases are referred to as message systems with automatic buffering.

Shared Memory vs Message Passing

Message passing:

Pros:

Easy to setup and use.
Can be used over networks and distributed environments.

Cons:

Kernel needs to be involved in every distinct exchange of data.
Is slow since messages need to be handled by the kernel.

Shared Memory:

Pros:

Faster speeds since only required to establish a shared memory region.
Most flexibility, you can access memory in any way the programmer likes.

Cons:

More work for programmer since mechanisms needs to be explicitly coded.
Less suitable for distributed environments - difficult to share memory.

Shared Memory Implementation

With POSIX , shared memory is implemented using memory-mapped file. The code is based on the instruction here: wikipedia

Create a shared-memory object using the shm_open system call.
Once the object is created, the ftruncate function is used to configure the size of the object in bytes.
The mmap function maps a file to a memory section so that file operations aren't needed.

POSIX Producer:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/shm.h>
#include <sys/stat.h>

int main() {
    const int SIZE = 4096; // the size (in bytes) of shared memory object
    const char *name = "OS"; // name of the shared memory object
    const char *message_0 = "Hello"; // strings written to shared memory
    const char *message_1 = "World"; // strings written to shared memory

    int shm_fd; // shared memory file descriptor
    void *ptr;  // pointer to shared memory object

    // Create the shared memory object
    shm_fd = shm_open(name, O_CREAT | O_RDWR, 0666);

    // configure the size of the shared memory object
    ftruncate(shm_fd, SIZE);

    // memory map the shared memory object
    // When the first parameter is set to 0, the kernel chooses the address at 
    // which to create the mapping. This is the most portable method of 
    // creating a new mapping.
    // The last parameter is the offset. When it is set to 0, the start of the 
    // file corresponds to the start of the memory object.
    // Designed memory protection: PROT_WRITE means it may be written.
    // MAP_SHARED indicates that updates to the mapping are visible to other processes.
    ptr = mmap(0, SIZE, PROT_WRITE, MAP_SHARED, shm_fd, 0);

    // sprintf(): It generates a C string and places it in 
    // the memory so that ptr points to it. 
    sprintf(ptr, "%s", message_0);

    // Note the NULL character at the end of the string is not included.
    // strlen() returns the length of a string, excluding the NULL character
    // at the end of the string.
    ptr += strlen(message_0);

    sprintf(ptr, "%s", message_1);
    ptr += strlen(message_1);

    return 0;
}

POSIX Consumer:

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/shm.h>
#include <sys/stat.h>

int main() {
    const int SIZE = 4096; // the size (in bytes) of shared memory object
    const char *name = "OS"; // name of the shared memory object

    int shm_fd; // shared memory file descriptor
    void *ptr;  // pointer to shared memory object

    // open the shared memory object
    shm_fd = shm_open(name, O_RDONLY, 0666);

    // memory map the shared memory object
    // Designed memory protection: PROT_READ means it may be read.
    ptr = mmap(0, SIZE, PROT_READ, MAP_SHARED, shm_fd, 0);

    // read from the shared memory object
    printf("%s", (char *)ptr);

    // remove the shared memory object
    shm_unlink(name);

    return 0;
}

When compiling the above programs, you need to link the realtime extensions library by adding -lrt to the compiling command.

You can use the following functions to work with shared memory:

shm_open()
ftruncate()
mmap()
shm_unlink()

Pipe

Pipe is a type of message-passing interprocess communication method.

There are different types of pipes and each one type has different conditions and implementations:

Unidirectional vs bidirectional: The flow of information is either in one direction or both directions.
Half duplex vs full duplex: If information can flow in both directions, can it flow in both ways at the same time?
Relationship: Does the pipe enforce a relationship between the processes such as a parent-child relationship.
Network communication: Can a pipe be formed over a network or does it have to reside on the same machine?

Unnamed Pipes (ordinary pipe)

Ordinary pipes allow two processes to communicate in a standard producer-consumer fashion where the producer writes to one end of the pipe called the write end and the consumer reads from the other end called the read end.

Unnamed pipes are:

Unidirectional - allow only one-way communication. Two different pipes can be used to communicate in both ways.
A parent-child relationship is required to create a pipe this way. Usually a parent creates a pipe and uses it to communicate to the child process.
As ordinary pipes require a parent-child relationship, this forces the communication to be on the same machine. That is, network communication is not allowed.

Ordinary pipes on UNIX systems are created used the pipe(int fd[]) function. Note, that the fd is used as the variable name as it stands for file descriptor.

fd[0]: the read end of the pipe.
fd[1]: the write end of the pipe.

On UNIX, pipes are just treated as a special type of file. Therefore, we can use the regular write() and read() functions to write and read from these pipes.

For more details on this topic, you can see: Analysing Pipe operating In C

Example

#include <sys/types.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

#define BUFFER_SIZE 25
#define READ_END 0
#define WRITE_END 1

int main (int argc, char *argv[]) {
    char write_msg[BUFFER_SIZE] = "Greetings";
    char read_msg[BUFFER_SIZE];

    int fd[2];
    pid_t pid;

    if (pipe(fd) == -1) {
        fprintf(stderr, "pipe failed.\n");
        return 1;
    }

    pid = fork();

    if (pid < 0) {
        fprintf(stderr, "Fork failed.\n");
        return 1;
    }

    if (pid > 0) {
        /* close the unused end of the pipe */
        close(fd[READ_END]);

        write(fd[WRITE_END], write_msg, strlen(write_msg) + 1);

        close(fd[WRITE_END]);
    } else {
        close(fd[WRITE_END]);

        read(fd[READ_END], read_msg, BUFFER_SIZE);
        printf("read: %s\n", read_msg);

        close(fd[READ_END]);
    }

    return 0;
}

Named Pipes

Named pipes provide a much more powerful communication tool:

Communication can be bidirectional.
Communication is only half-duplex (one at a time).
No parent-child relationship is required.
Named pipes continue to exist after communicating processes haves terminated.
Named pipes require that communicating processes reside on the same machine.

Note: on MS Windows version of named pipe, full-duplex communication is allowed and the communicating processes may reside on different machines.

Named pipes are referred to as FIFOs in UNIX systems. Once created, they appear as files in the file system.

Named pipes are created on UNIX systems using the mkfifo() system call and manipulated using the open(), read(), write() and close() system calls.

Socket

A socket refers to an endpoint for sending or receiving data across a computer network. A pair of processes communicating over a network use a pair of sockets (one for each process).

A socket is identified by an IP address concatenated with a port number, such as 185.139.135.65:5000.

communication via sockets visualized

Remote Procedure Call (RPC)

RPC is a special function call. The function (procedure) is called on the local computer, but the function is executed on the remote computer. Once the function execution is completed, the result is returned back to the local computer.

CH4: Threads and Concurrency

A thread is a basic unit of CPU utilization. It involves a thread ID, a program counter (PC), a set of registers, and a stack.

It shares the following resources with other threads belonging to the same process:

Code/Text section
Data section
Heap section
Other OS resources.

single-thread vs multi-thread visual

Benefits of Multithreaded Programming

Responsiveness: Multithreading an interaction application may allow a program to continue running even if parts of it is blocked or is performing a length operation.
Resource sharing: Threads share memory and the resources of the process which they belong to. This makes it simpler to move resources between threads compared to moving resources between processes.
Cost: Allocating memory and resources for process creation is costly. Because threads share the resources of the process to which they belong, it is more economical to create and context-switch threads.
Speed: The benefits of multithreaded programming can be even greater in a multiprocessor architecture, where threads may run in parallel on different processing processors.

Multicore Programming

Multithreaded programming provides a mechanism for more efficient use of multiprocessor or multi-core systems.

Consider an application with four threads. On a system with a single computing core, concurrency merely means that the execution of the thread will be interleaved over time.

single-core multi-thread processing

On a system with multiple cores, concurrency means that some threads can run in parallel, because the system can assign a separate thread to each core.

multi-core multi-thread processing

Concurrency VS Parallelism

A concurrent system supports more than one task by allowing all tasks to make progress using context-switches.
A parallel system can perform more than one task simultaneously. Multiple instructions are being run at a given instant.

Therefore, it is possible to have concurrency without parallelism.

Challenges

Designing new programs that are multithreaded involves the following challenges:

Identifying tasks: Find tasks that can be done simultaneously.
Balance: Work between tasks should be similar to maximize speed.
Data splitting: Data access must be managed as now data may be accessed by multiple tasks as the same time.
Data dependency: When one task depends on data from another, ensure that the execution of the tasks is synchronized to accommodate the data dependency.
Testing and debugging: Testing and debugging multithreaded programs is inherently more difficult.

User Level Threads VS Kernel Level Threads

User level threads are managed at the user level without the kernel being involved.
Kernel level threads are managed at the kernel level and is completed managed by the operating system.

user-threads and kernel-threads

Relationship

For the threads belonging to the same process, a relationship must exist between user threads and kernel threads. There are three common ways of establishing such as relationship:

Many-to-one model
One-to-one model
Many-to-many model.

Many-to-One Model

This model maps multiple user-level threads to one kernel level thread.

A process is given a single kernel thread. There are multiple user-level threads. Thread management is done by the thread library in the user space.
The entire process is blocked if one threads makes a blocking system call.
Multiple threads are unable to run in parallel on multicore systems.

many-to-one model of threads

One-to-One Model

This model maps each user-level thread to one kernel level thread.

A process includes multiple kernel threads. There are multiple user-level threads. Each user-level thread is mapped to one kernel thread.
It allows one thread to run when another thread makes a blocking system call.
It also allows multiple threads to run in parallel on multicore systems.
The drawback is that it creates a large number of kernel threads which may worsen system performance.

Linux and Windows implement the one-to-one model.

one-to-one model of threads

Many-to-Many Model

It multiplexes multiple user-level threads corresponding to a process to a smaller or equal number of kernel threads corresponding to the same process.

The number of kernel threads may depends on a particular application or a particular application.
An application may be allocated more kernel threads on a system with eight processing cores than a system with four cores.
This means the developer can create as many threads as the like and the number of kernel threads is not too large.
However, it is difficult to implement in practice and the increasing number of processing cores appearing on most systems has reduces the importance of limiting kernel threads.

many-to-many model of threads

Pthreads

Pthreads is a threads extension of the POSIX standard. This may be provided as either a user-level or a kernel-level library. On Linux, it is implemented as a kernel level library.

There are two strategies for creating multiple threads: asynchronous threading and synchronous threading.

Asynchronous threading: Once the parent creates a child thread, the parent resumes its execution, so that the parent and child execute concurrently and independently of one another. Asynchronous threading is commonly used for designing responsive user interfaces.

Synchronous threading: It occurs when the parent thread creates one or more children and then must wait for all of its children to terminate before it resumes. Typically, synchronous threading involves significant data sharing among threads.

Example

Note: To compile programs using pthread.h, we need to include the -lpthread argument to include the threading library when compiling our program.

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>

int sum;
void *runner(void *param) {
    int upper = atoi(param);
    sum = 0;
    for (int i = 1; i <= upper; i++) {
        sum += i;
    }
    pthread_exit(0);
}

int main(int argc, char *argv[]) {
    pthread_t tid;
    pthread_create(&tid, NULL, runner, argv[1]);
    pthread_join(tid, NULL);
    printf("sum = %d\n", sum);
    return 0;
}

Lifetime

Each C program starts with one thread, which corresponds to the main() function. When the main() thread terminates, all threads belonging to the same program will terminate. In general, a thread terminates if its parent thread terminates.

Thread Count

Hyperthreading: Hyperthreading is Intel's simultaneous multithreading (SMT) implementation used to improve parallelization of computations performed on x86 microprocessors. With hyperthreading, one physical core appears as two virtual cores to the operating system, allowing concurrent scheduling of two threads per core.

Thread Count: Theoretically, the total number of threads should be equal to the number of cores of the system for optimal performance. On systems that support hyperthreading, the number should be equal to twice the number of cores. However, practically, the optimal number of threads vary on bunch of parameters such as other running programs and the scheduling algorithm. Therefore, it is best to experiment and find the best one.

Implicit Threading

Designing multithreaded programs involve many challenges, one way to address these difficulties is to transfer the creation and management of threading from application developers to compilers and run-time libraries. This strategy, termed implicit threading, is an increasingly popular trend.

Implicit threading involves identifying tasks rather than threads. Once identified, this is passed to the library which decides the optimal number of threads to run. The library handles the thread creation and management.

The general idea behind a thread pool is to create a number of threads at startup and place them into a pool, where they sit and wait for work. When the OS receives a computation request, rather than creating a thread, it instead submits the request to the thread pool and resumes, waiting for additional requests.

Thread pools offer these benefits:

Servicing a request with an existing thread is often faster than waiting and thereafter creating a thread.
A thread pool limits the number of threads that exists at any time point. This is particularly important systems that cannot support large number of concurrent threads.
Thread creation and management is transferred from application developers to compilers and run-time libraries.

The numbers of threads in the pool can be set heuristically according to factors such as:

The number of CPUs in the system
The amount of physical memory
The expected number of concurrent client requests.

OpenMP

Open Multi-Processing (OpenMP) is an API for programs written in C, C++, or FORTRAN, which provides support for parallel programming in shared-memory environments.

OpenMP identifies parallel regions as blocks of code that may run in parallel.

Application developers insert compiler directives into their code at parallel regions, and these directives instruct the OpenMP run-time library to execute the region in parallel.

Example:

#include <stdlib.h>
#include <stdio.h>
#include <omp.h>

#define BUF_SIZE 1000000000

int main(int argc, char *argv[]) {

    int *a = malloc(BUF_SIZE * sizeof(int));
    int *b = malloc(BUF_SIZE * sizeof(int));
    int *c = malloc(BUF_SIZE * sizeof(int));

    // Initialize arrays
    #pragma omp parallel for
    for (int i = 0; i < BUF_SIZE; i++) {
        a[i] = 3;
        b[i] = 7;
        c[i] = 0;
    }

    #pragma omp parallel
    {
        printf("I am a parallel region.\n");
    }

    #pragma omp parallel for
    for (int i = 0; i < BUF_SIZE; i++) {
        c[i] = a[i] + b[i];
    }

    long long sum = 0;

    #pragma omp parallel for reduction(+:sum)
    for (int i = 0; i < BUF_SIZE; i++) {
        sum += c[i];
    }

    printf("total sum: %lld\n", sum);

    free(a);
    free(b);
    free(c);

    return EXIT_SUCCESS;
}

The output of which is:

I am a parallel region.
I am a parallel region.
I am a parallel region.
I am a parallel region.
I am a parallel region.
I am a parallel region.
I am a parallel region.
I am a parallel region.
I am a parallel region.
I am a parallel region.
I am a parallel region.
I am a parallel region.
I am a parallel region.
I am a parallel region.
I am a parallel region.
I am a parallel region.
total sum: 10000000000

Here we see that the string is printed 16 times. This is because the machine that this code was run on has 16 threads. The for-loops are also executed in parallel for initialization and summation.

In this code, the #pragma omp parallel for reduction(+:sum) directive tells OpenMP to parallelize the loop and use the sum variable to accumulate the total sum. Each thread will have its own local copy of sum, and at the end of the parallel region, OpenMP will automatically sum up all these local copies into the original sum variable. The + in reduction(+:sum) specifies that the reduction operation is summation.

Here is my CPU while the process is running:

cpu-utilization-by-openmp

I wrote a single threaded version without using openmp and compared their running times. The single threaded version took 22 seconds to execute where as the openmp version only took 6 seconds. It was also clear that the single threaded version was only using one core at any given time.

Here we clearly see that all my threads are being used. If you notice the code doesn't have too many details about the parallelization - just a few lines about the type of parallelization that we require.

CH5: CPU Scheduling

CPU scheduling is the basis of multiprogrammed operating systems. On modern operating systems, it is kernel-level threads (no processes) that are in fact being scheduled by the operating system. However, the terms "process scheduling" and "thread scheduling" are often used interchangeably.

Typically, process execution alternates between two states:

CPU burst: Where a process is mainly using the CPU.
I/O burst: Where a process is performing I/O operations.

A process flips between CPU bursts and IO bursts. Eventually the final CPU burst ends with a system request to terminate execution.

Although the durations of CPU bursts vary greatly from process to process. However, most processes tend to have frequency curve similar to that shown in the following figure:

cpu burst frequency chart

The curve is generally characterized as exponential or hyper exponential, with a large number of short CPU bursts and a small number of long CPU bursts.

Whenever the CPU becomes idle, the operating system must select one of the processes in the ready queue to be executed. The selection process is carried out by the CPU scheduler, which selects a process from the processes in memory that are ready to execute and allocates the CPU to that process.

The ready queue is a list of processes ready to be executed which is usually records of process control blocks (PCBs) of the processes.

Preemptive VS Nonpreemptive

Nonpreemptive schedulers: once the CPU has been allocated to a process, the process keeps the CPU until it releases it either by terminating or by switching to the waiting state.
Preemptive schedulers: after the CPU has been allocated to a process, the process could possibly lose the CPU before it terminates or switches to the waiting state.

Dispatcher

The dispatcher is the module that gives control of the CPU's core to the process selected by the CPU scheduler. This function involves the following steps:

switching context
switching to user mode
jumping to the proper location in the user program to restart the program.

The time it takes for the dispatcher to stop one process and start another running is known as the dispatch latency.

Scheduling Algorithms

You'll find all the scheduling algorithms coded in C in the following repository: GitHub.

Evaluation Criteria For CPU Schedulers

Here are some typical criteria used to rank algorithms:

CPU utilization: the percentage of the time when CPU is utilized.
Throughput: the number of processes that complete their execution per unit of time.
Turnaround time: the interval from the time of submission of a process to the time of completion.
Waiting time: sum of the time spent waiting in the ready queue.
Response time: the time from the submission of a process until the first response is produced.

First-Come First-Served (FCFS)

The simplest CPU-scheduling algorithm is the first-come first-severed (FCFS) scheduling algorithm. With this scheme, the process that requests the CPU first is allocated first.

The implementation of the FCFS policy is easily managed with a FIFO queue. When a process enters the ready queue, its PCB is linked onto the tail of the queue. When the CPU is free, it is allocated to the process at the head of the queue. The running process is then removed from the queue.

The negative with such an algorithm is that the average waiting time under the FCFS policy is often quite long.

Suppose the processes arrive in the order: P1, P2, P3.

fcfs scheduling grantt chart

This algorithm suffers from the "convoy effect". The convoy effect is a scheduling phenomenon in which a number of processes wait for one process to get off a core, causing overall device and CPU utilization to be suboptimal.

Overview of FCFS:

Simple implementation
Nonpreemptive nature
Long average waiting time
Experiences convoy effect
Bad for interactive systems.

Here is the algorithm written for a simulation of CPU scheduler: Github FCFS

Shortest Job First (SJF)

The shortest-job-first (SJF) is an algorithm that schedules the job with the shortest CPU burst first. When the CPU is available, it is assigned to the process that has the smallest next CPU burst. If two processes have the same CPU burst then FCFS scheduling is used to break the tie.

sjf scheduling example

The SJF scheduling algorithm is provably optimal, in that it gives the minimum average waiting time for a given set of processes.

Here is the algorithm written for a simulation of CPU scheduler: Github Nonpreemptive SJF

Predicting Next CPU Burst

However, we need to know the next CPU burst before we schedule algorithms based on this. But, we can't know the CPU burst of a process before it is scheduled. Therefore, we use a predicted CPU burst to schedule algorithms.

The following notations are used in the prediction:

\(t_n\): the length of the current CPU burst.
\(\tau_n\): the predicted value of the current CPU burst.
\(\tau_{n+1}\): the predicted value for the next CPU burst.
\(\alpha\): a tuning parameter \((0 \leq \alpha \leq 1)\), typically, \(\alpha = 1/2\).

The following equation can be used to predict the next CPU burst:

\[ \tau_{n+1} = \alpha t_n + (1 - \alpha)\tau_n \]

This equation adds a fraction of the previous prediction and the previous CPU burst to calculate the predicted CPU burst. The parameter \(\alpha\) controls the relative weight of the previous CPU burst in our prediction. So, the highest the value of \(\alpha\), the more we want \(t_n\) to affect our prediction.

The initial \(\tau_0\) can be defined as a constant or as an overall system average.

next-cpu-burst-graph

Preemptive SJF

The SJF algorithm can be either preemptive or non-preemptive.

Consider a situation where a process arrives at the ready queue with a shorter CPU burst when a process is currently running. If the scheduler considers removing the current process and scheduling the new one then it is preemptive. If the scheduler will always allow the process running to finish before scheduling the next process then it is non-preemptive.

Here is an example:

preemptive-sjf-example

Here we see that process \(P_1\) was scheduled but when \(P_2\) arrived, it has a shortest CPU burst than the remaining CPU burst of \(P_1\), and therefore replace \(P_1\).

Here is the algorithm written for a simulation of CPU scheduler: Github Preemptive SJF

Round Robin (RR)

The round-robin (RR) scheduling algorithm is similar to FCFS scheduling, but preemption is added to enable the system to switch between processes.

Each process, when scheduled given a small unit of time, called a time quantum or time slice that it can run for. Once the time quantum is done, another process is scheduled. The ready queue is treated as a circular queue.

The CPU scheduler goes around the ready queue, allocating the CPU to each process for a time interval of up to 1 time quantum.

If the CPU burst of the currently running process is longer than 1 time quantum, the timer will go off and will cause an interrupt to the operating system. A context switch will be executed, and the process will be put at the tail of the ready queue. The CPU scheduler will select the next process in the ready queue. Thus, the RR scheduling algorithm is thus preemptive.

Here is the algorithm written for a simulation of CPU scheduler: Github Round Robin

The average waiting time is typically high in this policy.

Example:

round-robin-example

The performance of the round robin algorithm depends heavily on the size of the time quantum. If the time quantum is extremely large then the policy behaves the same as the FCFS policy. In contract, if the time quantum is extremely small, the policy can result in a large number of context switches.

The turnaround time also depends on the size of h time quantum. The following figure shows the average turnaround time for a set of processes for a given time quantum:

round-robin-turnaround-time

We can see that turnaround time does not necessarily improve as the time-quantum size increases.

In general, the average turnaround time can be improved if most processes finish their next CPU burst in a single time quantum. The time quantum should be large compared with the context-switch time, but it should not be too large. A rule of thumb is that 80 percent of the CPU bursts should be shorter than the time quantum.

Priority Scheduling

A priority is associated with each process, and the CPU is allocated to the process with the highest priority. Processes with equal priority are scheduled in FCFS order.

In this course, we assume that a low number represents a high priority. However, this is only a convention.

priority-scheduling-example

Priorities can be defined either internally or externally.
Internal priorities use some measurable quantity or quantities to compute the priority of the process. Example, memory requirements, open files, etc.
External priorities are set by criteria outside the operating system. On Linux you can use the nice command to change the priority of a process.

A major problem with priority scheduling algorithms is indefinite blocking or starvation. This is when a low-priority task is never scheduled because higher priority tasks keep showing up.

A solution to this problem is aging. This involves gradually increasing the priority of the processes that wait in the system for a long time.

Priority Scheduling with Round-Robin

Priority scheduling can be combined with the round-robin scheduling so that the system executes the highest-priority process using priority scheduling and runs processes with the same priority using round-robin scheduling.

priority scheduling with round robin

Multilevel Queue Scheduling

With priority scheduling, all processes may be placed in a single queue and the scheduler selects the process with the highest priority to run. Depending on how the queues are managed, an \(O(n)\) search may be necessary to determine the highest-priority process.

In practice, it is often easier to have separate queues for each distinct priority, which is known as a multilevel queue.

multilevel queue scheduling example 1

A multilevel queue could be used to achieve a flexible process scheduling. A multilevel queue scheduling algorithm typically partitions processes into several separate queues based on the process type. For example, separate queues can be maintained for background and foreground processes.

Each queue could have its own scheduling algorithm. For example, the foreground queue could be scheduled by an RR algorithm while the background queue could be scheduled by FCFS algorithm.

multilevel queue scheduling example 2

Another possibility is to time-slice among the queues where each queue gets a certain portion of the CPU time, which it can then schedule among its various processes. For instance, in the foreground-background queue example the foreground queue can be given 80 percent of the CPU time and the background queue gets 20 percent of CPU time.

Feedback Based Scheduling

Normally, when the multilevel queue scheduling algorithm is used, processes are permanently assigned to a queue when they enter the system. This setup has the advantage of low scheduling overhead but it is inflexible.

The multilevel feedback queue scheduling algorithm, in contrast, allows a process to move between queues. The idea is to separate processes according to the characteristics of their CPU bursts.

If a process uses too too much CPU time, it will be moved to a lower-priority queue. This scheme leaves I/O bound and interactive processes in the higher-priority queues.

In addition, if a process that waits too long in a lower-priority queue may be moved to a higher-priority queue. This is a form of aging prevents starvation.

For example, consider a multilevel feedback queue scheduler with three queues, numbered from 0 to 2. The scheduler first executes all processes in queue 0. Only when queue 0 is empty will it execute processes in queue 1. A process in queue 1 will be preempted by a process arriving for queue 0. A process that arrives for queue 1 will preempt processes in queue 2.

multilevel feedback based scheduling

An entering process is put in queue 0.
- A process in queue 0 is given a time quantum of 8 milliseconds.
- If it does not finish within this time window, it is moved to the tail of queue 1.
If queue 0 is empty, the process at the head of queue 1 is given a quantum of 16 milliseconds. - If it does not complete within the time window, it it moved to tail of queue 2. - If a process arrives in queue 0 then a process is queue 1 is preempted.
If queue 0 and 1 are empty, the processes in queue 2 are run on an FCFS basis. - If a process arrives in queue 0 or 1, then a process is queue 2 is preempted.
To prevent starvation, a process that waits too long in a lower-priority queue may gradually be moved to a higher-priority queue.

The definition of a multilevel feedback queue scheduler makes it the most general CPU-scheduling algorithm. However, this also makes it the most complex algorithm.

Thread Scheduling

On most modern operating systems, it is kernel-level threads (not processes) that are being scheduled by the operating system.

For systems using many-to-one models, two steps are required to allocate the CPU to a user-level thread:

Process contention scope (PCS): Locally, the threads belonging to the same process are scheduled so that a user-level thread could be mapped to a kernel-level thread.
System Contention Scope (SCS): Globally, kernel-level threads are scheduled so that the CPU could be allocated to a kernel-level thread.

For systems using a one-to-one mode, such as Windows and Linux, only SCS is required.

Multiple-Processor Scheduling

The scheduling algorithms we have discussed only talk about how to schedule processes assuming a single processor. But most computers today have multiple processing cores, how do we handle such situations?

Asymmetric Multiprocessing: A single processor, the master server, handles all scheduling decisions, I/O processing, and other system activities. The other processors only execute user code. This method is simple to implement because only one processor accesses the system data structures, reducing the need for data sharing. However, this potentially becomes a bottleneck that could affect overall system performance.
Symmetric Multiprocessing (SMP): It is the standard scheme for a multiprocessor system. Each processor is self-scheduling. Namely the scheduler for each processor examine the ready queue and select a process to run.

With SMP, there are two possible strategies to organize the processes eligible to be scheduled. Either we have a single ready queue or each processor is given its own private ready queue.

multiple-processor queue options

Since strategy (b) does not require any synchronization to access the queue, it is the most common approach on systems supporting SMP.

Load Balancing

Load balancing attempts to keep the workload evenly distributed across all processors in an SMP system. This is an important consideration to fully utilize the benefits of having multiple processors.

Load balancing is typically necessary only on systems where each processor has its own private ready queue of eligible processes to execute. On systems with a common queue, load balancing is unnecessary.

There are two general approaches to load balancing: push migration and pull migration.

Push Migration: a specific task periodically checks the load on each processor; if it finds an imbalance, it evenly distributes the load by moving processes from overloaded to idle (or less-busy) processors.
Pull Migration: in this case, an idle processor pulls a waiting process from a busy processor.

Pull and push migration do not need to be mutually exclusive and are, in fact, often implemented in parallel on load-balancing systems. For example, the Linux process scheduler, Completely Fair Scheduler (CFS), implements both techniques.

Processor Affinity

On most systems, each processor has its own cache memory. When a process has been running on a specific processor the data most recently accessed by the process populate the cache of the processor. As a result, successive memory accesses by the process are often satisfied by cache memory.

If a process migrates to another processor, say, due to load balancing the contents of the cache memory must be invalidates for the first processor, and the cache for the second processor must be repopulated.

Because of the high cost of invalidating and repopulating cache, most operating systems with SMP support try to avoid migrating a process from one processor to another. This is known as processor affinity. A process has an affinity for the processor on which it is currently running. There is an attempt to always assign the same processor to a given processor.

Examples

Process scheduling in Linux:

Until Version 2.5: The Linux kernel ran a variation of the traditional UNIX scheduling algorithm. This algorithm was not designed with SMP systems in mind, it did not adequately support systems with multiple processors.
With Version 2.5: The scheduler was overhauled to include a schedulling algorithm known as \(O(1)\) that run in constant time regardless of the number of tasks in the system. This also provided increased support for SMP systems, including processor affinity and load balancing between processors. This leads to excellent performance on SMP systems but leads to poor response times for interactive processes that are common on many desktop computer systems.
With Version 2.6 In release 2.6.23 of the kernel, the Completely Fair Scheduler (CFS) became the default Linux scheduling algorithm.

Algorithm Evaluation

Deterministic Modeling: an evaluation method for scheduling algorithms. This method takes a particular predetermined workload and generates the performance of an algorithm under that workload. However, it requires exact numbers for input and its answers only apply to those specific cases. These methods can be used to indicate trends that can be analyzed.
Simulations: Running a simulation involves programming a model of the computer system. The data to drive the simulation can be generated in several ways. The most common method uses a random-number generator that is programmed to generate processes, CPU burst times, arrivals, departures, and so on.
- We can use trace files to monitoring the real systems and recording the sequence of actual events. We then use this sequence to drive the simulation.

trace-based simulation for algorithm eval

CH6: Synchronization Tools

Processes can be executed concurrently or in parallel. Context switch could be carried out rapidly in order to provide concurrent execution. This means that one process may only be partially completed before another process is scheduled. Two different processes could be executed simultaneously on separate processing cores.

We need to solve issues involving the integrity of data shared by several processes.

Race Condition: A race condition occurs when two or more process can access shared data and they try to change it at the same time. Because the process scheduling algorithm can swap between processes at any time, you don't know the order in which the processes will attempt to access the shared data. Therefore, the result of the change in data is dependent on the thread scheduling algorithm, i.e. both threads are "racing" to access/change the data.

Source For Race Condition

Here is a good YouTube Video that explains race conditions and dead-locks.

Critical Section Problem

Consider a system with \(n\) processes \(\{p_0, p_1, \dots, p_{n-1}\}\). Each process has a segment of code, called a critical section, which accesses and updates data that is shared with at least one other process.

The important feature of the system is that, when one process is running in its critical section, no other process is allowed to run in its critical section. That is, no two processes run in their critical section at the same time.

The critical-section problem is about designing a protocol that the prcoesses can use to synchronize their activity so as to cooperatively share data. With this protocol, each process must request permission to enter its critical section.

Entry section: The section of code implementing the request.
Exit section: The critical section may be follow by an exit section.
Remainder section: the remaining code is the remainder section.

A solution must satisfy the following three requirements:

Mutual Exclusion: If process \(P_i\) is running in its critical section, then no other processes can run in their critical sections.
Progress: If no process is running in its critical section and some processes wish to enter their critical sections, then only those processes that are not running in their remainder sections can participate in the procedure of selecting a process to enter its critical section next, and this selection cannot be postponed indefinitely.
Bounded Waiting: There exists a bound, or limit, on the number of times that other processes are allowed to enter their critical sections after a process has bade a request to enter its critical section and before that request is granted.

Peterson's Solution

A software-based solution to the critical-section problem.

Important Note: However, because of the way modern computer architectures perform basic machine-language instructions, there is no guarantee that Peterson's solution will work correctly with such architectures. This is primarily because to improve system performance, processors and/or compilers may reorder read and write operations that have no dependencies.

Peterson's solution is restricted to two processes that alternate execution between their critical sections and remainder sections. The processes are numbered \(P_0\) and \(P_1\).

Peterson's solution requires the two processes to share two data items:

int turn;
boolean flag[2];

The variable turn indicates whose turn it is to enter its critical section. That is, if turn == i, then process \(P_i\) is allowed to tun in its critical section.

The array flag is used to indicate if a process is ready to enter its critical section. If flag[i] is true, \(P_i\) is ready to enter its critical section.

while (true) {
    flag[i] = true;
    turn = j;
    while (flag[j] && turn == j) {
        // wait for j to finish
        ;
    }
    /* critical section */

    flag[i] = false;

    /* remainder section */
}

If both processes try to enter at the same time, turn will be set to both i and j at roughly the same time. Only one of these assignments will last; the other will occur but will be overwritten immediately. The eventual value of turn determines which of the two processes is allowed to enter its critical section first.

Hardware Synchronization

Hardware could be utilized to solve the critical-section problem. These include:

Memory barriers
Hardware instructions
Atomic variables.

These solutions are generally complicated and inaccessible to application programmers.

Mutex Locks

Mutex is short for "mutual exclusion". The operating system designers build higher-level software tools to solve the critical-section problem. We use mutex locks to protect critical sections and thus prevent race conditions. A process must acquire the lock before entering a critical section; it releases the lock when it exists the critical section.

The acquire() function acquires the lock, and the release() function releases the lock, as illustrated in the following code.

while (true) {
    < acquire lock >
    critical section
    < release lock >
    remainder section
}

A mutex lock has a boolean variable available whose value indicates if the lock is available or not. If the lock is available, a call to acquire() succeeds, and the lock is considered unavailable. A process that attempts to acquire an unavailable lock is blocked until the lock is released.

Calls to either acquire() or release() must be performed atomically (an uninterruptible unit). This can be achieved via hardware support.

Spinlock: Mutex locks that just loop while they wait are also called spinlocks. The process "spins" while waiting for the lock to become available.

The advantage of spin locks is the they do not require a context switch. The disadvantage is that they are wasting CPU cycles doing nothing. In certain circumstances on multicore systems, spinlocks are in fact the preferred choice for locking.

Mutex Locks In C

Here is an example of using mutex locks (from geeksforfeeks):

#include <pthread.h> 
#include <stdio.h> 
#include <stdlib.h> 
#include <string.h> 
#include <unistd.h> 

pthread_t tid[2]; 
int counter; 
pthread_mutex_t lock; 

void* trythis(void* arg) 
{ 
    pthread_mutex_lock(&lock); 

    unsigned long i = 0; 
    counter += 1; 
    printf("\n Job %d has started\n", counter); 

    for (i = 0; i < (0xFFFFFFFF); i++) 
        ; 

    printf("\n Job %d has finished\n", counter); 

    pthread_mutex_unlock(&lock); 

    return NULL; 
} 

int main(void) 
{ 
    int i = 0; 
    int error; 

    if (pthread_mutex_init(&lock, NULL) != 0) { 
        printf("\n mutex init has failed\n"); 
        return 1; 
    } 

    while (i < 2) { 
        error = pthread_create(&(tid[i]), NULL, &trythis, NULL); 
        if (error != 0) 
            printf("\nThread can't be created :[%s]", 
                strerror(error)); 
        i++; 
    } 

    pthread_join(tid[0], NULL); 
    pthread_join(tid[1], NULL); 
    pthread_mutex_destroy(&lock); 

    return 0; 
}

Semaphore

A semaphore is an integer variable that apart from initialization, is accessed only through two indivisible (atomic) operations wait() and signal().

The wait and signal operations:

wait(S) {
    while (S <= 0)
        ;  // busy wait
    S--;
}

signal(S) {
    S++;
}

All modifications to the integer value of the semaphore in the wait() and signal() operations must be executed atomically.

Binary semaphore: The integer value can only range from 0 to 1. This is similar to a mutex lock.
Counting semaphore: The integer value can range from 0 to N.
- The semaphore is initialized to the number of resources available.
- Each process that wishes to use a resource performs a wait() operations on the semaphore.
- When a process releases a resource, it performs a signal() operation.
- When the semaphore becomes 0, all resources are being used.
- After that, processes that wish to use a resource will be blocked until the semaphore becomes greater than 0.

Semaphores In C

// C program to demonstrate working of Semaphores 
#include <stdio.h> 
#include <pthread.h> 
#include <semaphore.h> 
#include <unistd.h> 

sem_t mutex; 

void* thread(void* arg) 
{ 
    //wait 
    sem_wait(&mutex); 
    printf("\nEntered..\n"); 

    //critical section 
    sleep(4); 

    //signal 
    printf("\nJust Exiting...\n"); 
    sem_post(&mutex); 
} 


int main() 
{ 
    sem_init(&mutex, 0, 1); 
    pthread_t t1,t2; 
    pthread_create(&t1,NULL,thread,NULL); 
    sleep(2); 
    pthread_create(&t2,NULL,thread,NULL); 
    pthread_join(t1,NULL); 
    pthread_join(t2,NULL); 
    sem_destroy(&mutex); 
    return 0; 
}

Monitors

Although semaphores provide a convenient and effective mechanism for process synchronization, using semaphores incorrectly can result in errors that are difficult to detect.

Proposed by Hoare in 1974 and Brinch Hansen in 1975, monitors are a language-specific synchronization construct. They provide a fundamental guarantee that only one process may be in a monitor at any time.

Monitors must be implemented at the compiler/language level. The compiler must ensure that the property is preserved. It is up to the compiler/language/system to determine how mutual exclusion is implemented.

Enters monitor
Executes the critical section
Leaves the critical section.

CH7: Synchronization Examples

The Dining Philosophers Problem

The Dining Philosophers Problem is an interesting and widely studied problem in computer science, particularly in the realm of concurrency and resource allocation.

Understanding the Situation

The problem is set around five philosophers who alternate between thinking and eating. They share a circular table, each with their own designated chair. In the center of the table lies a bowl of rice. Crucially, there are five single chopsticks placed on the table. The philosophers, when deep in thought, do not interact with each other.

philosophers problem

The Rules of Engagement

Eating and Thinking: Philosophers occasionally become hungry and attempt to pick up the two chopsticks closest to them - one between them and their left neighbor, and the other between them and their right neighbor.
Chopstick Handling: Each philosopher can only pick up one chopstick at a time and cannot use a chopstick that's already in the hand of another philosopher.
Eating Process: A philosopher eats only when they hold both chopsticks at the same time, and upon finishing, they put down both chopsticks and resume thinking.

The Technical Framework

From a programming perspective, the shared data is represented as semaphore chopstick[5], with all semaphore elements initially set to 1. A philosopher picks up a chopstick by performing a wait() operation on that semaphore and releases the chopsticks using a signal() operation.

The routine of each philosopher can be described in the following pseudo C-code:

while (true) {
    wait(chopstick[i]);
    wait(chopstick[(i+1) % 5]);

    /* eat for a while */

    signal(chopstick[i]);
    signal(chopstick[(i+1) % 5]);

    /* think for awhile */
}

The Problem: Deadlock

The setup, as described, can lead to a deadlock. Imagine a scenario where all five philosophers become hungry simultaneously and each grabs their left chopstick. All semaphore elements (chopstick) would then be 0. As each philosopher tries to grab their right chopstick, they find themselves unable to proceed, leading to a deadlock where no philosopher can eat.

Solutions to Avoid Deadlock

Limiting the Number of Philosophers: One approach is to allow no more than four philosophers at the table at any given time.
Critical Section for Chopsticks: Another solution is to allow a philosopher to pick up both chopsticks only if both are available, necessitating picking them up in a critical section.
Asymmetric Chopstick Pickup: A third solution proposes an asymmetric method: odd-numbered philosophers first pick up their left chopstick, then their right, while even-numbered philosophers do the opposite.

In summary, the Dining Philosophers Problem is more than a theoretical conundrum. It serves as a valuable model for understanding complex issues in concurrent programming and resource allocation, offering insights into solving similar real-world problems.

POSIX Mutex Locks

Pthreads uses the pthread_mutex_t data types for mutex locks. A mutex is created with the pthread_mutex_init() function. The first parameter is the address of the mutex variable. By passing NULL as a second parameter, we initialize the mutex with its default attributes.

#include <pthread.h>

pthread_mutex_t mutex;

/* create and initialize the mutex lock */
pthread_mutex_init(&mutex, NULL);

The mutex is acquired and released with pthread_mutex_lock() and pthread_mutex_unlock() functions. If the mutex lock is unavailable, when pthread_mutex_lock() is invoked, the called thread is block till the owner invokes pthread_mutex_unlock().

/* acquire the mutex lock */
pthread_mutex_lock(&mutex);

/* critical section */

/* release the mutex lock */
pthread_mutex_unlock(&mutex);

When a mutex is not used any more, pthread_mutex_destroy() could be used to eliminate the mutex.

pthread_mutex_destory(&mutex);

All pthread mutex functions return a value of 0 in the case of correct operation; if an error occurs, these functions return a non-zero error code.

To compile a program involving pthread mutex lock, you can use the following compiling command:

gcc filename.c -lpthread

POSIX Semaphores

Semaphores are not a part of the POSIX standard and instead belong to the POSIX SEM extension. POSIX specifies two types of semaphores: named and unnamed.

POSIX Named Semaphores

The function sem_open() is used to create and open a POSIX named semaphore:

#include <semaphore.h>

sem_t *sem;

/* create the semaphore and initialize it to 1 */
sem = sem_open("SEM", O_CREAT, 0666, 1);

We are naming the semaphore SEM.
The O_CREAT flag indicates that the semaphore will be created if it does not exist.
Additionally, the semaphore can be accessed via read and write (via the parameter 0666) and is initialized to 1.

The advantage of named semaphores is that multiple unrelated processes can easily use a common semaphore as a synchronization mechanism by simply referring to the semaphore's name.

In POSIX, the wait() and signal() semaphore operations are implemented as sem_wait() and sem_post(), respectively.

/* acquire the semaphore */
sem_wait(sem);

/* critical section */

/* release the semaphore */
sem_post(sem);

To destroy a named semaphore, you need to close the semaphore and thereafter unlink it:

sem_close(sem);
sem_unlink("SEM");

POSIX Unnamed Semaphores

An unnamed semaphore is created and initialized using the sem_init() function, which involves three parameters:

The address of the semaphore variable
A flag indicating the level of sharing
The semaphore's initial value.

#include <semaphore.h>
sem_t sem;

/* create the semaphore and initialize it to 1 */
sem_init(&sem, 0, 1);

By passing 0 as the second parameter, we are indicating that this semaphore can be shared only by threads belonging to the process that created the semaphore, which is widely used in C programs involving semaphore. If we supplied a nonzero value, we could allow the semaphore to be shared with separate processes by placing it in a region of shared memory.

We acquire and release in the same way:

/* acquire the semaphore */
sem_wait(&sem);

/* critical section */

/* release the semaphore */
sem_post(&sem);

We can destroy the semaphore using the following:

sem_destroy(&sem);

Compiling For Semaphores

When you compile a program involving POSIX semaphore, you need to use the following command:

gcc filename.c -lpthread

Note, you will need to include -lrt if you are not using something after glibc 2.17 to include the real-time library.

CH8: Deadlocks

Deadlocks can be defined as a situation in which every process in a set of processes is waiting for an event that can be cause only by another process in the set.

Request: The process requests the resource. If the request cannot be granted immediately (for example, if a mutex lock is currently held by another process), then the requesting process must wait till it can acquire the resource.
Use: The process can perform operations about the resource (for example, if the resource is a mutex lock, the process can access its critical section).
Release: The process releases the resource.

Characterization

Deadlock could arise if four conditions are satisfied simultaneously.

Mutual Exclusion: Only one process can use a resource at a time.
Hold and Wait: A process holding at least one resource is waiting to acquire additional resources held by other processes.
No Preemption: A resource can only be released voluntarily by the process holding it, after that process has completed its task.
Circular Wait: There exists a set of waiting processes \(\{P_0, P_1, \dots, P_n\}\) so that \(P_0\) is waiting for a resource that is held by \(P_1\), \(P_1\) is waiting for a resource that is held by \(P_2, \dots, P_{n–1}\) is waiting for a resource that is held by \(P_n\), and \(P_n\) is waiting for a resource that is held by \(P_0\).

Deadlocks can be described more precisely using a directed graph called resource-allocation graph.

This graph consists of a set of vertices \(V\) and a set of edges \(E\).

\(V\) is partitioned into two subsets:
- \(P = \{P_1, \dots, P_n\}\): the set consisting of all the active processes in the system.
- \(R = \{R_1, \dots, R_m\}\): the set consisting of all resource types in the system.
In \(E\), there are two types of edges:
- Request edge: A directed edge \(P_i \to R_j\), which indicates that process \(P_i\) has requested an instance of resource type \(R_j\) and is currently waiting for that resource.
- Assignment edge: A directed edge \(R_j \to P_i\), which indicates that an instance of resource type \(R_j\) has been allocated to process \(P_i\).

simple deadlock situation

Given the definition of a resource-allocation graph, it can be shown that, if the graph contains no cycles, then no process in the system is deadlocked.

If the graph does contain a cycle, then a deadlock may exist.

If each resource type has exactly one instance, then a cycle implies that a deadlock has occurred.
- If the cycle involves a set of resource types, each of which has only a single instance, then a deadlock has occurred.
- Each process involved in the cycle is deadlocked.
- In this case, a cycle in the graph is both a necessary and a sufficient condition for the existence of deadlock.
If each resource type has several instances, then a cycle does not necessarily imply that a deadlock has occurred.
- In this case, a cycle in the graph is a necessary but not a sufficient condition for the existence of deadlock.

deadlock example 2

Solutions

Generally speaking, we can deal with the deadlock problem using one of the following three methods:

Method 1: We can ignore the problem altogether and pretend that deadlocks never occur in the system. If they happen rare enough it just might be better to ignore them altogether.
Method 2: We can use a protocol to prevent or avoid deadlocks, ensuring that the system will never enter a deadlock state.
Method 3: We can allow the system to enter a deadlocked state, detect it, and recover.

Prevention

By ensuring that at least one of the four conditions for deadlocks cannot hold, we can prevent the occurrence of a deadlock.

Mutual exclusion: Mutual exclusion is not required for shareable resources. It must hold for non-shareable resources. In general, however, we cannot prevent deadlocks by denying the mutual-exclusion condition because some resources are intrinsically non-shareable.

Hold and Wait: To ensure that hold-and-wait never occurs, we must guarantee that whenever a process requests a resource, it does not hold any other resources.

All-then-execute: Require process to request and be allowed all its resources before it begins execution.
None-then-request: Allow process to request resources only when the process has none allocated to it. A process may request some resources and use them. Before it can request any additional resources, it must release all the resources that it is currently allocated.

This suffers from two disadvantages:

Low resource utilization: resources may be allocated but unused for a long period. For example, a process may be allocated a mutex lock for its entire execution, yet only require it for a short duration.
Starvation: A process that needs several popular resources may have to wait indefinitely because at least one of the resources that it needs is always allocated to some other process.

No Preemption: To ensure that this condition does not hold, we can use the following protocol:

If a process that is holding some resources requests another resource that cannot be immediately allocated to it, then all resources currently being held are released.
Preempted resources are added to the list of resources for which the process is waiting.
Process will be restarted only when it can regain its old resources, as well as the new ones that it is requesting.

Circular Wait: Require that each process requests resources in an increasing order of enumeration. That is, a process can initially request an instance of a resource, say \(R_i\). After that, the process can request an instance of resource \(R_j\) if and only if it has a higher number.

Avoidance

Deadlock avoidance involves requiring additional information on how resources will be requested.

The simplest and most useful model requires that each process declares the maximum number of instances of each resource type that may be needed.

In this scenario, a deadlock-avoidance algorithm dynamically examines the resource allocation state to ensure that a circular-wait condition can never exist.

When a process requests an available resource, the system must decide if allocating the resource immediately leaves the system in a safe state. A system is in a safe state only if there exists a safe sequence.

A sequence of processes, \(P_1, \dots, P_n\) is a safe sequence for the current resource allocation state if, for each \(P_i\), the resource requests that \(P_i\) might make can be satisfied by the currently available resources plus the resources held by all \(P_j\) with \(j < i\).

There are two deadlock avoidance algorithms. For systems with single-instance of each resource type we use "resource-allocation-graph algorithm". For systems with multiple instances of each resources type we use "Banker's algorithm".

Resource Allocation Graph Algorithm

We introduce a new claim edge \(P_i \to R_j\) which indicates that process \(P_j\) may request resource \(R_j\) at some time in the future. This is denoted by a dashed-line. A claim edge is converted to a request edge when the process requests the resource.

With this algorithm, resources must be claimed in advance in the system.

Now if a process requests a resource, we check if this could potentially form a cycle. If yes, then we do not allow the process to access the resource even if it is available. This is because a cycle puts the system into an unsafe state. We only allow the process to access the resource once there is not longer the opportunity for a cycle.

Detection

If each type of resource has only a single instance, then we can define a deadlock-detection algorithm that uses a variant of the resource-allocation graph, called a wait-for-graph.

We can obtain this graph from the resource-allocation graph by removing the resource nodes and collapsing the appropriate edges. More precisely, an edge from \(P_i\) to \(P_j\) in a wait-for graph implies that process \(P_i\) is waiting for process \(P_j\) to release a resource that \(P_i\) needs. An edge \(P_i \to P_j\) exists in a wait-for graph if and only if the corresponding resource-allocation graph contains two edges \(P_i \to R_q\) and \(R_q \to P_j\) for some resource \(R_q\).

Here is an example of a resource-allocation graph being converted to a wait-for graph:

example of wait-for-graph

As before, a deadlock exists in the system if and only if the wait-for graph contains a cycle. To detect deadlocks, the systems needs to maintain the wait-for graph and periodically invoke an algorithm that searches for a cycle in the graph.

Recovery

We have two options when we detect a deadlock in the system:

Aborting Processes: the first option is simply to abort one or more processes to break the deadlock.
Resource Preemption: The other is to preempt some resources from one or more of the deadlocked processes.

To eliminate deadlocks by aborting processes, we use one of two methods:

Abort all deadlocked processes. This method clearly will break the deadlock cycle, but at a great expense. The deadlocked processes may have computed for a long time, and the results of these partial computations must be discarded and probably will have to be recomputed later.
Abort one process at a time until the deadlock cycle is eliminated. This method incurs considerable over overhead, since after each process is aborted, a deadlock-detection algorithm must be invoked to determine whether any processes are still deadlocked.

Many factors are involved in the termination of a process - we try to incur the minimum cost:

What the priority of the process is.
How long the process has computed and how much longer the process will compute.
How many and what types of resources the process has used.
How many more resources the process needs in order to complete.

To eliminate deadlocks by resource preemption, we successively preempt some resources from processes and give these resources to other processes until the deadlock cycle is broken.

If preemption is required to deal with deadlocks, then three issues need to be addressed:

Selecting a victim: which resources and which processes are to be preempted? As in process termination, we must determine the order of preemption to minimize cost.
Rollback: If we preempt a resource from a process, what should be done with that process? The simplest solution is a total rollback: abort the process and restart it.
Starvation: How do we ensure that starvation will not occur? That is, how can we guarantee that resources will not always be preempted from the same process?

CH9: Main Memory

There are a couple of rules that the operating system enforces on the processes to ensure the proper functioning of the system.

The memory allocated to operating system should not be accessed by user processes.
The memory allocated to one user process should not be accessed by another user process.

One solution to this problem is to have a separate memory space for each process. To separate memory space, we need to have a range of legal addresses that a process may access and to ensure that the process can access only its legal addresses. We can provide this protection using two registers, usually a base register and a limit register.

memory allocation range

Logical VS Physical Address

Logical Address: The address from the perspective of the user program. It is also known as virtual address. The corresponding memory space is called logical memory space. A user program only uses logical addresses and thinks that the corresponding process runs in a memory space with the address range of 0 to max.

Physical Address The real address for each storage unit in main memory. The corresponding memory space is called physical memory space. The logical addresses must be mapped to physical address before the data in main memory can be accessed. We use a relocation register (same as base register) to accomplish the mapping.

logical being converted to physical memory figure

Contiguous Memory Allocation

With contiguous memory allocation, each process is contained in a single section of memory, which is typically next to the section containing the next process.

One of the simplest methods of allocating memory is to assign variable-sized sections of memory to processes. With this variable-section scheme, the operating system keeps a table indicating which parts of memory are available and which are occupied.

A hole in memory is just a region of memory that is not used.

holes in memory allocation

When a process is loaded into memory, the OS needs to find a proper hole to allocate enough memory to the processes.

Dynamic Storage Allocation: It is a memory allocation method used to satisfy a request of size \(n\) with a list of free holes.

First fit: Allocate the first hole that is big enough.
Best fit: Allocate the smallest hold that is big enough.
Worst fit: Allocate the largest hole.

Simulations have shown that both first fit and best fit are better than worst fit in terms of memory utilization but neither is "clearly better" than the other, but first fit is generally faster.

External Fragmentation: This is when there is enough total memory space to satisfy a request, but the available spaces are not contiguous.

Internal Fragmentation: If a system employs a fixed size allocation then if a process requests 40 Kb but is given 64 Kb then internal fragmentation exists.

Paging

Paging is a memory-management scheme that permits a processes' physical address space to be non-contiguous. Paging avoids the external fragmentation problem.

The basic method of implementing paging involves:

Breaking physical memory into fixed-sized blocks called frames.
Breaking logical memory into same size blocks called pages.

Then we map each page to a frame.

When a process is to be executed, its pages are loaded into available memory frames. The page number is used as an index into a per-process page table.

paging example

Every logical address can be divided into two parts: a page number (p) and a page offset (d).

page number and page offset division in address

To translate a logical address to a physical address:

Extract the page number \(p\) and use it as an index into the page table.
Extract the corresponding frame number \(f\) from the page table.
Replace the page number \(p\) in the logical address with the frame number \(f\).

As the offset \(d\) does not change, it is not replaced, and the frame number and offset now form the physical address.

The page size is defined by the hardware. It is usually determined by the processor architecture. The size of a page is a power of 2 bytes, typically varying between 4KB and 1 GB per page, depending on the computer architecture.

If the size of the logical address space is \(2^m\) bytes, a page size is \(2^n\) bytes, and the computer is byte-addressable, then the high-order (m-n) bits of a logical address correspond to the page number, and whatever is left in the logical address (the n low-order bits) designate the page offset.

page number and offset division 2

Conversion from logical address to physical address example:

conversion from logical address to physical address

When using a paging scheme, we have no external fragmentation, any free frame can be allocated to a process that needs it. However, we may have some internal fragmentation.

The allocation information is generally kept in a single system-wide data structure called a frame table.

Most modern computer systems support a large logical address space. In such environments, the page table itself becomes very large.

Hierarchical Paging

To solve the problem of page tables becoming very large, we use a two-level paging algorithm, in which the page table itself is also paged.

Swapping

We can use some of our external storage such as a hard drive or SSD as pretend memory. After a process is loaded into memory, the process can be swapped out temporarily out of memory to a backing store and then bought back into memory for continued execution.

Swapping enables the possibility that the total address space of all processes exceeds the real physical memory of the system.

Standard Swapping involves moving an entire process between main memory and a backing store. When a process is swapped to the backing store, the operating system must also maintain metadata for processes that have been swapped out, so they can be restored when they are swapped back into memory.

This method was used int traditional UNIX systems, but it is generally no longer used in contemporary operating systems, because the amount of time required to move the entire process between memory and the backing storage is prohibitive.

Paged Swapping: involves swapping pages of a processes rather than the entire process. This strategy still allows physical memory to be oversubscribed, but does not incur the cost of swapping entire processes.

In fact, the term "swapping" now generally refers to "standard swapping", and "paging" refers to "swapping with paging".

A page out operation moves a page from memory to the backing store and the reverse is known as a page in.

swapping with paging illustration

CH10: Virtual Memory

Demand Paging

When a program is loaded into memory, two methods could be used.

Method 1: Load the entire program in physical memory at program execution time. However, we may not initially need the entire program in memory.
Method 2: Load pages only when they are needed. This technique is known as demand paging and is commonly used in virtual memory systems. With demand-paged virtual memory, pages are loaded only when they are demanded during program execution.

With demand paging, while a process is executing, some pages will be in memory, and some will be in secondary storage. Thus, we need some form of hardware support to distinguish these two cases.

We can add a bit to identify if a page is valid. When the bit is set of valid, the corresponding page is in memory. When the bit is set of invalid, the page is currently in secondary storage.

demand paging figure

When a process tries to access a page that has not been brought into memory, it will cause a page fault, resulting in a trap to the OS. Here is the procedure used to access physical memory when a page fault is generated:

Extract the address from the current instruction.
Use page table to check whether the corresponding page has been loaded. If valid-invalid bit is 'invalid', a trap is generated.
Find a free frame in physical memory.
Move the desired page into the newly allocated frame.
Modify the page table to indicate that the page is now in memory.
Restart the instruction that was interrupted by the trap. The process can now access the page as if it had always been in memory.

Demand paging can significantly affect the performance of a computer system.

Assume the memory access time, denoted as \(ma\), is \(200\) nanoseconds.

Let \(p\) be the probability of a page fault. We would expect \(p\) to be close to zero. That is, we could expect to have only a few page faults.

The effective access time can be calculated using the following equation:

\[ \text{effective access time } = (1-p) * ma + p * \text{ page fault time} \]

Page Replacement

In the case that a page fault occurs and there is no free frame, the OS has to use one of the "page replacement" algorithms to

Find a victim frame
Move the victim frame to secondary storage
Move the page that caused the page fault into the freed frame.

Note that in this case, two page transfers are required. This situation effectively doubles the page-fault service time and increases the effective access time accordingly.

We can evaluate an algorithm by running it with a particular series of memory references and computing the number of page faults. This series of memory references are called a reference string. We can generate references string artificially, or we can trace a given system and record the address of each memory reference.

FIFO Page Replacement

A FIFO replacement algorithm associates with each page the time when the page was brought into memory. When a page must be replaced, the oldest page is chosen.

OPT Page Replacement

The optimal page replacement algorithm (OPT) should replace the page that will not be used for the longest period of time. Use of this page-replacement algorithm guarantees the lowest possible page-fault probability for a fixed number of frames.

Unfortunately, the optimal page replacement algorithm is difficult to implement because it requires knowledge about the future (how the memory will be accessed). As a result, the optimal algorithm is used mainly for comparison purposes.

LRU Page Replacement

The Least Recently Used replacement associates with each page the time of that page's last use. When page must be replaced, LRU chooses the page that has not been used for the longest period of time.

Allocation of Frames

The number of frames that should be allocated to a program initially depends on a couple of factors. There are a few options:

Equal Allocation: The easiest way to allocate \(m\) frames to \(n\) processes is to give each one an equal share, \(m/n\) frames.

Proportional Allocation: The available memory is allocated to each process according to its size.

Let the size of the virtual memory for process \(p_i\) be \(s_i\), and define \(S = \sum s_i\). If the total number of available frames is \(m\), we allocate \(a_i\) frames to process \(p_i\), where \(a_i\) is approximately \((s_i/S) * m\).

CH11: Mass Storage Systems

Secondary storage for modern computers is provided by hard disk drives (HDD) or nonvolatile memory (NVM) devices.

Hard Disk Drives

Each drive has multiple platters*.
Each platter looks like a CD.
Two surfaces of a platter are covered with a magnetic material.
We store information by recoding it magnetically on the platters.
We read information by detecting the magnetic pattern on the platters.

hard-drive image

The surface of a platter is logically divided into circular tracks, which are subdivided into sectors.

Each sector has a fixed size and is the smallest unit of transfer. The sector size was commonly 512 bytes until around 2010. At that point manufacturers started migrating to 4KB sectors.

The set of tracks at a given arm position make up a cylinder. A disk drive motor spins at high speeds.

There are number of parameters that determine how fast the data on HDDs can be accessed.

Seek Time is the time that it takes a disk arm to position itself over the required track.
Rotational Delay is the time that it takes the desired sector to position itself under the read/write head.
Access Time = Seek Time + Rotational Delay
Transfer Time = Access Time + Time To Read Data
Transfer Rate gives us the rate at which data can be read from the disk.

Nonvolatile Memory Devices (NVM)

Nonvolatile memory decides are normally based on flash memory technology.

Solid State Disk: It is frequently used in a disk-drive-like container, leading to SSD.
USB Drive: It can be embedded in a device with a USB interface, leading to USB drive.
Surface-mounted Storage: It is also surface-mounted onto motherboards as the main storage in devices like smartphones.

On the positive side:

NVM devices can be more reliable than HDDs because they have no moving parts.
NVM devices can be faster because it does not use the mechanical read-write head to read/write data.
In addition, they consume less power.

On the negative side:

They are more expensive per megabyte than traditional hard disks and have less capacity than the larger hard disks.
NVM devices deteriorate with every write, and stop working after approximately 100,000 writes.

Connection Methods

A secondary storage device is attached to a computer by the system bus or an I/O bus. Several kinds of buses are available, including: Advanced technology attachment (ATA), Serial ATA (SATA), eSATA, Serial Attached SCSI (SAS), Universal serial bus (USB), Fiber Channel (FC).

Because NVM devices are much faster than HDDs the industry created a special, fast interface for NVM devices called NVM express (NVMe).

Address Mapping

Storage devices are addressed as large one-dimensional arrays of logical blocks, where the logical block is the smallest unit of transfer.

For HDD, each logical block is mapped to a physical sector. By using this mapping on an HDD, we can convert a logical block number into an old-style disk address that consists of a cylinder number, a track number within that cylinder, and a sector number within that track.

For NVM devices, mapping can be done in a similar manner.

Scheduling

We can use on of the HDD scheduling algorithms to minimize access time and maximize transfer rate.

FCFS Scheduling

The simplest form of disk scheduling is FCFS. This algorithm is intrinsically fair, but generally does not provide the fastest service.

FCFS disk scheduling

SCAN Scheduling

With the SCAN algorithm the disk arm starts at one end of the disk and moves toward the other end, servicing requests as it reaches each cylinder, until it gets to the other end of the disk. At the other end, the direction of the head movement is reversed, and the servicing continues.

The SCAN algorithm is sometimes called the elevator algorithm, since the disk arm behaves like an elevator in a building, first servicing all the requests going up and then reversing to service requests the other way.

SCAN scheduling

C-SCAN Scheduling

Since after completing a scan of the disk, the heaviest density of requests is at the other end of the disk, C-SCAN starts the next scan at the start again.

Circular SCAN (C-SCAN) scheduling is a variant of SCAN designed to provide a more uniform wait time. When the head reaches the other end, it immediately returns to the beginning of the disk without servicing any requests on the return trip.

C-SCAN scheduling algorithm

NVM Scheduling

The disk-scheduling algorithms apply to mechanical platter-based storage like HDDs. The algorithms focus primarily on minimizing the amount of movement of the disk head.

NVM devices do not contain moving disk heads and commonly use a simple FCFS policy.

As we have seem, I/O can occur sequentially or randomly.

Sequential access is good for mechanical devices like HDD and tape because the data to be read or written is close to the read/write head.
Random-access I/O, which is measured in input/output operations per second (IOPS), causes more HDD disk movement. Naturally, random access I/O is much faster on NVM.

Acknowledgement
CH1: Introduction
CH2: Operating System Structures
CH3: Processes
CH4: Threads and Concurrency
- Benefits of Multithreaded Programming
- Multicore Programming
CH5: CPU Scheduling
CH6: Synchronization Tools
- Critical Section Problem
CH7: Synchronization Examples
CH8: Deadlocks
CH9: Main Memory
CH10: Virtual Memory
- Demand Paging
- Page Replacement
CH11: Mass Storage Systems

Enjoy the notes on this website? Consider supporting me in this adventure in you preferred way: Support me.