Documentation/accounting/psi.txt - kernel/common.git - Git at Google

 ================================
 PSI - Pressure Stall Information
 ================================

 :Date: April, 2018
 :Author: Johannes Weiner <hannes@cmpxchg.org>

 When CPU, memory or IO devices are contended, workloads experience
 latency spikes, throughput losses, and run the risk of OOM kills.

 Without an accurate measure of such contention, users are forced to
 either play it safe and under-utilize their hardware resources, or
 roll the dice and frequently suffer the disruptions resulting from
 excessive overcommit.

 The psi feature identifies and quantifies the disruptions caused by
 such resource crunches and the time impact it has on complex workloads
 or even entire systems.

 Having an accurate measure of productivity losses caused by resource
 scarcity aids users in sizing workloads to hardware--or provisioning
 hardware according to workload demand.

 As psi aggregates this information in realtime, systems can be managed
 dynamically using techniques such as load shedding, migrating jobs to
 other systems or data centers, or strategically pausing or killing low
 priority or restartable batch jobs.

 This allows maximizing hardware utilization without sacrificing
 workload health or risking major disruptions such as OOM kills.

 Pressure interface
 ==================

 Pressure information for each resource is exported through the
 respective file in /proc/pressure/ -- cpu, memory, and io.

 The format for CPU is as such:

 some avg10=0.00 avg60=0.00 avg300=0.00 total=0

 and for memory and IO:

 some avg10=0.00 avg60=0.00 avg300=0.00 total=0
 full avg10=0.00 avg60=0.00 avg300=0.00 total=0

 The "some" line indicates the share of time in which at least some
 tasks are stalled on a given resource.

 The "full" line indicates the share of time in which all non-idle
 tasks are stalled on a given resource simultaneously. In this state
 actual CPU cycles are going to waste, and a workload that spends
 extended time in this state is considered to be thrashing. This has
 severe impact on performance, and it's useful to distinguish this
 situation from a state where some tasks are stalled but the CPU is
 still doing productive work. As such, time spent in this subset of the
 stall state is tracked separately and exported in the "full" averages.

 The ratios are tracked as recent trends over ten, sixty, and three
 hundred second windows, which gives insight into short term events as
 well as medium and long term trends. The total absolute stall time is
 tracked and exported as well, to allow detection of latency spikes
 which wouldn't necessarily make a dent in the time averages, or to
 average trends over custom time frames.

 Monitoring for pressure thresholds
 ==================================

 Users can register triggers and use poll() to be woken up when resource
 pressure exceeds certain thresholds.

 A trigger describes the maximum cumulative stall time over a specific
 time window, e.g. 100ms of total stall time within any 500ms window to
 generate a wakeup event.

 To register a trigger user has to open psi interface file under
 /proc/pressure/ representing the resource to be monitored and write the
 desired threshold and time window. The open file descriptor should be
 used to wait for trigger events using select(), poll() or epoll().
 The following format is used:

 <some|full> <stall amount in us> <time window in us>

 For example writing "some 150000 1000000" into /proc/pressure/memory
 would add 150ms threshold for partial memory stall measured within
 1sec time window. Writing "full 50000 1000000" into /proc/pressure/io
 would add 50ms threshold for full io stall measured within 1sec time window.

 Triggers can be set on more than one psi metric and more than one trigger
 for the same psi metric can be specified. However for each trigger a separate
 file descriptor is required to be able to poll it separately from others,
 therefore for each trigger a separate open() syscall should be made even
 when opening the same psi interface file.

 Monitors activate only when system enters stall state for the monitored
 psi metric and deactivates upon exit from the stall state. While system is
 in the stall state psi signal growth is monitored at a rate of 10 times per
 tracking window.

 The kernel accepts window sizes ranging from 500ms to 10s, therefore min
 monitoring update interval is 50ms and max is 1s. Min limit is set to
 prevent overly frequent polling. Max limit is chosen as a high enough number
 after which monitors are most likely not needed and psi averages can be used
 instead.

 When activated, psi monitor stays active for at least the duration of one
 tracking window to avoid repeated activations/deactivations when system is
 bouncing in and out of the stall state.

 Notifications to the userspace are rate-limited to one per tracking window.

 The trigger will de-register when the file descriptor used to define the
 trigger  is closed.

 Userspace monitor usage example
 ===============================

 #include <errno.h>
 #include <fcntl.h>
 #include <stdio.h>
 #include <poll.h>
 #include <string.h>
 #include <unistd.h>

 /*
  * Monitor memory partial stall with 1s tracking window size
  * and 150ms threshold.
  */
 int main() {
 	const char trig[] = "some 150000 1000000";
 	struct pollfd fds;
 	int n;

 	fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK);
 	if (fds.fd < 0) {
 		printf("/proc/pressure/memory open error: %s\n",
 			strerror(errno));
 		return 1;
 	}
 	fds.events = POLLPRI;

 	if (write(fds.fd, trig, strlen(trig) + 1) < 0) {
 		printf("/proc/pressure/memory write error: %s\n",
 			strerror(errno));
 		return 1;
 	}

 	printf("waiting for events...\n");
 	while (1) {
 		n = poll(&fds, 1, -1);
 		if (n < 0) {
 			printf("poll error: %s\n", strerror(errno));
 			return 1;
 		}
 		if (fds.revents & POLLERR) {
 			printf("got POLLERR, event source is gone\n");
 			return 0;
 		}
 		if (fds.revents & POLLPRI) {
 			printf("event triggered!\n");
 		} else {
 			printf("unknown event received: 0x%x\n", fds.revents);
 			return 1;
 		}
 	}

 	return 0;
 }

 Cgroup2 interface
 =================

 In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem
 mounted, pressure stall information is also tracked for tasks grouped
 into cgroups. Each subdirectory in the cgroupfs mountpoint contains
 cpu.pressure, memory.pressure, and io.pressure files; the format is
 the same as the /proc/pressure/ files.

 Per-cgroup psi monitors can be specified and used the same way as
 system-wide ones.
	================================
	PSI - Pressure Stall Information
	================================

	:Date: April, 2018
	:Author: Johannes Weiner <hannes@cmpxchg.org>

	When CPU, memory or IO devices are contended, workloads experience
	latency spikes, throughput losses, and run the risk of OOM kills.

	Without an accurate measure of such contention, users are forced to
	either play it safe and under-utilize their hardware resources, or
	roll the dice and frequently suffer the disruptions resulting from
	excessive overcommit.

	The psi feature identifies and quantifies the disruptions caused by
	such resource crunches and the time impact it has on complex workloads
	or even entire systems.

	Having an accurate measure of productivity losses caused by resource
	scarcity aids users in sizing workloads to hardware--or provisioning
	hardware according to workload demand.

	As psi aggregates this information in realtime, systems can be managed
	dynamically using techniques such as load shedding, migrating jobs to
	other systems or data centers, or strategically pausing or killing low
	priority or restartable batch jobs.

	This allows maximizing hardware utilization without sacrificing
	workload health or risking major disruptions such as OOM kills.

	Pressure interface
	==================

	Pressure information for each resource is exported through the
	respective file in /proc/pressure/ -- cpu, memory, and io.

	The format for CPU is as such:

	some avg10=0.00 avg60=0.00 avg300=0.00 total=0

	and for memory and IO:

	some avg10=0.00 avg60=0.00 avg300=0.00 total=0
	full avg10=0.00 avg60=0.00 avg300=0.00 total=0

	The "some" line indicates the share of time in which at least some
	tasks are stalled on a given resource.

	The "full" line indicates the share of time in which all non-idle
	tasks are stalled on a given resource simultaneously. In this state
	actual CPU cycles are going to waste, and a workload that spends
	extended time in this state is considered to be thrashing. This has
	severe impact on performance, and it's useful to distinguish this
	situation from a state where some tasks are stalled but the CPU is
	still doing productive work. As such, time spent in this subset of the
	stall state is tracked separately and exported in the "full" averages.

	The ratios are tracked as recent trends over ten, sixty, and three
	hundred second windows, which gives insight into short term events as
	well as medium and long term trends. The total absolute stall time is
	tracked and exported as well, to allow detection of latency spikes
	which wouldn't necessarily make a dent in the time averages, or to
	average trends over custom time frames.

	Monitoring for pressure thresholds
	==================================

	Users can register triggers and use poll() to be woken up when resource
	pressure exceeds certain thresholds.

	A trigger describes the maximum cumulative stall time over a specific
	time window, e.g. 100ms of total stall time within any 500ms window to
	generate a wakeup event.

	To register a trigger user has to open psi interface file under
	/proc/pressure/ representing the resource to be monitored and write the
	desired threshold and time window. The open file descriptor should be
	used to wait for trigger events using select(), poll() or epoll().
	The following format is used:

	<some\|full> <stall amount in us> <time window in us>

	For example writing "some 150000 1000000" into /proc/pressure/memory
	would add 150ms threshold for partial memory stall measured within
	1sec time window. Writing "full 50000 1000000" into /proc/pressure/io
	would add 50ms threshold for full io stall measured within 1sec time window.

	Triggers can be set on more than one psi metric and more than one trigger
	for the same psi metric can be specified. However for each trigger a separate
	file descriptor is required to be able to poll it separately from others,
	therefore for each trigger a separate open() syscall should be made even
	when opening the same psi interface file.

	Monitors activate only when system enters stall state for the monitored
	psi metric and deactivates upon exit from the stall state. While system is
	in the stall state psi signal growth is monitored at a rate of 10 times per
	tracking window.

	The kernel accepts window sizes ranging from 500ms to 10s, therefore min
	monitoring update interval is 50ms and max is 1s. Min limit is set to
	prevent overly frequent polling. Max limit is chosen as a high enough number
	after which monitors are most likely not needed and psi averages can be used
	instead.

	When activated, psi monitor stays active for at least the duration of one
	tracking window to avoid repeated activations/deactivations when system is
	bouncing in and out of the stall state.

	Notifications to the userspace are rate-limited to one per tracking window.

	The trigger will de-register when the file descriptor used to define the
	trigger is closed.

	Userspace monitor usage example
	===============================

	#include <errno.h>
	#include <fcntl.h>
	#include <stdio.h>
	#include <poll.h>
	#include <string.h>
	#include <unistd.h>

	/*
	* Monitor memory partial stall with 1s tracking window size
	* and 150ms threshold.
	*/
	int main() {
	const char trig[] = "some 150000 1000000";
	struct pollfd fds;
	int n;

	fds.fd = open("/proc/pressure/memory", O_RDWR \| O_NONBLOCK);
	if (fds.fd < 0) {
	printf("/proc/pressure/memory open error: %s\n",
	strerror(errno));
	return 1;
	}
	fds.events = POLLPRI;

	if (write(fds.fd, trig, strlen(trig) + 1) < 0) {
	printf("/proc/pressure/memory write error: %s\n",
	strerror(errno));
	return 1;
	}

	printf("waiting for events...\n");
	while (1) {
	n = poll(&fds, 1, -1);
	if (n < 0) {
	printf("poll error: %s\n", strerror(errno));
	return 1;
	}
	if (fds.revents & POLLERR) {
	printf("got POLLERR, event source is gone\n");
	return 0;
	}
	if (fds.revents & POLLPRI) {
	printf("event triggered!\n");
	} else {
	printf("unknown event received: 0x%x\n", fds.revents);
	return 1;
	}
	}

	return 0;
	}

	Cgroup2 interface
	=================

	In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem
	mounted, pressure stall information is also tracked for tasks grouped
	into cgroups. Each subdirectory in the cgroupfs mountpoint contains
	cpu.pressure, memory.pressure, and io.pressure files; the format is
	the same as the /proc/pressure/ files.

	Per-cgroup psi monitors can be specified and used the same way as
	system-wide ones.