What Does it Mean when dmesg Says AMPS Blocked for More than 120 Seconds

From time to time, 60East receives reports of situations where the AMPS process becomes unresponsive for a period of time without creating a minidump and logging stuck thread warnings that don't seem to have an obvious cause.

Checking the dmesg output, though, shows messages along the lines of the following:

[Tue Nov 29 07:31:47 2022] INFO: task ampServer:728 blocked for more than 120 seconds.

This indicates that the AMPS process has made a system call that takes more than 120 seconds to return. During this time (which could be longer than 120 seconds), the operating system (including AMPS) can become nonresponsive.

One common reason for a pause like this is misconfiguration of the vm.dirty_ratio and vm.dirty_background_ratio configuration options.

If the amount of unflushed dirty memory exceeds the vm.dirty_ratio setting, the system will block I/O until the pages are written to disk. In a system with a relatively large ratio and relatively slow storage, these writes can take an extended period of time. (For example, imagine a system with 1TB of memory connected to a SAN with a vm.dirty_ratio of 40: this system can absorb 400GB of dirty pages, but if the server ever reaches a state where there are more than 400GB of dirty pages, the system will halt until all of those pages are flushed.)

The vm.dirty_background_ratio should be set to a relatively small value, so that pages begin to be flushed in the background relatively quickly (a ratio of 5, meaning 5%, is fairly common, although on a system with a large amount of memory a smaller value may be reasonable).

The most common approach for the vm.dirty_ratio is to set this somewhat above the vm.dirty_background_ratio, so that if the background synchronization can't keep up, the forced flush happens while the amount of memory involved is still somewhat reasonable, reducing the amount of time that it takes for the system to flush the pages.

Last updated 1 month ago