user warning: Duplicate entry '827686' for key 1
query: INSERT INTO dr_accesslog (title, path, url, hostname, uid, sid, timer, timestamp) values('Processor Utilization is Meaningless', 'node/73', '', '38.107.191.86', 0, '6embtud40on8btpkpfo7jdofd4', 244, 1268788638) in /usr/local/apache/htdocs/includes/database.mysql.inc on line 172.
Fri, 08/21/2009 - 12:41 by Chad Mynhier
I've written before about achieving a 300x performance improvement. While that sounds impressive, it was effectively in a lab situation, so it doesn't really count. But now I've done even better than that, and in this case, it was a real-world improvement.
The performance problem was seen in the system utilities ps(1) andprstat(1M). For example, on the server in question, “ps -ef” took a little over a full minute to run:
# ptime ps -ef > /dev/null
real 1:04.587
user 0.046
sys 1:04.494
This was obviously a noticeable problem, and it impacted the customer's ability to monitor and investigate problems on the system. What little output was available from prstat(1M) showed these two utilities spending almost 100% of their time in the kernel. This means that the impact went beyond simply having to wait for ps(1) or prstat(1M) to return data. Each instance of one of these commands was monopolizing a CPU, leaving less overhead in system capacity.
Kernel stack profiling showed the most common stack traces being similar to this:
genunix`rm_assize+0xa0
procfs`prgetpsinfo+0x3d0
procfs`pr_read_psinfo+0x38
genunix`fop_read+0x20
genunix`read+0x274
unix`syscall_trap+0xac
What this stack trace shows is that the process (ps(1) or prstat(1M)) is reading /proc/<pid>/psinfo for some other process on the system. It's doing this to get all the information you see in ps(1) output, such as the user ID, group ID, virtual size, resident set size, etc., of the process. The stack trace also shows that the bulk of that time is being spent in the kernel function rm_assize(). Some code-digging showed that this function is used to calculate the virtual size of a process, and it showed that this was being done in an inefficient manner.
The calculation involved getting the size of every segment in the address space for a process and adding them up. Even though the information for these segments is stored in a data structure that is very efficient for insertions, deletions, and lookups, that efficiency doesn't come into play when you need to look at every segment. At this point, you're no better off with the efficient data structure than you are with a linked list. This doesn't make a big difference for most processes, because they tend to have on the order of 100 address space segments. For processes with a large number of segments, however, the time it takes to get the size for every segment starts to add up. The application running on the server in question tended to have on the order of 100,000 segments, and there were hundreds of processes running.
In addition to being inefficient, the calculation was being performed on-the-fly every time the information was requested. If no address space had changed since the last time, CPU cycles were being burned just to get the same result we'd previously gotten.
The solution to the problem uses the fact that we know exactly when the address space for a process changes, and by exactly what amount. Rather than calculating the virtual size of a process on-the-fly, we can simply store that value and make incremental changes to it at these locations. I coded up these changes, and they've been putback into OpenSolaris. We got a preliminary patch with these changes for the customer. After this patch was installed, we saw significant improvement, from about 60 seconds down to about 60 milliseconds:
# ptime ps -ef > /dev/null
real 0.056
user 0.011
sys 0.038
So I achieved a 1000x improvement, or, as Dr. Evil would say: “One hundred thousand percent!”