Wikipedia:Reference desk/Archives/Computing/2016 January 11

= January 11 =

Measuring CPU usage of a process
[I asked a version of this question on stackoverflow but have not received any answers I can use.]

I'm interested in measuring the CPU usage of long-running processes, so that I can make sure they're not using "too much" CPU. Furthermore, I would like to measure usage in units akin to processor cycles, or machine instructions. In particular, I am not interested in merely measuring the percentage of the CPU each process is using.

Measuring the percentage of the CPU a process is using is easy: call something like getrusage, and divide the (user+system) CPU time used by the amount of time that has elapsed. And, conceptually, converting this number to units of processor cycles or machine instructions is also easy: just multiply by the clock frequency or MIPS rating of the CPU.

So the problem reduces to: what's a good way to determine the (perhaps approximate) clock frequency of the CPU in use, or the (perhaps approximate) MIPS rating of the processor in use, in a reasonably portable way? (This is for a process running under Linux, so for now "reasonably portable" can be interpreted as "available on mainstream Linux distros".)

I am not looking for something exact; I understand full well that concepts like clock frequencies and MIPS ratings are notoriously slippery. In a nutshell, my main goal is to arrange that if I've determined that the appropriate CPU usage of program P is, say, 10,000 instructions per second, and if I move it to a machine that has a 2&times; slower processor, as long as it continues to use approximately 10,000 instructions per second there, nothing is wrong (even though its CPU usage in percentage terms will likely double).

I know about PAPI but it feels like overkill for my purposes. Apparently Windows has a QueryProcessCycleTime call which might be exactly what I'm looking for, but I am not running these programs under Windows.

I have considered using the Linux "bogomips" value to do my scaling, but of course bogomips are bogus and should not be used for anything. (Although, in this case, I believe they would do more or less exactly what I want, with quite acceptable accuracy.)

The question obviously gets more complicated for multi-processor machines, and multi-core processors. The question gets even more complicated if the CPU frequency is throttled over time for whatever reason. To keep things simple (and because I'm running on a non-throttled monoprocessor), let's not worry about any of these issues for now.

Thanks for any suggestions. —Steve Summit (talk) 15:45, 11 January 2016 (UTC)


 * There is no good answer - see Megahertz myth. Different cpus will have different performance characteristics for different workloads. In addition to simple clock cycles, the length of the pipeline and the quality of the branch predictor has a major influence. And nowadays, much more often then not the memory subsystem is the bottleneck, so cache size and structure is critical (one reason why BogoMips are bogus, since they don't exercise the memory subsystem to any significant degree). With all these caveats:
 * cpu MHz        : 2593.748
 * I actually used a representative sample workload and benchmarked different CPUs and build a database at one time, so good luck! --Stephan Schulz (talk) 16:38, 11 January 2016 (UTC) (oh my god - I'm giving a 3rd graders answer to a UNIX/C question by Steve Summit ;-).
 * Thanks. I'm probably going to do something with /proc/cpuinfo, but speaking of pipelines, yours might be too short: although as I said my target machine is a monoprocessor, on my development machine grepping cpuinfo like that gives me
 * cpu MHz		: 1200.000
 * cpu MHz		: 1200.000
 * cpu MHz		: 1200.000
 * cpu MHz		: 1200.000
 * <tt>cpu MHz		: 1200.000</tt>
 * <tt>cpu MHz		: 1200.000</tt>
 * <tt>cpu MHz		: 1200.000</tt>
 * <tt>cpu MHz		: 1200.000</tt>
 * So if I go this route, I'll be adding the equivalent of <tt>grep | head -1</tt>, or <tt>grep -m 1</tt>.
 * But I'm surprised that there appear to be no Linux syscalls that deal with processor cycle counts. —Steve Summit (talk) 16:53, 11 January 2016 (UTC) (and worry not about the tenor of your reply: for my part, as James Hogg has said, "It's never too late to learn what 'opsimath' means." :-) )
 * But I'm surprised that there appear to be no Linux syscalls that deal with processor cycle counts. —Steve Summit (talk) 16:53, 11 January 2016 (UTC) (and worry not about the tenor of your reply: for my part, as James Hogg has said, "It's never too late to learn what 'opsimath' means." :-) )


 * (ec) A clarification: Stephan Schulz quite responsibly cited the Megahertz myth, and indeed, most of the time people start fixating on processor clock rates, they're seriously deluding themselves. My issue is that the only "kosher" way of approaching what I'm trying to do would involve simple, relative CPU percentages, which as I hope I've adequately explained, would not give me what I really want.  If I move my program(s) to another processor with a significantly faster or slower processor, I want the numbers to automatically adjust themselves in some more absolute sense.  If I can multiply a relative CPU percentage measurement by some number (any number) that has some kind of roughly linear relationship with the performance of the CPU I'm using, I'll get "absolute" numbers that will be more useful to me than if I were to use relative CPU percentages alone.


 * Oh, and the other thing I meant to mention is that I'm willing to assume that the magic scaling number is in the context of (that is, that I'll only be moving my programs between instances of) a single processor family. I certainly won't expect to be able to move Program P from an Intel x86 to an ARM chip and have my "10,000 instructions/sec" metric preserved. —Steve Summit (talk) 17:08, 11 January 2016 (UTC)


 * If it is your program that you are trying to profile, you can use rdtsc in your source code to get a very good count of cycles being used. Please note that the exact count may change due to process swapping, memory stalls, etc... 199.15.144.250 (talk) 17:03, 11 January 2016 (UTC)


 * Thanks. I was wondering about that.  Our page on rdtsc says that under *nix, it's accessible via <tt>clock_gettime</tt> and <tt>CLOCK_MONOTONIC</tt>, but in my understanding, <tt>clock_gettime</tt> returns values in units of seconds, not counts, so I'd still have a scaling problem. Is there another way to access rdtsc-like functionality (and without resorting to <tt>asm</tt> directives)? —Steve Summit (talk) 17:12, 11 January 2016 (UTC)


 * You need to use asm to run rdtsc directly. Then, you will get the timestamp count, which is a cycle count. Please note that everything else said here is true and the situation is much worse than what has been said. The value you get will likely be invalid for what you are trying to do. If you are trying to reduce cycle count, then reduce the size of your assembly code. If you are trying to reduce runtime, reduce the complexity of your code. If you are trying to reduce CPU usage, alter the niceness/priority of your process. 199.15.144.250 (talk) 17:33, 11 January 2016 (UTC)


 * You don't say why you care about how many instructions are being executed per second? The natural reason in my mind for worrying about things like this is in order to ensure that the program is fast enough to stay responsive to user inputs in real-time.  Is that, or something like it, the goal?  If so, it may be more natural to include timing checkpoints in your code.  In other words, test whether the code runs from point A to point B in less than Z seconds.  If your program is single-threaded and either linear or loop orientated, it shouldn't be too hard to find places in the code to place benchmarks.  (Much more complicated with multi-threaded code, of course.)  Adding internal benchmarks based on the difference in realtime elapsed can help determine if a system is fast enough to execute the code in a way the continues to appear responsive to real-time input.  Dragons flight (talk) 17:23, 11 January 2016 (UTC)


 * Simply stated, I'm trying to set up a CPU budgeting scheme for a system consisting of multiple long-running processes running on one processor. I'd like something akin to a circuit breaker to let me know if one process ever tries to use more of the CPU than it should.
 * In your house, your doorbell draws less current than your air conditioner, so they're on separate branch circuits protected by breakers having significantly different current ratings. Also, if you add up the current ratings of all the breakers in a panel, you should come up with a number which is less than the total rating of that panel (or, perhaps, less than the rating times some derating factor reflecting the fact that you probably won't try to turn on every lamp and appliance in the house all at once).
 * Also, I'd like to be able to predict for a new system whether the CPU requirements of a proposed new assortment of processes will or won't overload the proposed processor for that system. (That is, I'd like to do something analogous to the way an electrical engineer allocates branch circuits and panel ratings for a new building.)
 * For electric circuits protected by breakers, the appropriate unit for measuring electricity usage is the ampere. I'm trying to find an appropriate (and measurable) unit for my CPU budgeting.  (But, again, the appropriate unit is not percentage, for the same reason that circuit breakers aren't rated for some percentage of the panel they're installed in). —Steve Summit (talk) 18:21, 11 January 2016 (UTC)


 * Your goal appears very inefficient. The analogy to home electricity is a faulty one. If I unplug everything, my electricity is still available for use. Using none of the available power is not a problem. In a computer, an idle processor is a bad thing. We don't want the processor just sitting there doing nothing. We want it to work at capacity as much as possible. So, for your home electricity analogy, if I am not using the air conditioner, I want the doorbell to start ringing to use up the excess electricity available. Obviously, the analogy doesn't work.
 * The standard way to handle this problem is with priority settings. Higher priority processes have a greater chance to get CPU time. Lower priority processes have a lower chance to get CPU time. In your scheme, I tell a process that it can have, at most, 2000 CPU cycles a day. It has used them all. Now, nothing is running, but that process has to wait until tomorrow to start running again. That is no good. If nothing is running, there is no harm in having the low priority process run. I want it to run. Therefore, your CPU "budgeting" is better described as CPU "priority" settings. If I set one process to be priority 10 (max priority) and another to be 1 (lowest priority), then I expect the 10 to run most of the time, but the 1 to run a little here and there. Luckily, that is how most scheduling algorithms in modern operating systems work. There are complications due to things like priority inversion, but overall we use priority to set your so-called budget. 199.15.144.250 (talk) 18:45, 11 January 2016 (UTC)
 * Suppose five of those processes perform separate CPU-intensive steps along a pipeline, the outputs of which are needed, newly updated, 100 times per second. If they fail to do so, the system's performance will be seriously, perhaps fatally degraded.  How would I use priority to ensure their performance?  (As it happens, all five run with nice -10 already.) —Steve Summit (talk) 19:00, 11 January 2016 (UTC)
 * P.S. My CPU-limit circuit breakers won't necessarily shut the system or parts of it down, but they will alert me that something is wrong.


 * You are getting into real-time systems. That is an area I used to work in when I wrote programming for aircraft. In a real-time system, if something needs to happen right now, it happens right now or the system is a failure. We didn't monitor CPU cycles to solve that problem. We used an interrupt system. When something needed to take over the CPU, it would interrupt the CPU and then get the work done. There was also parallel systems, redundant systems, etc... It is a completely different mindset than a basic desktop computer. We are slowing getting back into real-time systems with cell phones. If your phone rings while you are playing the latest game, the game needs to stop immediately and let the "you've got a call" process take focus. Those systems don't count CPU cycles either.
 * Trying to find anything relative (and because this trip down aircraft programming goes back many many years), we did instruction counts back in the 8K and 16K days. It was more relaxed in the 64K days. Instruction count is directly related to CPU cycle count. Given a serial (non-looping) set of instructions, you could determine exactly how many cycles it will take in the old days. So (now rewinding to the Atari days), I would know that I had to update the video at precise moments because it was scanning across the screen. If I didn't get it this cycle, I had to wait until the next scan. That would show up as very obvious glitches in the video. So, I knew exactly how many cycles I had (though we didn't call them cycles at that time) between each time I had to update a video element. I had to play with the instructions I wanted to run to work out how to cram everything in just right to get the video to update at exactly the right time. So, long long ago, what you wanted was possible and done by experienced programmers. With modern operating systems, it isn't. With real-time operating systems, the problem is solved in other ways. I don't use real-time systems now, but I assume they still use interrupts to stop the current process and take over the CPU as necessary. 199.15.144.250 (talk) 19:16, 11 January 2016 (UTC)
 * You're absolutely right. What I'm trying to do here is very close to the kind of thing that it's a true real-time system's job to guarantee, although what I've got is not and will never be a true real-time system. —Steve Summit (talk) 12:29, 16 January 2016 (UTC)


 * The heart of your problem here is your statement: "some number (any number) that has some kind of roughly linear relationship with the performance of the CPU I'm using" - because you're inherently assuming that such a number exists and has meaning. "Performance" (interpreted loosely as "How long a program will take to run") is a multidimensional thing.  A program that runs in a single thread will have identical performance to one that is multithreaded on a CPU with only one core - but will run many times slower on a machine with multiple cores.  A program that is very careful about cache-coherency will behave differently than one that thrashes the cache on machines with different cache sizes and different ratios of cache speed to external RAM speed - and on machines with multi-level caches.  Programs which have very predictable jumps and loops will perform differently from programs that don't on CPU's with good branch prediction - but hardly any differently on machines with poor branch prediction.  RISC versus CISC has another effect...floating point performance versus integer performance versus mixed-use parallelism makes another difference.  It goes on and on.  Even if you could characterize your programs needs in this multidimensional situation - and have multidimensional 'scores' for each processor - you still wouldn't get a remotely good answer because of the interactions of those things with each other.


 * Increasingly, very intense pieces of software are starting to use the GPU as well as the CPU for things other than graphics - and that adds another huge pile of complexity to the analysis.


 * You really REALLY can't come up with a single number - or even a collection of numbers without knowing an enormous amount about the particular program that you're running and the architecture of the machine you're running it on.


 * SteveBaker (talk) 18:31, 11 January 2016 (UTC)


 * Thank you for telling me that what I am trying to do is impossible. :-)
 * Lest any other respondents be discouraged, however, let me emphasize that I am still open to suggestions helping me do what I am trying to do. —Steve Summit (talk) 18:49, 11 January 2016 (UTC)
 * Steve (Summit): I know you're really familiar with the subject-matter, so I'm inclined to try harder than normal to help find a solution, without resigning to dismissing the task as "impossible"... but, as you know, this type of measurement is not straightforward on modern computers (e.g. Intel CPUs running Linux). In order to help us help you, can you answer this simple question: why do you want to use units of "machine cycles" as your performance metric?
 * Nimur (talk) 18:58, 11 January 2016 (UTC)
 * (Thanks.) I have no particular desire to use cycles, but I do need a unit other than 'seconds' or 'percent of CPU'. My goal is that once I've determined that the <tt>WeathervaneDriver</tt> module needs, on average, 50 wotzits/minute while it's running normally, I can do things like get automatically-triggered warnings if it ever starts burning 100 wotzits/minute. But having determined that <tt>WeathervaneDriver</tt>'s number is 50, I am hoping to not have to recompute it (nor the numbers for all the other drivers) when we move to a 2&times; faster (but otherwise similar) processor next month. —Steve Summit (talk) 19:10, 11 January 2016 (UTC)
 * Stupid question: Why is "10%" not useful? If your new CPU is twice as fast, it should be 5%, so why the need for an absolute measurement? To avoid problems with spikes, just integrate (i.e. run  every 5 seconds and update your estimate with an Alpha beta filter, or, if you feel the need, a Kalman filter). If your estimate is above a limit, signal the problem. --Stephan Schulz (talk) 20:08, 11 January 2016 (UTC)
 * There are two reasons I'd rather not use percents:
 * Just as a matter of logistics, when that 2&times; faster processor comes along next month, I don't even want to have to do the conversion from 10% to 5%, I want my little metering scheme to take care of itself, without my constant attention. (Only if I move to a different processor family, or make some other architectural change which invalidates the -- hypothetical -- clock-scaling scheme do I want to have to recompute all my CPU budgets.)
 * Doing the metering by percentages feels more or less exactly like having a fusebox where all the fuses or circuit breakers are labeled, not in amps, but in percentages of the maximum rating of the panel. That would be a horrible and completely unworkable scheme for allocating electrical power, and to me it feels just as bad for budgeting CPU.
 * But I may give up and do it by pure percentages anyway, if the scaling idea ends up being completely unworkable. —Steve Summit (talk) 12:29, 16 January 2016 (UTC)
 * Steve, I think there is some unstated background assumption. Back when I was still doing Air Traffic Control systems, we just did a rough estimate (based on historical experience), scaled it to the worst-case traffic assumptions, and then over-designed the system by a factor of 5-10 (which was easy, since servers now are so powerful that even standard configurations of 1U servers from Dell, IBM or HP typically were enough to reach that factor). Is it possible that you not only want to estimate and monitor the behaviour of your one program, but also want to pack as many different instances (or different services) as possible onto one machine and still guarantee adequate performance? On the other hand, if you want to discover unusual excursions of your program, just have two different running averages of cpu usage, one smoothed over a long time (say hours to days), one smoothed over a much short time (minutes), and alert the user if the short time average exceeds the long-term average by more than an acceptable level. This setup will auto-tune on any machine. It might miss very gradual performance degradation, but you can catch these by building in some hard limits (or just determine acceptable percentage from the behaviour the first few appropriate time units and freeze it in). --Stephan Schulz (talk) 14:52, 16 January 2016 (UTC)
 * Good suggestions, thanks. —Steve Summit (talk) 16:42, 17 January 2016 (UTC)
 * With percents, there's always the issue that the base may not be what you think it is. In this case, that would mean that what you think is the available CPU time it's using as 100% is not.  (I'm not sure if this is ever an actual issue or not, but it potentially could be, say if a good chunk of the CPU time is reserved for something else and thus not counted.) StuRat (talk) 20:37, 11 January 2016 (UTC)
 * Okay, then what you want is Intel Performance Counters, available by using the msrmod kernel module or directly calling pcm from your software. Intel publishes a whole bunch of reference code to help you write "CPU resource"-aware scheduling into your program.  See the references section for more information.  If you aren't on Intel, ... talk to your chip and compiler vendor, probably?
 * For just about anyone else on the reference desk, I'd say "don't do it!" I'd encourage them to run away screaming... but I figure Steve Summit knows exactly what he's doing, and if he wants to write a model-specific, performance-aware software program, he knows what he's getting into.
 * For everyone else: don't do it, and run away screaming.
 * Nimur (talk) 21:36, 11 January 2016 (UTC)
 * Thanks for the references to Intel Performance Counters. —Steve Summit (talk) 12:29, 16 January 2016 (UTC)
 * Hmmm - if Steve Summit has some god-like capability in this field, and the rest of us are completely incapable, why the heck is he asking us incompetents for the answer?
 * The truth is...as I've carefully explained above...THERE REALLY, TRULY, IS NO GOOD ANSWER. I've been programming computers for over 40 years now (July 1973!) - and for almost all of that time, I've been programming 'hard' realtime systems where reliable performance is an issue - so I speak what I know to be true.  Truly, honestly.  What is being asked for here was somewhat reasonable 20 years ago, dubious in the extreme 10 years ago - and flat out impossible today.  With modern CPU architectures, all bets are off - it's a multi-dimensional, time-varying, resource-dependent, architecture-sensitive mess.  Whatever number you come up with will be a flaky hack.  Even if you think you've obtained a number - how do you know that you're not running on a laptop that's decided to drop into power-saving mode - or a deskside computer that's decided to slow it's clock down to avoid overheating?  Even on one single computer you can't predict the answer reliably.  SteveBaker (talk) 20:48, 12 January 2016 (UTC)
 * Make fun of me all you like, Steve -- I'm a big guy, I can take it -- but you know, you're making a lot of unwarranted assumptions there. Perhaps I know my code is not and will never be running on a laptop.  Perhaps I know that variable clocking is not an issue.
 * As for my allegedly god-like capabilities, pay no attention, that was just one of my overenthusiastic fans. In point of fact I'm always willing to learn something new; I hope I never get so dogmatic in my thinking that I start thinking I know everything or that someone with a new or different idea is worthy of ridicule. —Steve Summit (talk) 12:29, 16 January 2016 (UTC)
 * I was not razzing you, personally - but rather your "fan". If you have constraints beyond what was stated in your original question, then perhaps an  answer becomes more attainable (although I personally doubt it).  However, since my telepathic abilities are severely overrated, I have been answering the question as asked. :-)   SteveBaker (talk) 17:43, 16 January 2016 (UTC)


 * If cycles are unreliable, why not try scaling the percentages by the results of some benchmark test? Either find one that's widely used enough to cover most of the architechture you'll be using, or design one which mirrors the expected load of your program and run it on your servers to get a baseline value.  This gives you a scale which should give a reasonable estimate of that actual performance of the processor.  Of course, it either requires a lookup table for pre-generated benchmarks, or sufficient spare time to run the benchmark when the program is put onto the system. MChesterMC (talk) 09:31, 12 January 2016 (UTC)
 * I thought of that, and I may well do it, although if that's the right approach, in the end the kernel is already doing that, perhaps better than I can, and I could just bite the bullet and use the already-computed bogomips number. —Steve Summit (talk) 12:29, 16 January 2016 (UTC)


 * That's an approach that actually works...oh - but only if you're very careful that your benchmark realistically depicts things like cache coherency, float-versus-integer overlap, threading and branch prediction from your 'real' program. Actually, the very best benchmark for this is your actual program...so rather than messing around with benchmarks - just measure how fast it goes on a reasonably large set of computers in your user's demographic.
 * FYI: That's how the big boys play...it's called 'alpha testing'.
 * SteveBaker (talk) 20:48, 12 January 2016 (UTC)
 * And it's completely irrelevant to my question and my situation, but thanks again for the extra additional belittlement. —Steve Summit (talk) 12:29, 16 January 2016 (UTC)


 * ,, , , 199.15.144.250: thanks for all the replies. This has certainly been thought-provoking. —Steve Summit (talk) 12:29, 16 January 2016 (UTC)