Windows 7 Heap Performance Analysis
Introduction
Software is becoming more and more complex; this demands more memory and more CPU power from running multiple tasks in multiple threads. The overall performance of allocations is impacted by many aspects, such as OS, compiler, SDK and application designs. Memory management is one in SDK that impacts application performance, especially applications that use a lot of memory. This article provides the performance analysis to the default memory management on Windows 7 and the default memory management on Linux.
QHeap
QHeap is used for performance comparison. It is a none-blocking memory management that is designed to handle heavy memory requests from multiple threads. All multithreading synchronizations are done with intrinsic. The address of memory blocks is guaranteed at 4 bytes boundary on 32-bites system, at 8-bytes boundary on 64-bits system. QHeap is compiled with VC10 and compiled with g++4.5 on Linux.
Test Methodology
HeapPerf
is a tool that can simulate a relatively heavy load scenario by using STL to invoke memory allocations from multiple threads. It uses a custom allocator to sample the cost for each memory allocation and memory de-allocation. This avoids including costs in STL, IO operations, applications, etc. The custom allocator is modified slightly on top of std::allocator
to call testAlloc()
function instead of new operator and call testFree()
instead of delete operator. Below is how testAlloc()
works:
void*
testAlloc(size_t size)
{
// get the testing configuration from a TLS slot
TaskInfo *This = TaskInfo::get();
// Save performance counter before calling memory allocation API
#if defined(_WIN32)
LARGE_INTEGER start, end;
QueryPerformanceCounter(&start);
#else
timespec start, end;
clock_gettime(CLOCK_REALTIME, &start);
#endif
// Invokes the corresponding memory allocation API
// according to the testing configuration
void *pv;
switch (TaskInfo::_heapType)
{
#if defined(_WIN32)
case HEAPTYPE_WIN32:
pv = ::HeapAlloc(GetProcessHeap(), 0, size);
break;
#endif
case HEAPTYPE_QHEAP:
pv = ::QHeapAlloc(size);
break;
case HEAPTYPE_LHEAP:
pv = This ? ::LHeapAlloc(This->_heap, size) : ::malloc(size);
break;
case HEAPTYPE_SHEAP:
pv = ::SHeapAlloc(TaskInfo::_sharedHeap, size);
break;
case HEAPTYPE_DEFAULT:
default:
pv = ::operator new (size);
break;
}
// only sample the data from the testing threads,
// the main thread isn't configured for testing
if (This)
{
// Query performance counter, then increase the AllocTime
// of the thread with the delta:
#if defined(_WIN32)
QueryPerformanceCounter(&end);
This->_allocTime += end.QuadPart - start.QuadPart;
#else
clock_gettime(CLOCK_REALTIME, &end);
This->_allocTime += (ULONGLONG)(end.tv_sec - start.tv_sec) *
1000000000L + (end.tv_nsec - start.tv_nsec);
#endif
// increase the number of allocations
This->_allocCount++;
}
return pv;
}
The testFree
is similar to testAlloc
, see the source code for full details.
For comparison purposes, HeapPerf
concludes the total cost of AllocTime
and FreeTime
by adding up the numbers after all threads are completed, and then calculates the average time in milliseconds for 1 million allocations.
LARGE_INTEGER perfFrequency;
QueryPerformanceFrequency(&perfFrequency);
// get the total cost in milliseconds
double allocTime = double(_allocTime) * 1000.0 / perfFrequency.QuadPart;
double freeTime = double(_freeTime) * 1000.0 / perfFrequency.QuadPart;
// get the average cost in milliseconds for 1 million allocations
allocTime = allocTime * 1000000 / _allocCount;
// get the average cost in milliseconds for 1 million de-allocations
freeTime = freeTime * 1000000 / _freeCount;
In summary, HeapPerf
reports:
AllocTime
: The average time in milliseconds spent for 1 million allocationsFreeTime
: The average time in milliseconds spent for 1 million de-allocationsMem
: The peak memory usage in KB
Finally, HeapPerf
reports:
AllocTime
: The average time per thread in milliseconds spent for all allocationsFreeTime
: The average time per thread in milliseconds spent for all de-allocationsPeakMem
: The peak memory usage in KB
The source code of HeapPerf
is provided as a reference for the test methodology.
Conclusions
All numbers are average numbers from 10 runs on the same machine: Dell XPS 420, Intel Q6600 2.4G, 6GB RAM.
Too Many Threads are Bad
Table-1 and Table-2 indicate that too many running threads reduce performance dramatically. Applications should limit the number of running threads.
Table-1: Windows 7 Ultimate 64-bits
Heap | Threads | AllocTime(ms) | FreeTime(ms) | Mem(KB) |
default | 1 | 121 | 92 | 2,657 |
default | 2 | 178 | 110 | 5,084 |
default | 4 | 188 | 101 | 7,960 |
default | 8 | 366 | 195 | 8,384 |
default | 16 | 568 | 308 | 9,582 |
default | 32 | 1,030 | 580 | 11,518 |
default | 64 | 1,505 | 1,139 | 14,676 |
default | 128 | 2,717 | 2,018 | 20,842 |
qheap | 1 | 55 | 45 | 2,282 |
qheap | 2 | 61 | 49 | 2,428 |
qheap | 4 | 65 | 52 | 2,782 |
qheap | 8 | 112 | 106 | 3,354 |
qheap | 16 | 196 | 170 | 4,374 |
qheap | 32 | 386 | 389 | 6,598 |
qheap | 64 | 843 | 647 | 10,932 |
qheap | 128 | 1,475 | 1,277 | 19,421 |
Table 2: Ubuntun Server 10.10 64-bits
Heap | Threads | AllocTime(ms) | FreeTime(ms) | RSS(KB) |
default | 1 | 80 | 79 | 5,918 |
default | 2 | 90 | 82 | 6,518 |
default | 4 | 89 | 83 | 8,029 |
default | 8 | 164 | 153 | 9,942 |
default | 16 | 338 | 309 | 14,043 |
default | 32 | 650 | 614 | 22,446 |
default | 64 | 1,418 | 1,211 | 38,566 |
default | 128 | 3,155 | 2,123 | 71,042 |
qheap | 1 | 81 | 80 | 5,918 |
qheap | 2 | 90 | 84 | 6,475 |
qheap | 4 | 90 | 84 | 7,949 |
qheap | 8 | 163 | 156 | 9,931 |
qheap | 16 | 339 | 316 | 14,123 |
qheap | 32 | 673 | 621 | 22,238 |
qheap | 64 | 1,415 | 1,194 | 38,941 |
qheap | 128 | 2,800 | 2,168 | 70,648 |
The Default on Windows 7 isn’t Great
The formula (qheap-default) / qheap are used to compare QHeap against the default on Windows 7 (Table 3) and on Linux (Table 4). Table-3 indicates that QHeap is significantly faster than the default on Windows 7. Table-4 indicates that QHeap is about the same as the default on Linux.
Table-3 QHeap vs. the default on Windows 7
Heap | Threads | AllocTime(ms) | FreeTime(ms) | Mem(KB) |
qheap | 1 | -54.4% | -50.9% | -14.1% |
qheap | 2 | -66.0% | -56.0% | -52.2% |
qheap | 4 | -65.7% | -48.9% | -65.1% |
qheap | 8 | -69.4% | -45.6% | -60.0% |
qheap | 16 | -65.4% | -44.6% | -54.4% |
qheap | 32 | -62.6% | -33.0% | -42.7% |
qheap | 64 | -44.0% | -43.2% | -25.5% |
qheap | 128 | -45.7% | -36.7% | -6.8% |
Table-4 QHeap vs. the default on Ubuntu Server 10.10
Heap | Threads | AllocTime(ms) | FreeTime(ms) | RSS(KB) |
qheap | 1 | 1.1% | 1.8% | 0.0% |
qheap | 2 | 0.7% | 2.0% | -0.7% |
qheap | 4 | 1.1% | 0.8% | -1.0% |
qheap | 8 | -0.7% | 1.9% | -0.1% |
qheap | 16 | 0.3% | 2.5% | 0.6% |
qheap | 32 | 3.6% | 1.1% | -0.9% |
qheap | 64 | -0.2% | -1.4% | 1.0% |
qheap | 128 | -11.2% | 2.1% | -0.6% |
Windows 7+VC10 is Better than Linux+gcc4.5
QHeap on Windows 7 is significantly faster than on Linux according to Table-5. This indicates that an application should have better performance on Windows 7 than on Linux if the application is complied with VC10.
Table-5 QHeap on Windows 7 vs. on Linux
Heap | Threads | AllocTime(ms) | FreeTime(ms) |
qheap | 1 | -32.3% | -44.0% |
qheap | 2 | -32.9% | -42.0% |
qheap | 4 | -28.4% | -38.0% |
qheap | 8 | -31.3% | -31.9% |
qheap | 16 | -42.0% | -46.1% |
qheap | 32 | -42.7% | -37.4% |
qheap | 64 | -40.4% | -45.8% |
qheap | 128 | -47.3% | -41.1% |
Post Comment
HRk03L Really informative blog.Really thank you! Awesome.
BJFqgo I really like and appreciate your blog post.Really looking forward to read more. Keep writing.