Introduction

Software is becoming more and more complex; this demands more memory and more CPU power from running multiple tasks in multiple threads. The overall performance of allocations is impacted by many aspects, such as OS, compiler, SDK and application designs. Memory management is one in SDK that impacts application performance, especially applications that use a lot of memory. This article provides the performance analysis to the default memory management on Windows 7 and the default memory management on Linux.

QHeap

QHeap is used for performance comparison. It is a none-blocking memory management that is designed to handle heavy memory requests from multiple threads. All multithreading synchronizations are done with intrinsic. The address of memory blocks is guaranteed at 4 bytes boundary on 32-bites system, at 8-bytes boundary on 64-bits system. QHeap is compiled with VC10 and compiled with g++4.5 on Linux.

Test Methodology

HeapPerf is a tool that can simulate a relatively heavy load scenario by using STL to invoke memory allocations from multiple threads.  It uses a custom allocator to sample the cost for each memory allocation and memory de-allocation. This avoids including costs in STL, IO operations, applications, etc. The custom allocator is modified slightly on top of std::allocator to call testAlloc() function instead of new operator and call testFree() instead of delete operator. Below is how testAlloc() works:

void*
testAlloc(size_t size)
{
    // get the testing configuration from a TLS slot
    TaskInfo *This = TaskInfo::get();

    // Save performance counter before calling memory allocation API
#if defined(_WIN32)
    LARGE_INTEGER start, end;
    QueryPerformanceCounter(&start);
#else
    timespec start, end;
    clock_gettime(CLOCK_REALTIME, &start);
#endif

    // Invokes the corresponding memory allocation API 
    // according to the testing configuration
    void *pv;
    switch (TaskInfo::_heapType)
    {
#if defined(_WIN32)
    case HEAPTYPE_WIN32:
        pv = ::HeapAlloc(GetProcessHeap(), 0, size);
        break;
#endif
    case HEAPTYPE_QHEAP:
        pv = ::QHeapAlloc(size);
        break;

    case HEAPTYPE_LHEAP:
        pv = This ? ::LHeapAlloc(This->_heap, size) : ::malloc(size);
        break;

    case HEAPTYPE_SHEAP:
        pv = ::SHeapAlloc(TaskInfo::_sharedHeap, size);
        break;

    case HEAPTYPE_DEFAULT:
    default:
        pv = ::operator new (size);
        break;
    }

    // only sample the data from the testing threads, 
    // the main thread isn't configured for testing
    if (This)
    {
        // Query performance counter, then increase the AllocTime 
        // of the thread with the delta:
#if defined(_WIN32)
        QueryPerformanceCounter(&end);
        This->_allocTime += end.QuadPart - start.QuadPart;
#else
        clock_gettime(CLOCK_REALTIME, &end);
        This->_allocTime += (ULONGLONG)(end.tv_sec - start.tv_sec) * 
				1000000000L + (end.tv_nsec - start.tv_nsec);
#endif
        // increase the number of allocations
        This->_allocCount++;
    }
    return pv;
}

The testFree is similar to testAlloc, see the source code for full details.

For comparison purposes, HeapPerf concludes the total cost of AllocTime and FreeTime by adding up the numbers after all threads are completed, and then calculates the average time in milliseconds for 1 million allocations.

LARGE_INTEGER perfFrequency;
QueryPerformanceFrequency(&perfFrequency);

// get the total cost in milliseconds
double allocTime = double(_allocTime) * 1000.0 / perfFrequency.QuadPart;
double freeTime = double(_freeTime) * 1000.0 / perfFrequency.QuadPart;

// get the average cost in milliseconds for 1 million allocations
allocTime = allocTime * 1000000 / _allocCount;

// get the average cost in milliseconds for 1 million de-allocations
freeTime = freeTime * 1000000 / _freeCount;

In summary, HeapPerf reports:

  • AllocTime: The average time in milliseconds spent for 1 million allocations
  • FreeTime: The average time in milliseconds spent for 1 million de-allocations
  • Mem: The peak memory usage in KB

Finally, HeapPerf reports:

  • AllocTime: The average time per thread in milliseconds spent for all allocations
  • FreeTime: The average time per thread in milliseconds spent for all de-allocations
  • PeakMem: The peak memory usage in KB

The source code of HeapPerf is provided as a reference for the test methodology.

Conclusions

All numbers are average numbers from 10 runs on the same machine: Dell XPS 420, Intel Q6600 2.4G, 6GB RAM.

Too Many Threads are Bad

Table-1 and Table-2 indicate that too many running threads reduce performance dramatically. Applications should limit the number of running threads.

Table-1: Windows 7 Ultimate 64-bits

Heap Threads AllocTime(ms) FreeTime(ms) Mem(KB)
default 1 121 92 2,657
default 2 178 110 5,084
default 4 188 101 7,960
default 8 366 195 8,384
default 16 568 308 9,582
default 32 1,030 580 11,518
default 64 1,505 1,139 14,676
default 128 2,717 2,018 20,842
qheap 1 55 45 2,282
qheap 2 61 49 2,428
qheap 4 65 52 2,782
qheap 8 112 106 3,354
qheap 16 196 170 4,374
qheap 32 386 389 6,598
qheap 64 843 647 10,932
qheap 128 1,475 1,277 19,421

Table 2: Ubuntun Server 10.10 64-bits

Heap Threads AllocTime(ms) FreeTime(ms) RSS(KB)
default 1 80 79 5,918
default 2 90 82 6,518
default 4 89 83 8,029
default 8 164 153 9,942
default 16 338 309 14,043
default 32 650 614 22,446
default 64 1,418 1,211 38,566
default 128 3,155 2,123 71,042
qheap 1 81 80 5,918
qheap 2 90 84 6,475
qheap 4 90 84 7,949
qheap 8 163 156 9,931
qheap 16 339 316 14,123
qheap 32 673 621 22,238
qheap 64 1,415 1,194 38,941
qheap 128 2,800 2,168 70,648

The Default on Windows 7 isn’t Great

The formula (qheap-default) / qheap are used to compare QHeap against the default on Windows 7 (Table 3) and on Linux (Table 4).  Table-3 indicates that QHeap is significantly faster than the default on Windows 7. Table-4 indicates that QHeap is about the same as the default on Linux.

Table-3 QHeap vs. the default on Windows 7

Heap Threads AllocTime(ms) FreeTime(ms) Mem(KB)
qheap 1 -54.4% -50.9% -14.1%
qheap 2 -66.0% -56.0% -52.2%
qheap 4 -65.7% -48.9% -65.1%
qheap 8 -69.4% -45.6% -60.0%
qheap 16 -65.4% -44.6% -54.4%
qheap 32 -62.6% -33.0% -42.7%
qheap 64 -44.0% -43.2% -25.5%
qheap 128 -45.7% -36.7% -6.8%

Table-4 QHeap vs. the default on Ubuntu Server 10.10

Heap Threads AllocTime(ms) FreeTime(ms) RSS(KB)
qheap 1 1.1% 1.8% 0.0%
qheap 2 0.7% 2.0% -0.7%
qheap 4 1.1% 0.8% -1.0%
qheap 8 -0.7% 1.9% -0.1%
qheap 16 0.3% 2.5% 0.6%
qheap 32 3.6% 1.1% -0.9%
qheap 64 -0.2% -1.4% 1.0%
qheap 128 -11.2% 2.1% -0.6%

Windows 7+VC10 is Better than Linux+gcc4.5

QHeap on Windows 7 is significantly faster than on Linux according to Table-5. This indicates that an application should have better performance on Windows 7 than on Linux if the application is complied with VC10.

Table-5 QHeap on Windows 7 vs. on Linux

Heap Threads AllocTime(ms) FreeTime(ms)
qheap 1 -32.3% -44.0%
qheap 2 -32.9% -42.0%
qheap 4 -28.4% -38.0%
qheap 8 -31.3% -31.9%
qheap 16 -42.0% -46.1%
qheap 32 -42.7% -37.4%
qheap 64 -40.4% -45.8%
qheap 128 -47.3% -41.1%
推荐.NET配套的通用数据层ORM框架:CYQ.Data 通用数据层框架