Windows 7 Heap Performance Analysis

Download source - 11.25 KB

Introduction

Software is becoming more and more complex; this demands more memory and more CPU power from running multiple tasks in multiple threads. The overall performance of allocations is impacted by many aspects, such as OS, compiler, SDK and application designs. Memory management is one in SDK that impacts application performance, especially applications that use a lot of memory. This article provides the performance analysis to the default memory management on Windows 7 and the default memory management on Linux.

QHeap

QHeap is used for performance comparison. It is a none-blocking memory management that is designed to handle heavy memory requests from multiple threads. All multithreading synchronizations are done with intrinsic. The address of memory blocks is guaranteed at 4 bytes boundary on 32-bites system, at 8-bytes boundary on 64-bits system. QHeap is compiled with VC10 and compiled with g++4.5 on Linux.

Test Methodology

HeapPerf is a tool that can simulate a relatively heavy load scenario by using STL to invoke memory allocations from multiple threads. It uses a custom allocator to sample the cost for each memory allocation and memory de-allocation. This avoids including costs in STL, IO operations, applications, etc. The custom allocator is modified slightly on top of std::allocator to call testAlloc() function instead of new operator and call testFree() instead of delete operator. Below is how testAlloc() works:

void*
testAlloc(size_t size)
{
    // get the testing configuration from a TLS slot
    TaskInfo *This = TaskInfo::get();

    // Save performance counter before calling memory allocation API
#if defined(_WIN32)
    LARGE_INTEGER start, end;
    QueryPerformanceCounter(&start);
#else
    timespec start, end;
    clock_gettime(CLOCK_REALTIME, &start);
#endif

    // Invokes the corresponding memory allocation API 
    // according to the testing configuration
    void *pv;
    switch (TaskInfo::_heapType)
    {
#if defined(_WIN32)
    case HEAPTYPE_WIN32:
        pv = ::HeapAlloc(GetProcessHeap(), 0, size);
        break;
#endif
    case HEAPTYPE_QHEAP:
        pv = ::QHeapAlloc(size);
        break;

    case HEAPTYPE_LHEAP:
        pv = This ? ::LHeapAlloc(This->_heap, size) : ::malloc(size);
        break;

    case HEAPTYPE_SHEAP:
        pv = ::SHeapAlloc(TaskInfo::_sharedHeap, size);
        break;

    case HEAPTYPE_DEFAULT:
    default:
        pv = ::operator new (size);
        break;
    }

    // only sample the data from the testing threads, 
    // the main thread isn't configured for testing
    if (This)
    {
        // Query performance counter, then increase the AllocTime 
        // of the thread with the delta:
#if defined(_WIN32)
        QueryPerformanceCounter(&end);
        This->_allocTime += end.QuadPart - start.QuadPart;
#else
        clock_gettime(CLOCK_REALTIME, &end);
        This->_allocTime += (ULONGLONG)(end.tv_sec - start.tv_sec) * 
				1000000000L + (end.tv_nsec - start.tv_nsec);
#endif
        // increase the number of allocations
        This->_allocCount++;
    }
    return pv;
}

The testFree is similar to testAlloc, see the source code for full details.

For comparison purposes, HeapPerf concludes the total cost of AllocTime and FreeTime by adding up the numbers after all threads are completed, and then calculates the average time in milliseconds for 1 million allocations.

LARGE_INTEGER perfFrequency;
QueryPerformanceFrequency(&perfFrequency);

// get the total cost in milliseconds
double allocTime = double(_allocTime) * 1000.0 / perfFrequency.QuadPart;
double freeTime = double(_freeTime) * 1000.0 / perfFrequency.QuadPart;

// get the average cost in milliseconds for 1 million allocations
allocTime = allocTime * 1000000 / _allocCount;

// get the average cost in milliseconds for 1 million de-allocations
freeTime = freeTime * 1000000 / _freeCount;

In summary, HeapPerf reports:

AllocTime: The average time in milliseconds spent for 1 million allocations
FreeTime: The average time in milliseconds spent for 1 million de-allocations
Mem: The peak memory usage in KB

Finally, HeapPerf reports:

AllocTime: The average time per thread in milliseconds spent for all allocations
FreeTime: The average time per thread in milliseconds spent for all de-allocations
PeakMem: The peak memory usage in KB

The source code of HeapPerf is provided as a reference for the test methodology.

Conclusions

All numbers are average numbers from 10 runs on the same machine: Dell XPS 420, Intel Q6600 2.4G, 6GB RAM.

Too Many Threads are Bad

Table-1 and Table-2 indicate that too many running threads reduce performance dramatically. Applications should limit the number of running threads.

Table-1: Windows 7 Ultimate 64-bits

Heap	Threads	AllocTime(ms)	FreeTime(ms)	Mem(KB)
default	1	121	92	2,657
default	2	178	110	5,084
default	4	188	101	7,960
default	8	366	195	8,384
default	16	568	308	9,582
default	32	1,030	580	11,518
default	64	1,505	1,139	14,676
default	128	2,717	2,018	20,842
qheap	1	55	45	2,282
qheap	2	61	49	2,428
qheap	4	65	52	2,782
qheap	8	112	106	3,354
qheap	16	196	170	4,374
qheap	32	386	389	6,598
qheap	64	843	647	10,932
qheap	128	1,475	1,277	19,421

Table 2: Ubuntun Server 10.10 64-bits

Heap	Threads	AllocTime(ms)	FreeTime(ms)	RSS(KB)
default	1	80	79	5,918
default	2	90	82	6,518
default	4	89	83	8,029
default	8	164	153	9,942
default	16	338	309	14,043
default	32	650	614	22,446
default	64	1,418	1,211	38,566
default	128	3,155	2,123	71,042
qheap	1	81	80	5,918
qheap	2	90	84	6,475
qheap	4	90	84	7,949
qheap	8	163	156	9,931
qheap	16	339	316	14,123
qheap	32	673	621	22,238
qheap	64	1,415	1,194	38,941
qheap	128	2,800	2,168	70,648

The Default on Windows 7 isn’t Great

The formula (qheap-default) / qheap are used to compare QHeap against the default on Windows 7 (Table 3) and on Linux (Table 4). Table-3 indicates that QHeap is significantly faster than the default on Windows 7. Table-4 indicates that QHeap is about the same as the default on Linux.

Table-3 QHeap vs. the default on Windows 7

Heap	Threads	AllocTime(ms)	FreeTime(ms)	Mem(KB)
qheap	1	-54.4%	-50.9%	-14.1%
qheap	2	-66.0%	-56.0%	-52.2%
qheap	4	-65.7%	-48.9%	-65.1%
qheap	8	-69.4%	-45.6%	-60.0%
qheap	16	-65.4%	-44.6%	-54.4%
qheap	32	-62.6%	-33.0%	-42.7%
qheap	64	-44.0%	-43.2%	-25.5%
qheap	128	-45.7%	-36.7%	-6.8%

Table-4 QHeap vs. the default on Ubuntu Server 10.10

Heap	Threads	AllocTime(ms)	FreeTime(ms)	RSS(KB)
qheap	1	1.1%	1.8%	0.0%
qheap	2	0.7%	2.0%	-0.7%
qheap	4	1.1%	0.8%	-1.0%
qheap	8	-0.7%	1.9%	-0.1%
qheap	16	0.3%	2.5%	0.6%
qheap	32	3.6%	1.1%	-0.9%
qheap	64	-0.2%	-1.4%	1.0%
qheap	128	-11.2%	2.1%	-0.6%

Windows 7+VC10 is Better than Linux+gcc4.5

QHeap on Windows 7 is significantly faster than on Linux according to Table-5. This indicates that an application should have better performance on Windows 7 than on Linux if the application is complied with VC10.

Table-5 QHeap on Windows 7 vs. on Linux

Heap	Threads	AllocTime(ms)	FreeTime(ms)
qheap	1	-32.3%	-44.0%
qheap	2	-32.9%	-42.0%
qheap	4	-28.4%	-38.0%
qheap	8	-31.3%	-31.9%
qheap	16	-42.0%	-46.1%
qheap	32	-42.7%	-37.4%
qheap	64	-40.4%	-45.8%
qheap	128	-47.3%	-41.1%

Tag windows,heap,performance,analysis

推荐.NET配套的通用数据层ORM框架：CYQ.Data 通用数据层框架

IT Technology Blog

Introduction

QHeap

Test Methodology

Conclusions

Too Many Threads are Bad

Table-1: Windows 7 Ultimate 64-bits

Table 2: Ubuntun Server 10.10 64-bits

The Default on Windows 7 isn’t Great

Table-3 QHeap vs. the default on Windows 7

Table-4 QHeap vs. the default on Ubuntu Server 10.10

Windows 7+VC10 is Better than Linux+gcc4.5

Table-5 QHeap on Windows 7 vs. on Linux

Post Comment

Bulletin

Article Search

Article Categories

Article Archive

New Article

New Comment

Hits Order

Comment Order

Links