測量CPU的利器 - TSC

2022-03-22 09:30:44

次

自Pentium開始x86 CPU均引入TSC了，可提供指令級執(zhí)行時間度量的64位時間戳計數(shù)寄存器，隨著CPU時鐘自動增加。

CPU指令

rdtsc: Read Time-Stamp Counter
rdtscp: Read Time-Stamp Counter and Processor ID

調(diào)用:

Microsoft Visual C++:

unsigned __int64 __rdtsc();
unsigned __int64 __rdtscp( unsigned int * AUX );

Linux & gcc :

extern __inline unsigned long long
__attribute__((__gnu_inline__, __always_inline__, __artificial__))
__rdtsc (void) {
  return __builtin_ia32_rdtsc ();
}
extern __inline unsigned long long
__attribute__((__gnu_inline__, __always_inline__, __artificial__))
__rdtscp (unsigned int *__A)
{
  return __builtin_ia32_rdtscp (__A);
}

示例：

1: L1 cache及內(nèi)存的延遲測量:

代碼：

{
    ......
    /* flush cache line */
    _mm_clflush(&data[0]);

    /* measure cache miss latency */
    ts = rdtscp(&ui);
    m |= data[0];
    te = rdtscp(&ui);
    CALC_MIN(csci[0], ts, te);

    /* measure cache hit latency */
    ts = rdtscp(&ui);
    m &= data[0];
    te = rdtscp(&ui);
      /* flush cache line */
    _mm_clflush(&data[0]);

    /* measure cache miss latency */
    ts = rdtscp(&ui);
    m |= data[0];
    te = rdtscp(&ui);
    CALC_MIN(csci[0], ts, te);

    /* measure cache hit latency */
    ts = rdtscp(&ui);
    m &= data[0];
    te = rdtscp(&ui);
    CALC_MIN(csci[1], ts, te);
    CALC_MIN(csci[1], ts, te);
}

結(jié)果：

rdtscp指令自身耗時：

問題：讀取1個字節(jié)與8個字節(jié)的所用的時間是一樣的，為什么？

2: 常見整型運算及多條指令執(zhí)行周期:

代碼：

{
    ......
    /* measure mul latency */
    ts = rdtscp(&ui);
    m *= *((U32 *)&data[0]);
    te = rdtscp(&ui);
    CALC_MIN(csci[2], ts, te);

    /* measure div latnecy */
    ts = rdtscp(&ui);
    m /= *((U32 *)&data[0]);
    te = rdtscp(&ui);
    CALC_MIN(csci[3], ts, te);

    /* measure 2*mul latnecy */
    ts = rdtscp(&ui);
    m *= *((U32 *)&data[0]);
    m *= *((U32 *)&data[0]);
    te = rdtscp(&ui);
    CALC_MIN(csci2[0], ts, te);

    /* double div */
    ts = rdtscp(&ui);
    m /= *((U32 *)&data[0]);
    m /= *((U32 *)&data[0]);
    te = rdtscp(&ui);
    CALC_MIN(csci2[1], ts, te);

    /* mul + div */
    ts = rdtscp(&ui);
    m *= *((U32 *)&data[0]);
    m /= *((U32 *)&data[0]);
    te = rdtscp(&ui);
    CALC_MIN(csci2[2], ts, te);

    /* measure float mul latency */
    ts = rdtscp(&ui);
    f = f * m;
    te = rdtscp(&ui);
    CALC_MIN(csci[4], ts, te);

    /* measure float div latency */
    while (!m)
        m = rand();
    ts = rdtscp(&ui);
    f = f / m;
    te = rdtscp(&ui);
    CALC_MIN(csci[5], ts, te);
}

結(jié)果：

問題：m及n的除法運算的耗時只比m的除法多了一點，但卻明顯少于m的兩次除法，為什么？

注意事項：

1.考慮到CPU亂序執(zhí)行的問題，rdtsc需要配合cpuid或lfence指令，以保證計這一刻流水線已排空，即rdtsc要測量的指令已執(zhí)行完。后來的CPU提供了rdtscp指令，相當(dāng)于cpuid + rdtsc，但cpuid指令本身的執(zhí)行周期有波動，而rdtscp指令的執(zhí)行更穩(wěn)定。不過rdtscp不是所有的CPU都支持，使用前要通過cpuid指令查詢是不是支持：即CPUID.80000001H:EDX.RDTSCP[bit 27]是不是為1

2.多核系統(tǒng)：新的CPU支持了Invariant TSC特性，可以保證在默認(rèn)情況下各核心看到的TSC是一致的，否則測量代碼執(zhí)行時不能調(diào)度至其它核心上。另外TSC是可以通過MSR來修改的，這種情況下也要注意：
Invariant TSC: Software can modify the value of the time-stamp counter (TSC) of a logical processor by using the WRMSR instruction to write to the IA32_TIME_STAMP_COUNTER MSR

3.CPU降頻問題：第一代TSC的實現(xiàn)是Varient TSC，沒有考慮到降頻的問題，故在低功耗TSC計數(shù)會變慢，甚至停止；后來又有了Constant TSC，解決了降頻的問題，但在DEEP-C狀態(tài)下依然會發(fā)生停止計數(shù)的情況，所以又有了具新的Invariant TSC的特性：
The time stamp counter in newer processors may support an enhancement, referred to as invariant TSC. Processor’s support for invariant TSC is indicated by CPUID.80000007H:EDX[8]. The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is the architectural behavior moving forward. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more efficient and do not incur the overhead associated with a ring transition or access to a platform resource.

4.指令本身的時間開銷
Pentinum Gold G5500T: 31 cycles Core i7-7820HQ: 25 cycles

5.權(quán)限問題（此指令可用于時序攻擊，如Meltdown及Spectre）:
CR4.TSD: Time Stamp Disable (bit 2 of CR4) — Restricts the execution of the RDTSC instruction to procedures running at privilege level 0 when set; allows RDTSC instruction to be executed at any privilege level when clear. This bit also applies to the RDTSCP instruction if supported (if CPUID.80000001H:EDX[27] = 1).

6.計數(shù)器溢出可能：計算器本身是64位的，即使是主頻4G的CPU，也要100多年才會溢出，對于我們的測量來說可以不用考慮

7.時序測量容易被干擾（線程調(diào)度、搶占、系統(tǒng)中斷、虛擬化等），要求測量的指令序列盡量短，并且需要進(jìn)行多次測量