平臺論壇博客文庫

› 論壇 › 操作系統(tǒng) › BSD › freebsd9.2-ULE線程調度-保持cpu運行隊列間的平衡-sched ...

[FreeBSD] freebsd9.2-ULE線程調度-保持cpu運行隊列間的平衡-sched_balance函數 [復制鏈接]

71v5

家境小康

論壇徽章:: 0

電梯直達

1樓 [收藏(0)] [報告]

發(fā)表于 2014-06-29 23:02 |只看該作者 |倒序瀏覽

本帖最后由 71v5 于 2014-06-30 01:33 編輯

從帖子"freebsd9.2-何時調用mini_switch函數-線程時間片用完"的描述中可知，當statclock超時后，statclock_cnt函數
會調用調度程序相關的sched_clock函數進行相應的處理，sched_clock函數的一個任務就是檢查是否需要調用函數
sched_balance在cpu的運行隊列間遷移thread，以保持cpu運行隊列間的平衡,這里再摘取相關的代碼，這樣就會更清晰:

2174 /*
2175 * Handle a stathz tick. This is really only relevant for timeshare
2176 * threads.
2177 */
2178 void
2179 sched_clock(struct thread *td)
2180 {
2181 struct tdq *tdq;
2182 struct td_sched *ts;
2183 /* 2185：tdq指向當前cpu對應的struct tdq對象 */
2184 THREAD_LOCK_ASSERT(td, MA_OWNED);
2185 tdq = TDQ_SELF();
2186 #ifdef SMP
2187 /***********************************************************************
2188 * We run the long term load balancer infrequently on the first cpu.
static struct tdq *balance_tdq;
static int balance_ticks;
2190-2193：
在ULE調度程序初始化階段，BSP會將balance_tdq設置為BSP對應的struct tdq
數據對象的地址。
balance_ticks為調用sched_balance函數在系統(tǒng)中cpu的運行隊列間遷移線程
的周期，在ULE調度程序初始化階段設置，具體值是多少應該沒什么意義：
balance_interval默認值為128.
balance_ticks = max(balance_interval / 2, 1);
balance_ticks += random() % balance_interval;
當balance_ticks為零0時，函數sched_balance就要保持系統(tǒng)中cpu運行隊列
間的平衡。
只在BSP上執(zhí)行。
2189 *************/
2190 if (balance_tdq == tdq) {
2191 if (balance_ticks && --balance_ticks == 0)
2192 sched_balance();
2193 }
。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。
。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。
}

復制代碼

這里為了簡化分析，做下面的假設：
1：系統(tǒng)中有兩個物理cpu，每個cpu有4個物理核心，logical cpu id分別為0，1，2，3，4，5，6，7，后面就
以cpu0，cpu1，cpu2等等標識這些logical cpu，其對應的運行隊列分別記為tdq_cpu[0]，tdq_cpu[1]，tdq_cpu[2]
，tdq_cpu[3]等等。

2：cpuset_t類型的大小為1字節(jié)，并且bit位以從左到右的順序編號，那么：
對于一個cpuset_t類型的cpumask變量，如果其編碼為11111111，如果某個bit位設置為1，那么該bit位對應的cpu
就被檢查，比如如果bit0被設置為1，就要檢查cpu0。

這里先簡要分析一下sched_lowest函數，大家可以參照sched_highest函數的分析看看sched_lowest函數到底怎么工作的。
[函數sched_lowest]：

/******************************************************************************************
* Find the cpu with the least load via the least loaded path that has a
* lowpri greater than pri pri. A pri of -1 indicates any priority is
* acceptable.
*
參數描述：
cg：所要檢查的cpu_group。
mask：將要檢查的cpu集合。
pri: 該參數可以忽略。
maxload：該參數比較重要，一般設置為上一次調用函數sched_highest返回的cpu運行
隊列最高的cpu的的負載，在確定cpu運行隊列負載最小的cpu時，只有當
相應cpu運行隊列的負載小于或者等于maxload才會被檢查。
prefer：上一次調用函數sched_highest返回的cpu的logical cpu id，當所檢查的
cpu和prefer相同時，將prefer運行隊列的負載減去一個常數。
函數返回值：
-1：表示期間創(chuàng)建了好多thread，這些thread被添加到了cpu的運行隊列中，導致所檢查的
cpu運行隊列的負載都比maxload高。
非零：運行隊列負載最低的cpu的logical cpu id。
其實只要這些參數的含義弄明白，下來就是就是一些簡單的分析過程了。
************************************/
744 static inline int
745 sched_lowest(const struct cpu_group *cg, cpuset_t mask, int pri, int maxload,
746 int prefer)
747 {
748 struct cpu_search low;
749
750 low.cs_cpu = -1;
751 low.cs_prefer = prefer;
752 low.cs_mask = mask;
753 low.cs_pri = pri;
754 low.cs_limit = maxload;
755 cpu_search_lowest(cg, &low);
756 return low.cs_cpu;
757 }
716 /*
717 * cpu_search instantiations must pass constants to maintain the inline
718 * optimization.
719 */
720 int
721 cpu_search_lowest(const struct cpu_group *cg, struct cpu_search *low)
722 {
723 return cpu_search(cg, low, NULL, CPU_SEARCH_LOWEST);
724 }

復制代碼

[ULE調度程序-sched_balance函數]：

842 static void
843 sched_balance(void)
844 {
845 struct tdq *tdq;
846
847
848
849
850
/****************************************************************************************
* Select a random time between .5 * balance_interval and
* 1.5 * balance_interval.
*
* 變量reblance：表示是否要進行cpu運行隊列間的再平衡，初始值為1，可以通過
* sysctl系統(tǒng)調用來更改。
* static int rebalance = 1;
851-852：重新計算balance_ticks。
853：如果不是SMP系統(tǒng)或者變量rebalance為0，就直接返回，因為沒有必要進行
cpu運行隊列的再平衡操作。
855：因為只有BSP執(zhí)行sched_balance函數，而BSP的logical cpu id為0，所以
這里tdq的值就為&tdq_cpu[0]。
857：以參數cpu_top調用函數sched_balance_group
參數cpu_top請參考帖子"freebsd9.2-ULE線程調度-創(chuàng)建數據結構來描述CPU拓撲信息"
**************************/
851 balance_ticks = max(balance_interval / 2, 1);
852 balance_ticks += random() % balance_interval;
853 if (smp_started == 0 || rebalance == 0)
854 return;
855 tdq = TDQ_SELF();
856 TDQ_UNLOCK(tdq);
857 sched_balance_group(cpu_top);
858 TDQ_LOCK(tdq);
859 }

復制代碼

[函數sched_balance_group]：

798 static void
799 sched_balance_group(struct cpu_group *cg)
800 {
/********************************************************************************
* 局部變量描述：
hamsk：在調用函數sched_highest檢查運行隊列負載最高的cpu時使用，該變量是一個
cpu位圖的集合，設置為1的bit位對應的cpu都會被檢查。
lmask：類似于hamsk，在檢查運行隊列負載最低的cpu時使用。
high：運行隊列負載最高的cpu的logical cpu id。
low：運行隊列負載最低的cpu的logical cpu id。
**************************************/
801 cpuset_t hmask, lmask;
802 int high, low, anylow;
803
/***************************************************************************************************
* 804：for循環(huán)前，將變量hmask的bit位全設置為1，表示檢查系統(tǒng)中全部的cpu。
805-839：一個無限for循環(huán)，當下面任何一個條件滿足時，中止for循環(huán)：
條件1：函數sched_highest返回值為-1，在一般情況下，如果返回值為-1，就表示所
檢查cpu運行隊列中可遷移thread的數目為0。
條件2：變量lmask中的bit位全部為0，表示沒有要檢查的cpu。
條件3：anylow變量為1，并且函數sched_lowest返回值為-1。
810：如果執(zhí)行到這里，就表示函數sched_highest執(zhí)行成功，這里假設函數sched_highest
返回值為6(high值為6)，即cpu6的運行隊列tdq_cpu[6]的負載最高，宏CPU_CLR將變量hmask中的bit6
清零，這就意味著下一次調用sched_highest函數時將不會檢查cpu6。
811：將更新后的hmask變量copy到lmask變量中，函數sched_lowest將檢查lmask變量中包含
的cpu運行隊列的負載，選擇一個負載最低的cpu。
813-815：當變量lmask為空時，即變量lmask中的bit位全為零，表示沒有要檢查的cpu，此時
調用函數sched_lowest沒有意義，跳出for循環(huán)。
817-818：調用函數sched_lowest確定cpu運行隊列負載最低的cpu，這里假設函數sched_lowest的返回值
為2(low值為2)，即cpu2的運行隊列tdq_cpu[2]的負載最低。
820-821：對應上面的條件3。
823-824：當函數sched_lowest返回-1時，此時繼續(xù)for循環(huán)，如果期間一些thread意外終止或者
主動執(zhí)行了exit操作，那么下一次檢查時，sched_lowest函數有可能返回有意義的值。
825-828：執(zhí)行到這里話，high為負載最高的cpu，low為負載最低的cpu，此時函數
sched_balance_pair執(zhí)行一個將thread從運行隊列tdq_cpu[6]遷移到運行隊列tdq_cpu[2]
中的操作，以下面的形式調用函數：
sched_balance_pair(&tdq_cpu[6], &tdq_cpu[2])；
826-828：如果函數sched_balance_pair返回1，將變量hmask中的bit2清零，這意味著下次for循環(huán)時將
不會檢查cpu6和cpu2。
830-837：如果函數sched_balance_pair返回0，此時將變量lmask中的bit2清零，同時將變量anylow
設置為0，跳轉到nextlow處繼續(xù)執(zhí)行，繼續(xù)確定下一個運行隊列負載最低的cpu。
***********************/
804 CPU_FILL(&hmask);
805 for (;;) {
806 high = sched_highest(cg, hmask, 1);
807 /* Stop if there is no more CPU with transferrable threads. */
808 if (high == -1)
809 break;
810 CPU_CLR(high, &hmask);
811 CPU_COPY(&hmask, &lmask);
812 /* Stop if there is no more CPU left for low. */
813 if (CPU_EMPTY(&lmask))
814 break;
815 anylow = 1;
816 nextlow:
817 low = sched_lowest(cg, lmask, -1,
818 TDQ_CPU(high)->tdq_load - 1, high);
819 /* Stop if we looked well and found no less loaded CPU. */
820 if (anylow && low == -1)
821 break;
822 /* Go to next high if we found no less loaded CPU. */
823 if (low == -1)
824 continue;
825 /* Transfer thread from high to low. */
826 if (sched_balance_pair(TDQ_CPU(high), TDQ_CPU(low))) {
827 /* CPU that got thread can no longer be a donor. */
828 CPU_CLR(low, &hmask);
829 } else {
830 /*
831 * If failed, then there is no threads on high
832 * that can run on this low. Drop low from low
833 * mask and look for different one.
834 */
835 CPU_CLR(low, &lmask);
836 anylow = 0;
837 goto nextlow;
838 }
839 }
840 }

復制代碼

[函數sched_balance_pair]-負責在兩個運行隊列間遷移thread：

/********************************************************************
* Transfer load between two imbalanced thread queues.
*
參數描述:
high：負載最高的運行隊列，這里為tdq_cpu[6]。
low：負載最低的運行隊列，這里為tdq_cpu[2]。
sched_balance_pair函數返回值的含義請參考一下下面對tdq_move函數
返回值的分析。
****************************/
889 static int
890 sched_balance_pair(struct tdq *high, struct tdq *low)
891 {
/*********************************************************************************
* 局部變量描述：
moved：可以遷移的thread的數目。
cpu：運行隊列l(wèi)ow對應的cpu的logical cpu id。
901-912：再次檢查遷移條件是否滿足，tdq_move函數見下面的分析。
一般情況下if語句的前兩個條件都滿足，此時：
在函數tdq_move返回1時，檢查是否需要向運行隊列l(wèi)ow對應的cpu發(fā)出一個處理器間
中斷(IPI),后續(xù)有時間的話，將和大家分享一下處理器間中斷的發(fā)送過程以及相關
的處理。
*******************************************/
892 int moved;
893 int cpu;
894
895 tdq_lock_pair(high, low);
896 moved = 0;
897 /*
898 * Determine what the imbalance is and then adjust that to how many
899 * threads we actually have to give up (transferable).
900 */
901 if (high->tdq_transferable != 0 && high->tdq_load > low->tdq_load &&
902 (moved = tdq_move(high, low)) > 0) {
903 /*
904 * In case the target isn't the current cpu IPI it to force a
905 * reschedule with the new workload.
906 */
907 cpu = TDQ_ID(low);
908 sched_pin();
909 if (cpu != PCPU_GET(cpuid))
910 ipi_cpu(cpu, IPI_PREEMPT);
911 sched_unpin();
912 }
913 tdq_unlock_pair(high, low);
914 return (moved);
915 }