Spark的动态资源分配算法

十念 · 2024-7-27 16:32:19

前言

在《Spark RPC框架详解》这篇文章中，介绍了Spark RPC通信的基本流程。可以看到，Spark中的Driver通过Stage调度生成了物理执行计划，这个物理执行计划包罗了所有必要运行的Task，以及，最关键的，这些Task希望运行的节点信息，我们叫做Locality Preference，即本地性偏好。
但是在Yarn的场景下，资源是以Executor(Container)为单位进行调度，整个Container的粒度与Task不一一对应，Container的生命周期与Task不一一对应。在这种场景下，动态资源调度的基本使命就是

根据这些Task以及Task的本地性偏好，向Yarn申请Container作为Executor以执行Task；这里最紧张的，是在资源申请中表达出对应的Task的位置偏好。
在申请完资源以后，根据Task的本地性偏好，将Task调度到申请到的资源上面去。这里最紧张的，是尽量满足Task的本地性偏好。

本文将具体讲解这个过程。
对于资源申请算法的基本流程，以及将Task和资源进行匹配的基本流程，本文都用实际例子进行讲解。
基于使命需求进行资源请求的整体过程

向Yarn请求资源是由客户端向ApplicationMaster申请，然后ApplicationMaster向Yarn发起请求的，而不是客户端直接向Yarn申请的。
资源是为了服务于Task的运行，Task的生成显然是Driver端负责的，Driver会根据物理执行计划生成的Task信息发送给ApplicationMaster，ApplicationMaster根据这些Task的相干信息进行资源申请。
ApplicationMaster启动以后，会有一个独立线程不断通过调用YarnAllocator.allocateResources()进行连续的资源更新(查看ApplicationMaster的launchReporterThread()方法)。这里叫资源更新，而不叫资源申请，由于这里的操纵包罗新的资源的申请，旧的无用的Container的取消，以及Blocklist Node的更新等多种操纵。
总而言之，ApplicationMaster作为客户端和Yarn的中心方，其资源申请的方法allocateResource()在逻辑上的功能为：

粒度转换: 将Task级别的资源请求，转换为Container(Executor级别的资源请求)。这是一个游戏到粗的粒度的转换。
维度转换: Driver发过来的资源请求是资源的最终全局状态，而Yarn 的 API 要求的针对每一个Container进行增量请求。因此，allocateResources()会将Driver发送过来资源请求的最终状态，对比当前系统已经运行、分配未运行、已经发送请求但是还没有分配资源等等已经存在的状态，确定一个发送给Yarn的增量请求状态。这是一个全量到增量的维度的转换。
角度转换: Driver发过来的每个Task都带有各自Task的Locality，而发送给Yarn的Container请求又是带有Locality需求的Container需求。这是一个从Task到Container的角度的转换。

ApplicationMaster端的allocateResources()方法的基本流程在代码YarnAllocator.allocateResources()中:

---------------------------------- YarnAllocator ------------------------------------
def allocateResources(): Unit = synchronized {
updateResourceRequests()
// 与Yarn进行资源交互
val allocateResponse = amClient.allocate(progressIndicator) // 从Yarn端获取资源结果，包括新分配的、已经结束的等等
val allocatedContainers = allocateResponse.getAllocatedContainers()
handleAllocatedContainers(allocatedContainers.asScala) // 处理新分配的Container
val completedContainers = allocateResponse.getCompletedContainersStatuses() // 处理已经结束的Container
processCompletedContainers(completedContainers.asScala)
}
}

复制代码

ApplicationMaster端的allocateResources()的基本流程如下图所示:

生成资源请求，即将Driver发送过来的全量的、Task粒度的资源请求和Host偏好信息，转换为对Yarn的、以Executor为粒度的资源请求
1. updateResourceRequests()
复制代码
将资源请求发送给Yarn并从Yarn上获取分配结果(基于Yarn的异步调度策略，这次获取的记过并非本次资源请求的分配结果)以进行后续处置惩罚：
1. val allocatedContainers = allocateResponse.getAllocatedContainers()
2. handleAllocatedContainers(allocatedContainers.asScala)
3. val completedContainers = allocateResponse.getCompletedContainersStatuses()
4. processCompletedContainers(completedContainers.asScala)
复制代码

可以看到，updateResourceRequests()
是资源请求的核心方法，它会负责同Yarn进行通信以进行资源请求。
在《超时导致SparkContext构造失败的题目探究》中，我们也介绍过，生成资源请求，其决定过程发生在方法updateResourceRequests()
中。我们紧张来看updateResourceRequests()
方法：

---------------------------------- YarnAllocator ------------------------------------ def updateResourceRequests()
: Unit = { // 获取已经发送给Yarn但是待分配的ContainerRequest，计算待分配容器请求的数量 // 这些ContainerRequest是之前通过调用amClient.addContainerRequest 发送出去的 val pendingAllocate = getPendingAllocate val numPendingAllocate = pendingAllocate.size // 还没有发送请求的executor的数量 val missing = targetNumExecutors - numPendingAllocate - numExecutorsStarting.get - numExecutorsRunning.get // 还没有发送给Yarn的资源请求 if (missing > 0) { /** * 将待处置惩罚的container请求分为三组：本地匹配列表、本地不匹配列表和非本地列表。 */ val (localRequests, staleRequests, anyHostRequests) = splitPendingAllocationsByLocality(
hostToLocalTaskCounts, pendingAllocate)
// staleRequests 的意思是，ApplicationMaster已经请求了这个Container， // 但是这个ContainerRequest所要求的hosts里面没有一个是在 hostToLocalTaskCounts (即task所倾向于)中的，因此，必要取消这个Container Request，由于已经没故意义了 // cancel "stale" requests for locations that are no longer needed staleRequests.foreach { stale => amClient.removeContainerRequest(stale) } val cancelledContainers = staleRequests.size // consider the number of new containers and cancelled stale containers available // 将新的container请求，以及刚刚取消的container，作为available container val availableContainers = missing + cancelledContainers // to maximize locality, include requests with no locality preference that can be cancelled // 在availableContainers的基础上，再算上没有任何locality要求的并且还没有分配成功的container val potentialContainers = availableContainers + anyHostRequests.size // LocalityPreferredContainerPlacementStrategy，计算每一个Container 的Node locality和 Rack locality val containerLocalityPreferences = containerPlacementStrategy.localityOfRequestedContainers( potentialContainers, numLocalityAwareTasks, hostToLocalTaskCounts, allocatedHostToContainersMap, localRequests) val newLocalityRequests = new mutable.ArrayBuffer[ContainerRequest] // 遍历ContainerLocalityPreferences数组中的每一个ContainerLocalityPreferences containerLocalityPreferences.foreach { case ContainerLocalityPreferences(nodes, racks) if nodes != null => newLocalityRequests += createContainerRequest(resource, nodes, racks)// 根据获取的locality，重新创建ContainerRequest请求 } // 除了有locality需求的container以外，还有更多的available container必要被请求，因此对这些container请求也发送出去 if (availableContainers >= newLocalityRequests.size) { // more containers are available than needed for locality, fill in requests for any host for (i <- 0 until (availableContainers - newLocalityRequests.size)) { newLocalityRequests += createContainerRequest(resource, null, null) // 构造ContainerRequest对象 } } else { val numToCancel = newLocalityRequests.size - availableContainers // cancel some requests without locality preferences to schedule more local containers anyHostRequests.slice(0, numToCancel).foreach { nonLocal => amClient.removeContainerRequest(nonLocal) } } } else if (numPendingAllocate > 0 && missing < 0) { val numToCancel = math.min(numPendingAllocate, -missing) val matchingRequests = amClient.getMatchingRequests(RM_REQUEST_PRIORITY, ANY_HOST, resource) matchingRequests.iterator().next().asScala .take(numToCancel).foreach(amClient.removeContainerRequest) } }

复制代码

其基本过程为：

获取当前Pending的request(已经发送给Yarn但是还没有分配Container的请求)，并将这些Pending的请求按照本地性的需求进行切分。这里的基本意图是，当前收到了来自Driver的全局的资源状态信息，而在Yarn上还有一部门之前的资源请求还没有分配Container，那么，会不会这些Pending Requewt中有些Request已经不必要了(满足不了任何一个task的locality需求)？
1. val (localRequests, staleRequests, anyHostRequests) = splitPendingAllocationsByLocality(
2. hostToLocalTaskCounts, pendingAllocate)
复制代码
切分的过程，就是查看当前在Yarn这一端pending的所有的Container的locality与我们目前需求的所有(全局)的Task的locality的交集:
1. -------------------------------------- YarnAllocator ----------------------------------
2. private def splitPendingAllocationsByLocality(
3. hostToLocalTaskCount: Map[String, Int], // 每一个host到希望分配上去的task的数量
4. pendingAllocations: Seq[ContainerRequest] // 还没有分配出去的ContainerRequest
5. ): (Seq[ContainerRequest], Seq[ContainerRequest], Seq[ContainerRequest]) = {
6. val localityMatched = ArrayBuffer[ContainerRequest]()
7. val localityUnMatched = ArrayBuffer[ContainerRequest]()
8. val localityFree = ArrayBuffer[ContainerRequest]()
10. val preferredHosts = hostToLocalTaskCount.keySet
11. // 将当前已经发送给Yarn但是还没有分配的Container的请求进行切分
12. pendingAllocations.foreach { cr =>
13. val nodes = cr.getNodes // 这个 ContainerRequest 对节点的要求
14. if (nodes == null) {
15. localityFree += cr // 这个ContainerRequest对nodes没有要求，那么就是对本地性没有要求的Container请求
16. } else if (nodes.asScala.toSet.intersect(preferredHosts).nonEmpty) { // 这个Container的本地性要求和task期望分配的hosts集合有交集
17. localityMatched += cr // 把这个ContainerRequest添加到localityMatched的ContainerRequest集合中去
18. } else { // 这个Container的本地性要求和task期望分配的hosts集合没有交集
19. localityUnMatched += cr // 把这个ContainerRequest添加到localityMatched的ContainerRequest集合中去
20. }
21. }
22. // 切分结果 (localRequests, staleRequests, anyHostRequests)
23. (localityMatched.toSeq, localityUnMatched.toSeq, localityFree.toSeq)
24. }
26. }
复制代码
切分过程的具体流程如下图所示:

切分的具体流程为 :
- 如果这个Pending Container没有任何的locality要求，那么就是localityFree Container，即，其实际分配的位置有大概是当前所有tasks所希望的位置，也大概不是，那么这个container就是localityFree container
  1. if (nodes == null) {
  2. localityFree += cr // 这个ContainerRequest对nodes没有要求，那么就是对本地性没有要求的Container请求
  3. }
  复制代码
- 如果这个Pending Container有locality 要求，并且这个locality的nodes与当前所有tasks有交集，那么这个Pending Container就被划分为localityMatched，显然，这个Pending Container是不应该被取消的；
  1. else if (nodes.asScala.toSet.intersect(preferredHosts).nonEmpty) { // 这个Container的本地性要求和task期望分配的hosts集合有交集
  2. localityMatched += cr // 把这个ContainerRequest添加到localityMatched的ContainerRequest集合中去
  3. }
  复制代码
- 如果这个Pending Container有locality要求，但是这个locality中的nodes不在当前所有tasks的locality中的任何一个节点，即这个Pending Container实际分配的位置不大概是任何一个task所倾向于的位置，那么这个Pending Container就是localityUnMatched，显然，localityUnMatched container目前无法放置任何一个task，必要取消掉；
  1. else { // 这个Container的本地性要求和task期望分配的hosts集合没有交集
  2. localityUnMatched += cr // 把这个ContainerRequest添加到localityMatched的ContainerRequest集合中去
  3. }
  复制代码
对于localityUnmatched container，向Yarn发送请求，取消这种Container，这些被取消的Container在后面会重新申请，以便在申请的资源总量不变的情况下增强资源的本地特性:
1. staleRequests.foreach { stale =>
2. amClient.removeContainerRequest(stale)
3. }
复制代码
计算总的Container的数量，包罗：
- Pending Container中刚刚cancel的container的数量，这些Container刚刚取消了，我们可以再次申请这些Container，但是肯定会增强这些新的资源请求的locality，以最大化我们的Task的locality
- Pending Container中的locality free的Container数量，这些Container大概分配在集群中的任何地方
- 新增(missing)的Container请求，即当前的总的container请求中除去正在运行(已经有task在运行，numExecutorsRunning)和正在启动(已经分配但是还没分配task，numExecutorsStarting)的，再除去所有的pending的container(numPendingAllocate，是从Yarn的API中获取的数量，已经请求但是还没有分配成功的资源)，多出来的Container:
1. // 还没有发送请求的executor的数量
2. val missing = targetNumExecutors - numPendingAllocate -
3. numExecutorsStarting.get - numExecutorsRunning.get
4. .....
5. // consider the number of new containers and cancelled stale containers available
6. // 将新的container请求，以及刚刚取消的container，作为available container
7. val availableContainers = missing + cancelledContainers
9. // to maximize locality, include requests with no locality preference that can be cancelled
10. // 在availableContainers的基础上，再算上没有任何locality要求的并且还没有分配成功的container
11. val potentialContainers = availableContainers + anyHostRequests.size
复制代码
在这里，val availableContainers = missing + cancelledContainers，即available container代表这次可以增量申请的最大的container数量，包罗了这次的额外需求，以及刚刚取消的container(取消的container可以重新申请)
构建Container请求。这里会根据LocalityPreferredContainerPlacementStrategy的localityOfRequestedContainers来构建Container请求，返回Array[ContainerLocalityPreferences]，每一个ContainerLocalityPreferences代表了一个带有对应host和rack信息的Container请求:
1. val containerLocalityPreferences = containerPlacementStrategy.localityOfRequestedContainers(
2. potentialContainers, numLocalityAwareTasks, hostToLocalTaskCounts,
3. allocatedHostToContainersMap, localRequests)
复制代码
根据ContainerLocalityPreferences，转换成Yarn的 ContainerRequest
1. containerLocalityPreferences.foreach {
2. case ContainerLocalityPreferences(nodes, racks) if nodes != null =>
3. newLocalityRequests += createContainerRequest(resource, nodes, racks)// 根据获取的locality，重新创建ContainerRequest请求
4. case _ =>
5. }
复制代码
如果可以申请的Container(available container)的数量大于刚刚计算完locality的Container数量，那么，为了将申请配额用尽，就再申请其相差的部门Container，保证申请的Container的数量不小于Available Container的数量。
1. if (availableContainers >= newLocalityRequests.size) {
2. // more containers are available than needed for locality, fill in requests for any host
3. for (i <- 0 until (availableContainers - newLocalityRequests.size)) {
4. newLocalityRequests += createContainerRequest(resource, null, null) // 构造ContainerRequest对象
5. }
6. }
复制代码
如果可以申请的Container(available container)的数量小于刚刚计算完locality的Container数量，那么必要取消一部门container:
1. else {
2. val numToCancel = newLocalityRequests.size - availableContainers
3. // cancel some requests without locality preferences to schedule more local containers
4. anyHostRequests.slice(0, numToCancel).foreach { nonLocal =>
5. amClient.removeContainerRequest(nonLocal)
6. }
复制代码
调用Yarn的尺度接口addContainerRequest()，将ContainerRequest发送给Yarn(其实这个接口并不会真正将请求发送出去，只会存放在RMAMClient端，真正发送是通过allocate()接口)：
1. newLocalityRequests.foreach { request =>
2. amClient.addContainerRequest(request)
3. } // 在这里发送container的请求，从日志来看，资源请求已经发出来了，Yarn已经处理了
复制代码

所以，从上面可以看到，最关键的方法是LocalityPreferredContainerPlacementStrategy.localityOfRequestedContainers()方法，它根据当前的已有信息(统共的Container需求，有locality需求的task的数量，这些locality分布在每一个task上的数量等)，生成一个Array[ContainerLocalityPreferences]数组，数组中的每一个元素代表了一个Container的需求，并包含了其locality的要求信息，然后基于生成的ContainerLocalityPreferences经过转换成ContainerRequest，发送给Yarn。
资源申请的生成过程详解

资源申请的生成，就是根据当前集群运行的基本情况，Task的基本需求，生成Yarn上的资源请求的过程。
资源申请的生成过程的简单例子

在相识其具体实现以前，我们以具体例子的方式，看一下localityOfRequestedContainers()方法的基本实现逻辑，从而对其动机和达成的效果有一个很好的理解，然后，我们再看其实现细节。

从使命调度去看，看到的是Task以及每个 Task的Locality倾向。比如，现在我们一共必要为30个Task分配资源，其中，20个Task的locality倾向为Host1,Host2,Host3，10个Task的Locality倾向为Host1, Host2, Host4，因此，对应到每个Host上的Task权重如下表所示：
Host 1Host 2Host 3Host 420 Tasks20202010 Tasks101010Sum of Tasks30302010即，20个Task希望分配在Host1,Host2,Host3中的任何一个，有10个Task希望分配在Host1, Host2, Host4中的任何一个。如上表所示，综合来看，所有Task在四台机器上分配的权重是(30, 30, 20,10)。
假设一个Task必要的vCore是1，而一个Container(Executor)有2个vCore，因此，转换成Container以后的结果如下表所示：
Host 1Host 2Host 3Host 420 Tasks20202010 Tasks101010Sum of Tasks30302010Sum of Containers1515105上面的Sum of Containers的数字只是表现一个比例值，并不表现对应的Host上实际必要申请的Container的数量，我们实际必要的总的Container数量才15个。那么，这15个Container需求均匀到每台Host上是多少呢？
比如Host 1的Sum of Container 为15，所有Host的Sum of Container 是45，因此占比是1/3，所以均匀下来分配到Host1上的Container数量应该是 15 * 1/3 = 5。经过向上取整(宁可轻微多分配也不要少分配)以后，每台机器所均匀到的15个Container需求是:
Host 1Host 2Host 3Host 420 Tasks20202010 Tasks101010Sum of Tasks30302010Sum of Containers1515105Allocated Container Target5542
在这里，计算完总的Allocated Container Target以后，必要减去当前已经在该Host上已经存在(正在运行或者在这个Host上pending的Container)，由于我们最终发送给Yarn的Container请求是增量请求。假设现在在每一个Host上已经存在的Container数量都是1，即15个Container中有4个Container是已经分配的，那么，减去已经存在的Container数量以后的结果如下表所示，所以，我们必要新申请12个Container:
Host 1Host 2Host 3Host 420 Tasks20202010 Tasks101010Sum of Tasks30302010Sum of Containers1515126Allocated Container Target5542Newly Allocated Container4431
将每个Host的Newly Allocated Container按照比例进行缩放，保证比例最大的那个Host(这里是Host1 和 Host 2)的比例值是必要新申请的Container的数量。在这里，扩大因子应该是 12(Container的总数量)/4(比例最大的Host的Average Allocated Container) = 3 :
Host 1Host 2Host 3Host 420 Tasks20202010 Tasks101010Sum of Tasks30302010Sum of Containers1515105Allocated Container Target5542Newly Allocated Container4431Round Up121293
开始发起资源请求。每一个Container请求的Locality中包含的Host如下表所示：
Host 1Host 2Host 3Host 33 Containers✔✔✔✔6 Containers✔✔✔3 Containers✔✔其寄义是：
- 3个Container请求的Locality是[Host1, Host2, Host3, Host4]，即请求Yarn分配3个Container，并且尽量将它们分配在这4个Hosts中。此时剩下的Host比例为[9:9:6:0]
- 6个Container请求的Locality是[Host1, Host2, Host3]，即请求Yarn分配6个Container，并且尽量将它们分配在这3个Hosts中，此时剩下的Host中container比例为[3:3:0]
- 3个Container请求的Locality是[Host1, Host2]，即请求Yarn分配3个Container，并且尽量将它们分配在这2个Hosts中
这样，所有Host的Container比例就是12:12:9:3，均匀到12个需分配的Container以后的比例是4:4:3:1，再加上已经分配在每个host上的1个Container，那么总的Container在每个Host上的比例就是5:5:4:2，这个比例和我们直接根据每个Host的task比例折算成的Container的比例15:15:10:5是大抵相近的。
到了这里，我们可以理解了，为什么我们必要在步调5 做Round Up操纵，并且Round Up的目标是将目前比例值最大的Host的比例值扩大为当前Container需求的最大值? 由于在步调6中生成Container请求的时间，比例值最大的Host的比例值肯定是等于必要申请的Container数量的。

资源调度算法的代码解析

上面以实际例子解释了Spark将当前的Task的Locality需求信息转换成Yarn的资源请求的细节。下面，我们结合代码，具体看一下localityOfRequestedContainers()方法的实现细节：

def localityOfRequestedContainers(
numContainer: Int, // 需要进行计算的container的数量，包括missing的，cancel掉的(本地性不符合任何task要求的pending container)，以及对本地性没有要求的pending的container
numLocalityAwareTasks: Int, // 对locality有要求的task的数量，这个是Driver端通过stageIdToExecutorPlacementHints计算然后通过RequestExecutor传递过来的
hostToLocalTaskCount: Map[String, Int], // 在Stage提交了以后，这个map里面保存了从host到期望分配到这个host的task的数量，这个是Driver端通过stageIdToExecutorPlacementHints传递过来的
allocatedHostToContainersMap: HashMap[String, Set[ContainerId]], // 已经launch起来的host -> container的映射关系
localityMatchedPendingAllocations: Seq[ContainerRequest] // 对本地性有要求的pending的container
): Array[ContainerLocalityPreferences] = {
// 预期的从host到期望在上面再launch的新的container数量的映射关系
val updatedHostToContainerCount = expectedHostToContainerCount(
numLocalityAwareTasks, hostToLocalTaskCount, allocatedHostToContainersMap,
localityMatchedPendingAllocations)
// 希望再launch的所有Host上的container的数量之和，在这里的例子中，是15
val updatedLocalityAwareContainerNum = updatedHostToContainerCount.values.sum
// The number of containers to allocate, divided into two groups, one with preferred locality,
// and the other without locality preference.
// 没有locality需求的container的数量
val requiredLocalityFreeContainerNum =
math.max(0, numContainer - updatedLocalityAwareContainerNum)
// 有locality需求的container的数量
val requiredLocalityAwareContainerNum = numContainer - requiredLocalityFreeContainerNum
val containerLocalityPreferences = ArrayBuffer[ContainerLocalityPreferences]()
if (requiredLocalityFreeContainerNum > 0) { // 如果有container是没有locality需求的
for (i <- 0 until requiredLocalityFreeContainerNum) {
containerLocalityPreferences += ContainerLocalityPreferences( // 为这些没有locality需求的container一一创建container需求
null.asInstanceOf[Array[String]], null.asInstanceOf[Array[String]])
}
}
if (requiredLocalityAwareContainerNum > 0) { // 如果有container有locality需求
val largestRatio = updatedHostToContainerCount.values.max // 全局的所有host中最大的container数量
// Round the ratio of preferred locality to the number of locality required container
// number, which is used for locality preferred host calculating.
var preferredLocalityRatio = updatedHostToContainerCount.map { case(k, ratio) =>
val adjustedRatio = ratio.toDouble * requiredLocalityAwareContainerNum / largestRatio
(k, adjustedRatio.ceil.toInt) // 往上取整
}
// 每个有locality需求的的Container request，为他们确定对应的hosts和rack
for (i <- 0 until requiredLocalityAwareContainerNum) {
// Only filter out the ratio which is larger than 0, which means the current host can
// still be allocated with new container request.
val hosts = preferredLocalityRatio.filter(_._2 > 0).keys.toArray // 还有container可以分配的一个或者多个hosts
val racks = hosts.map { h =>
resolver.resolve(yarnConf, h) // 解析这些host所在的rack
}.toSet
// 每一个ContainerLocalityPreferences代表一个Container
containerLocalityPreferences += ContainerLocalityPreferences(hosts, racks.toArray)
// Minus 1 each time when the host is used. When the current ratio is 0,
// which means all the required ratio is satisfied, this host will not be allocated again.
preferredLocalityRatio = preferredLocalityRatio.map { case (k, v) => (k, v - 1) }
}
}
// containerLocalityPreferences中的每一项都会变成一个新的Container Request
containerLocalityPreferences.toArray
}

复制代码

其参数的基本寄义是：

numContainer: Int 必要进行计算的Container的数量，即大概进行分配的Container数量，包罗Miss的container(还没有申请的Container)，Cancel掉的(本地性不符合任何task要求，因此已经从Yarn上取消的pending container)。同时，还包罗Pending Container中对本地性没有要求的Container，这一部门Container也是我们重新申请的对象，以最大化Locality。上文讲到过的updateResourceRequests()
方法中的potentialContainers就是传入到该方法的numContainers参数：
1. // 将新的container请求，以及刚刚取消的container，作为available container
2. val availableContainers = missing + cancelledContainers
4. // to maximize locality, include requests with no locality preference that can be cancelled
5. // 在availableContainers的基础上，再算上没有任何locality要求的并且还没有分配成功的container
6. val potentialContainers = availableContainers + anyHostRequests.size
复制代码
numLocalityAwareTasks: Int 对locality有要求的task的数量，这个是Driver端通过对stageIdToExecutorPlacementHints计算然后通过RequestExecutor通报过来的数值。已经说过，这是此时的全局状态量，而不是一个增量；
hostToLocalTaskCount: Map[String, Int] 在Stage提交了以后，这个map里面保存了从host到期望分配到这个host的task的数量，这个是Driver端通过stageIdToExecutorPlacementHints通报过来的，具体过程是：
- 在Driver端，ExecutorAllocationManager的onStageSubmitted回调中，会将这个Stage的task preference存放在stageIdToExecutorPlacementHints中。
  1. ----------------------------------------- ExecutorAllocationManager ----------------------------------------
  2. override def onStageSubmitted(stageSubmitted: SparkListenerStageSubmitted): Unit = {
  3. .....
  4. // 计算这个stage在每一个host上的task数量
  5. // Compute the number of tasks requested by the stage on each host
  6. var numTasksPending = 0
  7. val hostToLocalTaskCountPerStage = new mutable.HashMap[String, Int]()
  8. stageSubmitted.stageInfo.taskLocalityPreferences.foreach { locality =>
  9. // 对于每一个task的prefered location的list
  10. numTasksPending += 1
  11. // 对于这个task的每一个 preferred location
  12. locality.foreach { location => // 对于这个locality中的每一个location
  13. // 这个host上的task的数量+1
  14. val count = hostToLocalTaskCountPerStage.getOrElse(location.host, 0) + 1
  15. hostToLocalTaskCountPerStage(location.host) = count
  16. }
  17. }
  18. // 这个map的key是stage id，value是一个元组，记录了这个stage的pending的task的数量，以及从host到task count的map信息
  19. stageIdToExecutorPlacementHints.put(stageId,
  20. (numTasksPending, hostToLocalTaskCountPerStage.toMap))
  21. updateExecutorPlacementHints()
  复制代码
- 随后，ExecutorAllocationManager会有线程不断将这些信息通过RequestExecutors发送给长途的ApplicationMaster：
  1. def start(): Unit = {
  2. listenerBus.addToManagementQueue(listener)
  4. val scheduleTask = new Runnable() {
  5. override def run(): Unit = {
  6. schedule() // 这里会根据需要更新numExecutorsTarget的数量，也会调用
  7. }
  8. }
  9. executor.scheduleWithFixedDelay(scheduleTask, 0, intervalMillis, TimeUnit.MILLISECONDS)
  10. client.requestTotalExecutors(numExecutorsTarget, localityAwareTasks, hostToLocalTaskCount)
  11. }
  复制代码
allocatedHostToContainersMap: HashMap[String, Set[ContainerId]] 已经launch起来的host -> container的映射关系。这是updateResource()方法每次通过Yarn的尺度API allocate()向Yarn扣问以后获取的结果。我们说过，allocate()接口用来向Yarn发送本次的资源请求，并返回当前Yarn为这个Application分配的Container的结果。由于Yarn端的资源分配是异步分配，因此allocate()返回的结果并非是这次请求的资源的分配结果，而是两次相邻的allocate()请求发生之间的新产生的资源分配结果；
localityMatchedPendingAllocations: Seq[ContainerRequest] 对本地性有要求的pending的container，其在方法splitPendingAllocationsByLocality()中对Pending的Container的Locality状态进行切分后，那些与当前请求的Task的Locality有交集的Pending Container将作为已经存在的Container，整个资源请求的目标，是使得新申请的Container和已经分配的Container加起来，其资源倾向和所有Task的统计倾向尽量匹配，从而最大程度满足Task的本地性需求。

localityOfRequestedContainers()算法的基本过程为：

计算每一个Host上应该新分配的Container的数量的预期值。由于是新分配的Container的预期值，因此必要先根据每个Host上的预期存在的Container的总的数量，减去该Host上已经存在的Container：
1. val updatedHostToContainerCount = expectedHostToContainerCount(
2. numLocalityAwareTasks, hostToLocalTaskCount, allocatedHostToContainersMap,
3. localityMatchedPendingAllocations)
复制代码
这里的计算，就是完成下表中从Sum of Tasks(每个机器上分配到的Task的比例) 到 Sum of Containers (每个机器上分配的Container的比例)的转换，然后根据Sum of Containers 减去每台机器上已经分配的Container，就得到了Average Allocated Container Total(每台机器上应该新分配的Container的数量)：
Host 1Host 2Host 3Host 420 Tasks20202010 Tasks101010Sum of Tasks30302010Sum of Containers1515126Allocated Container Target5542Newly-Allocated Container4431
根据上面计算的分配结果，统计没有locality需求的Container的总数量和有locality需求的Container数量：
1. // 希望再launch的所有Host上的container的数量之和，在这里的例子中，是15
2. val updatedLocalityAwareContainerNum = updatedHostToContainerCount.values.sum
4. // The number of containers to allocate, divided into two groups, one with preferred locality,
5. // and the other without locality preference.
6. // 没有locality需求的container的数量
7. val requiredLocalityFreeContainerNum =
8. math.max(0, numContainer - updatedLocalityAwareContainerNum)
10. // 有locality需求的container的数量
11. val requiredLocalityAwareContainerNum = numContainer - requiredLocalityFreeContainerNum
复制代码
先为没有locality需求的Container 构造ContainerLocalityPreferences，每一个ContainerLocalityPreferences对象对应了一个Container请求和这个请求的Locality需求。可以看到，这种没有Locality需求的Container的Host 偏好和Rack 偏好都是空的：
1. val containerLocalityPreferences = ArrayBuffer[ContainerLocalityPreferences]()
2. if (requiredLocalityFreeContainerNum > 0) { // 如果有container是没有locality需求的
3. for (i <- 0 until requiredLocalityFreeContainerNum) {
4. containerLocalityPreferences += ContainerLocalityPreferences( // 为这些没有locality需求的container一一创建container需求
5. null.asInstanceOf[Array[String]], null.asInstanceOf[Array[String]])
6. }
7. }
复制代码
为有locality需求的Container 构造ContainerLocalityPreferences对象，每一个ContainerLocalityPreferences对象封装了这个Container请求和这个请求的Locality需求：
4.1 这里必要完成比力难以理解的Container请求的比例放大，保证比例最大的那个Host(这里是Host1 和 Host 2)的比例值是必要申请的Container的数量，以满足随后为每一个Container构造其Locality信息的过程：
1. var preferredLocalityRatio = updatedHostToContainerCount.map { case(k, ratio) =>
2. val adjustedRatio = ratio.toDouble * requiredLocalityAwareContainerNum / largestRatio
3. (k, adjustedRatio.ceil.toInt) // 往上取整
4. }
复制代码
如下图所示，这里就是完成Round Up 这一步调，将必要新分配的Container数量成比例放大，保证Container比例最大的Host（这里是Host1 和 Host 2）放大以后的值刚好等于必要分配的有Locality Preference 的 Container的总数量。
Host 1Host 2Host 3Host 420 Tasks20202010 Tasks101010Sum of Tasks30302010Sum of Containers1515105Allocated Container Target5542Newly Allocated Container4431Round Up1212934.2 放大完成以后，开始进行分配。
当前有12 个Container必要分配，每一个Host的分配比例为(12,12,9,3)。分配过程上文已经经过，其代码如下：
1. // 每个有locality需求的的Container request，为他们确定对应的hosts和rack
2. for (i <- 0 until requiredLocalityAwareContainerNum) {
3. // Only filter out the ratio which is larger than 0, which means the current host can
4. // still be allocated with new container request.
5. val hosts = preferredLocalityRatio.filter(_._2 > 0).keys.toArray // 还有container可以分配的一个或者多个hosts
6. val racks = hosts.map { h =>
7. resolver.resolve(yarnConf, h) // 解析这些host所在的rack
8. }.toSet
9. // 每一个ContainerLocalityPreferences代表一个Container
10. containerLocalityPreferences += ContainerLocalityPreferences(hosts, racks.toArray)
12. // Minus 1 each time when the host is used. When the current ratio is 0,
13. // which means all the required ratio is satisfied, this host will not be allocated again.
14. preferredLocalityRatio = preferredLocalityRatio.map { case (k, v) => (k, v - 1) }
15. }
复制代码

申请资源以后的处置惩罚：Executor的启动或者竣事

上面讲过，资源调度的入口方法allocateResources()会通过updateResourceRequests()
来计算所需资源并向Yarn进行资源的更新，包罗申请新的资源、开释无用的资源等：

def allocateResources(): Unit = synchronized { updateResourceRequests()
// val allocateResponse = amClient.allocate(progressIndicator) val allocatedContainers = allocateResponse.getAllocatedContainers() handleAllocatedContainers(allocatedContainers.asScala) val completedContainers = allocateResponse.getCompletedContainersStatuses() processCompletedContainers(completedContainers.asScala) } }

复制代码

通过调用Yarn的尺度API allocate()，获取了资源分配的结果。再次强调，Yarn这一端的资源调度是异步调度，因此这个资源分配的结果并不是刚刚通过addContainerRequest()进行资源申请的结果，只是调用者在两次调用allocate() API的之间Yarn对于这个Application的新的资源分配结果。拿到了分配的Container，Spark就可以将Executor启动起来了(注意，是启动一个空的Executor，不是启动Task)。启动起来的Executor随后就会向DriverEndpoint注册自己，通信的具体过程参考《超时导致SparkContext构造失败的题目探究》。这里不再赘述。
对分配结果的处置惩罚，紧张是处置惩罚已经分配的Container以及已经运行竣事的Container:

val allocatedContainers = allocateResponse.getAllocatedContainers()
handleAllocatedContainers(allocatedContainers.asScala)
val completedContainers = allocateResponse.getCompletedContainersStatuses()
processCompletedContainers(completedContainers.asScala)

复制代码

对于已经分配的Container，必要从Yarn的AMRMClient中将对应的资源请求删除，制止对同一个资源进行多次重复申请，然后启动对应的Executor。
对于已经完成的Container，必要根据Container的退出状态，记录相干日记，同时，必要向Driver发送RemoveExecutor消息告知Driver这个Container的竣事，Driver端会进行相干状态的维护。

对于新启动的Container的处置惩罚

对于一个刚刚分配成功的Container，其处置惩罚工作紧张包罗两个

一是从AMRMClient中将对应的资源请求删除，制止同一资源请求的Container被重复申请；
然后，在长途的NodeManager节点上启动Container。

这些过程在方法handleAllocatedContainers()中进行：

--------------------------------- YarnAllocator --------------------------------------
def handleAllocatedContainers(allocatedContainers: Seq[Container]): Unit = {
val containersToUse = new ArrayBuffer[Container](allocatedContainers.size)
// 先处理Host Match的Container
val remainingAfterHostMatches = new ArrayBuffer[Container]
for (allocatedContainer <- allocatedContainers) {
matchContainerToRequest(allocatedContainer, allocatedContainer.getNodeId.getHost,
containersToUse, remainingAfterHostMatches)
}
// 处理Host Match以后剩余的Container
val remainingAfterRackMatches = new ArrayBuffer[Container]
for (allocatedContainer <- remainingAfterHostMatches) {
val rack = resolver.resolve(conf, allocatedContainer.getNodeId.getHost)
matchContainerToRequest(allocatedContainer, rack, containersToUse,
remainingAfterRackMatches)
}
// 处理Host Match和Rack Match以后剩余的Container
val remainingAfterOffRackMatches = new ArrayBuffer[Container]
for (allocatedContainer <- remainingAfterRackMatches) {
matchContainerToRequest(allocatedContainer, ANY_HOST, containersToUse,
remainingAfterOffRackMatches)
}
// 在Host Match，Rack Match，以及ANY_HOST Match以后，依然还有剩余的Container，这只能是Bug
if (!remainingAfterOffRackMatches.isEmpty) {
for (container <- remainingAfterOffRackMatches) {
internalReleaseContainer(container)
}
}
/**
* 在这里会打印 Launching container container_1714042499037_5294_01_000002 on host
*/
runAllocatedContainers(containersToUse)
}

复制代码

遍历每一个分配的Container，在AMRMClient端找到跟这个Container地点的机器相匹配的资源请求，将这个资源请求AMRMClient中删除。这个删除操纵并不会和长途的ResourceManager通信，由于这些资源请求都通过addContainerRequest() API被AMRMClient保存在本地然后通过allocate() API发送给长途的RM的。因此对应请求的删除是在AMRMClient的本地进行的:
1. val remainingAfterHostMatches = new ArrayBuffer[Container]
2. for (allocatedContainer <- allocatedContainers) {
3. matchContainerToRequest(allocatedContainer, allocatedContainer.getNodeId.getHost,
4. containersToUse, remainingAfterHostMatches)
5. }
复制代码
其中，matchContainerToRequest()方法就是根据这个分配成功的Container的特性(Priority，VCore， CPU，Host Location)，从AMRMClient中删除对应的ResourceRequest。未删除的Container保存在参数remainingAfterHostMatches中：
1. --------------------------------- YarnAllocator --------------------------------------
2. private def matchContainerToRequest(
3. allocatedContainer: Container,
4. location: String,
5. containersToUse: ArrayBuffer[Container],
6. remaining: ArrayBuffer[Container]): Unit = {
7. // 这个Container的资源特性
8. val matchingResource = Resource.newInstance(allocatedContainer.getResource.getMemory,
9. resource.getVirtualCores)
10. // 以Priority，Resource(VCore, Memory)，location作为ID，删除这个Container对应的资源请求
11. val matchingRequests = amClient.getMatchingRequests(allocatedContainer.getPriority, location,
12. matchingResource)
14. if (!matchingRequests.isEmpty) { // 匹配成功
15. val containerRequest = matchingRequests.get(0).iterator.next
16. amClient.removeContainerRequest(containerRequest) // 从AMRMClient中删除
17. containersToUse += allocatedContainer
18. } else {
19. remaining += allocatedContainer // 未匹配的Container放入remaining，接着进行其他匹配
20. }
21. }
复制代码
遍历剩下的在Host级别没有匹配成功的剩余的Container，在Rack级别进行Container和ResourceRequest 的匹配，并将匹配不成功的Container保存在remainingAfterRackMatches中：
1. val remainingAfterRackMatches = new ArrayBuffer[Container]
2. for (allocatedContainer <- remainingAfterHostMatches) {
3. /**
4. * SparkRackResolver.
5. */
6. val rack = resolver.resolve(conf, allocatedContainer.getNodeId.getHost)
7. matchContainerToRequest(allocatedContainer, rack, containersToUse,
8. remainingAfterRackMatches)
9. }
复制代码
遍历剩下的在Rack级别没有匹配成功的剩余的Container，在ANY_HOST(忽略本地偏好)级别进行Container和ResourceRequest 的匹配(这种ResourceRequest是那种没有Locality需求的ResourceRequest)，并将匹配不成功的Container保存在remainingAfterRackMatches
1. // Assign remaining that are neither node-local nor rack-local
2. val remainingAfterOffRackMatches = new ArrayBuffer[Container]
3. for (allocatedContainer <- remainingAfterRackMatches) {
4. matchContainerToRequest(allocatedContainer, ANY_HOST, containersToUse,
5. remainingAfterOffRackMatches)
6. }
复制代码
如果有的已经分配的Container无论是Host、Rack照旧ANY_HOST偏好都没有在AMRMClient本地找到他们匹配的资源请求(注释中说，这是由于Yarn自己的竞争导致的bug)，那么开释这些Container:
1. if (!remainingAfterOffRackMatches.isEmpty) {
2. for (container <- remainingAfterOffRackMatches) {
3. internalReleaseContainer(container)
4. }
5. }
复制代码
资源请求开释完毕以后，通过方法runAllocatedContainers()逐个启动每一个Container。
1. ---------------------------------- YarnAllocator ----------------------------------------
2. private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): Unit = {
3. for (container <- containersToUse) { // 对于已经allocate并且资源已经匹配的container
4. executorIdCounter += 1
5. val executorHostname = container.getNodeId.getHost
6. val containerId = container.getId
7. val executorId = executorIdCounter.toString // 分配executorId
8. def updateInternalState(): Unit = synchronized {
9. numExecutorsRunning.incrementAndGet()
10. numExecutorsStarting.decrementAndGet()
11. executorIdToContainer(executorId) = container // executor 和 container的映射关系
12. containerIdToExecutorId(container.getId) = executorId // container 和 executor 的映射关系
13. /**
14. * Container launch起来以后，更新allocatedHostToContainersMap
15. */
16. val containerSet = allocatedHostToContainersMap.getOrElseUpdate(executorHostname,
17. new HashSet[ContainerId])
18. containerSet += containerId
19. allocatedContainerToHostMap.put(containerId, executorHostname)
20. }
22. numExecutorsStarting.incrementAndGet()
23. launcherPool.execute(new Runnable {
24. override def run(): Unit = {
25. try {
26. new ExecutorRunnable(
27. ........
28. ).run() // 运行ExecutorRunnable，用来和NodeManager通信来启动Container
29. updateInternalState()
30. }
31. })
32. }
33. }
复制代码
5.1 在线程池中启动一个ExecutorRunnable。ExecutorRunnable会负责和NodeManager进行通信，在对应节点上将Container启动起来；
1. launcherPool.execute(new Runnable {
2. override def run(): Unit = {
3. try {
4. new ExecutorRunnable(
5. Some(container),
6. conf,
7. sparkConf,
8. driverUrl,
9. executorId,
10. executorHostname,
11. executorMemory,
12. executorCores,
13. appAttemptId.getApplicationId.toString,
14. securityMgr,
15. localResources
16. ).run()
复制代码
5.2 在ExecutorRunnable启动以后，由于新的Container的到场，更新相干元数据信息，包罗executor -> container，container -> executor， container -> host， host -> container的映射关系，这是通过内部方法updateInternalState()来负责的：
1. ---------------------------------- YarnAllocator ----------------------------------------
2. def updateInternalState(): Unit = synchronized {
3. numExecutorsRunning.incrementAndGet()
4. numExecutorsStarting.decrementAndGet()
5. executorIdToContainer(executorId) = container // executor 和 container的映射关系
6. containerIdToExecutorId(container.getId) = executorId // container 和 executor 的映射关系
7. /**
8. * Container launch起来以后，更新allocatedHostToContainersMap
9. */
10. val containerSet = allocatedHostToContainersMap.getOrElseUpdate(executorHostname,
11. new HashSet[ContainerId])
12. containerSet += containerId
13. allocatedContainerToHostMap.put(containerId, executorHostname)
14. }
复制代码

对于竣事的Container的处置惩罚

对于竣事的Container的处置惩罚在方法processCompletedContainers()中进行：

---------------------------------- YarnAllocator ----------------------------------------
private[yarn] def processCompletedContainers(completedContainers: Seq[ContainerStatus]): Unit = {
for (completedContainer <- completedContainers) {
val containerId = completedContainer.getContainerId
val alreadyReleased = releasedContainers.remove(containerId) // alreadyReleased记录这个Container是否已经被释放
val hostOpt = allocatedContainerToHostMap.get(containerId)
val onHostStr = hostOpt.map(host => s" on host: $host").getOrElse("")
val exitReason = if (!alreadyReleased) { // 这个Container还没有释放，那么走释放流程
// Decrement the number of executors running. The next iteration of
// the ApplicationMaster's reporting thread will take care of allocating.
numExecutorsRunning.decrementAndGet()
// Hadoop 2.2.X added a ContainerExitStatus we should switch to use
// there are some exit status' we shouldn't necessarily count against us, but for
// now I think its ok as none of the containers are expected to exit.
val exitStatus = completedContainer.getExitStatus
val (exitCausedByApp, containerExitReason) = exitStatus match {
case ContainerExitStatus.SUCCESS =>
.....
case ContainerExitStatus.PREEMPTED =>
....
case VMEM_EXCEEDED_EXIT_CODE =>
....
case PMEM_EXCEEDED_EXIT_CODE =>
....
}
if (exitCausedByApp) {
logWarning(containerExitReason)
} else {
logInfo(containerExitReason)
}
ExecutorExited(exitStatus, exitCausedByApp, containerExitReason)
} else {
// 如果我们释放了这个Container，那么说明一定是Driver直接通过 killExecutor
// 释放掉了这个Container，而不是它自行结束
ExecutorExited(completedContainer.getExitStatus, exitCausedByApp = false,
s"Container $containerId exited from explicit termination request.")
}
// 解除host -> container 以及 container -> host mapping
for {
host <- hostOpt // 这个Container对应的Host
containerSet <- allocatedHostToContainersMap.get(host) // 这个Container对应的Host上的所有container
} {
containerSet.remove(containerId) // 删除这个container
if (containerSet.isEmpty) { // 这个container是这个host上的最后一个container
allocatedHostToContainersMap.remove(host) // 删除host
} else {
allocatedHostToContainersMap.update(host, containerSet)
}
allocatedContainerToHostMap.remove(containerId) // 解除container -> host map
}
// 解除container -> executor mapping
containerIdToExecutorId.remove(containerId).foreach { eid =>
executorIdToContainer.remove(eid)
....
if (!alreadyReleased) {
// The executor could have gone away (like no route to host, node failure, etc)
// Notify backend about the failure of the executor
numUnexpectedContainerRelease += 1
driverRef.send(RemoveExecutor(eid, exitReason))
}
}
}
}

复制代码

可以看到，方法processCompletedContainers()会遍历Yarn返回的每一个Completed(注意，Completed只是代表Container运行竣事，但是运行结果大概是Succeed大概是Fail)，然后逐个处置惩罚：

如果Container此时并没有被开释，说明Container是自行竣事，而不是Driver所杀死的。根据Container的退出状态和退出原因，打印日记：
1. val alreadyReleased = releasedContainers.remove(containerId) // alreadyReleased记录这个Container是否已经被释放
2. val exitReason = if (!alreadyReleased) {
3. val alreadyReleased = releasedContainers.remove(containerId) // alreadyReleased记录这个Container是否已经被释放
4. val exitStatus = completedContainer.getExitStatus
5. val (exitCausedByApp, containerExitReason) = exitStatus match {
6. case ContainerExitStatus.SUCCESS =>
7. .....
8. case ContainerExitStatus.PREEMPTED =>
9. ....
10. case VMEM_EXCEEDED_EXIT_CODE =>
11. ....
12. case PMEM_EXCEEDED_EXIT_CODE =>
13. ....
14. }
15. if (exitCausedByApp) {
16. logWarning(containerExitReason)
17. } else {
18. logInfo(containerExitReason)
19. }
20. }
复制代码
如果我们发现这个Container已经在ReleasedContainer中存在，说明只能是Driver通过KillExecutor的方式将Container给Release了，而不是Container自行退出：
1. // 如果我们释放了这个Container，那么说明一定是Driver直接通过 killExecutor
2. // 释放掉了这个Container，而不是它自行结束
3. ExecutorExited(completedContainer.getExitStatus, exitCausedByApp = false,
4. s"Container $containerId exited from explicit termination request.")
复制代码
如果是Driver杀死了Executor，Driver会向AMEndpoint发送KillExecutor消息，AMEndpoint会将这个Executor从其维护的元数据信息中删除，将这个Kill掉的Executor的Container添加到releasedContainers中，同时通过AMRMClient向Yarn发送开释container的请求：
1. def killExecutor(executorId: String): Unit = synchronized {
2. val container = executorIdToContainer.get(executorId).get
3. internalReleaseContainer(container)
4. numExecutorsRunning.decrementAndGet()
5. }
6. private def internalReleaseContainer(container: Container): Unit = {
7. releasedContainers.add(container.getId()) // 将这个Container从releasedContainer中删除
8. amClient.releaseAssignedContainer(container.getId()) // 向Yarn发送释放Container的请求
9. }
复制代码
删除这个Container和Host之间的映射关系，包罗Host -> Container的映射关系以及反向的Container-> Host的映射关系
1. for {
2. host <- hostOpt // 这个Container所在的Hosts
3. containerSet <- allocatedHostToContainersMap.get(host) // 这个Host上的所有Container
4. } {
5. containerSet.remove(containerId)
6. if (containerSet.isEmpty) {
7. allocatedHostToContainersMap.remove(host)
8. } else {
9. allocatedHostToContainersMap.update(host, containerSet)
10. }
11. allocatedContainerToHostMap.remove(containerId)
12. }
复制代码
删除Container和Executor之间的映射关系，同时，如果不是Driver主动release的这个container，那么会向Driver发送RemoveExecutor消息
1. containerIdToExecutorId.remove(containerId).foreach { eid =>
2. executorIdToContainer.remove(eid)
3. ........
4. if (!alreadyReleased) {
5. // 这个Container不是Driver自行释放的，那么需要像Driver汇报一个RemoveExecutor消息
6. driverRef.send(RemoveExecutor(eid, exitReason))
7. }
复制代码

基于资源分配结果进行使命调度

上面讲到对于新分配的Container的处置惩罚，在收到Yarn返回的分配的Container以后，ApplicationMaster会启动对应的Executor。这些Executor启动以后，会向Driver注册自己以告知Driver自己的存在，Driver进而将Task调度到Executor中。
其实，Task的调度的触发不仅仅是新分配了Container或者新Launch了Executor，基本上在集群的资源大概发生变化的情况下，都会触发Task的调度，因此，Task的调度是一个不断将Pending Task与可用资源进行匹配然后调度出去的过程。
我列举了下面四种可以触发Driver端的CoarseGrainedSchedulerBackend通过运行makeOffers(或者只针对某一个Executor的makeOffer)来进行使命调度：

来自Executor的注册：上面说过Executor启动以后会向Driver注册自己。此时，Driver认为集群中有了新的可用资源，因此尝试进行Task到Executor的调度。Executor的注册是通过Executor启动的时间向DriverEndpoint发送RegisterExecutor消息来触发的。
Driver将Executor能够提供的可用资源(Memory , CPU)叫做Resource Offer。因此，当收到了新的Executor的注册，Driver端会调用makeOffers()方法，为这个Executor生成对应的WorkerOffer，代表这个Executor剩余可用的CPU和Memory:
1. ------------------------------------ DriverEndpoint -------------------------------
2. override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {
3. // 收到 RegisterExecutor 请求，这个请求发生在Executor启动以后，向Driver发送的信息
4. case RegisterExecutor(executorId, executorRef, hostname, cores, logUrls) =>
5. ......
6. executorRef.send(RegisteredExecutor)
7. .....
8. makeOffers() // 尝试进行task的调度，针对所有可用的executor
9. }
复制代码
来自Executor的StatusUpdate：在Executor上Task的运行状态发生变化，都会告知Driver，Driver认为此时集群的资源状态发生了变化，因此尝试进行一次task的调度。但是这次的Task的调度是针对这个Executor的，即只会调度适合运行在这个Executor上的Pending Task到这个Exeuctor上：
1. ------------------------------------ DriverEndpoint -----------------------------------
2. override def receive: PartialFunction[Any, Unit] = {
3. case StatusUpdate(executorId, taskId, state, data) =>
4. scheduler.statusUpdate(taskId, state, data.value)
5. if (TaskState.isFinished(state)) {
6. executorDataMap.get(executorId) match {
7. case Some(executorInfo) =>
8. executorInfo.freeCores += scheduler.CPUS_PER_TASK
9. makeOffers(executorId) // 尝试进行Task的调度，但是只针对这一个Executor
10. ...
11. }
复制代码
来自TaskScheduler中的使命的相干变化，比如，Task的提交，Executor的丢失，Task的失败，推测执行等等，由于系统的可用资源发生了变化，因此TaskScheduler都会通过向DriverEndpoint发送ReviveOffers消息，以触发新一轮的Pending Task的调度。
1. ----------------------------------- DriverEndpoint --------------------------------
2. override def receive: PartialFunction[Any, Unit] = {
3. .....
4. case ReviveOffers =>
5. makeOffers()
复制代码
来自SchedulerBackend的自我触发：在DriverEndpoint作为一个RpcEndpoint启动的时间，会启动一个ReviveThread，以固定频率，向自己发送ReviveOffers的本地消息(发送给自己)，以触发Pending 的 Task到Container的调度：
1. -------------------------------- DriverEndpoint ---------------------------------------
2. override def onStart() {
3. // 定期恢复offer，以允许延迟调度工作
4. // Periodically revive offers to allow delay scheduling to work
5. val reviveIntervalMs = conf.getTimeAsMs("spark.scheduler.revive.interval", "1s")
7. reviveThread.scheduleAtFixedRate(new Runnable {
8. override def run(): Unit = Utils.tryLogNonFatalError {
9. Option(self).foreach(_.send(ReviveOffers))
10. }
11. }, 0, reviveIntervalMs, TimeUnit.MILLISECONDS)
12. }
复制代码

在以Locality为思量重点的Task的调度，就是根据locality从高到低(executor-local 优先级最高)，参照当前答应的优先级，取出对应的Task进行调度。如果当前优先级的Task已经调度完毕，或者当前locality的一部门或者全部Task经过了很久还没有完成调度(即当前的系统资源无法完全满足当前的locality需求)，那么就必要低落locality再次尝试进行调度。
下文会具体讲解makeOffers()的具体流程。
PendingTask的生成:

Task的调度的紧张逻辑是满足Task 的Location Preference，即每一个Task对运行位置 (Executor, Host, Rack等等)上的偏好。我们先看看TaskLocation的寄义和生成过程，然后看看Spark的Driver是怎样通过Locality-Aware的调度方式，尽最大大概满足每一个Task的本地性需求。
TaskLocation的构造过程

在Spark中，一个Task的Location归根结底是由这个Task的Split决定的，Split代表了一个切分，比如，对一个HDFS文件的切分。在Hadoop上，一个Split表现为一个InputSplit接口的实现类的一个实例。
一个HDFS文件的一个Split是由FileSplit对象表达，对象中hostInfos存放了一个SplitLocationInfo数组，每一个SplitLocationInfo对象存放了一个这个Split的一个Replica的Location信息(由于在HDFS上文件是多副本的)，包罗具体的Hostname(即DataNode地点的节点)以及这个Host是否在内存中缓存了这个Split的标记位。
当然，在HDFS的维度，我们说一个Split被缓存，其实缓存的是这个Split对应的Replica。在Hadoop中一个文件的Split对应一个FileSplit对象，代码如下：

--------------------------------------- FileSplit ------------------------------------------------
public FileSplit(Path file, long start, long length, String[] hosts,
String[] inMemoryHosts) {
this(file, start, length, hosts);
hostInfos = new SplitLocationInfo[hosts.length];
for (int i = 0; i < hosts.length; i++) {
// because N will be tiny, scanning is probably faster than a HashSet
boolean inMemory = false;
for (String inMemoryHost : inMemoryHosts) {
if (inMemoryHost.equals(hosts[i])) {
inMemory = true;
break;
}
}
hostInfos[i] = new SplitLocationInfo(hosts[i], inMemory);
}
}

复制代码

其中，一个Replica的Location信息对应了一个SplitLocationInfo对象，包含了这个Split是否在对应的Host上缓存了，以及，对应的Host(DataNode) 信息。在多副本的环境下，一个FileSplit对应的是一个数组 SplitLocationInfo[]：

------------------------------------------ SplitLocationInfo -----------------------------------------
public class SplitLocationInfo {
private boolean inMemory;
private String location;
public SplitLocationInfo(String location, boolean inMemory) {
this.location = location;
this.inMemory = inMemory;
}

复制代码

我们看一下Spark在提交Stage和Task的过程中是怎么获取Task的Location信息的，以及在获取以后是如何基于Location调度Task的。
在HDFS的场景下，Spark是将每一个Stage对应的RDD(一个NewHadoopRDD对象)中的Partition(一个NewHadoopPartition对象)的Location信息(如果有)存放在这个NewHadoopPartition的serializableHadoopSplit中，其实是对InputSplit的封装:

--------------------------------------- NewHadoopPartition ------------------------------------
private[spark] class NewHadoopPartition(
rddId: Int,
val index: Int,
rawSplit: InputSplit with Writable) // 这个HadoopPartition对应的InputSplit
extends Partition {
val serializableHadoopSplit = new SerializableWritable(rawSplit)

复制代码

Stage以及Partition的生成是在提交以进步行的，即执行计划的生成阶段形成的。那么，到了Stage的提交阶段，是怎么利用Stage和Partition中的Location信息，生成具有Location Preference的Task信息的呢？我们看看Stage提交的时间是如何利用Partition的Location信息的。
下面的这段代码是一段经典代码，Spark中DagScheduler提交Stage的时间，会首先查抄这个Stage是否有未提交的Parent Stage：

如果有，会首先递归提交Parent Stage，然后把当前Stage放入到waitingStages中。waitingStage指的是那些parentStage还没有完成、因此必要期待的Stage;
如果没有Missing Parent Stage，那么就可以提交当前的Stage：

/** Submits stage, but first recursively submits any missing parents. */
private def submitStage(stage: Stage) {
val jobId = activeJobForStage(stage)
if (jobId.isDefined) {
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
val missing = getMissingParentStages(stage).sortBy(_.id)
if (missing.isEmpty) { // 没有missing parent stage，那么提交这个Stage的所有task
submitMissingTasks(stage, jobId.get)
} else {
for (parent <- missing) { // 有missing parent stage，那么先提交
submitStage(parent)
}
waitingStages += stage // 将当前stage放入到waitingStages中随后提交
}
}
}
}

复制代码

在提交Stage的时间，会提交这个Stage所有的Task。我们忽略其他细节，从下面的代码可以看到，Spark会通过方法getPreferredLocs()针对这个Stage的每一个Partition获取这个Partition的Location信息，放到一个taskIdToLocations: Map[Int, Seq[TaskLocation]]中，然后，会提交这些已经带有Location信息的Task：

-------------------------------------- DAGScheduler --------------------------------------
/** Called when stage's parents are available and we can now do its task. */
private def submitMissingTasks(stage: Stage, jobId: Int) {
// First figure out the indexes of partition ids to compute.
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
.....
val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
stage match {
case s: ShuffleMapStage =>
// 计算每一个partition(这里的partition就是一个task)的prefered location的list
partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
case s: ResultStage =>
partitionsToCompute.map { id =>
val p = s.partitions(id) // ResultStage.partitions中存放的是对应的partition id
(id, getPreferredLocs(stage.rdd, p))
}.toMap
}
}
// 在获取了这个Stage的每一个task的location preference信息(如果有)以后，创建一个新的Stage Attempt
// 随后，在调度这个Stage的task的时候，这个task中就包含了对应的location preference
stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)
// 提交这个Stage。基于listenerBus的事件触发机制，调度方法就会将包含了Location Preference的Task调度出去
listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
.......
val tasks: Seq[Task[_]] = try {
val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
stage match {
case stage: ShuffleMapStage =>
stage.pendingPartitions.clear()
partitionsToCompute.map { id =>
val locs = taskIdToLocations(id)
val part = partitions(id)
stage.pendingPartitions += id
new ShuffleMapTask(stage.id, stage.latestInfo.attemptNumber,
taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
Option(sc.applicationId), sc.applicationAttemptId)
}
case stage: ResultStage =>
partitionsToCompute.map { id =>
val p: Int = stage.partitions(id)
val part = partitions(p)
val locs = taskIdToLocations(id)
new ResultTask(stage.id, stage.latestInfo.attemptNumber,
taskBinary, part, locs, id, properties, serializedTaskMetrics,
Option(jobId), Option(sc.applicationId), sc.applicationAttemptId)
}
}
}
if (tasks.size > 0) {
// 提交task
taskScheduler.submitTasks(new TaskSet(
tasks.toArray, stage.id, stage.latestInfo.attemptNumber, jobId, properties))
}
}

复制代码

submitMissingTasks()方法的基本过程为:

获取这个Stage必要进行计算的所有partitions信息
1. // First figure out the indexes of partition ids to compute.
2. val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
复制代码
这里，ResultStage和ShuffledMapStage对于findMissingPartitions()的实现有区别，紧张是ResultStage中有些函数必要利用的Partition数量不等于这个ResultStage对应的rdd的partition的数量，比如我们常见的first()、last()等。所以，ResultsStage有一个partitions数组，数组中的每一个值代表其实际必要利用的partition id:
1. private[spark] class ResultStage(
2. id: Int,
3. rdd: RDD[_],
4. val func: (TaskContext, Iterator[_]) => _,
5. val partitions: Array[Int], // 实际需要的partition，partitions数组中的每一个值是实际使用的partition在rdd.partitions中的partition id
6. parents: List[Stage],
7. firstJobId: Int,
8. callSite: CallSite)
9. extends Stage(id, rdd, partitions.length, parents, firstJobId, callSite) {
复制代码
所以，对于ResultStage.findMissingPartitions()方法，其返回的是ResultStage.partitions的index的聚集，ResultStage.partitions中在索引index处存放的是对应的Partition ID。比如，ResultStage.partitions[0] = 2，意味着Partition ID = 2。而对于ShuffleMapStage，返回的直接就是ShuffleMapStage的RDD的partitions数组，这个数组中，index 和 value是雷同的，即索引0的位置存放的是partition 0，索引1的位置存放的是partition 1.
1. ------------------------------------ ResultStage ------------------------------------------
2. override def findMissingPartitions(): Seq[Int] = {
3. val job = activeJob.get
4. // 如果是ResultStage，那么实际参与计算的numPartitions不一定等于stage.rdd.partitions，
5. // 因为像first() last()这种function不需要所有的partition参与
6. // 但是对于ShuffleMapStage，stage.rdd中所有的partition都需要参与计算
7. (0 until job.numPartitions).filter(id => !job.finished(id))
8. }
复制代码
1. -------------------------------------- ShuffleMapStage -----------------------------------
2. val numPartitions = rdd.partitions.length // 对于ShuffleMapStage，numPartitions就是这个Stage对应的RDD的partitions的长度
3. override def findMissingPartitions(): Seq[Int] = {
4. mapOutputTrackerMaster
5. .findMissingPartitions(shuffleDep.shuffleId)
6. .getOrElse(0 until numPartitions)
7. }
复制代码
1. ---------------------------------- ActiveJob ---------------------------------------
2. /**
3. * 我们需要为这个job计算的partition的数量. 必须注意，ResultStage可能并不需要计算目标RDD中所有的partition，比如first()或者lookup()等函数
4. */
5. val numPartitions = finalStage match {
6. // ResultStage的partiton数量直接使用ResultStage构造的时候所传入的Partition的数量，这个partitions是ResultStage对应的rdd或者是它的子集
7. case r: ResultStage => r.partitions.length
8. // 对于ShuffleMapStage，它的partitino数量就是对应的ShuffleMapStage.RDD.partitions的长度
9. case m: ShuffleMapStage => m.rdd.partitions.length
10. }
复制代码
生成从partitionID 到对应的Prefered Location的映射关系，遍历这个Stage的所有Partition，从Partition中提取出对应的Location Preference，生成一个从Task 到 Seq[TaskLocation]的映射关系。每一个Task对应了多个TaskLocation，每一个 TaskLocation对应了一个位置信息。基于上面讲到的ResultStage的特别性，它和ShuffleMapStage在获取Partition ID的时间差别：
- 如果是ShuffleMapStage，这个index就是Partition ID，所以直接调用了id, getPreferredLocs(stage.rdd, id)；
- 如果是ResultStage，由于上面讲到的原因，必要先通过val p = s.partitions(id)获取这个index对应的Partition ID:
1. val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
2. stage match {
3. case s: ShuffleMapStage =>
4. // 计算每一个partition(这里的partition就是一个task)的prefered location的list
5. partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
6. case s: ResultStage =>
7. partitionsToCompute.map { id =>
8. val p = s.partitions(id) // 这里的id其实只是ResultStage.partitions的索引，对应的值才是Partition ID
9. (id, getPreferredLocs(stage.rdd, p)) // 根据rdd和partitionsID获取对应的preferered location
10. }.toMap
11. }
12. }
复制代码
创建一个新的Stage Attempt。一次Stage Attempt代表的就是一次Stage的运行，如果Stage运行失败必要重试，那么就进入下一个 Stage Attempt。这里紧张是更新Stage中的_latestInfo，代表当前的Stage Attempt的信息:
1. stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)
复制代码
1. ----------------------------------------- Stage ------------------------------------------
2. def makeNewStageAttempt(
3. numPartitionsToCompute: Int,
4. taskLocalityPreferences: Seq[Seq[TaskLocation]] = Seq.empty): Unit = {
5. val metrics = new TaskMetrics
6. metrics.register(rdd.sparkContext)
7. // taskLocalityPreferences是这一个Seq[Seq[TaskLocation]]，代表了每一个task的prefered location，
8. // 一个task的prefered location可能不止一个
9. _latestInfo = StageInfo.fromStage(
10. this, nextAttemptId, Some(numPartitionsToCompute), metrics, taskLocalityPreferences)
11. nextAttemptId += 1
12. }
复制代码
为这个Stage生成对应的Task(只是生成，还没有提交):
1. val tasks: Seq[Task[_]] = try {
2. val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
3. stage match {
4. case stage: ShuffleMapStage =>
5. stage.pendingPartitions.clear()
6. partitionsToCompute.map { id =>
7. val locs = taskIdToLocations(id)
8. val part = partitions(id)
9. stage.pendingPartitions += id
10. new ShuffleMapTask(stage.id, stage.latestInfo.attemptNumber,
11. taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
12. Option(sc.applicationId), sc.applicationAttemptId)
13. }
15. case stage: ResultStage =>
16. partitionsToCompute.map { id =>
17. val p: Int = stage.partitions(id)
18. val part = partitions(p)
19. val locs = taskIdToLocations(id)
20. new ResultTask(stage.id, stage.latestInfo.attemptNumber,
21. taskBinary, part, locs, id, properties, serializedTaskMetrics,
22. Option(jobId), Option(sc.applicationId), sc.applicationAttemptId)
23. }
24. }
25. }
复制代码
构造Task的时间，对应的Location Preference信息存入了Task对象的locs成员中，通过调用preferredLocs可以获取这个Task的Location Preference，随后，Spark根据Task的preferredLocs进行资源请求，并且在申请到了资源以后，根据preferredLocs将Task调度到符合的Executor上去:
1. private[spark] class ShuffleMapTask(
2. stageId: Int,
3. stageAttemptId: Int,
4. taskBinary: Broadcast[Array[Byte]],
5. partition: Partition, // Partition
6. @transient private var locs: Seq[TaskLocation], // Location Preference
7. ......
8. )
10. // 获取对应Task的Location Preference，用来进行调度决策
11. @transient private val preferredLocs: Seq[TaskLocation] = {
12. if (locs == null) Nil else locs.toSet.toSeq
13. }
复制代码
提交所有的Task
1. if (tasks.size > 0) {
2. // 提交task
3. taskScheduler.submitTasks(new TaskSet(
4. tasks.toArray, stage.id, stage.latestInfo.attemptNumber, jobId, properties))
5. }
复制代码

下面的代码显示了getPreferredLocsInternal()方法的细节，即获取一个RDD中某一个Partition的Location信息(一个Seq[String])，返回对应的TaskLocation接口的某个实现类的Seq：

-------------------------------------- DAGScheduler --------------------------------------
private def getPreferredLocsInternal(
rdd: RDD[_], // 当前Stage的RDD
partition: Int, // 这个Partitoin的index,即在rdd.partitions中的index
visited: HashSet[(RDD[_], Int)]): Seq[TaskLocation] = {
// If the partition has already been visited, no need to re-visit.
// This avoids exponential path exploration. SPARK-695
if (!visited.add((rdd, partition))) {
// Nil has already been returned for previously visited partitions.
return Nil
}
// 如果这个partition已经缓存，那么就返回缓存以后的Partion的Location信息
val cached = getCacheLocs(rdd)(partition)
if (cached.nonEmpty) {
return cached
}
// 如果这个RDD含有一个Location Preference，那么就返回这个Location Preference
val rddPrefs = rdd.preferredLocations(rdd.partitions(partition)).toList
if (rddPrefs.nonEmpty) {
return rddPrefs.map(TaskLocation(_)) // 这里会根据getPreferredLocations()的返回值Seq[String]，将List中的每一个值映射到一个TaskLocation的实现类，即一个Seq[TaskLocation]
}

复制代码

TaskLocation的apply()方法会根据传入的参数的类型，返回其对应的差别实现类：

---------------------------------- TaskLocation ------------------------------------
/**
* Create a TaskLocation from a string returned by getPreferredLocations.
* These strings have the form executor_[hostname]_[executorid], [hostname], or
* hdfs_cache_[hostname], depending on whether the location is cached.
*/
def apply(str: String): TaskLocation = {
val hstr = str.stripPrefix(inMemoryLocationTag)
if (hstr.equals(str)) {
if (str.startsWith(executorLocationTag)) {
val hostAndExecutorId = str.stripPrefix(executorLocationTag)
val splits = hostAndExecutorId.split("_", 2)
val Array(host, executorId) = splits
new ExecutorCacheTaskLocation(host, executorId)
} else {
new HostTaskLocation(str)
}
} else {
new HDFSCacheTaskLocation(hstr)
}
}
}

复制代码

可以看到，rdd.partitions(partition)方法根据这个Partition Index，返回了这个Partition对象。对于NewHadoopRDD，这个Partition的实现是NewHadoopPartition。一个NewHadoopPartition对象中存放了对应的InputSplit接口的实现类。比如，对于一个文件，InputSplit的实现是FileSplit，而InputSplit则包含了对应的Location信息:
然后，调用对应的RDD的getPreferredLocations()方法获取Location Preference，即一个NewHadoopRDD中包含了所有的Partition，每一个Partition由一个Partition接口的实现类NewHadoopPartition对象表现，每个NewHadoopPartition中存放了这个Partition的InputSplit信息：

----------------------------------- NewHadoopRDD ---------------------------------------
override def getPreferredLocations(hsplit: Partition): Seq[String] = {
// 返回这个 NewHadoopPartition对应的Hadoop层面的Split信息，比如，一个文件的FileSplit，
// 包含了这个File中这个split的起始位置，长度，replica的位置
val split = hsplit.asInstanceOf[NewHadoopPartition].serializableHadoopSplit.value
// 提取这个FileSplit的多个Replica位置信息
val locs = HadoopRDD.convertSplitLocationInfo(split.getLocationInfo)
locs.getOrElse(split.getLocations.filter(_ != "localhost"))
}

复制代码

在Spark端，在生成每一个RDD中的Partition的信息的时间，其映射关系其实从底向上为 HDFS Split(InputSplit) -> Partition(NewHadoopPartition) -> Task 的关系，因此，在生成RDD Partition的时间，会将对应的Split的Location信息转换成Partition的Location信息，这个Partition的Location信息利用TaskLocation对象的具体实现类来表达的。这个转换是由object方法convertSplitLocationInfo()负责的，该方法的输入是一个Array[SplitLocationInfo]，输出为对应的Seq[String]，即对应的Location的字符串表现，即:

如果Split的是内存缓存，则构造HDFSCacheTaskLocation的location对象，代表这个partition已经被HDFS的对应Host缓存了。很显然，对应的Task如果直接调度到这台机器，则会提高读取效率;
如果Split不是内存缓存，则构造HostTaskLocation，显然，Task调度到这条机器上，读取效率会更高(没有跨机器的网络带宽，如果打开了短路读，还可以利用HDFS的短路读特性)
1. ---------------------------------- HadoopRDD ---------------------------------------
2. private[spark] def convertSplitLocationInfo(
3. infos: Array[SplitLocationInfo]): Option[Seq[String]] = {
4. Option(infos).map(_.flatMap { loc =>
5. val locationStr = loc.getLocation
6. if (loc.isInMemory) {
7. Some(HDFSCacheTaskLocation(locationStr).toString)
8. } else {
9. Some(HostTaskLocation(locationStr).toString)
10. }
11. })
12. }
复制代码

TaskLocation的实现类除了HDFSCacheTaskLocation和HostTaskLocation，还有ExecutorCacheTaskLocation，即这个Task希望运行在这个Executor上，比如，下面两种情况，这个Partition的Locality Preference是希望精确到对应的Executor的:

在Streaming的Task的调度中，对于Receiver的调度，为了均匀调度，会首先将所有的Task均匀调度到Host上，剩下的Task均匀调度到Executor上。这不不在赘述，具体逻辑可以查看ReceiverSchedulingPolicy的scheduleReceivers()方法;
这个Partition在这个Executor上被缓存，因此，对应的Task肯定必要精确运行在对应的Executor上。这里不在赘述，对应的方法为DAGScheduler.getCacheLocs()：

private[scheduler]
def getCacheLocs(rdd: RDD[_]): IndexedSeq[Seq[TaskLocation]] = cacheLocs.synchronized {
// Note: this doesn't use `getOrElse()` because this method is called O(num tasks) times
if (!cacheLocs.contains(rdd.id)) {
// Note: if the storage level is NONE, we don't need to get locations from block manager.
val locs: IndexedSeq[Seq[TaskLocation]] = if (rdd.getStorageLevel == StorageLevel.NONE) {
IndexedSeq.fill(rdd.partitions.length)(Nil) // 如果对应RDD的StorageLevel是NONE，那么返回一个空个的TaskLocation数组
} else { // 如果RDD的StorageLevel不是空的，那么会构造对应的ExecutorCacheTaskLocation
val blockIds =
rdd.partitions.indices.map(index => RDDBlockId(rdd.id, index)).toArray[BlockId]
blockManagerMaster.getLocations(blockIds).map { bms => // 通过BlockManagerMaster获取对应的Block的位置，然后构造对应的ExecutorCacheTaskLocation
bms.map(bm => TaskLocation(bm.host, bm.executorId))
}
}
cacheLocs(rdd.id) = locs
}
cacheLocs(rdd.id)
}

复制代码

对应的TaskLocation接口的实现类的toString()方法就返回了这个TaskLocation的String表现。比如，HostTaskLocation.toString()就返回对应的Hostname:

----------------------------------------- HostTaskLocation ----------------------------------
/**
* A location on a host.
*/
private [spark] case class HostTaskLocation(override val host: String) extends TaskLocation {
override def toString: String = host
}

复制代码

-------------------------------------- HDFSCacheTaskLocation ----------------------------------
/**
* A location on a host that is cached by HDFS.
*/
private [spark] case class HDFSCacheTaskLocation(override val host: String) extends TaskLocation {
override def toString: String = TaskLocation.inMemoryLocationTag + host
}

复制代码

这样，DAGScheduler.submitMissingTasks()就会提交这些带有Location Preference的Task。这些Task会在Driver端经过一些统计计算，让ApplicationMaster根据提供的资源量、Task的位置倾向等，向Yarn申请资源，申请到的资源向Driver注册，Driver再将Task和Container进行最有匹配，最大程度满足Task的位置需求。
根据TaskLocation信息，将Task添加到差别的pendingTask数组中

所有必要运行因此必要相应资源的Task，Driver都会将对应的Task添加到pendingTask中，这些pendingTask是由TaskSetManager维护的，一个TaskSetManager对象是一个Stage的使命聚集，紧张负责这个Stage的Task的管理和调度。
根据每一个Task的资源本地性需求的差别，pendingTask分别维护在下面的Map中，代码如下:

// 希望在某一个Executor上运行的pendingTasks，Key是Executor ID，Value是希望在这个Executor上运行的TaskID的列表
private val pendingTasksForExecutor = new HashMap[String, ArrayBuffer[Int]]
// 希望在某一个Host上运行的pendingTasks,Key是对应的Host，Value是希望在这个Host上运行的TaskID的列表
private val pendingTasksForHost = new HashMap[String, ArrayBuffer[Int]]
// 希望在某个一Rack上运行的pendingTasks，Key是对应的Rack，Value是希望在这个Rack上运行的TaskID的列表
private val pendingTasksForRack = new HashMap[String, ArrayBuffer[Int]]
// 没有任何的本地性需求的pendingTasks的列表
private[scheduler] var pendingTasksWithNoPrefs = new ArrayBuffer[Int]

复制代码

向PendingTask中添加使命的时间，会根据这个task的Locality Preference，将对应的pendingTask放入到上面差别的使命聚会合：

/** 将一个Task添加到它所属的所有的pending-task lists中去 */
private[spark] def addPendingTask(index: Int) {
for (loc <- tasks(index).preferredLocations) {
loc match {
case e: ExecutorCacheTaskLocation => // Executor级别的Location Preference
pendingTasksForExecutor.getOrElseUpdate(e.executorId, new ArrayBuffer) += index
case e: HDFSCacheTaskLocation => // HDFS缓存级别的Location Preference
val exe = sched.getExecutorsAliveOnHost(loc.host) // 获取这个location所在的host上的所有executor id
exe match {
case Some(set) =>
for (e <- set) { // 对于这个host上的每一个executor
// 将这个task添加到这个host上的每一个executor上去
pendingTasksForExecutor.getOrElseUpdate(e, new ArrayBuffer) += index
}
}
case _ =>
}
// 将这个task添加到pendingTasksForHost
pendingTasksForHost.getOrElseUpdate(loc.host, new ArrayBuffer) += index
for (rack <- sched.getRackForHost(loc.host)) {
pendingTasksForRack.getOrElseUpdate(rack, new ArrayBuffer) += index
}
} // end of for loop
if (tasks(index).preferredLocations == Nil) {
pendingTasksWithNoPrefs += index
}
allPendingTasks += index // No point scanning this whole list to find the old task there
}

复制代码

其基本逻辑为:

遍历所有的Task，根据对应Task的Location Preference，将对应的Task添加到对应的pendingLocation数组中:

for (loc <- tasks(index).preferredLocations) {
......
}

复制代码

如果是ExecutorCacheTaskLocation，那么就将Task添加到pendingTasksForExecutor中
1. case e: ExecutorCacheTaskLocation =>
2. pendingTasksForExecutor.getOrElseUpdate(e.executorId, new ArrayBuffer) += index
复制代码
如果是HDFSCacheTaskLocation，并且这个Host上的确有active的Executor，那么就将这个task添加到这个Host上的每一个Executor上，即将这个Task添加到pendingTasksForExecutor中。但是，显然，这个Task最终只会在这个Host中的某一个Executor上运行一次，而不会重复运行：
1. val exe = sched.getExecutorsAliveOnHost(loc.host) // 获取这个Host上所有的alive executors
2. exe match {
3. case Some(set) =>
4. for (e <- set) { // 对于这个host上的每一个executor
5. // 将这个task添加到这个host上的每一个executor上去
6. pendingTasksForExecutor.getOrElseUpdate(e, new ArrayBuffer) += index
7. }
复制代码
无论是ExecutorCacheTaskLocation，照旧HDFSCacheTaskLocation，都将这个Task添加到pendingTasksForHost中，由于这个Task肯定会运行着这个Host上的。
1. pendingTasksForHost.getOrElseUpdate(loc.host, new ArrayBuffer) += index
复制代码
获取这个Host地点的Rack，然后添加到这个对应的pendingTasksForRack中：
1. for (rack <- sched.getRackForHost(loc.host)) {
2. pendingTasksForRack.getOrElseUpdate(rack, new ArrayBuffer) += index
3. }
复制代码
如果这个Task没有任何location preference，那么就添加到pendingTasksWithNoPrefs中：
1. if (tasks(index).preferredLocations == Nil) {
2. pendingTasksWithNoPrefs += index
3. }
复制代码
无论怎样，将这个task添加到allPendingTasks中，这是当所有的locality都无法满足要求一户最坏的Locality选择。

总之，我们必要注意:

区分pendingTasksWithNoPrefs和allPendingTasks：前者意味着Task自己对locality没有任何要求，因此可以随意将Task进行分配。而allPendingTasks中存放了所有的task，意味着当我们无法满足task的localtiy要求以后的不得已的选择。比如一个在pendingTasksForRack中的task的locality要求无法满足，只能退化到选择其他Rack来运行该Task
pendingTasksForExecutor、pendingTasksForHost、pendingTasksForRack之间存在的包含关系，即，如果一个Task是pendingTasksForExecutor，那么也会放入到这个Executor地点的host(pendingTasksForHost)和地点的rack(pendingTasksForRack)中，同理，如果一个Task的pendingTasksForHost，也会放入到这个Host地点的Rack(pendingTasksForRack)中。而且，所有的Task都会放入到allPendingTasks中。我们从后面的locality退化可以看到原因，即如果某个locality要求下无法分配完所有的task，那么会退化到低一级的优先级，而低一级的优先级列表中肯定包含了高一级优先级列表的Task，因此这个Task可以在低一级进行调度。
我们可以看到，pendingTasksForRack中的Task并不是这个task的Locality Preference直接就是就是这个rack，而是这个task的location preference是对应的这个Rack中的某一个Host，因此将这个Host地点的Rack添加为这个Task的Rack Locality。后面会看到，Rack locality是比no preference更低一级的locality，其实就是这个原因，只管表面上看起来比力疑惑，也与比如Yarn的使命调度逻辑好像差别。

可用的LocationLevel的计算

Spark用一个枚举类型TaskLocality来表达差别的本地级别：

object TaskLocality extends Enumeration {
// Process local is expected to be used ONLY within TaskSetManager for now.
val PROCESS_LOCAL, NODE_LOCAL, NO_PREF, RACK_LOCAL, ANY = Value

复制代码

PROCESS_LOCAL: 进程本地，即这个Task希望调度到某一个特定的Executor上
NODE_LOCAL：节点本地，即这个Task希望调度到某一个特定的节点
NO_PREF: 没有本地倾向，即这个Task可以任意调度到集群中的任何节点
RACK_LOCAL：机架本地，即这个Task希望调度到某一个特定的rack上
ANY: 无法满足要求，是前面四种本地性倾向都无法满足要求时的最后的选择策略，将有节点倾向的Task任意调度到集群中无法满足其本地性倾向的位置上。
这里必要注意的是:

显然，越靠前，数字越小，Locality级别越高
RACK_LOCAL并不指的是这些使命的locality preference中原来就指明确运行在这个Rack上，而是NODE_LOCAL或者PROCESS_LOCAL的对应退化调度策略，即，如果对应的节点或者Executor由于资源题目无法满足其要求，就会退化到利用RACK_LOCAL，并不是Task自己声明调度到Rack上。
NO_PREF的优先级比RACK_LOCAL的优先级更高，这是由于NO_PREF指的是Task自己明确表明自己没有本地性偏好，因此不是一种退化策略，而是对其Locality Preference的一种完全的满足，而RACK_LOCAL其实是一种退化策略，因此RACK_LOCAL的优先级低于NO_PREF
ANY的优先级最低，和RACK_LOCAL一样，也是一种退化策略。

在一个TaskSet的TaskSetManager构造的时间，以及后来一个新的Executor到场或者丢失的时间，都会重新计算这个TaskSet的Valid Locality Levels。所以，应该注意到：

Valid Locality Levels 指的是根据当前的Pending Task的差别LOCALITY需求(比如，差别locality需求的pending task分别放在了pendingTasksForExecutor、pendingTasksForHost、pendingTasksWithNoPrefs、pendingTasksForRack和allTasks中)，计算我们目前大概必要的是哪些TaskLocality。这样，在对Task进行基于Locality 的分配(下文会讲到)的时间，只必要思量这些Valid Locality Levels。
必须注意，computeValidLocalityLevels()是TaskSetManager的成员方法，即这里是为某一个TaskSet计算对应的Valid Locality Levels

下面的方法显示了计算Valid Locality Levels的基本过程：

------------------------------------- TaskSetManager --------------------------------------
private def computeValidLocalityLevels(): Array[TaskLocality.TaskLocality] = {
import TaskLocality.{PROCESS_LOCAL, NODE_LOCAL, NO_PREF, RACK_LOCAL, ANY}
val levels = new ArrayBuffer[TaskLocality.TaskLocality]
// 如果有任务在等待执行器（executor），并且这些执行器中有活着的执行器，那么就将PROCESS_LOCAL添加到有效级别列表中。
if (!pendingTasksForExecutor.isEmpty &&
pendingTasksForExecutor.keySet.exists(sched.isExecutorAlive(_))) {
levels += PROCESS_LOCAL
}
// 如果有任务在等待主机（host），并且这些主机上有活着的执行器，那么就将NODE_LOCAL添加到有效级别列表中。
if (!pendingTasksForHost.isEmpty &&
pendingTasksForHost.keySet.exists(sched.hasExecutorsAliveOnHost(_))) {
levels += NODE_LOCAL
}
if (!pendingTasksWithNoPrefs.isEmpty) {
levels += NO_PREF
}
// 如果有任务在等待机架（rack），并且这些机架上有活着的主机，那么就将RACK_LOCAL添加到有效级别列表中。
if (!pendingTasksForRack.isEmpty &&
pendingTasksForRack.keySet.exists(sched.hasHostAliveOnRack(_))) {
levels += RACK_LOCAL
}
levels += ANY
levels.toArray
}

复制代码

该方法返回一个TaskLocality数组，即如果对应的pendingTask数组中有Task，那么就将对应的TaskLocality添加到数组中。这里必要注意，往数组中添加TaskLocality是按照TaskLocality的值从低到高(Locality优先级从高到低)的顺序添加的，从下文介绍基于Locality的Task调度可以看到，调度时会遍历这个返回的TaskLocality数组，即调度时按照Locality优先级从高到低进行的。最后的TaskLocality.ANY一定会最后添加到结果中，作为一个优先级的保底操纵。
基于Locality的Task调度

上文讲过，Task的调度的触发，以及其通过makeOffers()方法进行调度。在这里，我们具体讲解这个调度的基本细节。
makeOffers()的基本功能，就是为当前的Executor的资源剩余情况，生成对应的WorkerOffer，代表这些Executor可提供的运行资源，其中包含了对应的executorId，地点的host信息以及可用的vCore信息：

private[spark]
case class WorkerOffer(executorId: String, host: String, cores: Int)

复制代码

然后TaskScheduler会根据这些剩余资源，将对应的pendingTask调度出去，当然，调度过程中必要依赖对应的Task的Locality信息。

-------------------------------------- CoarseGrainedSchedulerBackend -------------------------------
private def makeOffers() {
// 根据集群当前的可用资源状况，生成Task的调度结果
val taskDescs = CoarseGrainedSchedulerBackend.this.synchronized {
// 筛选出alive的executor
val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
val workOffers = activeExecutors.map {
case (id, executorData) =>
new WorkerOffer(id, executorData.executorHost, executorData.freeCores)
}.toIndexedSeq // 构造每一个Executor上的可用资源
scheduler.resourceOffers(workOffers) // 在这里依赖于TaskScheduler来调度对应的task到对应的worker上
}
launchTasks(taskDescs) // 启动这些tasks
}

复制代码

resourceOffers()方法是TaskSchedulerImpl的成员方法，其输入是一系列的WorkerOffer，返回可以进行调度的所有Task(每一个Task由一个TaskDescription表现)。我们后面会看到，TaskSchedulerImpl会遍历当前必要调度的所有TaskSet尝试进行调度，而不是某一个TaskSet。
resourceOffers()代码如下所示:

---------------------------------------- TaskSchedulerImpl ------------------------------
def resourceOffers(offers: IndexedSeq[WorkerOffer]): Seq[Seq[TaskDescription]] = synchronized {
// Mark each slave as alive and remember its hostname
// Also track if new executor is added
var newExecAvail = false
for (o <- offers) { // 对于每一个WorkerOffer
if (!hostToExecutors.contains(o.host)) {
hostToExecutors(o.host) = new HashSet[String]() // 构建这个Host的Executor的map
}
// 是否有新增的executor进来。如果有，则维护executor相关的map信息
if (!executorIdToRunningTaskIds.contains(o.executorId)) {
hostToExecutors(o.host) += o.executorId // 将这个executor加到这个host -> executors 的map中
executorAdded(o.executorId, o.host)
executorIdToHost(o.executorId) = o.host // 将这个host加到executor -> host 的毛重
executorIdToRunningTaskIds(o.executorId) = HashSet[Long]() // 构建这个executor -> tasks的map
newExecAvail = true
}
for (rack <- getRackForHost(o.host)) {
hostsByRack.getOrElseUpdate(rack, new HashSet[String]()) += o.host // 构建rack -> hosts的map
}
}
// Before making any offers, remove any nodes from the blacklist whose blacklist has expired. Do
// this here to avoid a separate thread and added synchronization overhead, and also because
// updating the blacklist is only relevant when task offers are being made.
blacklistTrackerOpt.foreach(_.applyBlacklistTimeout())
val filteredOffers = blacklistTrackerOpt.map { blacklistTracker =>
offers.filter { offer =>
!blacklistTracker.isNodeBlacklisted(offer.host) &&
!blacklistTracker.isExecutorBlacklisted(offer.executorId)
}
}.getOrElse(offers)
// 将offer进行随机shuffle，返回打乱顺序以后的IndexedSeq[WorkerOffer]
val shuffledOffers = shuffleOffers(filteredOffers)
// Build a list of tasks to assign to each worker.
// 根据当前的WorkerOffer，预构建一个TaskDescription的二维数组
val tasks = shuffledOffers.map(o => new ArrayBuffer[TaskDescription](o.cores / CPUS_PER_TASK))
val availableCpus = shuffledOffers.map(o => o.cores).toArray // 构建WorkerOffer -> 可用CPU的对应关系
val sortedTaskSets = rootPool.getSortedTaskSetQueue
for (taskSet <- sortedTaskSets) {
if (newExecAvail) { // 如果有新的Executor加入进来
taskSet.executorAdded() // 重新计算这个TaskSet的locality的相关信息
}
}
// Take each TaskSet in our scheduling order, and then offer it each node in increasing order
// of locality levels so that it gets a chance to launch local tasks on all of them.
// NOTE: the preferredLocality order: PROCESS_LOCAL, NODE_LOCAL, NO_PREF, RACK_LOCAL, ANY
for (taskSet <- sortedTaskSets) { // 按照调度顺序，取出每一个TaskSet
var launchedAnyTask = false
var launchedTaskAtCurrentMaxLocality = false
// locality从低到高遍历这个TaskSet中的每一个可用的localityLevels，locality越低代表本地性越好
for (currentMaxLocality <- taskSet.myLocalityLevels) {
do {
launchedTaskAtCurrentMaxLocality = resourceOfferSingleTaskSet(
taskSet, currentMaxLocality, shuffledOffers, availableCpus, tasks)
launchedAnyTask |= launchedTaskAtCurrentMaxLocality
} while (launchedTaskAtCurrentMaxLocality)
}
if (!launchedAnyTask) { // 没有launch任何一个task
taskSet.abortIfCompletelyBlacklisted(hostToExecutors)
}
}
if (tasks.size > 0) {
hasLaunchedTask = true
}
return tasks
}

复制代码

resourceOffers()的基本流程如下图所示：

其基本流程为:

根据生成的每一个WorkerOffer，TaskScheduler会维护executor <-> host <-> rack之间的映射关系。刚刚说过，resourceOffers()调用有大概来自于Executor的注册，因此有必要根据WorkerOffer来更新这个新的Executor和Host以及Rack之间的映射关系
1. for (o <- offers) { // 对于每一个WorkerOffer
2. if (!hostToExecutors.contains(o.host)) {
3. hostToExecutors(o.host) = new HashSet[String]() // 构建这个Host的Executor的map
4. }
5. // 是否有新增的executor进来。如果有，则维护executor相关的map信息
6. if (!executorIdToRunningTaskIds.contains(o.executorId)) {
7. hostToExecutors(o.host) += o.executorId // 将这个executor加到这个host -> executors 的map中
8. executorAdded(o.executorId, o.host)
9. executorIdToHost(o.executorId) = o.host // 将这个host加到executor -> host 的毛重
10. executorIdToRunningTaskIds(o.executorId) = HashSet[Long]() // 构建这个executor -> tasks的map
11. newExecAvail = true
12. }
13. for (rack <- getRackForHost(o.host)) {
14. hostsByRack.getOrElseUpdate(rack, new HashSet[String]()) += o.host // 构建rack -> hosts的map
15. }
16. }
复制代码
过滤掉部门不思量的WorkerOffer，比如，这个WorkerOffer的Node或者Executor在黑名单中
1. val filteredOffers = blacklistTrackerOpt.map { blacklistTracker =>
2. offers.filter { offer =>
3. !blacklistTracker.isNodeBlacklisted(offer.host) &&
4. !blacklistTracker.isExecutorBlacklisted(offer.executorId)
5. }
6. }.getOrElse(offers)
复制代码
为了制止每次都将Task调度到某一个WorkerOffer上(假如这个WorkerOffer资源足够多)，每次调用resourceOffers()的时间都会对WorkerOffer进行一次重新随机排序。注意，这个重新排序只是将WorkerOffer重新排序，没有将Locality重新排序，Locality永远从高到低(值从小到大)进行思量:
1. // 将offer进行随机shuffle，返回打乱顺序以后的IndexedSeq[WorkerOffer]
2. val shuffledOffers = shuffleOffers(filteredOffers)
3. // Build a list of tasks to assign to each worker.
4. // 根据当前的WorkerOffer，预构建一个TaskDescription的二维数组
5. val tasks = shuffledOffers.map(o => new ArrayBuffer[TaskDescription](o.cores / CPUS_PER_TASK))
6. val availableCpus = shuffledOffers.map(o => o.cores).toArray // 构建WorkerOffer -> 可用CPU的对应关系
复制代码
根据TaskSet的调度策略，获取排序以后的TaskSet列表。对于列表中的每一个TaskSet，如果这个TaskSet刚刚有新的Host到场，必要通过调用TaskSet的executorAdded()方法，这个方法紧张是重新计算locality信息。上文已经介绍过TaskSet的Locality信息的计算。 TaskSet的调度策略不在本文介绍范围内，感兴趣的读者可以自行相识
1. val sortedTaskSets = rootPool.getSortedTaskSetQueue // 根据TaskSet调度策略，返回排序以后的TaskSet数组
2. for (taskSet <- sortedTaskSets) {
3. if (newExecAvail) { // 如果有新的Executor加入进来
4. taskSet.executorAdded() // 重新计算这个TaskSet的locality的相关信息
5. }
6. }
复制代码
根据排序以后的TaskSet，遍历每一个TaskSet，基于这个TaskSet的Locality数组，对这个TaskSet的Tasks进行调度：
1. for (taskSet <- sortedTaskSets) { // 按照调度顺序，取出每一个TaskSet
2. var launchedAnyTask = false
3. var launchedTaskAtCurrentMaxLocality = false
4. // locality从低到高遍历这个TaskSet中的每一个可用的localityLevels，locality越低代表本地性越好
5. for (currentMaxLocality <- taskSet.myLocalityLevels) {
6. do {
7. launchedTaskAtCurrentMaxLocality = resourceOfferSingleTaskSet(
8. taskSet, currentMaxLocality, shuffledOffers, availableCpus, tasks)
9. launchedAnyTask |= launchedTaskAtCurrentMaxLocality
10. } while (launchedTaskAtCurrentMaxLocality)
11. }
12. if (!launchedAnyTask) { // 没有launch任何一个task
13. taskSet.abortIfCompletelyBlacklisted(hostToExecutors)
14. }
15. }
复制代码
上文讲过locality数组的生成过程，可以看到locality数组从前到后的Locality从高到低，因此，这里的逻辑是，对于每一个TaskSet，尝试先调度Locality最高的Task，并且，如果当前要求的Locality在指定的超时时间内无法将所Task调度完毕，将尝试更低一级的locality进行调度。到最后，locality会低落到TaskLocality.ANY，即进行任意调度。

所以，对于一个TaskSet中的task根据当前要求的TaskLocality进行使命调度，发生在方法resourceOfferSingleTaskSet()中。这个方法根据答应的最大的locality(currentMaxLocality，这里的最大指的是最低要求，即，不可以比这个locality更宽松了)，当前可用资源（shuffledOffers），必要调度的TaskSet，返回成功调度的这个TaskSet中的task：

private def resourceOfferSingleTaskSet(
taskSet: TaskSetManager, // 当前的TaskSet
maxLocality: TaskLocality, // 当前最大的locality，最大的意思是最优的locality
shuffledOffers: Seq[WorkerOffer], // 每一个WorkerOffer代表了一个可用资源
availableCpus: Array[Int], // 这个shuffledOffers中的每一个WorkerOffer所代表的可用的VCores
// 一个WorkerOffer按照可用cpu以及每个task的cpu，算出Task的数量
tasks: IndexedSeq[ArrayBuffer[TaskDescription]]) : Boolean = {
var launchedTask = false
// nodes and executors that are blacklisted for the entire application have already been
// filtered out by this point
for (i <- 0 until shuffledOffers.size) { // 遍历每一个WorkerOffer(每一个WorkerOffer对应了一个executor)
val execId = shuffledOffers(i).executorId
val host = shuffledOffers(i).host
for (task <- taskSet.resourceOffer(execId, host, maxLocality)) { // 往这个Offer上调度一个Task
tasks(i) += task // 保存调度结果
val tid = task.taskId
....
launchedTask = true
}
}
return launchedTask
}

复制代码

从上面的代码可以看到，resourceOfferSingleTaskSet()方法会遍历当前所有的可用资源WorkerOffer，尝试按照当前的maxLocality，调用resourceOffer()方法，往这个WorkerOffer上面调度一个task。
resourceOffer()方法是TaskSet的成员方法，其根据executorID， hostname以及最大答应的locality(maxLocality，即locality不可以再差了)，尝试从pendingTask中选出一个满足条件的task调度：

def resourceOffer(
execId: String, // executor id
host: String, // executor所在的host
maxLocality: TaskLocality.TaskLocality) // 所容许的locality，不能比这个locality更宽松
: Option[TaskDescription] =
{
....
val curTime = clock.getTimeMillis()
// allowedLocality 代表当前最宽松的locality是什么，显然，在开始的时候，我们希望allowedLocality严格一点儿，
// 后面如果分配失败了，再逐渐放松要求
var allowedLocality = maxLocality
// 如果 maxLocality == TaskLocality.NO_PREF，那么allowedLocality = maxLocality，
// 进入TaskLocality.NO_PREF本来就是对调度没有任何要求
if (maxLocality != TaskLocality.NO_PREF) {
allowedLocality = getAllowedLocalityLevel(curTime) // 根据当前的时间，更新当前时间节点下的allowedLocality，这个allowedLocality可能小于maxLocality,这时候就会使用这个更小(更严格)的locality
if (allowedLocality > maxLocality) { // 如果，allowedLocality比maxLocality更松弛，那么还是以maxLocality为准
// We're not allowed to search for farther-away tasks
allowedLocality = maxLocality
}
}
// 根据当前计算得到的locality 弹出对应的tasks，然后调度起来
dequeueTask(execId, host, allowedLocality).map { case ((index, taskLocality, speculative)) =>
// Found a task; do some bookkeeping and return a task description
.....
currentLocalityIndex = getLocalityIndex(taskLocality)
lastLaunchTime = curTime // 更新这个TaskSetManager的最后一个task的启动时间
.....
....
addRunningTask(taskId)
sched.dagScheduler.taskStarted(task, info)
new TaskDescription(.....)
}
}

复制代码

这个方法的基本逻辑为:

准备计算allowedLocality，初始值为maxLocality：
1. var allowedLocality = maxLocality
复制代码
尝试更新allowedLocality：
为什么allowedLocality有大概会跟maxLocality差别？原因是maxLocality代表了这个TaskSet的locality数组的外层循环(resourceOffers()方法会按照locality从第到高进行Task的分配)，但是在延迟调度的场景下，在某一个locality level(比如NODE_LOCAL)没有符合资源的情形下，在配置的退化时间到来之前，不急于将locality进行退化处置惩罚，即myLocalityLevels(currentLocalityIndex)不变，依然为NODE_LOCAL。在下一轮调度重新到来的时间(resourceOffers()方法重新运行)，在尝试到maxLocality = ALL的时间，忽然有了可以满足locality = NODE_LOCAL的资源，这时间getAllowedLocalityLevel()就会返回myLocalityLevels(currentLocalityIndex) = NODE_LOCAL，由于这个locality level的超时时间还没到，进而maxLocality更新为NODE_LOCAL，从而在最大运行的locality为NODE_LOCAL(而不是resourceOffers()循环到的位置ALL)当前的束缚下进行使命调度；
1. // 如果 maxLocality == TaskLocality.NO_PREF，那么allowedLocality = maxLocality，
2. // 进入TaskLocality.NO_PREF本来就是对调度没有任何要求
3. if (maxLocality != TaskLocality.NO_PREF) {
4. allowedLocality = getAllowedLocalityLevel(curTime) // 根据当前的时间，更新当前时间节点下的allowedLocality，这个allowedLocality可能小于maxLocality,这时候就会使用这个更小(更严格)的locality
5. if (allowedLocality > maxLocality) { // 如果，allowedLocality比maxLocality更松弛，那么还是以maxLocality为准
6. // We're not allowed to search for farther-away tasks
7. allowedLocality = maxLocality
8. }
9. }
复制代码
根据当前计算得到的allowedLocality，尝试从这个WorkOffer对应的execID,host中弹出一个task并返回结果：
1. // 根据当前计算得到的locality 弹出对应的tasks，然后调度起来
2. dequeueTask(execId, host, allowedLocality).map { case ((index, taskLocality, speculative)) =>
3. // Found a task; do some bookkeeping and return a task description
4. .....
5. currentLocalityIndex = getLocalityIndex(taskLocality)
6. lastLaunchTime = curTime // 更新这个TaskSetManager的最后一个task的启动时间
7. .....
8. ....
9. addRunningTask(taskId)
10. sched.dagScheduler.taskStarted(task, info)
11. new TaskDescription(.....)
12. }
复制代码

可以看到，TaskSetManager.getAllowedLocalityLevel()方法根据当前时间，计算当前真正的答应的locality，其核心原理为，每一个TaskSet都维护了currentLocalityIndex，myLocalityLevels[currentLocalityIndex]就对应了这个TaskSet当前应该利用的Locality级别。只有在某种条件下，currentLocalityIndex索引会通过自增的方式向前移动以移动到下一个Locality(即Locality变得更松懈)，即getAllowedLocalityLevel()方法控制了currentLocalityIndex移动条件，保证currentLocalityIndex只有在某些条件下才气移动到下一级。移动到下一级的条件为：

当前currentLocalityIndex对应的locality的pendingTasks下面已经没有task，天然应该移动到下一个Locality，这个很容易理解，即当前的locality的所有task已经处置惩罚完了。
虽然当前currentLocalityIndex对应的locality的pendingTasks下面有task没处置惩罚完，但是基于该locality已经有很长时间无法launch一个task了(即当前时间已经凌驾了这个TaskSet的上一次成功launch一个task的时间)，没有必要继续再尝试这个Locality了，只好移动到下一个Locality(Locality 降级)。

具体代码如下所示:

------------------------------------ TaskSetManager --------------------------------------------
private def getAllowedLocalityLevel(curTime: Long): TaskLocality.TaskLocality = {
// Remove the scheduled or finished tasks lazily
def tasksNeedToBeScheduledFrom(pendingTaskIds: ArrayBuffer[Int]): Boolean = {
....
}
def moreTasksToRunIn(pendingTasks: HashMap[String, ArrayBuffer[Int]]): Boolean = {
......
}
// 循环遍历所有位置级别，直到达到最后一级，currentLocalityIndex从小到大，本地性由强到弱
while (currentLocalityIndex < myLocalityLevels.length - 1) {
// 查看当前的locality下面是否有task还没有运行
val moreTasks = myLocalityLevels(currentLocalityIndex) match {
case TaskLocality.PROCESS_LOCAL => moreTasksToRunIn(pendingTasksForExecutor)
case TaskLocality.NODE_LOCAL => moreTasksToRunIn(pendingTasksForHost)
case TaskLocality.NO_PREF => pendingTasksWithNoPrefs.nonEmpty
case TaskLocality.RACK_LOCAL => moreTasksToRunIn(pendingTasksForRack)
}
if (!moreTasks) {
// 如果当前locality级别没有更多的任务，将lastLaunchTime更新为当前时间，并将位置级别索引增加1。
lastLaunchTime = curTime
currentLocalityIndex += 1 // 查看下一级locality
} else if (curTime - lastLaunchTime >= localityWaits(currentLocalityIndex)) {
// 尽管这个localityWaits(currentLocalityIndex) 的locality上还有对应的pending task，
// 但是达到等待时间的情况，会进入下一个locality
lastLaunchTime += localityWaits(currentLocalityIndex)
currentLocalityIndex += 1 // 查看下一级locality
} else {
// 在myLocalityLevels(currentLocalityIndex)的locality下还有更多的task，并且没有超时，那么就返回这个locality
return myLocalityLevels(currentLocalityIndex)
}
}
myLocalityLevels(currentLocalityIndex)
}

复制代码

其中，TaskSetManager中的lastLaunchTime维护了这个TaskSet中上一次启动一个Task的时间，即，只要我们在满足allowedLocalityLevel的条件下，在某一个WorkOffer上启动了一个Task，那么就将这个TaskSet的lastLaunchTime更新为当前时间。这样，如果一个TaskSet上长时间无法launch一个Task，那么这个allowedLocalityLevel就会自增，即低落Locality Preference的要求。
通过TaskSetManager中getLocalityWait()方法可以看到我们可以为每一个TaskLocality配置差别的超时时间：

------------------------------------ TaskSetManager -------------------------------------------
private def getLocalityWait(level: TaskLocality.TaskLocality): Long = {
val defaultWait = conf.get(config.LOCALITY_WAIT)
val localityWaitKey = level match {
case TaskLocality.PROCESS_LOCAL => "spark.locality.wait.process"
case TaskLocality.NODE_LOCAL => "spark.locality.wait.node"
case TaskLocality.RACK_LOCAL => "spark.locality.wait.rack"
case _ => null
}
if (localityWaitKey != null) {
conf.getTimeAsMs(localityWaitKey, defaultWait.toString)
} else {
0L
}
}

复制代码

在获取了最后的allowedLocalityLevel以后，就尝试通过dequeueTask(execId, host, allowedLocality)方法，根据当前WorkOffer的信息(execID, host)以及allowedLocalityLevel，从对应的pendingTask中取出符合要求的task，这里的符合要求指的是，这个Task的locality偏好是答应在这个host+executorID上运行的，并且如果在这个host + executorID上运行，其locality是不会比maxLocality更差的。即：
dequeueTask()的逻辑如下所示。它根据当前的WorkerOffer(Executor ID + hostname)和答应的最大的locality(maxLocality)，返回一个locality满足要求的Task准备运行。这里的满足要求指的是，这个Task如果运行在这个WorkerOffer上，能够满足所答应的最差的locality(maxLocality)的要求。

------------------------------------ TaskSetManager --------------------------------------------
private def dequeueTask(execId: String, host: String, maxLocality: TaskLocality.Value)
: Option[(Int, TaskLocality.Value, Boolean)] =
{
// TaskLocality.isAllowed(maxLocality, TaskLocality.PROCESS_LOCAL) 永远返回true，因此不做判断
for (index <- dequeueTaskFromList(execId, host, getPendingTasksForExecutor(execId))) {
return Some((index, TaskLocality.PROCESS_LOCAL, false))
}
// 假如maxLocality允许NODE_LOCAL，那么就从getPendingTasksForHost中取出对应的Host的task返回
if (TaskLocality.isAllowed(maxLocality, TaskLocality.NODE_LOCAL)) {
for (index <- dequeueTaskFromList(execId, host, getPendingTasksForHost(host))) {
return Some((index, TaskLocality.NODE_LOCAL, false))
}
}
// 假如maxLocality允许NO_PREF，那么就从pendingTasksWithNoPrefs中取出对应的task返回
if (TaskLocality.isAllowed(maxLocality, TaskLocality.NO_PREF)) {
// Look for noPref tasks after NODE_LOCAL for minimize cross-rack traffic
for (index <- dequeueTaskFromList(execId, host, pendingTasksWithNoPrefs)) {
return Some((index, TaskLocality.PROCESS_LOCAL, false))
}
}
// 假如maxLocality允许RACK_LOCAL，那么就获取当前的host的rack，从对应的rack的pendingTasks中取出对应的task返回
if (TaskLocality.isAllowed(maxLocality, TaskLocality.RACK_LOCAL)) {
for {
rack <- sched.getRackForHost(host)
index <- dequeueTaskFromList(execId, host, getPendingTasksForRack(rack))
} {
return Some((index, TaskLocality.RACK_LOCAL, false))
}
}
if (TaskLocality.isAllowed(maxLocality, TaskLocality.ANY)) {
for (index <- dequeueTaskFromList(execId, host, allPendingTasks)) {
return Some((index, TaskLocality.ANY, false))
}
}
// find a speculative task if all others tasks have been scheduled
dequeueSpeculativeTask(execId, host, maxLocality).map {
case (taskIndex, allowedLocality) => (taskIndex, allowedLocality, true)}
}

复制代码

示例1：
当前的WorkerOffer信息如下:

Hostname: testhost1.corp.com
ExecutorID: 25536
当前的maxLocality = NODE_LOCAL， testhost1-rack1.corp.com地点的rack为rack1

在当前的TaskSet中pendingTasksForExecutor中有一个Task A，其preference恰好是 Executor 25535。因此，根据上文所讲解的pendingTask的添加过程，Task A会同时存在于pendingTasksForExecutor，pendingTasksForHost，pendingTasksForRack以及allTasks中。
这时间dequeueTask()的运行过程为:

由于pendingTasksForExecutor中没有针对Executor 25536的Task A，因此返回Task A，Task A的实际locality是更好的PROCESS_LOCAL，满足了maxLocality = NODE_LOCAL的要求

示例2：
当前的WorkerOffer信息如下:

Hostname: testhost1.corp.com
ExecutorID: 25536
当前的maxLocality = NODE_LOCAL， testhost1-rack1.corp.com地点的rack为rack1

在当前的TaskSet中pendingTasksForExecutor中没有针对ExecutorID: 25536
的pending Task。但是在pendingTasksForHost中有一个Task B，其 locality preference恰好是Hostname: testhost1.corp.com。根据上文所讲解的pendingTask的添加过程，Task B会同时存在于pendingTasksForHost，pendingTasksForRack以及allTasks中。
这时间dequeueTask()的运行过程为:

由于pendingTasksForExecutor中没有针对Hostname: testhost1.corp.com，因此跳过；
由于pendingTasksForHost中有针对Hostname testhost1.corp.com的Task，因此取出并返回Task B，此时的实际locality恰好等于maxLocality，为 NODE_LOCAL

示例3：
当前的WorkerOffer信息如下:

Hostname: testhost1.corp.com
ExecutorID: 25536
当前的maxLocality = RACK_LOCAL， testhost1-rack1.corp.com地点的rack为rack1

在当前的TaskSet中pendingTasksForExecutor中没有针对ExecutorID: 25536
的pending Task。但是在pendingTasksForHost中有一个Task B，其 locality preference是Hostname: testhost2-rack1.corp.com，即它的locality preference同当前的WorkerOffer不在一个节点但是在一个机架上。根据上文所讲解的pendingTask的添加过程，Task B会同时存在于pendingTasksForHost，pendingTasksForRack以及allTasks中。
这时间dequeueTask()的运行过程为:

由于pendingTasksForExecutor中没有针对Hostname: testhost1.corp.com，因此跳过
由于pendingTasksForHost中有针对Hostname testhost1.corp.com的Task，因此跳过
由于pendingTasksForHost中有针对rack1的Task B，因此取出并返回Task B，此时的实际locality是RACK_LOCAL

示例4：
当前的WorkerOffer信息如下:

Hostname: testhost1.corp.com
ExecutorID: 25536
当前的maxLocality = ANY， testhost1-rack1.corp.com地点的rack为rack1

在当前的TaskSet中pendingTasksForExecutor中没有针对ExecutorID: 25536的的pending Task。pendingTasksWithNoPrefs也没有Task。但是在pendingTasksForHost中有一个Task D，其 locality preference是别的一个Hostname: testhost2-rack2.corp.com(这个testhost2地点的rack为rack2)。根据上文所讲解的pendingTask的添加过程，Task D会同时存在于pendingTasksForHost，pendingTasksForRack以及allTasks中。
这时间dequeueTask()的运行过程为:

由于pendingTasksForExecutor中没有针对ExecutorID: 25536，因此跳过
由于pendingTasksForHost中没有针对Hostname testhost2-rack2.corp.com的Task，因此跳过
由于pendingTasksWithNoPrefs中没有任何Task，因此跳过
由于getPendingTasksForRack中没有针对rack1的任何task，因此跳过
在allPendingTasks中取出Task D，满足maxLocality = NODE_LOCAL 要求，因此取出并返回Task D，此时实际的locality恰好等于maxLocality，为ANY。

结语

Spark在Yarn上的资源管理是粗粒度的资源管理(Coarse Grained)，即资源和Task并不一一对应。ApplicationMaster作为资源请求的署理，充当了细粒度的Task和粗粒度的Yarn Container之间的桥梁，即，根据细粒度的、全量的Task资源需求状态不断生成增量的、粗粒度的资源请求，并将Yarn不断异步返回的资源和当前的Task进行匹配，以最优化Task的放置。
本文全面地讲解了整个资源请求和调度过程。读者从中可以看到基于Yarn的整个资源调度过程，Spark的差别角色之间的通信过程，整个过程虽然是以Yarn为基础，但是其反映的是一个通用的资源调度决定的基本思路，因此很有参考意义。
我以为，感兴趣的读者，可以和Spark on K8S这种细粒度的资源调度过程作比力，将会有更多的收益。

免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！更多信息从访问主页：qidao123.com:ToB企服之家，中国第一个企服评测及商务社交产业平台。

		自动登录	找回密码
密码			立即注册

Spark的动态资源分配算法

本帖子中包含更多资源

0 个回复

快速回复

楼主热帖

标签云

浏览过的版块