1. 简介
2. 迁移序列图
数据块迁移的请求是从配置服务器(config server)发给(donor,捐献方),再有捐献方发起迁移请求给目标节点(recipient,接收方),后续迁移由捐献方和接收方配合完成。

可以看到,序列图中的5个步骤,是对应前面文章的迁移流程中的5个步骤,其中接收方的流程控制代码在migration_destination_manager.cpp中的_migrateDriver方法中,捐献方的流程控制代码在donor的move_chunk_command.cpp中的_runImpl方法中完成,代码如下:- static void _runImpl(OperationContext* opCtx, const MoveChunkRequest& moveChunkRequest) {
- const auto writeConcernForRangeDeleter =
- uassertStatusOK(ChunkMoveWriteConcernOptions::getEffectiveWriteConcern(
- opCtx, moveChunkRequest.getSecondaryThrottle()));
- // Resolve the donor and recipient shards and their connection string
- auto const shardRegistry = Grid::get(opCtx)->shardRegistry();
- // 准备donor和recipient的连接
- const auto donorConnStr =
- uassertStatusOK(shardRegistry->getShard(opCtx, moveChunkRequest.getFromShardId()))
- ->getConnString();
- const auto recipientHost = uassertStatusOK([&] {
- auto recipientShard =
- uassertStatusOK(shardRegistry->getShard(opCtx, moveChunkRequest.getToShardId()));
- return recipientShard->getTargeter()->findHost(
- opCtx, ReadPreferenceSetting{ReadPreference::PrimaryOnly});
- }());
- std::string unusedErrMsg;
- // 用于统计每一步的耗时情况
- MoveTimingHelper moveTimingHelper(opCtx,
- "from",
- moveChunkRequest.getNss().ns(),
- moveChunkRequest.getMinKey(),
- moveChunkRequest.getMaxKey(),
- 6, // Total number of steps
- &unusedErrMsg,
- moveChunkRequest.getToShardId(),
- moveChunkRequest.getFromShardId());
- moveTimingHelper.done(1);
- moveChunkHangAtStep1.pauseWhileSet();
- if (moveChunkRequest.getFromShardId() == moveChunkRequest.getToShardId()) {
- // TODO: SERVER-46669 handle wait for delete.
- return;
- }
- // 构建迁移任务管理器
- MigrationSourceManager migrationSourceManager(
- opCtx, moveChunkRequest, donorConnStr, recipientHost);
- moveTimingHelper.done(2);
- moveChunkHangAtStep2.pauseWhileSet();
- // 向接收方发送迁移命令
- uassertStatusOKWithWarning(migrationSourceManager.startClone());
- moveTimingHelper.done(3);
- moveChunkHangAtStep3.pauseWhileSet();
- // 等待块数据和变更数据都拷贝完成
- uassertStatusOKWithWarning(migrationSourceManager.awaitToCatchUp());
- moveTimingHelper.done(4);
- moveChunkHangAtStep4.pauseWhileSet();
- // 进入临界区
- uassertStatusOKWithWarning(migrationSourceManager.enterCriticalSection());
- // 通知接收方
- uassertStatusOKWithWarning(migrationSourceManager.commitChunkOnRecipient());
- moveTimingHelper.done(5);
- moveChunkHangAtStep5.pauseWhileSet();
- // 在配置服务器提交分块元数据信息
- uassertStatusOKWithWarning(migrationSourceManager.commitChunkMetadataOnConfig());
- moveTimingHelper.done(6);
- moveChunkHangAtStep6.pauseWhileSet();
- }
复制代码 下面对每一个步骤的代码做分析。
3. 各步骤源码分析
3.1 启动迁移( _recvChunkStart)
1. 参数检查,在MigrationSourceManager 构造函数中完成,不再赘述。
2. 注册监听器,用于记录在迁移期间该数据块内发生的变更数据,代码如下:
3. 向接收方发送迁移命令_recvChunkStart。
步骤2和3的代码实现在一个方法中,如下:- Status MigrationSourceManager::startClone() {
- ...// 省略了部分代码
- _cloneAndCommitTimer.reset();
- auto replCoord = repl::ReplicationCoordinator::get(_opCtx);
- auto replEnabled = replCoord->isReplEnabled();
- {
- const auto metadata = _getCurrentMetadataAndCheckEpoch();
- // Having the metadata manager registered on the collection sharding state is what indicates
- // that a chunk on that collection is being migrated. With an active migration, write
- // operations require the cloner to be present in order to track changes to the chunk which
- // needs to be transmitted to the recipient.
- // 注册监听器,_cloneDriver除了迁移数据外,还会用于记录在迁移过程中该数据块增量变化的数据(比如新增的数据)
- _cloneDriver = std::make_unique<MigrationChunkClonerSourceLegacy>(
- _args, metadata.getKeyPattern(), _donorConnStr, _recipientHost);
- AutoGetCollection autoColl(_opCtx,
- getNss(),
- replEnabled ? MODE_IX : MODE_X,
- AutoGetCollectionViewMode::kViewsForbidden,
- _opCtx->getServiceContext()->getPreciseClockSource()->now() +
- Milliseconds(migrationLockAcquisitionMaxWaitMS.load()));
- auto csr = CollectionShardingRuntime::get(_opCtx, getNss());
- auto lockedCsr = CollectionShardingRuntime::CSRLock::lockExclusive(_opCtx, csr);
- invariant(nullptr == std::exchange(msmForCsr(csr), this));
- _coordinator = std::make_unique<migrationutil::MigrationCoordinator>(
- _cloneDriver->getSessionId(),
- _args.getFromShardId(),
- _args.getToShardId(),
- getNss(),
- *_collectionUUID,
- ChunkRange(_args.getMinKey(), _args.getMaxKey()),
- _chunkVersion,
- _args.getWaitForDelete());
- _state = kCloning;
- }
- if (replEnabled) {
- auto const readConcernArgs = repl::ReadConcernArgs(
- replCoord->getMyLastAppliedOpTime(), repl::ReadConcernLevel::kLocalReadConcern);
- // 检查当前节点状态是否满足repl::ReadConcernLevel::kLocalReadConcern
- auto waitForReadConcernStatus =
- waitForReadConcern(_opCtx, readConcernArgs, StringData(), false);
- if (!waitForReadConcernStatus.isOK()) {
- return waitForReadConcernStatus;
- }
- setPrepareConflictBehaviorForReadConcern(
- _opCtx, readConcernArgs, PrepareConflictBehavior::kEnforce);
- }
- _coordinator->startMigration(_opCtx);
- // 向接收方发送开始拷贝数据的命令(_recvChunkStart)
- Status startCloneStatus = _cloneDriver->startClone(_opCtx,
- _coordinator->getMigrationId(),
- _coordinator->getLsid(),
- _coordinator->getTxnNumber());
- if (!startCloneStatus.isOK()) {
- return startCloneStatus;
- }
- scopedGuard.dismiss();
- return Status::OK();
- }
接收方在收到迁移请求后,会先检查本地是否有该表,如果没有的话,会先建表会创建表的索引:- void MigrationDestinationManager::cloneCollectionIndexesAndOptions(
- OperationContext* opCtx,
- const NamespaceString& nss,
- const CollectionOptionsAndIndexes& collectionOptionsAndIndexes) {
- {
- // 1. Create the collection (if it doesn't already exist) and create any indexes we are
- // missing (auto-heal indexes).
- ...// 省略部分代码
- {
- AutoGetCollection collection(opCtx, nss, MODE_IS);
- // 如果存在表,且不缺索引,则退出
- if (collection) {
- checkUUIDsMatch(collection.getCollection());
- auto indexSpecs =
- checkEmptyOrGetMissingIndexesFromDonor(collection.getCollection());
- if (indexSpecs.empty()) {
- return;
- }
- }
- }
- // Take the exclusive database lock if the collection does not exist or indexes are missing
- // (needs auto-heal).
- // 建表时,需要对数据库加锁
- AutoGetDb autoDb(opCtx, nss.db(), MODE_X);
- auto db = autoDb.ensureDbExists();
- auto collection = CollectionCatalog::get(opCtx)->lookupCollectionByNamespace(opCtx, nss);
- if (collection) {
- checkUUIDsMatch(collection);
- } else {
- ...// 省略部分代码// We do not have a collection by this name. Create the collection with the donor's
- // options.
- // 建表
- OperationShardingState::ScopedAllowImplicitCollectionCreate_UNSAFE
- unsafeCreateCollection(opCtx);
- WriteUnitOfWork wuow(opCtx);
- CollectionOptions collectionOptions = uassertStatusOK(
- CollectionOptions::parse(collectionOptionsAndIndexes.options,
- CollectionOptions::ParseKind::parseForStorage));
- const bool createDefaultIndexes = true;
- uassertStatusOK(db->userCreateNS(opCtx,
- nss,
- collectionOptions,
- createDefaultIndexes,
- collectionOptionsAndIndexes.idIndexSpec));
- wuow.commit();
- collection = CollectionCatalog::get(opCtx)->lookupCollectionByNamespace(opCtx, nss);
- }
- // 创建对应的索引
- auto indexSpecs = checkEmptyOrGetMissingIndexesFromDonor(collection);
- if (!indexSpecs.empty()) {
- WriteUnitOfWork wunit(opCtx);
- auto fromMigrate = true;
- CollectionWriter collWriter(opCtx, collection->uuid());
- IndexBuildsCoordinator::get(opCtx)->createIndexesOnEmptyCollection(
- opCtx, collWriter, indexSpecs, fromMigrate);
- wunit.commit();
- }
- }
- }
3.2 接收方拉取存量数据( _migrateClone)
1. 定义了一个批量插入记录的方法。
2. 定义了一个批量拉取数据的方法。
3. 定义生产者和消费队列。
4. 启动数据写入线程,该线程会消费队列中的数据,并调用批量插入记录的方法把记录保存到本地。
5. 循环向捐献方发起拉取数据请求(步骤2的方法),并写入步骤3的队列中。
6. 数据拉取结束后(写入空记录到队列中,触发步骤5结束),则同步等待步骤5的线程也结束。
详细代码如下:- // 1. 定义批量写入函数
- auto insertBatchFn = [&](OperationContext* opCtx, BSONObj arr) {
- auto it = arr.begin();
- while (it != arr.end()) {
- int batchNumCloned = 0;
- int batchClonedBytes = 0;
- const int batchMaxCloned = migrateCloneInsertionBatchSize.load();
- assertNotAborted(opCtx);
- write_ops::InsertCommandRequest insertOp(_nss);
- insertOp.getWriteCommandRequestBase().setOrdered(true);
- insertOp.setDocuments([&] {
- std::vector<BSONObj> toInsert;
- while (it != arr.end() &&
- (batchMaxCloned <= 0 || batchNumCloned < batchMaxCloned)) {
- const auto& doc = *it;
- BSONObj docToClone = doc.Obj();
- toInsert.push_back(docToClone);
- batchNumCloned++;
- batchClonedBytes += docToClone.objsize();
- ++it;
- }
- return toInsert;
- }());
- const auto reply =
- write_ops_exec::performInserts(opCtx, insertOp, OperationSource::kFromMigrate);
- for (unsigned long i = 0; i < reply.results.size(); ++i) {
- uassertStatusOKWithContext(
- reply.results[i],
- str::stream() << "Insert of " << insertOp.getDocuments()[i] << " failed.");
- }
- {
- stdx::lock_guard<Latch> statsLock(_mutex);
- _numCloned += batchNumCloned;
- ShardingStatistics::get(opCtx).countDocsClonedOnRecipient.addAndFetch(
- batchNumCloned);
- _clonedBytes += batchClonedBytes;
- }
- if (_writeConcern.needToWaitForOtherNodes()) {
- runWithoutSession(outerOpCtx, [&] {
- repl::ReplicationCoordinator::StatusAndDuration replStatus =
- repl::ReplicationCoordinator::get(opCtx)->awaitReplication(
- opCtx,
- repl::ReplClientInfo::forClient(opCtx->getClient()).getLastOp(),
- _writeConcern);
- if (replStatus.status.code() == ErrorCodes::WriteConcernFailed) {
- 22011,
- "secondaryThrottle on, but doc insert timed out; continuing",
- "migrationId"_attr = _migrationId->toBSON());
- } else {
- uassertStatusOK(replStatus.status);
- }
- });
- }
- sleepmillis(migrateCloneInsertionBatchDelayMS.load());
- }
- };
- // 2. 定义批量拉取函数
- auto fetchBatchFn = [&](OperationContext* opCtx) {
- auto res = uassertStatusOKWithContext(
- fromShard->runCommand(opCtx,
- ReadPreferenceSetting(ReadPreference::PrimaryOnly),
- "admin",
- migrateCloneRequest,
- Shard::RetryPolicy::kNoRetry),
- "_migrateClone failed: ");
- uassertStatusOKWithContext(Shard::CommandResponse::getEffectiveStatus(res),
- "_migrateClone failed: ");
- return res.response;
- };
- SingleProducerSingleConsumerQueue<BSONObj>::Options options;
- options.maxQueueDepth = 1;
- // 3. 使用生产者和消费者队列来把同步的数据写入到本地
- SingleProducerSingleConsumerQueue<BSONObj> batches(options);
- repl::OpTime lastOpApplied;
- // 4. 定义写数据线程,该线程会读取队列中的数据并写入本地节点,直到无需要同步的数据时线程退出
- stdx::thread inserterThread{[&] {
- Client::initThread("chunkInserter", opCtx->getServiceContext(), nullptr);
- auto client = Client::getCurrent();
- {
- stdx::lock_guard lk(*client);
- client->setSystemOperationKillableByStepdown(lk);
- }
- auto executor =
- Grid::get(opCtx->getServiceContext())->getExecutorPool()->getFixedExecutor();
- auto inserterOpCtx = CancelableOperationContext(
- cc().makeOperationContext(), opCtx->getCancellationToken(), executor);
- auto consumerGuard = makeGuard([&] {
- batches.closeConsumerEnd();
- lastOpApplied = repl::ReplClientInfo::forClient(inserterOpCtx->getClient()).getLastOp();
- });
- try {
- while (true) {
- auto nextBatch = batches.pop(inserterOpCtx.get());
- auto arr = nextBatch["objects"].Obj();
- if (arr.isEmpty()) {
- return;
- }
- insertBatchFn(inserterOpCtx.get(), arr);
- }
- } catch (...) {
- stdx::lock_guard<Client> lk(*opCtx->getClient());
- opCtx->getServiceContext()->killOperation(lk, opCtx, ErrorCodes::Error(51008));
- LOGV2(21999,
- "Batch insertion failed: {error}",
- "Batch insertion failed",
- "error"_attr = redact(exceptionToStatus()));
- }
- }};
- {
- //6. makeGuard的作用是延迟执行inserterThread.join()
- auto inserterThreadJoinGuard = makeGuard([&] {
- batches.closeProducerEnd();
- inserterThread.join();
- });
- // 5. 向捐献方发起拉取请求,并把数据写入队列中
- while (true) {
- auto res = fetchBatchFn(opCtx);
- try {
- batches.push(res.getOwned(), opCtx);
- auto arr = res["objects"].Obj();
- if (arr.isEmpty()) {
- break;
- }
- } catch (const ExceptionFor<ErrorCodes::ProducerConsumerQueueEndClosed>&) {
- break;
- }
- }
- } // This scope ensures that the guard is destroyed
复制代码 3.3 接收方拉取变更数据( _recvChunkStart)
在本步骤,接收方会再拉取变更数据,即在前面迁移过程中,捐献方上发生的针对该数据块的写入、更新和删除的记录,代码如下:- // 同步变更数据(_transferMods)
- const BSONObj xferModsRequest = createTransferModsRequest(_nss, *_sessionId);
- {
- // 5. Do bulk of mods
- // 5. 批量拉取变更数据,循环拉取,直至无变更数据
- _setState(CATCHUP);
- while (true) {
- auto res = uassertStatusOKWithContext(
- fromShard->runCommand(opCtx,
- ReadPreferenceSetting(ReadPreference::PrimaryOnly),
- "admin",
- xferModsRequest,
- Shard::RetryPolicy::kNoRetry),
- "_transferMods failed: ");
- uassertStatusOKWithContext(Shard::CommandResponse::getEffectiveStatus(res),
- "_transferMods failed: ");
- const auto& mods = res.response;
- if (mods["size"].number() == 0) {
- // There are no more pending modifications to be applied. End the catchup phase
- // 无变更数据时,停止循环
- break;
- }
- // 应用拉取到的变更数据
- if (!_applyMigrateOp(opCtx, mods, &lastOpApplied)) {
- continue;
- }
- const int maxIterations = 3600 * 50;
- // 等待从节点完成数据同步
- int i;
- for (i = 0; i < maxIterations; i++) {
- opCtx->checkForInterrupt();
- outerOpCtx->checkForInterrupt();
- if (getState() == ABORT) {
- LOGV2(22002,
- "Migration aborted while waiting for replication at catch up stage",
- "migrationId"_attr = _migrationId->toBSON());
- return;
- }
- if (runWithoutSession(outerOpCtx, [&] {
- return opReplicatedEnough(opCtx, lastOpApplied, _writeConcern);
- })) {
- break;
- }
- if (i > 100) {
- LOGV2(22003,
- "secondaries having hard time keeping up with migrate",
- "migrationId"_attr = _migrationId->toBSON());
- }
- sleepmillis(20);
- }
- if (i == maxIterations) {
- _setStateFail("secondary can't keep up with migrate");
- return;
- }
- }
- timing.done(5);
- migrateThreadHangAtStep5.pauseWhileSet();
- }
复制代码 变更数据拉取结束,就进入等待捐献方进入临界区,在临界区内,捐献方会阻塞写入请求,因此在未进入临界区前,仍然需要拉取变更数据:- // 6. Wait for commit
- // 6. 等待donor进入临界区
- _setState(STEADY);
- bool transferAfterCommit = false;
- while (getState() == STEADY || getState() == COMMIT_START) {
- opCtx->checkForInterrupt();
- outerOpCtx->checkForInterrupt();
- // Make sure we do at least one transfer after recv'ing the commit message. If we
- // aren't sure that at least one transfer happens *after* our state changes to
- // COMMIT_START, there could be mods still on the FROM shard that got logged
- // *after* our _transferMods but *before* the critical section.
- if (getState() == COMMIT_START) {
- transferAfterCommit = true;
- }
- auto res = uassertStatusOKWithContext(
- fromShard->runCommand(opCtx,
- ReadPreferenceSetting(ReadPreference::PrimaryOnly),
- "admin",
- xferModsRequest,
- Shard::RetryPolicy::kNoRetry),
- "_transferMods failed in STEADY STATE: ");
- uassertStatusOKWithContext(Shard::CommandResponse::getEffectiveStatus(res),
- "_transferMods failed in STEADY STATE: ");
- auto mods = res.response;
- // 如果请求到变更数据,则应用到本地,并继续请求变更数据,直到所有变更数据都迁移结束
- if (mods["size"].number() > 0 && _applyMigrateOp(opCtx, mods, &lastOpApplied)) {
- continue;
- }
- if (getState() == ABORT) {
- LOGV2(22006,
- "Migration aborted while transferring mods",
- "migrationId"_attr = _migrationId->toBSON());
- return;
- }
- // We know we're finished when:
- // 1) The from side has told us that it has locked writes (COMMIT_START)
- // 2) We've checked at least one more time for un-transmitted mods
- // 检查transferAfterCommit的原因:进入COMMIT_START(临界区)后,需要再拉取一次变更数据
- if (getState() == COMMIT_START && transferAfterCommit == true) {
- // 检查所有数据同步到从节点后,数据迁移流程结束
- if (runWithoutSession(outerOpCtx,
- [&] { return _flushPendingWrites(opCtx, lastOpApplied); })) {
- break;
- }
- }
- // Only sleep if we aren't committing
- if (getState() == STEADY)
- sleepmillis(10);
- }
3.4 进入临界区( _recvChunkStatus,_recvChunkCommit)
1. 等待接收方完成数据同步(_recvChunkStatus)。
2. 标记本节点进入临界区,阻塞写操作。
3. 通知接收方进入临界区(_recvChunkCommit)。
相关代码如下:- Status MigrationSourceManager::awaitToCatchUp() {
- invariant(!_opCtx->lockState()->isLocked());
- invariant(_state == kCloning);
- auto scopedGuard = makeGuard([&] { cleanupOnError(); });
- _stats.totalDonorChunkCloneTimeMillis.addAndFetch(_cloneAndCommitTimer.millis());
- _cloneAndCommitTimer.reset();
- // Block until the cloner deems it appropriate to enter the critical section.
- // 等待数据拷贝完成,这里会向接收方发送_recvChunkStatus,检查接收方的状态是否是STEADY
- Status catchUpStatus = _cloneDriver->awaitUntilCriticalSectionIsAppropriate(
- _opCtx, kMaxWaitToEnterCriticalSectionTimeout);
- if (!catchUpStatus.isOK()) {
- return catchUpStatus;
- }
- _state = kCloneCaughtUp;
- scopedGuard.dismiss();
- return Status::OK();
- }
- <br>// 进入临界区
- Status MigrationSourceManager::enterCriticalSection() {
- ...// 省略部分代码<br> // 标记进入临界区,后续更新类操作会被阻塞(通过ShardingMigrationCriticalSection::getSignal()检查该标记)
- _critSec.emplace(_opCtx, _args.getNss(), _critSecReason);
- _state = kCriticalSection;
- // Persist a signal to secondaries that we've entered the critical section. This is will cause
- // secondaries to refresh their routing table when next accessed, which will block behind the
- // critical section. This ensures causal consistency by preventing a stale mongos with a cluster
- // time inclusive of the migration config commit update from accessing secondary data.
- // Note: this write must occur after the critSec flag is set, to ensure the secondary refresh
- // will stall behind the flag.
- // 通知从节点此时主节点已进入临界区,如果有数据访问时要刷新路由信息(保证因果一致性)
- Status signalStatus = shardmetadatautil::updateShardCollectionsEntry(
- _opCtx,
- BSON(ShardCollectionType::kNssFieldName << getNss().ns()),
- BSON("$inc" << BSON(ShardCollectionType::kEnterCriticalSectionCounterFieldName << 1)),
- false /*upsert*/);
- if (!signalStatus.isOK()) {
- return {
- ErrorCodes::OperationFailed,
- str::stream() << "Failed to persist critical section signal for secondaries due to: "
- << signalStatus.toString()};
- }
- LOGV2(22017,
- "Migration successfully entered critical section",
- "migrationId"_attr = _coordinator->getMigrationId());
- scopedGuard.dismiss();
- return Status::OK();
- }<br><br>
4. 小结
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作! |