Kafka 监控及使用 JMX 举行远程监控的安全注意事项

吴旭华 · 2024-6-15 01:46:20

目次
一. 前言
二. Kafka 监控（Kafka Monitoring）
2.1. 概览
2.2. 使用 JMX 举行远程监控的安全注意事项

一. 前言

众所周知，Kafka 的会合式计划具有很强的耐用性和容错性。此外，由于 Kafka 是一个分布式体系，因此 Topic 在多个节点之间举行分区和复制。此外，Kafka 可以成为数据集成的极具吸引力的选择，具有有意义的性能监控和对问题的及时警报。根本上，当对 Kafka 问题举行故障排除时，应用程序管理器会向需要采取纠正措施的人收集所有性能指标和警报。
二. Kafka 监控（Kafka Monitoring）

2.1. 概览

原文引用：Kafka uses Yammer Metrics for metrics reporting in the server. The Java clients use Kafka Metrics, a built-in metrics registry that minimizes transitive dependencies pulled into client applications. Both expose metrics via JMX and can be configured to report stats using pluggable stats reporters to hook up to your monitoring system.
    Kafka 使用 Yammer Metrics 在服务器中举行度量报告。Java 客户端使用 Kafka Metrics，这是一个内置的度量注册表，可以最大限度地减少客户端应用程序中的可传递依赖关系。两者都通过JMX 公开度量，而且可以配置为使用可插入的统计报告器报告统计信息，以毗连到您的监控体系。
原文引用：All Kafka rate metrics have a corresponding cumulative count metric with suffix -total. For example, records-consumed-rate has a corresponding metric named records-consumed-total.
    所有 Kafka 速率度量都有一个后缀为 -total 的相应累积计数度量。例如，records-consumed-rate（记录消耗率）有一个名为 records-consumed-total（记录消耗总量）的相应度量。
原文引用：The easiest way to see the available metrics is to fire up jconsole and point it at a running kafka client or server; this will allow browsing all metrics with JMX.
    查看可用度量的最简单方法是启动 jconsole 并将其指向正在运行的 Kafka 客户端或服务器；这将允许使用 JMX 欣赏所有度量。
2.2. 使用 JMX 举行远程监控的安全注意事项

原文引用：Apache Kafka disables remote JMX by default. You can enable remote monitoring using JMX by setting the environment variable JMX_PORT for processes started using the CLI or standard Java system properties to enable remote JMX programmatically. You must enable security when enabling remote JMX in production scenarios to ensure that unauthorized users cannot monitor or control your broker or application as well as the platform on which these are running. Note that authentication is disabled for JMX by default in Kafka and security configs must be overridden for production deployments by setting the environment variable KAFKA_JMX_OPTS for processes started using the CLI or by setting appropriate Java system properties. See Monitoring and Management Using JMX Technology for details on securing JMX.

We do graphing and alerting on the following metrics:
    Apache Kafka 默认禁用远程 JMX。您可以使用 JMX 启用远程监控，方法是为使用 CLI 或尺度Java 体系属性启动的历程设置环境变量 JMX_PORT，以编程方式启用远程 JMX。在生产场景中启用远程 JMX 时，必须启用安全性，以确保未经授权的用户无法监视或控制您的 Broker 或应用程序以及运行这些 Broker 或应用程序的平台。请注意，在 Kafka 中，默认情况下会禁用 JMX 的身份验证，而且必须通过为使用 CLI 启动的历程设置环境变量 Kafka_JMX_OPTS 或设置适当的Java 体系属性来覆盖生产部署的安全配置。有关掩护 JMX 的详细信息，请参阅使用 JMX 技能举行监视和管理。
我们根据以下指标举行绘图和警报：
DESCRIPTIONMBEAN NAMENORMAL VALUE Message in rate
消息速率
  kafka.server:type=BrokerTopicMetrics,
name=MessagesInPerSec,topic=([-.\w]+)
Incoming message rate per topic. Omitting 'topic=(...)' will yield the all-topic rate. Byte in rate from clients
客户端字节速率
  kafka.server:type=BrokerTopicMetrics,
name=BytesInPerSec,topic=([-.\w]+)
Byte in (from the clients) rate per topic. Omitting 'topic=(...)' will yield the all-topic rate. Byte in rate from other brokers
其他brokers字节速率
  kafka.server:type=BrokerTopicMetrics,
name=ReplicationBytesInPerSec
Byte in (from the other brokers) rate across all topics. Controller Request rate from Broker

  kafka.controller:type=ControllerChannelManager,
name=RequestRateAndQueueTimeMs,
brokerId=([0-9]+)
The rate (requests per second) at which the ControllerChannelManager takes requests from the queue of the given broker. And the time it takes for a request to stay in this queue before it is taken from the queue. Controller Event queue size

  kafka.controller:type=ControllerEventManager,
name=EventQueueSize
Size of the ControllerEventManager's queue.Controller Event queue time kafka.controller:type=ControllerEventManager,
name=EventQueueTimeMs
Time that takes for any event (except the Idle event) to wait in the ControllerEventManager's queue before being processed Request rate
请求速率
  kafka.network:type=RequestMetrics,
name=RequestsPerSec,
request={Produce|FetchConsumer|FetchFollower},
version=([0-9]+)
  Error rate
错误速率
  kafka.network:type=RequestMetrics,
name=ErrorsPerSec,request=([-.\w]+),
error=([-.\w]+)
Number of errors in responses counted per-request-type, per-error-code. If a response contains multiple errors, all are counted. error=NONE indicates successful responses.Produce request rate kafka.server:type=BrokerTopicMetrics,
name=TotalProduceRequestsPerSec,
topic=([-.\w]+)
Produce request rate per topic. Omitting 'topic=(...)' will yield the all-topic rate.Fetch request rate kafka.server:type=BrokerTopicMetrics,
name=TotalFetchRequestsPerSec,
topic=([-.\w]+)
Fetch request (from clients or followers) rate per topic. Omitting 'topic=(...)' will yield the all-topic rate.Failed produce request rate kafka.server:type=BrokerTopicMetrics,
name=FailedProduceRequestsPerSec,
topic=([-.\w]+)
Failed Produce request rate per topic. Omitting 'topic=(...)' will yield the all-topic rate.Failed fetch request rate kafka.server:type=BrokerTopicMetrics,
name=FailedFetchRequestsPerSec,
topic=([-.\w]+)
Failed Fetch request (from clients or followers) rate per topic. Omitting 'topic=(...)' will yield the all-topic rate. Request size in bytes
请求巨细（以字节为单位）
  kafka.network:type=RequestMetrics,
name=RequestBytes,request=([-.\w]+)
Size of requests for each request type. Temporary memory size in bytes
临时内存巨细（以字节为段位）
  kafka.network:type=RequestMetrics,
name=TemporaryMemoryBytes,request={Produce|Fetch}
Temporary memory used for message format conversions and decompression. Message conversion time
消息转换时间
  kafka.network:type=RequestMetrics,
name=MessageConversionsTimeMs,
request={Produce|Fetch}
Time in milliseconds spent on message format conversions. Message conversion rate
消息转换比率
  kafka.server:type=BrokerTopicMetrics,
name={Produce|Fetch}MessageConversionsPerSec,
topic=([-.\w]+)
Message format conversion rate, for Produce or Fetch requests, per topic. Omitting 'topic=(...)' will yield the all-topic rate.Request Queue Size kafka.network:type=RequestChannel,
name=RequestQueueSize
Size of the request queue. Byte out rate to clients
向客户端的字节输出率
  kafka.server:type=BrokerTopicMetrics,
name=BytesOutPerSec,topic=([-.\w]+)
Byte out (to the clients) rate per topic. Omitting 'topic=(...)' will yield the all-topic rate. Byte out rate to other brokers
对其他broker的字节输出率
  kafka.server:type=BrokerTopicMetrics,
name=ReplicationBytesOutPerSec
Byte out (to the other brokers) rate across all topicsRejected byte rate kafka.server:type=BrokerTopicMetrics,
name=BytesRejectedPerSec,topic=([-.\w]+)
Rejected byte rate per topic, due to the record batch size being greater than max.message.bytes configuration. Omitting 'topic=(...)' will yield the all-topic rate. Message validation failure rate due to no key specified for compacted topic
由于未为压缩topic指定key，消息验证失败率
  kafka.server:type=BrokerTopicMetrics,
name=NoKeyCompactedTopicRecordsPerSec
0 Message validation failure rate due to invalid magic number
无效的magic导致的消息验证失败率
  kafka.server:type=BrokerTopicMetrics,
name=InvalidMagicNumberRecordsPerSec
0 Message validation failure rate due to incorrect crc checksum
由于错误的crc校验和导致的消息验证失败率
  kafka.server:type=BrokerTopicMetrics,
name=InvalidMessageCrcRecordsPerSec
0 Message validation failure rate due to non-continuous offset or sequence number in batch
由于不连续offset或批处理中的序列号，导致消息验证失败率
  kafka.server:type=BrokerTopicMetrics,
name=InvalidOffsetOrSequenceRecordsPerSec
0 Log flush rate and time
日记刷新率和时间
  kafka.log:type=LogFlushStats,
name=LogFlushRateAndTimeMs
  # of offline log directories
脱机日记目次
  kafka.log:type=LogManager,
name=OfflineLogDirectoryCount
0 Leader election rate
leader选举率
  kafka.controller:type=ControllerStats,
name=LeaderElectionRateAndTimeMs
non-zero when there are broker failures Unclean leader election rate
未清算的leader选举率
  kafka.controller:type=ControllerStats,
name=UncleanLeaderElectionsPerSec
0 Is controller active on broker
控制器在broker上是否活跃
  kafka.controller:type=KafkaController,
name=ActiveControllerCount
only one broker in the cluster should have 1 Pending topic deletes
待删除主题
  kafka.controller:type=KafkaController,
name=TopicsToDeleteCount
  Pending replica deletes
待删除的副本
  kafka.controller:type=KafkaController,
name=ReplicasToDeleteCount
  Ineligible pending topic deletes
不合格的待删除主题
  kafka.controller:type=KafkaController,
name=TopicsIneligibleToDeleteCount
  Ineligible pending replica deletes
不合格的待删除副本
  kafka.controller:type=KafkaController,
name=ReplicasIneligibleToDeleteCount
# of under replicated partitions (|ISR| < |all replicas|) kafka.server:type=ReplicaManager,
name=UnderReplicatedPartitions
0# of under minIsr partitions (|ISR| < min.insync.replicas) kafka.server:type=ReplicaManager,
name=UnderMinIsrPartitionCount
0# of at minIsr partitions (|ISR| = min.insync.replicas) kafka.server:type=ReplicaManager,
name=AtMinIsrPartitionCount
0Producer Id counts kafka.server:type=ReplicaManager,
name=ProducerIdCount
Count of all producer ids created by transactional and idempotent producers in each replica on the broker Partition counts
分区数
  kafka.server:type=ReplicaManager,
name=PartitionCount
mostly even across brokersOffline Replica counts kafka.server:type=ReplicaManager,
name=OfflineReplicaCount
0 Leader replica counts
Leader副本数
  kafka.server:type=ReplicaManager,
name=LeaderCount
mostly even across brokers ISR shrink rate
ISR收缩率
  kafka.server:type=ReplicaManager,
name=IsrShrinksPerSec
If a broker goes down, ISR for some of the partitions will shrink. When that broker is up again, ISR will be expanded once the replicas are fully caught up. Other than that, the expected value for both ISR shrink rate and expansion rate is 0. ISR expansion rate
ISR扩展率
  kafka.server:type=ReplicaManager,
name=IsrExpandsPerSec
See aboveFailed ISR update rate kafka.server:type=ReplicaManager,
name=FailedIsrUpdatesPerSec
0 Max lag in messages btw follower and leader replicas
follower副本和leader副本之间的最大消息延迟
  kafka.server:type=ReplicaFetcherManager,
name=MaxLag,clientId=Replica
lag should be proportional to the maximum batch size of a produce request. Lag in messages per follower replica
每个follower副本的消息延迟
  kafka.server:type=FetcherLagMetrics,
name=ConsumerLag,clientId=([-.\w]+),
topic=([-.\w]+),partition=([0-9]+)
lag should be proportional to the maximum batch size of a produce request. Requests waiting in the producer purgatory
请求在生产者purgatory中等候
  kafka.server:type=DelayedOperationPurgatory,
name=PurgatorySize,
delayedOperation=Produce
non-zero if ack=-1 is used Requests waiting in the fetch purgatory
请求在purgatory中等候
  kafka.server:type=DelayedOperationPurgatory,
name=PurgatorySize,delayedOperation=Fetch
size depends on fetch.wait.max.ms in the consumer Request total time
请求总时间
  kafka.network:type=RequestMetrics,
name=TotalTimeMs,
request={Produce|FetchConsumer|FetchFollower}
broken into queue, local, remote and response send time Time the request waits in the request queue
请求在请求队列中等候的时间
  kafka.network:type=RequestMetrics,
name=RequestQueueTimeMs,
request={Produce|FetchConsumer|FetchFollower}
  Time the request is processed at the leader
leader处理请求的时间
  kafka.network:type=RequestMetrics,
name=LocalTimeMs,
request={Produce|FetchConsumer|FetchFollower}
  Time the request waits for the follower
请求等候follower的时间
  kafka.network:type=RequestMetrics,
name=RemoteTimeMs,
request={Produce|FetchConsumer|FetchFollower}
non-zero for produce requests when ack=-1 Time the request waits in the response queue
请求在响应队列中等候的时间
  kafka.network:type=RequestMetrics,
name=ResponseQueueTimeMs,
request={Produce|FetchConsumer|FetchFollower}
  Time to send the response
发送回应的时间
  kafka.network:type=RequestMetrics,
name=ResponseSendTimeMs,
request={Produce|FetchConsumer|FetchFollower}
  Number of messages the consumer lags behind the producer by. Published by the consumer, not broker.
消耗者落后于生产者的消息数。由消耗者而非broker提供。
  kafka.consumer:type=consumer-fetch-manager-metrics,
client-id={client-id} Attribute: records-lag-max
  The average fraction of time the network processors are idle
网络处理空闲的平均时间
  kafka.network:type=SocketServer,
name=NetworkProcessorAvgIdlePercent
between 0 and 1, ideally > 0.3 The number of connections disconnected on a processor due to a client not re-authenticating and then using the connection beyond its expiration time for anything other than re-authentication
由于客户端未重新举行身份验证，然后将毗连超出其到期时间而用于除重新身份验证以外的任何操作而在处理器上断开的毗连数
  kafka.server:type=socket-server-metrics,
listener=[SASL_PLAINTEXT|SASL_SSL],
networkProcessor=<#>,
name=expired-connections-killed-count
ideally 0 when re-authentication is enabled, implying there are no longer any older, pre-2.2.0 clients connecting to this (listener, processor) combination The total number of connections disconnected, across all processors, due to a client not re-authenticating and then using the connection beyond its expiration time for anything other than re-authentication
由于客户端未重新举行身份验证，然后在其逾期时间之后使用该毗连举行除重新身份验证以外的任何操作时，所有处理器之中断开毗连的总数
  kafka.network:type=SocketServer,
name=ExpiredConnectionsKilledCount
ideally 0 when re-authentication is enabled, implying there are no longer any older, pre-2.2.0 clients connecting to this broker The average fraction of time the request handler threads are idle
请求处理程序线程空闲的平均时间百分比
  kafka.server:type=KafkaRequestHandlerPool,
name=RequestHandlerAvgIdlePercent
between 0 and 1, ideally > 0.3 Bandwidth quota metrics per (user, client-id), user or client-id
每个（user， client-id），user或client-id的带宽配额指标
  kafka.server:type={Produce|Fetch},
user=([-.\w]+),client-id=([-.\w]+)
Two attributes. throttle-time indicates the amount of time in ms the client was throttled. Ideally = 0. byte-rate indicates the data produce/consume rate of the client in bytes/sec. For (user, client-id) quotas, both user and client-id are specified. If per-client-id quota is applied to the client, user is not specified. If per-user quota is applied, client-id is not specified. Request quota metrics per (user, client-id), user or client-id
每个（user， client-id），user或client-id的请求配额指标
  kafka.server:type=Request,
user=([-.\w]+),client-id=([-.\w]+)
Two attributes. throttle-time indicates the amount of time in ms the client was throttled. Ideally = 0. request-time indicates the percentage of time spent in broker network and I/O threads to process requests from client group. For (user, client-id) quotas, both user and client-id are specified. If per-client-id quota is applied to the client, user is not specified. If per-user quota is applied, client-id is not specified. Requests exempt from throttling
请求不受限制
kafka.server:type=Requestexempt-throttle-time indicates the percentage of time spent in broker network and I/O threads to process requests that are exempt from throttling. ZooKeeper client request latency
ZooKeeper客户端请求延迟
  kafka.server:type=ZooKeeperClientMetrics,
name=ZooKeeperRequestLatencyMs
Latency in milliseconds for ZooKeeper requests from broker. ZooKeeper connection status
ZooKeeper毗连状态
  kafka.server:type=SessionExpireListener,
name=SessionState
Connection status of broker's ZooKeeper session which may be one of Disconnected|SyncConnected|AuthFailed|ConnectedReadOnly|SaslAuthenticated|Expired. Max time to load group metadata
加载组元数据的最长时间
  kafka.server:type=group-coordinator-metrics,
name=partition-load-time-max
maximum time, in milliseconds, it took to load offsets and group metadata from the consumer offset partitions loaded in the last 30 seconds (including time spent waiting for the loading task to be scheduled) Avg time to load group metadata
加载组元数据的平均时间
  kafka.server:type=group-coordinator-metrics,
name=partition-load-time-avg
average time, in milliseconds, it took to load offsets and group metadata from the consumer offset partitions loaded in the last 30 seconds (including time spent waiting for the loading task to be scheduled) Max time to load transaction metadata
加载生意业务元数据的最长时间
  kafka.server:type=transaction-coordinator-metrics,
name=partition-load-time-max
maximum time, in milliseconds, it took to load transaction metadata from the consumer offset partitions loaded in the last 30 seconds (including time spent waiting for the loading task to be scheduled) Avg time to load transaction metadata
加载生意业务元数据的平均时间
  kafka.server:type=transaction-coordinator-metrics,
name=partition-load-time-avg
average time, in milliseconds, it took to load transaction metadata from the consumer offset partitions loaded in the last 30 seconds (including time spent waiting for the loading task to be scheduled)Rate of transactional verification errors kafka.server:type=AddPartitionsToTxnManager,
name=VerificationFailureRate
Rate of verifications that returned in failure either from the AddPartitionsToTxn API response or through errors in the AddPartitionsToTxnManager. In steady state 0, but transient errors are expected during rolls and reassignments of the transactional state partition.Time to verify a transactional request kafka.server:type=AddPartitionsToTxnManager,
name=VerificationTimeMs
The amount of time queueing while a possible previous request is in-flight plus the round trip to the transaction coordinator to verify (or not verify)Consumer Group Offset Count kafka.server:type=GroupMetadataManager,
name=NumOffsets
Total number of committed offsets for Consumer GroupsConsumer Group Count kafka.server:type=GroupMetadataManager,
name=NumGroups
Total number of Consumer GroupsConsumer Group Count, per State kafka.server:type=GroupMetadataManager,
name=NumGroups[PreparingRebalance,
CompletingRebalance,Empty,Stable,Dead]
The number of Consumer Groups in each state: PreparingRebalance, CompletingRebalance, Empty, Stable, DeadNumber of reassigning partitions kafka.server:type=ReplicaManager,
name=ReassigningPartitions
The number of reassigning leader partitions on a broker.Outgoing byte rate of reassignment traffic kafka.server:type=BrokerTopicMetrics,
name=ReassignmentBytesOutPerSec
0; non-zero when a partition reassignment is in progress.Incoming byte rate of reassignment traffic kafka.server:type=BrokerTopicMetrics,
name=ReassignmentBytesInPerSec
0; non-zero when a partition reassignment is in progress.Size of a partition on disk (in bytes)kafka.log:type=Log,name=Size,topic=([-.\w]+),partition=([0-9]+)The size of a partition on disk, measured in bytes.Number of log segments in a partition kafka.log:type=Log,name=NumLogSegments,
topic=([-.\w]+),partition=([0-9]+)
The number of log segments in a partition.First offset in a partition kafka.log:type=Log,name=LogStartOffset,
topic=([-.\w]+),partition=([0-9]+)
The first offset in a partition.Last offset in a partition kafka.log:type=Log,name=LogEndOffset,
topic=([-.\w]+),partition=([0-9]+)
The last offset in a partition.
免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！更多信息从访问主页：qidao123.com:ToB企服之家，中国第一个企服评测及商务社交产业平台。

		自动登录	找回密码
密码			立即注册

Kafka 监控及使用 JMX 举行远程监控的安全注意事项

0 个回复

快速回复

楼主热帖

标签云