内存排查
总体思路(排查路径)
一般的排查思路:先看 processes → 再看 merges → 再看 parts → 最后看 cache/dict
当发现 ClickHouse 内存异常时,可以按这个顺序排查:
- 整体内存分布:
merges / processes / tables / dict - 实时指标(metrics)
- 活跃查询(processes)
- 历史高内存查询(query_log)
- 存储结构(parts / tables)
- 缓存 / 字典 / query cache
实时内存总览
脚本:持续打印内存使用
- Merges:后台 merge 操作内存占用
- Processes:当前 SQL 查询的内存占用
- PrimaryK:主键索引(Primary Key)驻留内存大小
- TempTabs:临时表 / JOIN / SET 等内存消耗
- Dicts:字典结构(dictionaries)的内存占用
echo " Merges Processes PrimaryK TempTabs Dicts"; \
for i in `seq 1 600`; \
do clickhouse-client --empty_result_for_aggregation_by_empty_set=0 \
-q "select \
(select leftPad(formatReadableSize(sum(memory_usage)),15, ' ') from system.merges)||
(select leftPad(formatReadableSize(sum(memory_usage)),15, ' ') from system.processes)||
(select leftPad(formatReadableSize(sum(primary_key_bytes_in_memory_allocated)),15, ' ') from system.parts)|| \
(select leftPad(formatReadableSize(sum(total_bytes)),15, ' ') from system.tables WHERE engine IN ('Memory','Set','Join'))|| \
(select leftPad(formatReadableSize(sum(bytes_allocated)),15, ' ') FROM system.dictionaries)";
sleep 1; \
done
系统指标(Metrics)
常见关键指标:
- MemoryTracking
- UncompressedCache
- MarkCache
- QueryCache
SELECT
*,
formatReadableSize(value)
FROM system.asynchronous_metrics
WHERE (metric LIKE '%Cach%') OR (metric LIKE '%Mem%')
ORDER BY metric
SELECT
event_time,
metric,
value,
formatReadableSize(value)
FROM system.asynchronous_metric_log
WHERE (event_time > (now() - 600))
AND ((metric LIKE '%Cach%') OR (metric LIKE '%Mem%'))
AND (value != 0)
ORDER BY metric, event_time
查询相关内存
当前查询(实时)
SELECT formatReadableSize(sum(memory_usage)) FROM system.processes;
Top 内存查询:
SELECT
initial_query_id,
elapsed,
formatReadableSize(memory_usage),
formatReadableSize(peak_memory_usage),
query
FROM system.processes
ORDER BY peak_memory_usage DESC
LIMIT 10;
历史查询
用途说明:
- 定位具体消耗内存的查询或会话
- 判断是否为某条 SQL 导致异常内存占用
SELECT
type,
event_time,
formatReadableSize(memory_usage),
query
FROM system.query_log
WHERE (event_date >= today())
AND (event_time >= (now() - 7200))
ORDER BY memory_usage DESC
LIMIT 10;
Merge 内存
内存占用较高的常见主要原因包括:
- 大表正在进行数据合并(merge)操作
- 表触发了 mutation(如 UPDATE、DELETE、ALTER 等操作)
可通过以下 SQL 查询 Merge 阶段的总内存占用:
SELECT formatReadableSize(sum(memory_usage)) FROM system.merges;
存储结构(Parts / Tables)
主键内存
主键过大会导致常驻内存增长(Resident Memory)。
SELECT
sumIf(data_uncompressed_bytes, part_type = 'InMemory') as memory_parts,
formatReadableSize(sum(primary_key_bytes_in_memory)) AS primary_key_bytes_in_memory,
formatReadableSize(sum(primary_key_bytes_in_memory_allocated)) AS primary_key_bytes_in_memory_allocated
FROM system.parts;
Part 详情
SELECT
name,
rows,
formatReadableSize(bytes_on_disk),
formatReadableSize(data_compressed_bytes),
formatReadableSize(data_uncompressed_bytes)
FROM system.parts
WHERE table = 'test_table';
表内存(Memory/Join/Set)
SELECT
name,
total_rows,
active_parts,
formatReadableSize(total_bytes)
FROM system.tables
WHERE table = 'test_table'
字典(Dictionaries)
SELECT formatReadableSize(sum(bytes_allocated)) FROM system.dictionaries;
常见问题:
- 字典太大
- 没有限制 lifetime
Query Cache
如果很大可能缓存太多结果,可以查看缓存大小:
SELECT formatReadableSize(sum(result_size)) FROM system.query_cache;
实时监控
for i in `seq 1 600`; \
do clickhouse-client --empty_result_for_aggregation_by_empty_set=0 \
-q "select
(select 'Merges: '||formatReadableSize(sum(memory_usage)) from system.merges),
(select 'Processes: '||formatReadableSize(sum(memory_usage)) from system.processes)"; \
sleep 1; \
done
Trace Log
WITH
now() - INTERVAL 24 HOUR AS min_time, -- start time
now() AS max_time -- end time
SELECT
trace_type,
count(),
topK(20)(query_id)
FROM system.trace_log
WHERE event_time BETWEEN min_time AND max_time
GROUP BY trace_type;