物联网R语言文件IO和并行盘算优化实践

雁过留声 发表于 2025-1-2 17:25:44

R语言文件IO和并行盘算优化实践

### 背景

在处置惩罚大规模数据文件生成任务时，性能往往是一个关键题目。本文将分享一个实际案例，展示如何通过并行盘算和IO优化提拔R程序的性能。
初始题目

在我们的数据处置惩罚任务中，需要生成多个大型数据文件。初始版本的代码存在以下题目：

[*] 执行服从低：

[*]串行处置惩罚多个文件
[*]每行数据都举行一次文件写入
[*]单个文件生成速度约1MB/s

[*] 资源利用不合理：

[*]CPU利用率低
[*]频繁的IO操作

优化过程

第一步：实现并行处置惩罚

首先，我们实验使用R的parallel包实现并行处置惩罚：
# 设置并行环境
num_cores <- detectCores() - 1# 保留一个核心给系统
cl <- makeCluster(min(num_cores, length(MODEL_GROUPS)))

# 导出必要的函数和变量
clusterExport(cl, c("generate_rch_file", "get_combo_data", "write_row_data",
"HEADER_LINES", "MODEL_GROUPS", "CC_SUFFIXES", "PAW_VALUES"))

# 确保每个工作进程都加载必要的库
clusterEvalQ(cl, {
library(tidyverse)
library(foreign)
})

# 并行执行任务
parLapply(cl, MODEL_GROUPS, function(model) {
generate_rch_file(model, "Hindcast", "00", rc_combi, "MODFLOW_recharge_Outputs/Hindcast")
})
这次优化后碰到了第一个题目：
错误于checkForRemoteErrors(val): 6 nodes produced errors; first error: could not find function "generate_rch_file"
原因是并行情况中的函数可见性题目，通过正确导出函数和变量解决了这个题目。
第二步：发现IO瓶颈

并行实现后，我们发现了新的题目：

[*]原来单个文件写入速度为1MB+/s
[*]并行后每个文件只有几百KB/s
[*]团体性能并没有得到显著提拔
分析发现这是由于：

[*]多个进程同时写文件导致磁盘IO竞争
[*]频繁的小数据写入造成大量磁盘寻道时间
[*]写入操作过于频繁（每行数据一次写入）
第三步：优化IO策略

我们对文件写入逻辑举行了重构：

[*]修改write_row_data函数，不直接写文件而是返回字符串：
write_row_data <- function(values) {
result <- character()
for(i in 1:6) {
start_idx <- (i-1)*10 + 1
line_values <- values
formatted_values <- sprintf("%.4e", line_values)
result <- c(result, paste(" ", paste(formatted_values, collapse="")))
}
return(result)
}

[*]使用buffer缓存数据，批量写入：
# 初始化buffer
buffer <- character()

# 累积数据到buffer
buffer <- c(buffer, write_row_data(row_values))

# 每处理50周数据写入一次
if(week_idx %% 50 == 0 || week_idx == length(dates)) {
if(week_idx == 50) {
writeLines(buffer, outfile)
} else {
con <- file(outfile, "a")
writeLines(buffer, con)
close(con)
}
buffer <- character()
}
优化结果

最终版本取得了显著的性能提拔：

[*] 写入速度：

[*]每个文件的单次写入量提拔到5MB+
[*]IO操作次数大大淘汰

[*] 资源使用：

[*]CPU使用率维持在40%左右
[*]内存使用率合理
[*]仍有足够资源用于其他任务

[*] 代码质量：

[*]保持了代码的可读性
[*]错误处置惩罚更完善
[*]资源管理更合理

履历总结

[*] 并行化留意事项：

[*]正确导出函数和变量
[*]合理设置并行度
[*]留意资源竞争

[*] IO优化策略：

[*]淘汰IO操作频率
[*]使用缓存机制
[*]批量处置惩罚数据

[*] 性能调优发起：

[*]先找到性能瓶颈
[*]渐渐优化，及时验证
[*]平衡资源使用

R Language File IO and Parallel Computing Optimization Practice

Background

When handling large-scale file generation tasks, performance is often a critical issue. This article shares a practical case study demonstrating how to improve R program performance through parallel computing and IO optimization.
Initial Problems

In our data processing task, we needed to generate multiple large data files. The initial code had the following issues:

[*] Low Execution Efficiency:

[*]Serial processing of multiple files
[*]File writing for each row of data
[*]Single file generation speed about 1MB/s

[*] Improper Resource Utilization:

[*]Low CPU utilization
[*]Frequent IO operations

Optimization Process

Step 1: Implementing Parallel Processing

First, we tried using R’s parallel package for parallel processing:
# Set up parallel environment
num_cores <- detectCores() - 1# Reserve one core for system
cl <- makeCluster(min(num_cores, length(MODEL_GROUPS)))

# Export necessary functions and variables
clusterExport(cl, c("generate_rch_file", "get_combo_data", "write_row_data",
"HEADER_LINES", "MODEL_GROUPS", "CC_SUFFIXES", "PAW_VALUES"))

# Ensure each worker loads necessary libraries
clusterEvalQ(cl, {
library(tidyverse)
library(foreign)
})

# Execute tasks in parallel
parLapply(cl, MODEL_GROUPS, function(model) {
generate_rch_file(model, "Hindcast", "00", rc_combi, "MODFLOW_recharge_Outputs/Hindcast")
})
We encountered our first issue after this optimization:
Error in checkForRemoteErrors(val): 6 nodes produced errors; first error: could not find function "generate_rch_file"
This was resolved by correctly exporting functions and variables to the parallel environment.
Step 2: Discovering IO Bottleneck

After implementing parallelization, we found new issues:

[*]Original single file write speed was 1MB+/s
[*]After parallelization, each file only achieved few hundred KB/s
[*]Overall performance didn’t improve significantly
Analysis revealed this was due to:

[*]Disk IO contention from multiple processes writing simultaneously
[*]Excessive disk seek time from frequent small data writes
[*]Too frequent write operations (one write per row)
Step 3: Optimizing IO Strategy

We restructured the file writing logic:

[*]Modified write_row_data function to return strings instead of writing directly:
write_row_data <- function(values) {
result <- character()
for(i in 1:6) {
start_idx <- (i-1)*10 + 1
line_values <- values
formatted_values <- sprintf("%.4e", line_values)
result <- c(result, paste(" ", paste(formatted_values, collapse="")))
}
return(result)
}

[*]Used buffer to cache data and write in batches:
# Initialize buffer
buffer <- character()

# Accumulate data to buffer
buffer <- c(buffer, write_row_data(row_values))

# Write every 50 weeks of data
if(week_idx %% 50 == 0 || week_idx == length(dates)) {
if(week_idx == 50) {
writeLines(buffer, outfile)
} else {
con <- file(outfile, "a")
writeLines(buffer, con)
close(con)
}
buffer <- character()
}
Optimization Results

The final version achieved significant performance improvements:

[*] Write Speed:

[*]Single write volume increased to 5MB+
[*]Greatly reduced number of IO operations

[*] Resource Usage:

[*]CPU usage maintained around 40%
[*]Reasonable memory usage
[*]Sufficient resources for other tasks

[*] Code Quality:

[*]Maintained code readability
[*]Improved error handling
[*]Better resource management

Lessons Learned

[*] Parallelization Considerations:

[*]Correct function and variable export
[*]Appropriate parallelization level
[*]Resource contention awareness

[*] IO Optimization Strategies:

[*]Reduce IO operation frequency
[*]Use caching mechanism
[*]Batch process data

[*] Performance Tuning Tips:

[*]First identify performance bottlenecks
[*]Optimize step by step, verify timely
[*]Balance resource usage

免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！更多信息从访问主页：qidao123.com:ToB企服之家，中国第一个企服评测及商务社交产业平台。

页: [1]

IT评测·应用市场-qidao123.com技术社区's Archiver

R语言文件IO和并行盘算优化实践