马上注册,结交更多好友,享用更多功能,让你轻松玩转社区。
您需要 登录 才可以下载或查看,没有账号?立即注册
x
### 背景
在处置惩罚大规模数据文件生成任务时,性能往往是一个关键题目。本文将分享一个实际案例,展示如何通过并行盘算和IO优化提拔R程序的性能。
初始题目
在我们的数据处置惩罚任务中,需要生成多个大型数据文件。初始版本的代码存在以下题目:
- 执行服从低:
- 串行处置惩罚多个文件
- 每行数据都举行一次文件写入
- 单个文件生成速度约1MB/s
- 资源利用不合理:
优化过程
第一步:实现并行处置惩罚
首先,我们实验使用R的parallel包实现并行处置惩罚:
- # 设置并行环境
- num_cores <- detectCores() - 1 # 保留一个核心给系统
- cl <- makeCluster(min(num_cores, length(MODEL_GROUPS)))
- # 导出必要的函数和变量
- clusterExport(cl, c("generate_rch_file", "get_combo_data", "write_row_data",
- "HEADER_LINES", "MODEL_GROUPS", "CC_SUFFIXES", "PAW_VALUES"))
- # 确保每个工作进程都加载必要的库
- clusterEvalQ(cl, {
- library(tidyverse)
- library(foreign)
- })
- # 并行执行任务
- parLapply(cl, MODEL_GROUPS, function(model) {
- generate_rch_file(model, "Hindcast", "00", rc_combi, "MODFLOW_recharge_Outputs/Hindcast")
- })
复制代码 这次优化后碰到了第一个题目:
- 错误于checkForRemoteErrors(val): 6 nodes produced errors; first error: could not find function "generate_rch_file"
复制代码 原因是并行情况中的函数可见性题目,通过正确导出函数和变量解决了这个题目。
第二步:发现IO瓶颈
并行实现后,我们发现了新的题目:
- 原来单个文件写入速度为1MB+/s
- 并行后每个文件只有几百KB/s
- 团体性能并没有得到显著提拔
分析发现这是由于:
- 多个进程同时写文件导致磁盘IO竞争
- 频繁的小数据写入造成大量磁盘寻道时间
- 写入操作过于频繁(每行数据一次写入)
第三步:优化IO策略
我们对文件写入逻辑举行了重构:
- 修改write_row_data函数,不直接写文件而是返回字符串:
- write_row_data <- function(values) {
- result <- character()
- for(i in 1:6) {
- start_idx <- (i-1)*10 + 1
- line_values <- values[start_idx:(start_idx+9)]
- formatted_values <- sprintf("%.4e", line_values)
- result <- c(result, paste(" ", paste(formatted_values, collapse=" ")))
- }
- return(result)
- }
复制代码- # 初始化buffer
- buffer <- character()
- # 累积数据到buffer
- buffer <- c(buffer, write_row_data(row_values))
- # 每处理50周数据写入一次
- if(week_idx %% 50 == 0 || week_idx == length(dates)) {
- if(week_idx == 50) {
- writeLines(buffer, outfile)
- } else {
- con <- file(outfile, "a")
- writeLines(buffer, con)
- close(con)
- }
- buffer <- character()
- }
复制代码 优化结果
最终版本取得了显著的性能提拔:
- 写入速度:
- 每个文件的单次写入量提拔到5MB+
- IO操作次数大大淘汰
- 资源使用:
- CPU使用率维持在40%左右
- 内存使用率合理
- 仍有足够资源用于其他任务
- 代码质量:
- 保持了代码的可读性
- 错误处置惩罚更完善
- 资源管理更合理
履历总结
R Language File IO and Parallel Computing Optimization Practice
Background
When handling large-scale file generation tasks, performance is often a critical issue. This article shares a practical case study demonstrating how to improve R program performance through parallel computing and IO optimization.
Initial Problems
In our data processing task, we needed to generate multiple large data files. The initial code had the following issues:
- Low Execution Efficiency:
- Serial processing of multiple files
- File writing for each row of data
- Single file generation speed about 1MB/s
- Improper Resource Utilization:
- Low CPU utilization
- Frequent IO operations
Optimization Process
Step 1: Implementing Parallel Processing
First, we tried using R’s parallel package for parallel processing:
- # Set up parallel environment
- num_cores <- detectCores() - 1 # Reserve one core for system
- cl <- makeCluster(min(num_cores, length(MODEL_GROUPS)))
- # Export necessary functions and variables
- clusterExport(cl, c("generate_rch_file", "get_combo_data", "write_row_data",
- "HEADER_LINES", "MODEL_GROUPS", "CC_SUFFIXES", "PAW_VALUES"))
- # Ensure each worker loads necessary libraries
- clusterEvalQ(cl, {
- library(tidyverse)
- library(foreign)
- })
- # Execute tasks in parallel
- parLapply(cl, MODEL_GROUPS, function(model) {
- generate_rch_file(model, "Hindcast", "00", rc_combi, "MODFLOW_recharge_Outputs/Hindcast")
- })
复制代码 We encountered our first issue after this optimization:
- Error in checkForRemoteErrors(val): 6 nodes produced errors; first error: could not find function "generate_rch_file"
复制代码 This was resolved by correctly exporting functions and variables to the parallel environment.
Step 2: Discovering IO Bottleneck
After implementing parallelization, we found new issues:
- Original single file write speed was 1MB+/s
- After parallelization, each file only achieved few hundred KB/s
- Overall performance didn’t improve significantly
Analysis revealed this was due to:
- Disk IO contention from multiple processes writing simultaneously
- Excessive disk seek time from frequent small data writes
- Too frequent write operations (one write per row)
Step 3: Optimizing IO Strategy
We restructured the file writing logic:
- Modified write_row_data function to return strings instead of writing directly:
- write_row_data <- function(values) {
- result <- character()
- for(i in 1:6) {
- start_idx <- (i-1)*10 + 1
- line_values <- values[start_idx:(start_idx+9)]
- formatted_values <- sprintf("%.4e", line_values)
- result <- c(result, paste(" ", paste(formatted_values, collapse=" ")))
- }
- return(result)
- }
复制代码
- Used buffer to cache data and write in batches:
- # Initialize buffer
- buffer <- character()
- # Accumulate data to buffer
- buffer <- c(buffer, write_row_data(row_values))
- # Write every 50 weeks of data
- if(week_idx %% 50 == 0 || week_idx == length(dates)) {
- if(week_idx == 50) {
- writeLines(buffer, outfile)
- } else {
- con <- file(outfile, "a")
- writeLines(buffer, con)
- close(con)
- }
- buffer <- character()
- }
复制代码 Optimization Results
The final version achieved significant performance improvements:
- Write Speed:
- Single write volume increased to 5MB+
- Greatly reduced number of IO operations
- Resource Usage:
- CPU usage maintained around 40%
- Reasonable memory usage
- Sufficient resources for other tasks
- Code Quality:
- Maintained code readability
- Improved error handling
- Better resource management
Lessons Learned
- Parallelization Considerations:
- Correct function and variable export
- Appropriate parallelization level
- Resource contention awareness
- IO Optimization Strategies:
- Reduce IO operation frequency
- Use caching mechanism
- Batch process data
- Performance Tuning Tips:
- First identify performance bottlenecks
- Optimize step by step, verify timely
- Balance resource usage
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。 |