R语言文件IO和并行盘算优化实践

打印 上一主题 下一主题

主题 1810|帖子 1810|积分 5430

马上注册,结交更多好友,享用更多功能,让你轻松玩转社区。

您需要 登录 才可以下载或查看,没有账号?立即注册

x
### 背景

在处置惩罚大规模数据文件生成任务时,性能往往是一个关键题目。本文将分享一个实际案例,展示如何通过并行盘算和IO优化提拔R程序的性能。
初始题目

在我们的数据处置惩罚任务中,需要生成多个大型数据文件。初始版本的代码存在以下题目:

  • 执行服从低:

    • 串行处置惩罚多个文件
    • 每行数据都举行一次文件写入
    • 单个文件生成速度约1MB/s

  • 资源利用不合理:

    • CPU利用率低
    • 频繁的IO操作

优化过程

第一步:实现并行处置惩罚

首先,我们实验使用R的parallel包实现并行处置惩罚:
  1. # 设置并行环境
  2. num_cores <- detectCores() - 1  # 保留一个核心给系统
  3. cl <- makeCluster(min(num_cores, length(MODEL_GROUPS)))
  4. # 导出必要的函数和变量
  5. clusterExport(cl, c("generate_rch_file", "get_combo_data", "write_row_data",
  6.                     "HEADER_LINES", "MODEL_GROUPS", "CC_SUFFIXES", "PAW_VALUES"))
  7. # 确保每个工作进程都加载必要的库
  8. clusterEvalQ(cl, {
  9.   library(tidyverse)
  10.   library(foreign)
  11. })
  12. # 并行执行任务
  13. parLapply(cl, MODEL_GROUPS, function(model) {
  14.   generate_rch_file(model, "Hindcast", "00", rc_combi, "MODFLOW_recharge_Outputs/Hindcast")
  15. })
复制代码
这次优化后碰到了第一个题目:
  1. 错误于checkForRemoteErrors(val): 6 nodes produced errors; first error: could not find function "generate_rch_file"
复制代码
原因是并行情况中的函数可见性题目,通过正确导出函数和变量解决了这个题目。
第二步:发现IO瓶颈

并行实现后,我们发现了新的题目:


  • 原来单个文件写入速度为1MB+/s
  • 并行后每个文件只有几百KB/s
  • 团体性能并没有得到显著提拔
分析发现这是由于:

  • 多个进程同时写文件导致磁盘IO竞争
  • 频繁的小数据写入造成大量磁盘寻道时间
  • 写入操作过于频繁(每行数据一次写入)
第三步:优化IO策略

我们对文件写入逻辑举行了重构:

  • 修改write_row_data函数,不直接写文件而是返回字符串:
  1. write_row_data <- function(values) {
  2.   result <- character()
  3.   for(i in 1:6) {
  4.     start_idx <- (i-1)*10 + 1
  5.     line_values <- values[start_idx:(start_idx+9)]
  6.     formatted_values <- sprintf("%.4e", line_values)
  7.     result <- c(result, paste(" ", paste(formatted_values, collapse="  ")))
  8.   }
  9.   return(result)
  10. }
复制代码

  • 使用buffer缓存数据,批量写入:
  1. # 初始化buffer
  2. buffer <- character()
  3. # 累积数据到buffer
  4. buffer <- c(buffer, write_row_data(row_values))
  5. # 每处理50周数据写入一次
  6. if(week_idx %% 50 == 0 || week_idx == length(dates)) {
  7.   if(week_idx == 50) {
  8.     writeLines(buffer, outfile)
  9.   } else {
  10.     con <- file(outfile, "a")
  11.     writeLines(buffer, con)
  12.     close(con)
  13.   }
  14.   buffer <- character()
  15. }
复制代码
优化结果

最终版本取得了显著的性能提拔:

  • 写入速度:

    • 每个文件的单次写入量提拔到5MB+
    • IO操作次数大大淘汰

  • 资源使用:

    • CPU使用率维持在40%左右
    • 内存使用率合理
    • 仍有足够资源用于其他任务

  • 代码质量:

    • 保持了代码的可读性
    • 错误处置惩罚更完善
    • 资源管理更合理

履历总结


  • 并行化留意事项:

    • 正确导出函数和变量
    • 合理设置并行度
    • 留意资源竞争

  • IO优化策略:

    • 淘汰IO操作频率
    • 使用缓存机制
    • 批量处置惩罚数据

  • 性能调优发起:

    • 先找到性能瓶颈
    • 渐渐优化,及时验证
    • 平衡资源使用


R Language File IO and Parallel Computing Optimization Practice

Background

When handling large-scale file generation tasks, performance is often a critical issue. This article shares a practical case study demonstrating how to improve R program performance through parallel computing and IO optimization.
Initial Problems

In our data processing task, we needed to generate multiple large data files. The initial code had the following issues:

  • Low Execution Efficiency:

    • Serial processing of multiple files
    • File writing for each row of data
    • Single file generation speed about 1MB/s

  • Improper Resource Utilization:

    • Low CPU utilization
    • Frequent IO operations

Optimization Process

Step 1: Implementing Parallel Processing

First, we tried using R’s parallel package for parallel processing:
  1. # Set up parallel environment
  2. num_cores <- detectCores() - 1  # Reserve one core for system
  3. cl <- makeCluster(min(num_cores, length(MODEL_GROUPS)))
  4. # Export necessary functions and variables
  5. clusterExport(cl, c("generate_rch_file", "get_combo_data", "write_row_data",
  6.                     "HEADER_LINES", "MODEL_GROUPS", "CC_SUFFIXES", "PAW_VALUES"))
  7. # Ensure each worker loads necessary libraries
  8. clusterEvalQ(cl, {
  9.   library(tidyverse)
  10.   library(foreign)
  11. })
  12. # Execute tasks in parallel
  13. parLapply(cl, MODEL_GROUPS, function(model) {
  14.   generate_rch_file(model, "Hindcast", "00", rc_combi, "MODFLOW_recharge_Outputs/Hindcast")
  15. })
复制代码
We encountered our first issue after this optimization:
  1. Error in checkForRemoteErrors(val): 6 nodes produced errors; first error: could not find function "generate_rch_file"
复制代码
This was resolved by correctly exporting functions and variables to the parallel environment.
Step 2: Discovering IO Bottleneck

After implementing parallelization, we found new issues:


  • Original single file write speed was 1MB+/s
  • After parallelization, each file only achieved few hundred KB/s
  • Overall performance didn’t improve significantly
Analysis revealed this was due to:

  • Disk IO contention from multiple processes writing simultaneously
  • Excessive disk seek time from frequent small data writes
  • Too frequent write operations (one write per row)
Step 3: Optimizing IO Strategy

We restructured the file writing logic:

  • Modified write_row_data function to return strings instead of writing directly:
  1. write_row_data <- function(values) {
  2.   result <- character()
  3.   for(i in 1:6) {
  4.     start_idx <- (i-1)*10 + 1
  5.     line_values <- values[start_idx:(start_idx+9)]
  6.     formatted_values <- sprintf("%.4e", line_values)
  7.     result <- c(result, paste(" ", paste(formatted_values, collapse="  ")))
  8.   }
  9.   return(result)
  10. }
复制代码

  • Used buffer to cache data and write in batches:
  1. # Initialize buffer
  2. buffer <- character()
  3. # Accumulate data to buffer
  4. buffer <- c(buffer, write_row_data(row_values))
  5. # Write every 50 weeks of data
  6. if(week_idx %% 50 == 0 || week_idx == length(dates)) {
  7.   if(week_idx == 50) {
  8.     writeLines(buffer, outfile)
  9.   } else {
  10.     con <- file(outfile, "a")
  11.     writeLines(buffer, con)
  12.     close(con)
  13.   }
  14.   buffer <- character()
  15. }
复制代码
Optimization Results

The final version achieved significant performance improvements:

  • Write Speed:

    • Single write volume increased to 5MB+
    • Greatly reduced number of IO operations

  • Resource Usage:

    • CPU usage maintained around 40%
    • Reasonable memory usage
    • Sufficient resources for other tasks

  • Code Quality:

    • Maintained code readability
    • Improved error handling
    • Better resource management

Lessons Learned


  • Parallelization Considerations:

    • Correct function and variable export
    • Appropriate parallelization level
    • Resource contention awareness

  • IO Optimization Strategies:

    • Reduce IO operation frequency
    • Use caching mechanism
    • Batch process data

  • Performance Tuning Tips:

    • First identify performance bottlenecks
    • Optimize step by step, verify timely
    • Balance resource usage


免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。
回复

使用道具 举报

0 个回复

倒序浏览

快速回复

您需要登录后才可以回帖 登录 or 立即注册

本版积分规则

雁过留声

论坛元老
这个人很懒什么都没写!
快速回复 返回顶部 返回列表