ToB企服应用市场:ToB评测及商务社交产业平台

标题: MIMIC数据库SCI复现课程Day2: 数据提取到数据洗濯 [打印本页]

作者: 西河刘卡车医    时间: 2024-8-1 00:21
标题: MIMIC数据库SCI复现课程Day2: 数据提取到数据洗濯
 

   诺维医学科研官网:https://www.newboat.top 更新换版中!
  bilibili:文章对应的讲解视频在此。熊大学习社 熊大学习社的个人空间-熊大学习社个人主页-哔哩哔哩视频
  微信公众号:熊大学习社、诺维之舟
  公益网站,首页 | 公益网站 ,内有医学资料库
  诺维之舟AI:https://gpt4.nwzz.xyz 可在线使用GPT4
  课程相干资料:
(1)课程资料包罗[DAY1]SCI论文复现全部代码-基于R、PostgreSql/Navicat等软件、SQL常用下令与批处理脚本、课本;[Day2]MIMIC IV常见数据提取代码-基于sql、数据洗濯-基于R课本;[Day3] 待更新[Day4]待更新等关注公众号“熊大学习社”,复兴“mimic01”,获取全部4天MIMIC复现课程资料链接。
我们对峙学以致用,做有质量的分享。关注B站诺维医学(原名熊大学习社)公众号诺维之舟、熊大学习社。您的一键三连是我最大的动力。
(2)一对一论文引导学员免费获取学习课程和专属答疑。了解咨询扫客服二维码。
0 课程的总体框架


这次直播课程的特点:上手操纵+撸代码,零基础到SCI复现,随时互动互换,快速开启你的医学研究。
一、MIMIC IV物化视图

MIMIC IV物化视图怎样安装,物化视图有哪些,主要有什么用?
1 增补:Navicat中的表显示不出来,怎么办?

Navicat的表显示不出来的标题已解决,是版本过低导致的。升级版本即可,已放课程资料包。
2 建立MIMIC IV数据库物化视图

在G:/program files/MIMIC/mimic-code-main/mimic-iv/concepts_postgres(你的SQL下令文件夹)目次下运行cmd.
  1.  # 访问数据库
  2.  psql -U postgres -p 5432
  3.  ​
  4.  # 进入mimiciv数据库
  5.  \c mimiciv
  6.  ​
  7.  # 生成函数
  8.  \i 'postgres-functions.sql'
  9.  ​
  10.  # 视图
  11.  \i 'postgres-make-concepts.sql'
复制代码
看一下postgres-make-concepts.sql
 

里面用的是相对路径,这就是为什么要在上面指定的目次。
 

mimiciv_derived版块其实是在mimic_hosp及mimic_icu两个模块基础上然后在MIMIC数据库官网下载其提供的可视图化代码,然后运行而得出来的61个视图表。
 

也可通过下令行查看。
  1.  -- 查看视图
  2.  SELECT * FROM information_schema.views;
  3.  ​
  4.  -- 查看物化视图
  5.  SELECT mv.relname AS materialized_view,
  6.         pg_get_viewdef(mv.oid) AS definition
  7.  FROM pg_class mv
  8.  WHERE mv.relkind = 'm';
复制代码
从上面的下令行运行结果发现,有报错信息。最后天生的物化视图58个,还缺3个:icustay_hourly、sofa、sepsis3。
 

报错的标题是UNNEST函数不存在。改写代码,不用UNNEST函数实现。
icustay_hourly.sql的代码改写后如下:
  1.  -- THIS SCRIPT IS AUTOMATICALLY GENERATED. DO NOT EDIT IT DIRECTLY.
  2.  DROP TABLE IF EXISTS mimiciv_derived.icustay_hourly; CREATE TABLE mimiciv_derived.icustay_hourly AS
  3.  /* This query generates a row for every hour the patient is in the ICU. */ /* The hours are based on clock-hours (i.e. 02:00, 03:00). */ /* The hour clock starts 24 hours before the first heart rate measurement. */ /* Note that the time of the first heart rate measurement is ceilinged to */ /* the hour. */ /* this query extracts the cohort and every possible hour they were in the ICU */ /* this table can be to other tables on stay_id and (ENDTIME - 1 hour,ENDTIME] */ /* get first/last measurement time */
  4.  WITH all_hours AS (
  5.    SELECT
  6.      it.stay_id,
  7.      CASE
  8.        WHEN DATE_TRUNC('HOUR', it.intime_hr) = it.intime_hr
  9.        THEN it.intime_hr
  10.        ELSE DATE_TRUNC('HOUR', it.intime_hr) + INTERVAL '1 HOUR'
  11.      END AS endtime,
  12.      GENERATE_SERIES(-24, CAST(CEIL(EXTRACT(EPOCH FROM it.outtime_hr - it.intime_hr) / 3600.0) AS INT)) AS hrs,
  13.      row_number() over (partition by it.stay_id order by it.intime_hr) as rn
  14.    FROM mimiciv_derived.icustay_times AS it
  15.  )
  16.  SELECT
  17.    a.stay_id,
  18.    CAST(a.hrs AS BIGINT) AS hr,
  19.    a.endtime + CAST(b.hrs AS BIGINT) * INTERVAL '1 HOUR' AS endtime
  20.  FROM all_hours a
  21.  JOIN all_hours b ON a.stay_id = b.stay_id AND a.rn = b.rn;
复制代码
进入到三个文件的上一级目次,再运行cmd,然后依次运行sql代码就行。
  1.  -- cmd
  2.  psql -U postgres -p 5432
  3.  ​
  4.  -- 进入mimiciv数据库
  5.  \c mimiciv
  6.  ​
  7.  -- 物化视图:demographics/icustay_hourly.sql
  8.  \i 'demographics/icustay_hourly.sql'
  9.  ​
  10.  ​
  11.  -- 物化视图:score/sofa.sql
  12.  \i 'score/sofa.sql'
  13.  ​
  14.  -- 物化视图:sepsis/sepsis3.sql
  15.  \i 'sepsis/sepsis3.sql'
复制代码
 
3 物化视图

(1)age,年龄

  1.  SELECT  
  2.    ad.subject_id
  3.    , ad.hadm_id
  4.    , ad.admittime
  5.    , pa.anchor_age
  6.    , pa.anchor_year
  7.    , mimiciv_derived.DATETIME_DIFF(ad.admittime, mimiciv_derived.DATETIME(pa.anchor_year, 1, 1, 0, 0,0),'YEAR') + pa.anchor_age AS age
  8.  FROM mimiciv_hosp.admissions  AS ad
  9.  INNER JOIN mimiciv_hosp.patients  AS pa
  10.  ON ad.subject_id = pa.subject_id
  11.  ;
复制代码
 

这里就使用了官方函数 "mimiciv_derived.DATETIME_DIFF" 和 “mimiciv_derived.DATETIME”, 假如您没有安装官方函数就会报错!
(2)weight_duration,体重变革

weight_duration, ICU期间体重的变革,体重是反应患者营养状态的重要因素。具体代码看物化视图天生所调用的sql文件。
 

(3)GCS, 神经体系功能评分

GCS评分, 神经体系功能评分。
 

二、MIMIC IV数据提取

MIMIC IV的关键数据有哪些,常用提取下令是什么?
1 MIMIC IV关键数据与表格

mimiciv_derived模块的61个视图具有非常多的提取信息,基本上一个入住ICU病人的90%以上的信息都能在这提取!!
除了61个视图,另有15个函数。
 

(1)计算患者的真实年龄

   患者的真实年龄 = anchor_age + admittime - anchor_year
  anchor_age,anchor_year可以在mimiciv_hosp.patients表中找到
患者的入院年龄可以在mimiciv_hosp.admissions表中找到。

公式用sql代码体现:
  1.  pa.anchor_age + mimiciv_derived.DATETIME_DIFF(ad.admittime,
  2.      mimiciv_derived.DATETIME(pa.anchor_year, 1, 1, 0, 0, 0)
复制代码
这里就使用了官方函数 "mimiciv_derived.DATETIME_DIFF" 和 “mimiciv_derived.DATETIME”, 假如您没有安装官方函数就会报错!
这个算法算出来的年龄是小数,我们可以使用ROUND函数转成整数,完整SQL如下:
  1.  SELECT  ad.subject_id,  
  2.  MAX (ROUND(pa.anchor_age + mimiciv_derived.datetime_diff ( ad.admittime, mimiciv_derived.DATETIME ( pa.anchor_year, 1, 1, 0, 0, 0 ), 'YEAR' ), 0)) AS Age_real
  3.  FROM  mimiciv_hosp.admissions AS ad,  mimiciv_hosp.patients AS pa
  4.  WHERE  ad.subject_id = pa.subject_id
  5.  GROUP BY  ad.subject_id   LIMIT 200;
复制代码
 

第二种方法:物化视图age表
  1.  SELECT subject_id, ROUND(age,0) as age
  2.  FROM age
  3.  LIMIT 100
复制代码
 

 
(2)排除小于18岁的患者


先使用子查询先查询出来患者的真实年龄,再排撤除小于18岁的患者。
  1. with base as (
  2.     SELECT ad.subject_id,
  3.     MAX(ROUND(pa.anchor_age + mimiciv_derived.DATETIME_DIFF(ad.admittime,  
  4.         mimiciv_derived.DATETIME(pa.anchor_year, 1, 1, 0, 0, 0), 'YEAR'),0)) AS age
  5.     FROM mimiciv_hosp.admissions ad, mimiciv_hosp.patients pa
  6.     WHERE ad.subject_id = pa.subject_id
  7.     GROUP BY ad.subject_id
  8. )
  9. SELECT * FROM base WHERE base.age >=18 limit 100;
复制代码
 

可以根据代码修改筛选出来其他年龄段的患者。

  1. SELECT subject_id, ROUND(age,0) as age
  2. FROM mimic_derived.age
  3. WHERE anchor_age >=18
  4. LIMIT 100
复制代码
 
(3)首次入院记录

以急性胰腺反复入院的患者为例,仅保留首次入院数据。
MIMIV IV数据库记录了29万个病人的43万条入院记录,但是在做数据分析和提取的时候,通常只必要提取某个病人的首次入院记录。

postgres数据库的内置函数ROW_NUMBER,可将患者举行分组,并可以按照入院时间排序。
  1. SELECT ADM.SUBJECT_ID, ADM.HADM_ID, ADM.ADMITTIME,
  2. -- 分组排序
  3. ROW_NUMBER () OVER(PARTITION BY ADM.SUBJECT_ID ORDER BY ADM.ADMITTIME) AS ADMITTIME_RANK
  4. FROM MIMICIV_HOSP.ADMISSIONS AS ADM
  5. LIMIT 100;
复制代码
 


利用子查询,过滤出患者的首次入院记录
  1. WITH base AS (
  2.     SELECT ADM.SUBJECT_ID,  ADM.HADM_ID,  ADM.ADMITTIME,  
  3.     ROW_NUMBER () OVER(PARTITION BY ADM.SUBJECT_ID ORDER BY ADM.ADMITTIME) AS ADMITTIME_RANK
  4.     FROM MIMICIV_HOSP.ADMISSIONS AS ADM
  5. )
  6. SELECT * FROM base WHERE base.ADMITTIME_RANK = 1 LIMIT 100;
复制代码
 

(4)首次进ICU记录


一个患者可以多次住ICU,一个subject_id对应着多个icustay_id。研究通常只选择该患者第一次住ICU的记录。
入ICU记录按照intime排序,这样就可以看到每个icustay_id是对应患者的第频频入ICU。
  1. -- 每个病人按照进icu时间排序
  2. select stay_id,
  3. rank() over (partition by subject_id order by intime) as icu_order
  4. from icustays
复制代码
 

第二种方法:物化视图icustay_detail
  1. -- 每个病人按照进icu时间排序
  2. select subject_id, stay_id
  3. from icustay_detail
  4. where first_icu_stay = 't'
复制代码
 

(5)高血压(hypertension)患者数据

在d_icd_diagnoses查找高血压的ICD编号和版本。
  1. -- 高血压的icd
  2. SELECT * FROM "D_ICD_DIAGNOSES" WHERE long_title like '%hypertension%'
复制代码
结果许多个,具体哪一个呢,照旧哪几个。
 

假如确定高血压的具体种别是Unspecified essential hypertension(原发性高血压),就确定下来了:icd_code=4019, icd_version=9。
  1. -- 查看Unspecified essential hypertension的患者
  2. SELECT DISTINCT (HADM_ID)
  3. FROM MIMICIV_HOSP.DIAGNOSES_ICD
  4. WHERE ICD_CODE = '4019' and ICD_VERSION = 9
  5. LIMIT 100;
复制代码
 

假如不确定,查看这些高血压相干的疾病分别都多少患者手术。一种思路是选最多最广泛的群体。
  1. -- 查看这些高血压相关的疾病分别都多少患者
  2. -- 相关疾病保存到hypertension_diseases
  3. WITH hypertension_diseases AS (
  4.     SELECT icd_code, icd_version FROM D_ICD_DIAGNOSES WHERE long_title like '%hypertension%'
  5. )
  6. -- 然后,我们可以使用这个子查询来联合查询用户表,并计算指定疾病编号和版本的病人手术数量
  7. SELECT d.icd_code, d.icd_version, COUNT(*) AS count
  8. FROM DIAGNOSES_ICD d
  9. JOIN hypertension_diseases h ON d.icd_code = h.icd_code AND d.icd_version = h.icd_version
  10. GROUP BY d.icd_code, d.icd_version ORDER BY count DESC
复制代码

一是结果赋值;
二是别名;
三是统计数count等函数,另有sum等;
四是联合查询join on;
五是分组;
六是排序。
 


提问:有两个表,一个是疾病与其编号和编号版本的表D_ICD_DIAGNOSES。二是患者与疾病编号的表DIAGNOSES_ICD。用sql语言,首先模糊查找高血压hypertension的编号和版本,然后根据查出的多个结果,查询DIAGNOSES_ICD表中这些结果对应的患者数量。
 

 

GPT给出的代码和终极的代码已经很靠近了,稍作调整就可以用!
(6)脓毒症(sepsis)患者数据


在前面的代码基础上加上了疾病名称long_title。
  1. -- 相关疾病信息保存到diseases
  2. WITH diseases AS (
  3.     -- %疾病名称英文%对应修改!
  4.     SELECT icd_code, icd_version, long_title FROM D_ICD_DIAGNOSES WHERE long_title like '%疾病名称英文%'
  5. )
  6. -- 然后,我们可以使用这个子查询来联合查询用户表,并计算指定疾病编号和版本的病人手术数量
  7. SELECT d.icd_code, d.icd_version, h.long_title, COUNT(*) AS count
  8. FROM DIAGNOSES_ICD d
  9. JOIN diseases h ON d.icd_code = h.icd_code AND d.icd_version = h.icd_version
  10. GROUP BY d.icd_code, d.icd_version, h.long_title ORDER BY count DESC
复制代码
以脓毒症(Sepsis)为例。
  1. -- 相关疾病信息保存到diseases
  2. WITH diseases AS (
  3.     -- Sepsis
  4.     SELECT icd_code, icd_version, long_title FROM D_ICD_DIAGNOSES WHERE long_title like '%Sepsis%'
  5. )
  6. -- 然后,我们可以使用这个子查询来联合查询用户表,并计算指定疾病编号和版本的病人手术数量
  7. SELECT d.icd_code, d.icd_version, h.long_title, COUNT(*) AS count
  8. FROM DIAGNOSES_ICD d
  9. JOIN diseases h ON d.icd_code = h.icd_code AND d.icd_version = h.icd_version
  10. GROUP BY d.icd_code, d.icd_version, h.long_title ORDER BY count DESC
复制代码
 


上传截图到百度翻译。
 

(7)脑梗(cerebral infarction)患者数据

脑卒中,英文stroke
  1. -- 相关疾病信息保存到diseases
  2. WITH diseases AS (
  3.     -- stroke
  4.     SELECT icd_code, icd_version, long_title FROM MIMICIV_HOSP.D_ICD_DIAGNOSES WHERE long_title like '%stroke%'
  5. )
  6. -- 然后,我们可以使用这个子查询来联合查询用户表,并计算指定疾病编号和版本的病人手术数量
  7. SELECT d.icd_code, d.icd_version, h.long_title, COUNT(*) AS count
  8. FROM MIMICIV_HOSP.DIAGNOSES_ICD d
  9. JOIN diseases h ON d.icd_code = h.icd_code AND d.icd_version = h.icd_version
  10. GROUP BY d.icd_code, d.icd_version, h.long_title ORDER BY count DESC
复制代码
 

发现这个结果不太好,数据量不多,而且不怎么相干。
试试脑梗死,这是脑卒中的一种,英文,cerebral infarction。
  1. -- 相关疾病信息保存到diseases
  2. WITH diseases AS (
  3.     -- cerebral infarction`
  4.     SELECT icd_code, icd_version, long_title FROM MIMICIV_HOSP.D_ICD_DIAGNOSES WHERE long_title like '%cerebral infarction%'
  5. )
  6. -- 然后,我们可以使用这个子查询来联合查询用户表,并计算指定疾病编号和版本的病人手术数量
  7. SELECT d.icd_code, d.icd_version, h.long_title, COUNT(*) AS count
  8. FROM MIMICIV_HOSP.DIAGNOSES_ICD d
  9. JOIN diseases h ON d.icd_code = h.icd_code AND d.icd_version = h.icd_version
  10. GROUP BY d.icd_code, d.icd_version, h.long_title ORDER BY count DESC
复制代码
 

相干疾病类型挺多,必要筛选一下。前2个V1254、Z8673是个人病史,不相干。最后几个G43501、G43509等是没有脑梗的,不纳入。
 
  1. -- 脑梗患者数据
  2. SELECT d.subject_id,d.hadm_id
  3. from mimiciv_hosp.diagnoses_icd d
  4. WHERE d.icd_code in ('43301','43330','43331','43390','43391','G43609','43411',
  5. 'I630','I63012','I63013','I6302','I63032','I63033','I63039','I6309','I631',
  6. 'I6310','I63111','I63112','I63113','I63139','I632','I6320','I63212',
  7. 'I63213','I63219','I6322','I63231','I63232','I6330','I63311','I63312',
  8. 'I63319','I6332','I63321','I63322','I63339','I6334','I63341','I63342','I63343','I6339','I6341',
  9. 'I63411','I63412','I63413','I63419','I6342','I6343','I63431','I63432','I63433',
  10. 'I63441','I63442','I6350','I6351','I63511','I63512','I63513','I6352','I63521','I63529','I6353','I63531','I63532','I63541','I63542','I63549','I6359','I636','I69312','I69314','I69341','I69342','I69359','I69361','I69362','I69363','I69364')
复制代码
 

(8)生命体征信息(身高、体重、BMI)


  1. SELECT i.subject_id,i.stay_id,i.icu_intime,c.charttime charttime_height,c.height
  2. FROM mimiciv_derived.icustay_detail i
  3. INNER JOIN mimiciv_derived.height c ON  i.subject_id=c.subject_id AND i.stay_id=c.stay_id
  4. WHERE c.charttime BETWEEN i.icu_intime AND mimiciv_derived.DATETIME_ADD(i.icu_intime, INTERVAL '24' HOUR)
  5. AND c.height is not NULL
复制代码
 


  1. SELECT i.subject_id,i.stay_id,i.icu_intime,c.starttime as starttime_weight,c.weight
  2. FROM mimiciv_derived.icustay_detail i
  3. INNER JOIN mimiciv_derived.weight_durations c ON  i.stay_id=c.stay_id
  4. WHERE c.starttime BETWEEN i.icu_intime AND mimiciv_derived.DATETIME_ADD(i.icu_intime, INTERVAL '24' HOUR)
  5. AND c.weight is not NULL
复制代码
 


可用R语言实现。
  1. # BMI: bmi = weight_kg / (height_m ** 2)
  2. d$BMI <- round(d$weight_kg/(d$height_m/100)**2,2)
  3. table(d$weight_kg, useNA = 'ifan')
  4. table(d$height_m , useNA = 'ifan')
  5. table(d$BMI      , useNA = 'ifan')
  6. d1 <- subset(d, is.na(d$BMI))
复制代码
后面联合数据合并举行实操。
(9)GCS神经体系功能评分

第一天入住ICU的GCS评分
  1. with t1 as(
  2. SELECT i.subject_id,i.stay_id,i.icu_intime, g.charttime as charttime_gcs, g.gcs,
  3. ROW_NUMBER () OVER(PARTITION BY g.SUBJECT_ID ORDER BY g.charttime) AS CHARTTIME_RANK
  4. FROM mimiciv_derived.icustay_detail i
  5. INNER JOIN mimiciv_derived.gcs g ON i.stay_id=g.stay_id
  6. WHERE g.charttime BETWEEN i.icu_intime AND mimiciv_derived.DATETIME_ADD(i.icu_intime, INTERVAL '24' HOUR)
  7. AND g.gcs is not NULL
  8. )
  9. select * from t1 where CHARTTIME_RANK = 1
复制代码
 

GCS、APS III、SAPS II、OASIS这四个指标在物化视图mimiciv_derived中都有对应的表格。稍作修改就能提取。
(10)合并症数据

物化视图charlson表为合并症常用数据表,包罗18种疾病和1个指数。
列名称subject_id:病人IDhadm_id: 住院IDage_score:年龄得分myocardial infarct: 心肌梗死congestive_heart_failure: 充血性心力衰竭peripheral_vascular_disease: 外周血管疾病cerebrovascular_disease:脑血管疾病dementia:痴呆症hronic_pulmonary_disease: 慢性肺部疾病heumatic_disease:风湿病charlson entic_ulcer_disease:消化性溃疡病mild liver disease:轻度肝病diabetsithout_.cc:无并发症的糖尿病diabetes with ce:有并发症的糖尿病paraplegia:截瘫renal disease:肾脏疾病malignant_cancer:恶性肿瘤severe_liver_disease: 重度肝病metastatic_solid-tumor:转移性实体痘aids:艾滋病charlson comorbidity index:查尔森共病指数 以Paraplegia(截瘫)、Renal disease(肾脏病)、CCI(charlson_comorbidity_index)为例。
  1. SELECT i.subject_id,i.stay_id,i.icu_intime,c.paraplegia,c.renal_disease,c.charlson_comorbidity_index
  2. FROM mimiciv_derived.icustay_detail i
  3. INNER JOIN mimiciv_derived.charlson c ON i.subject_id=c.subject_id AND i.hadm_id=c.hadm_id
复制代码
 

(11)呼吸衰竭

首先,查询呼吸衰竭的诊断icd-code。呼吸衰竭:respiratory failure。
  1. -- 相关疾病信息保存到diseases
  2. WITH diseases AS (
  3.     -- respiratory failure
  4.     SELECT icd_code, icd_version, long_title FROM MIMICIV_HOSP.D_ICD_DIAGNOSES WHERE long_title like '%respiratory failure%'
  5. )
  6. -- 然后,我们可以使用这个子查询来联合查询用户表,并计算指定疾病编号和版本的病人手术数量
  7. SELECT d.icd_code, d.icd_version, h.long_title, COUNT(*) AS count
  8. FROM MIMICIV_HOSP.DIAGNOSES_ICD d
  9. JOIN diseases h ON d.icd_code = h.icd_code AND d.icd_version = h.icd_version
  10. GROUP BY d.icd_code, d.icd_version, h.long_title ORDER BY count DESC
复制代码
 

接着,选取ICD编号和版本,提取相应的数据。这里不妨以急性呼吸衰竭(Acute respiratory failure)为例。
  1. with t1 as (
  2. SELECT d.subject_id,d.hadm_id,
  3. case when d.icd_code is not null then 1
  4. ELSE 0
  5. END AS Respiratory_failure
  6. from mimiciv_hosp.diagnoses_icd d
  7. WHERE (d.icd_code in ('51851', '51853', '51881', '51883','51884') and d.icd_version=9) OR
  8. (d.icd_code in ('J95821', 'J95822', 'J9600', 'J9601','J9602','J9620','J9621','J9622') and d.icd_version=10)
  9. )
  10. -- 在t1的基础上增加了mimiciv_derived.icustay_detail表中的信息
  11. SELECT i.subject_id,i.stay_id,i.icu_intime,t1.Respiratory_failure
  12. FROM mimiciv_derived.icustay_detail i
  13. LEFT JOIN t1 on t1.subject_id=i.subject_id and t1.hadm_id=i.hadm_id
复制代码
 

(12)肺栓塞

首先,查询肺栓塞的诊断icd-code。肺栓塞:pulmonary embolism。
  1. -- 相关疾病信息保存到diseases
  2. WITH diseases AS (
  3.     -- pulmonary embolism
  4.     SELECT icd_code, icd_version, long_title FROM MIMICIV_HOSP.D_ICD_DIAGNOSES WHERE long_title like '%pulmonary embolism%'
  5. )
  6. -- 然后,我们可以使用这个子查询来联合查询用户表,并计算指定疾病编号和版本的病人手术数量
  7. SELECT d.icd_code, d.icd_version, h.long_title, COUNT(*) AS count
  8. FROM MIMICIV_HOSP.DIAGNOSES_ICD d
  9. JOIN diseases h ON d.icd_code = h.icd_code AND d.icd_version = h.icd_version
  10. GROUP BY d.icd_code, d.icd_version, h.long_title ORDER BY count DESC
复制代码
 

接着,选取ICD编号和版本,提取相应的数据。
('41519' , 'I2699' , '41511', '41512', 'I2692', 'I2698' , '41513' , 'I2609', 'I2694', 'I2602', 'I2693', '67382')
  1. with t1 as (
  2. SELECT d.subject_id,d.hadm_id,
  3. case when d.icd_code is not null then 1
  4. ELSE 0
  5. END AS pulmonary_embolism
  6. from mimiciv_hosp.diagnoses_icd d
  7. WHERE d.icd_code in ('41519' , 'I2699' , '41511', '41512', 'I2692', 'I2698' , '41513' , 'I2609', 'I2694', 'I2602', 'I2693', '67382')
  8. )
  9. -- 在t1的基础上增加了mimiciv_derived.icustay_detail表中的信息
  10. SELECT i.subject_id,i.stay_id,i.icu_intime,t1.pulmonary_embolism
  11. FROM mimiciv_derived.icustay_detail i
  12. LEFT JOIN t1 on t1.subject_id=i.subject_id and t1.hadm_id=i.hadm_id
复制代码
 

(13)实验室检查指标:红细胞rbc、wbc、plt、hb


这几个指标都在mimiciv_derived.complete_blood_count表中。
  1. SELECT i.subject_id,i.stay_id,i.icu_intime,c.charttime,c.rbc, c.wbc, c.platelet, c.hemoglobin
  2. FROM mimiciv_derived.icustay_detail i
  3. LEFT JOIN mimiciv_derived.complete_blood_count c ON i.subject_id=c.subject_id AND i.hadm_id=c.hadm_id
  4. WHERE c.charttime BETWEEN mimiciv_derived.DATETIME_SUB(i.icu_intime, INTERVAL '6' HOUR)
  5. AND mimiciv_derived.DATETIME_ADD(i.icu_intime, INTERVAL '24' HOUR)
  6. AND c.rbc is not null
复制代码
 

 
(14)实验室检查指标:血钠、肌酐


这两个指标在mimiciv_derived.chemistry表中。
  1. SELECT i.subject_id,i.stay_id,i.icu_intime,c.charttime, c.sodium, c.creatinine, c.glucose
  2. FROM mimiciv_derived.icustay_detail i
  3. LEFT JOIN mimiciv_derived.chemistry c ON i.subject_id=c.subject_id AND i.hadm_id=c.hadm_id
  4. WHERE c.charttime BETWEEN mimiciv_derived.DATETIME_SUB(i.icu_intime, INTERVAL '6' HOUR)
  5. AND mimiciv_derived.DATETIME_ADD(i.icu_intime, INTERVAL '24' HOUR)
复制代码
 

(15)甘油三脂

首先,甘油三酯的编号,在MIMICIV_HOSP.D_LABITEMS中查找,为51000。
  1. SELECT itemid, label FROM MIMICIV_HOSP.D_LABITEMS WHERE label like '%Triglyceride%'
复制代码
 

接着,查找查验表labevents,甘油三酯的检测值。
  1. SELECT subject_id, hadm_id, charttime, valuenum
  2. FROM mimiciv_hosp.labevents
  3. WHERE itemid in ('51000') and valuenum is not null
复制代码
 

继续,获取ICU病人的数据信息。
  1. WITH t1 as (SELECT subject_id, hadm_id, charttime, valuenum
  2. FROM mimiciv_hosp.labevents
  3. WHERE itemid in ('51000') and valuenum is not null)SELECT i.subject_id,i.stay_id,i.icu_intime,t1.charttime charttime_Triglyceride,t1.valuenum TriglycerideFROM mimiciv_derived.icustay_detail iINNER JOIN t1 ON i.subject_id= t1.subject_id AND i.hadm_id=t1.hadm_idWHERE t1.charttime  BETWEEN mimiciv_derived.DATETIME_SUB(i.icu_intime, INTERVAL '6' HOUR) AND mimiciv_derived.DATETIME_ADD(i.icu_intime, INTERVAL '24' HOUR)
复制代码
 

 
(16)葡萄糖

glucose这个指标在mimiciv_derived.chemistry表中。参考血钠和肌酐的数据提取。
(17)终末期肾功能不全

AKI:急性肾损伤,在在mimiciv_derived.kdigo_stages表中。
aki_stage :疾病状态,3体现终末期。
  1. SELECT i.subject_id,i.stay_id,i.icu_intime,i.first_icu_stay,
  2. CASE WHEN c.aki_stage is not null THEN 1
  3. ELSE 0
  4. END AS end_stage_renal_disease
  5. FROM mimiciv_derived.icustay_detail i
  6. LEFT JOIN mimiciv_derived.kdigo_stages c ON i.subject_id=c.subject_id AND i.hadm_id=c.hadm_id
  7. WHERE c.charttime BETWEEN i.icu_intime AND i.icu_outtime
  8. AND c.aki_stage in (3)
复制代码
 

(18)肝硬化

首先,查询肝硬化的诊断icd-code。肝硬化:cirrhosis。
  1. -- 相关疾病信息保存到diseases
  2. WITH diseases AS (
  3.     -- cirrhosis
  4.     SELECT icd_code, icd_version, long_title FROM MIMICIV_HOSP.D_ICD_DIAGNOSES WHERE long_title like '%cirrhosis%'
  5. )
  6. -- 然后,我们可以使用这个子查询来联合查询用户表,并计算指定疾病编号和版本的病人手术数量
  7. SELECT d.icd_code, d.icd_version, h.long_title, COUNT(*) AS count
  8. FROM MIMICIV_HOSP.DIAGNOSES_ICD d
  9. JOIN diseases h ON d.icd_code = h.icd_code AND d.icd_version = h.icd_version
  10. GROUP BY d.icd_code, d.icd_version, h.long_title ORDER BY count DESC
复制代码
 

接着,选取ICD编号和版本,提取相应的数据。
  1. with t1 as (
  2. SELECT d.subject_id,d.hadm_id,
  3. case when d.icd_code is not null then 1
  4. ELSE 0
  5. END AS cirrhosis
  6. from mimiciv_hosp.diagnoses_icd d
  7. WHERE d.icd_code in ('5712','5715','5716','K703','K7030','K7031','K717','K74','K741','K742','K743',
  8.                      'K744','K745','K746','K7460','K7469','P7881')
  9. )
  10. -- 在t1的基础上增加了mimiciv_derived.icustay_detail表中的信息
  11. SELECT i.subject_id,i.stay_id,i.icu_intime,t1.cirrhosis
  12. FROM mimiciv_derived.icustay_detail i
  13. LEFT JOIN t1 on t1.subject_id=i.subject_id and t1.hadm_id=i.hadm_id
复制代码
 

(19)癌症(恶性肿瘤)

癌症,恶性肿瘤,malignant_cancer,在mimiciv_derived.charlson表中。
  1. SELECT i.subject_id,i.stay_id,i.icu_intime,c.malignant_cancer
  2. FROM mimiciv_derived.icustay_detail i
  3. INNER JOIN mimiciv_derived.charlson c ON i.subject_id=c.subject_id AND i.hadm_id=c.hadm_id
复制代码
 

 
2 MIMIC IV数据提取实操

从常见的研究主题出发,考虑多个环境的组合。比如高血压患者的血液指标、首次入院时年龄不小于18岁的高血压患者。
(1)首次入院时年龄不小于18岁的患者


  1. WITH BASE0 AS (
  2.     SELECT AD.hadm_id,  MAX(ROUND(pa.anchor_age + mimiciv_derived.datetime_diff(ad.admittime,
  3.     mimiciv_derived.datetime(pa.anchor_year, 1, 1, 0, 0, 0), 'YEAR'), 0)) AS age
  4.     FROM mimiciv_hosp.admissions AS AD, mimiciv_hosp.patients AS PA
  5.     WHERE ad.subject_id = pa.subject_id
  6.     GROUP BY ad.hadm_id
  7.     LIMIT 1000
  8. ),
  9. BASE1 AS (
  10.     SELECT ADM.subject_id, ADM.hadm_id, ADM.admittime,
  11.     ROW_NUMBER() OVER( PARTITION BY ADM.subject_id ORDER BY ADM.admittime) AS admittime_rank
  12.     FROM mimiciv_hosp.admissions AS ADM
  13. )
  14. SELECT * FROM BASE0, BASE1
  15. WHERE BASE0.age >= 18 AND BASE1.admittime_rank = 1 AND BASE0.hadm_id = base1.hadm_id
  16. LIMIT 200;
复制代码
 


  1. WITH base AS (
  2.     SELECT ADM.SUBJECT_ID,  ADM.HADM_ID,  ADM.ADMITTIME,  
  3.     ROW_NUMBER () OVER(PARTITION BY ADM.SUBJECT_ID ORDER BY ADM.ADMITTIME) AS ADMITTIME_RANK
  4.     FROM MIMICIV_HOSP.ADMISSIONS AS ADM
  5. )
  6. SELECT * FROM base, age
  7. WHERE base.ADMITTIME_RANK = 1 AND age.age >= 18 AND base.hadm_id = age.hadm_id
  8. LIMIT 100;
复制代码
 

也可把提取的变量名明确。
  1. WITH base AS (
  2.     SELECT ADM.SUBJECT_ID,  ADM.HADM_ID,  ADM.ADMITTIME,  
  3.     ROW_NUMBER () OVER(PARTITION BY ADM.SUBJECT_ID ORDER BY ADM.ADMITTIME) AS ADMITTIME_RANK
  4.     FROM MIMICIV_HOSP.ADMISSIONS AS ADM
  5. )
  6. SELECT base.subject_id, base.hadm_id, base.admittime, base.admittime_rank, ROUND(age.age, 0) as age
  7. FROM base, age
  8. WHERE base.ADMITTIME_RANK = 1 AND age.age >= 18 AND base.hadm_id = age.hadm_id
  9. LIMIT 100;
复制代码
 

(2)首次进ICU时年龄不小于18岁的患者


  1. WITH base AS (
  2.     SELECT ICU.SUBJECT_ID,  ICU.STAY_ID,  ICU.intime,  
  3.     ROW_NUMBER () OVER(PARTITION BY ICU.SUBJECT_ID ORDER BY ICU.intime) AS ADMITTIME_RANK
  4.     FROM MIMICIV_ICU.ICUSTAYS AS ICU
  5. )
  6. SELECT base.SUBJECT_ID, base.STAY_ID, base.intime, base.ADMITTIME_RANK, ROUND(age.age, 0) as age
  7. FROM base, age
  8. WHERE base.ADMITTIME_RANK = 1 AND age.age >= 18 AND base.hadm_id = age.hadm_id
  9. LIMIT 100;
复制代码
 


  1. -- 用age和icustay_detail两个物化视图
  2. select i.subject_id, i.stay_id, ROUND(a.age,0) as age
  3. from icustay_detail as i
  4. LEFT JOIN age as a
  5. ON i.first_icu_stay = 't'and i.hadm_id = a.hadm_id and a.age>=18
复制代码
 

(3)首次入院时年龄不小于18岁的脓毒症患者

  1. WITH base AS (
  2.     SELECT ADM.SUBJECT_ID,  ADM.HADM_ID,  ADM.ADMITTIME,  
  3.     ROW_NUMBER () OVER(PARTITION BY ADM.SUBJECT_ID ORDER BY ADM.ADMITTIME) AS ADMITTIME_RANK
  4.     FROM MIMICIV_HOSP.ADMISSIONS AS ADM
  5. ),
  6. t1 as (
  7. SELECT base.subject_id, base.hadm_id, base.admittime, ROUND(age.age, 0)
  8. FROM base, age
  9. WHERE base.ADMITTIME_RANK = 1 AND age.age >= 18 AND base.hadm_id = age.hadm_id;
  10. ),
  11. t2 as (
  12. -- 查看脓毒症的患者
  13. SELECT DISTINCT (HADM_ID)
  14. FROM MIMICIV_HOSP.DIAGNOSES_ICD
  15. WHERE ICD_CODE = 'A419' and ICD_VERSION = 10
  16. )
  17. SELECT * from t1,t2
  18. WHERE t1.HADM_ID = t2.HADM_ID
复制代码
 

(4)首次进ICU时年龄不小于18岁的高血压患者

  1. with t1 as(
  2.     -- 用age和icustay_detail两个物化视图
  3.     select i.subject_id, i.hadm_id, i.stay_id, ROUND(a.age,0) as age
  4.     from icustay_detail as i
  5.     LEFT JOIN age as a
  6.     ON i.first_icu_stay = 't'and i.hadm_id = a.hadm_id and a.age>=18
  7. ),
  8. t2 as (
  9. -- 查看高血压的患者
  10. SELECT DISTINCT (HADM_ID)
  11. FROM MIMICIV_HOSP.DIAGNOSES_ICD
  12. WHERE ICD_CODE = '4019' and ICD_VERSION = 9
  13. )
  14. SELECT * from t1,t2
  15. WHERE t1.HADM_ID = t2.HADM_ID
复制代码
 

(5)首次进ICU时年龄不小于18岁的脑梗患者

  1. with t1 as(
  2.     -- 用age和icustay_detail两个物化视图
  3.     select i.subject_id, i.hadm_id, i.stay_id, ROUND(a.age,0) as age
  4.     from icustay_detail as i
  5.     LEFT JOIN age as a
  6.     ON i.first_icu_stay = 't'and i.hadm_id = a.hadm_id and a.age>=18
  7. ),
  8. t2 as (
  9. -- 脑梗患者
  10. SELECT d.subject_id,d.hadm_id
  11. from mimiciv_hosp.diagnoses_icd d
  12. WHERE d.icd_code in ('43301','43330','43331','43390','43391','G43609','43411','I630','I63012','I63013','I6302','I63032','I63033','I63039','I6309','I631','I6310','I63111','I63112','I63113','I63139','I632','I6320','I63212','I63213','I63219','I6322','I63231','I63232','I6330','I63311','I63312','I63319','I6332','I63321','I63322','I63339','I6334','I63341','I63342','I63343','I6339','I6341','I63411','I63412','I63413','I63419','I6342','I6343','I63431','I63432','I63433','I63441','I63442','I6350','I6351','I63511','I63512','I63513','I6352','I63521','I63529','I6353','I63531','I63532','I63541','I63542','I63549','I6359','I636','I69312','I69314','I69341','I69342','I69359','I69361','I69362','I69363','I69364')
  13. )
  14. SELECT t1.* from t1,t2
  15. WHERE t1.HADM_ID = t2.HADM_ID
复制代码
 

保存到一个表里。
  1. -- 创建并保存到新表icu_ci
  2. CREATE TABLE icu_ci as (
  3. with t1 as(
  4.     -- 用age和icustay_detail两个物化视图
  5.     select i.subject_id, i.hadm_id, i.stay_id, ROUND(a.age,0) as age
  6.     from mimiciv_derived.icustay_detail as i
  7.     LEFT JOIN mimiciv_derived.age as a
  8.     ON i.first_icu_stay = 't'and i.hadm_id = a.hadm_id and a.age>=18
  9. ),
  10. t2 as (
  11. -- 脑梗患者
  12. SELECT d.subject_id,d.hadm_id
  13. from mimiciv_hosp.diagnoses_icd d
  14. WHERE d.icd_code in ('43301','43330','43331','43390','43391','G43609','43411','I630','I63012','I63013','I6302','I63032','I63033','I63039','I6309','I631','I6310','I63111','I63112','I63113','I63139','I632','I6320','I63212','I63213','I63219','I6322','I63231','I63232','I6330','I63311','I63312','I63319','I6332','I63321','I63322','I63339','I6334','I63341','I63342','I63343','I6339','I6341','I63411','I63412','I63413','I63419','I6342','I6343','I63431','I63432','I63433','I63441','I63442','I6350','I6351','I63511','I63512','I63513','I6352','I63521','I63529','I6353','I63531','I63532','I63541','I63542','I63549','I6359','I636','I69312','I69314','I69341','I69342','I69359','I69361','I69362','I69363','I69364')
  15. )
  16. SELECT t1.* from t1,t2
  17. WHERE t1.HADM_ID = t2.HADM_ID
  18. )
复制代码
数据表icu_ci在哪儿?看选定的组件。
 

 

 
 
三、MIMIC IV数据合并和洗濯

1 数据变量

研究主题:甘油三脂葡萄糖指数与肾脏疾病的关联性
确定提取哪些数据呢?如下。
变量名称表名subject_id病人编号mimiciv_derived.icustay_detailhadm_id住院编号mimiciv_derived.icustay_detailstay_idICU编号mimiciv_derived.icustay_detailage年龄mimiciv_derived.ageintimeICU进入时间mimiciv_derived.icustay_detailgender性别mimiciv_derived.icustay_detailheight身高mimiciv_derived.heightweight体重mimiciv_derived.weightBMI体质指数-GCS神经体系功能评分mimiciv_derived.gcsrbc红细胞mimiciv_derived.complcreatinine肌酐mimiciv_derived.chemistryrenal_disease肾脏病mimiciv_derived.charlsonTriglyceride甘油三酯mimiciv_hosp.labeventsglucose葡萄糖mimiciv_derived.chemistryTyGTyG指数-- 2 数据合并

新建schema框架,用以保存我们处理的数据。
  1. DROP SCHEMA IF EXISTS work CASCADE;
  2. CREATE SCHEMA work;
复制代码
更多内容和课程资料在公众号!
五 小结


一是做好数据检查,有没有遗漏、有没有重复、有没有多删或少删。基础数据(如首次不小于18岁),然后用left join逐个提取合并。合并前后的数据量变革,一定要做好检查。
二是数据提取用SQL完成,数据洗濯与分析用R。对于提取时间很长的数据,如实验数据(甘油三酯)、药品数据等,可把初始数据提取后保存为一个表,后面必要处理用这个表就行。
三是熟悉研究的基本思路,如数据纳排(按了局变量、自变量、协变量的序次)作为数据选取流程图的依据,是论文的一部门,另有数据格式转化、分类变量等。另外,MIMIC数据分析不涉及权重,与NHANES不同。

 

服务互助见客服二维码。关注B站熊大学习社公众号诺维之舟、熊大学习社。您的一键三连是我最大的动力。

 
 

免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。




欢迎光临 ToB企服应用市场:ToB评测及商务社交产业平台 (https://dis.qidao123.com/) Powered by Discuz! X3.4