🏥 REAL MODEL OUTPUT EVALUATION ================================================================================ 📊 ============================================================================== EVALUATING SPECIALIST ASSISTANCE REPORTS ================================================================================ ================================================================================ CREATING COMPREHENSIVE EVALUATION FROM REAL MODEL OUTPUTS ================================================================================ ✅ Subgroup analysis by number of previous histories is ENABLED Loading real model outputs from: /home/work/sj/medgemma/250827_benchmarking/lingshu-7b/outputs/lm_srrg_temporal_findings_ift/inference_test/inference_results.json Loaded 1459 samples ================================================================================ REPORT CLIPPED - SHOWING ORIGINAL AND CLIPPED VERSIONS ================================================================================ Original length: 2591 characters Clipped length: 1642 characters Clipped at position: 1600 ---------------------------------------- ORIGINAL REPORT ---------------------------------------- FINDINGS: Lungs and Airways: - Worsening of right upper lobe collapse - Worsening of right lower lobe collapse - Worsening of left lower lobe collapse - Worsening of left upper lobe collapse - Worsening of left lower lobe consolidation - Worsening of right lower lobe consolidation - Worsening of right upper lobe consolidation - Worsening of right middle lobe consolidation - Worsening of right lower lobe atelectasis - Worsening of right upper lobe atelectasis - Worsening of right middle lobe atelectasis - Worsening of left lower lobe atelectasis - Worsening of left upper lobe atelectasis - Worsening of right lower lobe atelectasis - Worsening of right upper lobe atelectasis - Worsening of right middle lobe atelectasis - Worsening of left lower lobe atelectasis - Worsening of left upper lobe atelectasis - Worsening of right lower lobe atelectasis - Worsening of right upper lobe atelectasis - Worsening of right middle lobe atelectasis - Worsening of left lower lobe atelectasis - Worsening of left upper lobe atelectasis - Worsening of right lower lobe atelectasis - Worsening of right upper lobe atelectasis - Worsening of right middle lobe atelectasis - Worsening of left lower lobe atelectasis - Worsening of left upper lobe atelectasis - Worsening of right lower lobe atelectasis - Worsening of right upper lobe atelectasis - Worsening of right middle lobe atelectasis - Worsening of left lower lobe atelectasis - Worsening of left upper lobe atelectasis - Worsening of right lower lobe atelectasis - Worsening of right upper lobe atelectasis - Worsening of right middle lobe atelectasis - Worsening of left lower lobe atelectasis - Worsening of left upper lobe atelectasis - Worsening of right lower lobe atelectasis - Worsening of right upper lobe atelectasis - Worsening of right middle lobe atelectasis - Worsening of left lower lobe atelectasis - Worsening of left upper lobe atelectasis - Worsening of right lower lobe atelectasis - Worsening of right upper lobe atelectasis - Worsening of right middle lobe atelectasis - Worsening of left lower lobe atelectasis - Worsening of left upper lobe atelectasis - Worsening of right lower lobe atelectasis - Worsening of right upper lobe atelectasis - Worsening of right middle lobe atelectasis - Worsening of left lower lobe atelectasis - Worsening of left upper lobe atelectasis - Worsening of right lower lobe atelectasis - Worsening of right upper lobe atelectasis - Worsening of right middle lobe atelectasis - Worsening of left lower lobe atelectasis - Worsening of left upper lobe atelectasis - Worsening of right lower ---------------------------------------- CLIPPED REPORT ----------------------------------------- FINDINGS: Lungs and Airways: - Worsening of right upper lobe collapse - Worsening of right lower lobe collapse - Worsening of left lower lobe collapse - Worsening of left upper lobe collapse - Worsening of left lower lobe consolidation - Worsening of right lower lobe consolidation - Worsening of right upper lobe consolidation - Worsening of right middle lobe consolidation - Worsening of right lower lobe atelectasis - Worsening of right upper lobe atelectasis - Worsening of right middle lobe atelectasis - Worsening of left lower lobe atelectasis - Worsening of left upper lobe atelectasis - Worsening of right lower lobe atelectasis - Worsening of right upper lobe atelectasis - Worsening of right middle lobe atelectasis - Worsening of left lower lobe atelectasis - Worsening of left upper lobe atelectasis - Worsening of right lower lobe atelectasis - Worsening of right upper lobe atelectasis - Worsening of right middle lobe atelectasis - Worsening of left lower lobe atelectasis - Worsening of left upper lobe atelectasis - Worsening of right lower lobe atelectasis - Worsening of right upper lobe atelectasis - Worsening of right middle lobe atelectasis - Worsening of left lower lobe atelectasis - Worsening of left upper lobe atelectasis - Worsening of right lower lobe atelectasis - Worsening of right upper lobe atelectasis - Worsening of right middle lobe atelectasis - Worsening of left lower lobe atelectasis - Worsening of left upper lobe atelectasis - Worsening of right lower lobe atelectasis - Worsening of right upper lobe atelectasis - Worsening of right middle lobe atelectas [Note: Report was clipped due to length] ================================================================================ Using device: cuda:0 ================================================================================ EVALUATION RESULTS FOR FINDINGS SECTION Alignment: Unaligned ================================================================================ 📊 OVERALL AVERAGE SCORES: ---------------------------------------- radgraph_simple : 0.2043 radgraph_partial : 0.1719 radgraph_complete : 0.1607 bleu : 0.0316 bertscore : 0.3560 rouge1 : 0.2501 rouge2 : 0.1481 rougeL : 0.2353 samples_avg_precision: 0.4596 samples_avg_recall : 0.4442 samples_avg_f1-score: 0.4363 🏥 AVERAGE SCORES PER ORGAN: ---------------------------------------- LUNGS AND AIRWAYS:: radgraph_simple : 0.2116 radgraph_partial : 0.1775 radgraph_complete : 0.1671 bleu : 0.0349 bertscore : 0.4469 rouge1 : 0.2848 rouge2 : 0.1494 rougeL : 0.2632 samples_avg_precision: 0.4389 samples_avg_recall: 0.4705 samples_avg_f1-score: 0.4210 HILA AND MEDIASTINUM:: radgraph_simple : 0.1337 radgraph_partial : 0.1065 radgraph_complete : 0.1003 bleu : 0.0216 bertscore : 0.2235 rouge1 : 0.1618 rouge2 : 0.0895 rougeL : 0.1466 samples_avg_precision: 0.3047 samples_avg_recall: 0.2909 samples_avg_f1-score: 0.2943 CARDIOVASCULAR:: radgraph_simple : 0.2415 radgraph_partial : 0.1863 radgraph_complete : 0.1701 bleu : 0.0190 bertscore : 0.4524 rouge1 : 0.2869 rouge2 : 0.1533 rougeL : 0.2674 samples_avg_precision: 0.6747 samples_avg_recall: 0.6323 samples_avg_f1-score: 0.6393 ABDOMINAL:: radgraph_simple : 0.0401 radgraph_partial : 0.0343 radgraph_complete : 0.0281 bleu : 0.0220 bertscore : 0.0687 rouge1 : 0.0628 rouge2 : 0.0405 rougeL : 0.0581 samples_avg_precision: 0.1093 samples_avg_recall: 0.1066 samples_avg_f1-score: 0.1075 MUSCULOSKELETAL AND CHEST WALL:: radgraph_simple : 0.1225 radgraph_partial : 0.1057 radgraph_complete : 0.1007 bleu : 0.0291 bertscore : 0.2349 rouge1 : 0.1621 rouge2 : 0.0994 rougeL : 0.1520 samples_avg_precision: 0.4257 samples_avg_recall: 0.4081 samples_avg_f1-score: 0.4108 PLEURA:: radgraph_simple : 0.3717 radgraph_partial : 0.3385 radgraph_complete : 0.3331 bleu : 0.0327 bertscore : 0.4896 rouge1 : 0.3836 rouge2 : 0.2695 rougeL : 0.3763 samples_avg_precision: 0.5725 samples_avg_recall: 0.5064 samples_avg_f1-score: 0.5179 OTHER:: radgraph_simple : 0.0512 radgraph_partial : 0.0444 radgraph_complete : 0.0429 bleu : 0.0129 bertscore : 0.1132 rouge1 : 0.0661 rouge2 : 0.0368 rougeL : 0.0647 samples_avg_precision: 0.1155 samples_avg_recall: 0.0946 samples_avg_f1-score: 0.0985 TUBES, CATHETERS, AND SUPPORT DEVICES:: radgraph_simple : 0.2270 radgraph_partial : 0.1822 radgraph_complete : 0.1422 bleu : 0.0849 bertscore : 0.4237 rouge1 : 0.3490 rouge2 : 0.2055 rougeL : 0.3212 samples_avg_precision: 0.5831 samples_avg_recall: 0.6211 samples_avg_f1-score: 0.5868 📋 SECTION PRESENCE SCORES: ---------------------------------------- section_avg_precision: 0.7980 section_avg_recall : 0.8320 section_avg_f1-score: 0.7968 ************************************************************ SUBGROUP: 0 Previous Histories - FINDINGS ************************************************************ radgraph_simple : 0.2043 radgraph_partial : 0.1719 radgraph_complete : 0.1607 bleu : 0.0316 bertscore : 0.3560 rouge1 : 0.2501 rouge2 : 0.1481 rougeL : 0.2353 samples_avg_precision: 0.4596 samples_avg_recall : 0.4442 samples_avg_f1-score: 0.4363 ✅ Specialist assistance evaluation completed successfully! Check the '/home/work/sj/medgemma/250827_benchmarking/lingshu-7b/outputs/lm_srrg_temporal_findings_ift/eval_test' directory for detailed results. Total samples evaluated: 1