Two Major Bioinformatics Analysis Flaws in Study Linking Microbes to Cancer Types

Introduction

First Bioinformatics Analysis Flaw

Second Bioinformatics Analysis Flaw

Bioinformatics Analysis Blind Spots

Dive Deeper

Introduction

In a bioRxiv preprint posted in July, which Derek Lowe compares to a 'cruise missile' on the data analyses employed in a 2020 paper, Gihawi et al. identifies two major data analysis flaws, each of which would have invalidated the 2020 paper's findings.

Poore et al. claimed in 2020 that a large-scale analysis of DNA and RNA samples from 32 cancer types from the Cancer Genome Atlas (TCGA) identified microbial signatures that were highly predictive of cancer types, perhaps even as a diagnostic tool.

First Bioinformatics Analysis Flaw

However, when the (next generation) sequencing data was re-analyzed by Gihawi et al., they found draft bacterial genome databases contaminated with human sequences. As some microbial sequences do match short regions of human sequences, this resulted in "false positives that were inflated by many orders of magnitude". Further, the reanalysis showed that failure to filter out human or common vector sequences retained those false positives.

Second Bioinformatics Analysis Flaw

The second flaw arose from the use of (Voom-SNM) normalized data that somehow artificially tagged based on cancer types, which resulted in highly accurate classifier, but were not reflective of the raw data.

It is common to use normalization methods to mitigate batch effects in large-scale studies, however, the normalization process applied by Poore et al. artificially tagged cancer samples based on their types. These 'tags' when used to train machine learning models, created near-perfect (95-100%) accuracy.

As an example, the raw (sequencing) read counts for Hepandensovirus in 79 Adrenocortical Carinoma (ACC) samples was 0. After normalization, 90% of the ACC samples were assigned the (tagged) value of 3.078874655 by Poore et al. Out of 17,625 total samples, there were only 77 additional samples assigned lower or equivalent values to 3.078874655. This enabled the classifier to accurately identify ACC samples, despite the original raw reads being 0.

Putting aside the biological link between microbes and cancer, this preprint provides a glimpse into a few blind spots that may arise from such bioinformatics analysis:

contaminated draft genome database
inaccurate normalized data that were used to train machine learning models
use of processed (aligned sequence data — BAM format) instead of using raw sequence data
lack of additional controls or tests

Indeed, the preprint authors' conclusions are befitting of a written cruise missile, "Our conclusion after re-analysis is that the near-perfect association between microbes and cancer types reported in the study is, simply put, a fiction."

For help or second opinion (audit) on bioinformatics analysis, reach out out at blindspotbio.

Special Offer

Free Bioinformatics or Research Consultation

Mention BLOG for free 20 min bioinformatics or research services help!

Schedule a Consultation

Updated Jul. 19, 2024: paper is officially retracted.

Dive Deeper

bioXrix preprint is now published.
Rebuttal from original authors.
Commentary in Science.
Commentary in NY Times.
Commentary in STAT news.
Listed in retraction watch.

Cancer Microbiome Reanalysis

Table of Contents

Introduction

First Bioinformatics Analysis Flaw

Second Bioinformatics Analysis Flaw

Bioinformatics Analysis Blind Spots

Special Offer

Free Bioinformatics or Research Consultation

Dive Deeper

Cancer Microbiome Reanalysis

Table of Contents

Introduction

First Bioinformatics Analysis Flaw

Second Bioinformatics Analysis Flaw

Bioinformatics Analysis Blind Spots

Special Offer

Free Bioinformatics or Research Consultation

Dive Deeper

Newsletter Subscription

Recent posts

Tour de Force Proteomics Dataset Reveals New Clues to Small Molecule Mechanism of Action (MOA) and Off-Target Effects