Introduction

In a bioRxiv preprint posted in July, which Derek Lowe compares to a 'cruise missile' on the data analyses employed in a 2020 paper, Gihawi et al. identifies two major data analysis flaws, each of which would have invalidated the 2020 paper's findings.

Poore et al. claimed in 2020 that a large-scale analysis of DNA and RNA samples from 32 cancer types from the Cancer Genome Atlas (TCGA) identified microbial signatures that were highly predictive of cancer types, perhaps even as a diagnostic tool.

First Bioinformatics Analysis Flaw

However, when the (next generation) sequencing data was re-analyzed by Gihawi et al., they found draft bacterial genome databases contaminated with human sequences. As some microbial sequences do match short regions of human sequences, this resulted in "false positives that were inflated by many orders of magnitude". Further, the reanalysis showed that failure to filter out human or common vector sequences retained those false positives.

Second Bioinformatics Analysis Flaw

The second flaw arose from the use of (Voom-SNM) normalized data that somehow artificially tagged based on cancer types, which resulted in highly accurate classifier, but were not reflective of the raw data.

It is common to use normalization methods to mitigate batch effects in large-scale studies, however, the normalization process applied by Poore et al. artificially tagged cancer samples based on their types. These 'tags' when used to train machine learning models, created near-perfect (95-100%) accuracy.

As an example, the raw (sequencing) read counts for Hepandensovirus in 79 Adrenocortical Carinoma (ACC) samples was 0. After normalization, 90% of the ACC samples were assigned the (tagged) value of 3.078874655 by Poore et al. Out of 17,625 total samples, there were only 77 additional samples assigned lower or equivalent values to 3.078874655. This enabled the classifier to accurately identify ACC samples, despite the original raw reads being 0.

Bioinformatics Analysis Blind Spots

Putting aside the biological link between microbes and cancer, this preprint provides a glimpse into a few blind spots that may arise from such bioinformatics analysis:

  • contaminated draft genome database
  • inaccurate normalized data that were used to train machine learning models
  • use of processed (aligned sequence data — BAM format) instead of using raw sequence data
  • lack of additional controls or tests

Indeed, the preprint authors' conclusions are befitting of a written cruise missile, "Our conclusion after re-analysis is that the near-perfect association between microbes and cancer types reported in the study is, simply put, a fiction."

For help or second opinion (audit) on bioinformatics analysis, reach out out at blindspotbio.

Updated Jul. 19, 2024: paper is officially retracted.

Learn More