The Co-evolution of Bioinformatics and “Big Data” Analytics
Bioinformatics is a multidisciplinary field that combines biology, computer science, and statistics to develop methods for the processing and interpretation of biological data. It has grown exponentially since the late 1980’s, when the first databases of protein sequence motifs emerged. Boosted first by the growth of the internet and later by the increasing popularity of high-throughput biological experimentation, bioinformatics has evolved far beyond “motif finding” in recent years. Increasing industrialization of laboratory techniques to make them “high throughput” has revolutionized many fields of biological inquiry, and bioinformatics has rapidly evolved in conjunction with the emergence of “big data” produced by such techniques.
An early application of bioinformatics to process and interpret “big data” was the analysis of microarrays, which allowed the expression levels of thousands of genes to be examined simultaneously. More recently, the development of next-generation sequencing technologies that can determine the sequence of hundreds of millions of short pieces of DNA or RNA per experiment has spawned whole new sub-fields of bioinformatics. As the cost of sequencing has decreased, an explosive increase in the use of whole genome sequencing techniques has revolutionized molecular biology.
More importantly, the tremendous increase in the quantity and variety of data that is generated by high-throughput assays has changed the very nature of hypothesis generation and experimental design. Whereas most experiments used to be designed to test a specific hypothesis (i.e. “that expression of gene A will be altered in response to X”), it has now become more common to design experiments that are “data-driven”. Rather than looking individually at gene A, one can simultaneously examine the expression of every gene in the genome and formulate a hypothesis later based on the results. While “hypothesis-driven” experimentation will always be the cornerstone of scientific inquiry, the ability to perform “data-driven” experiments frees the process of discovery from the confines of expectation. For example, next-generation sequencing studies have demonstrated the existence of thousands of new non-coding RNAs and novel gene isoforms that were never detected by more targeted assays.
At the same time, the quantity and multidimensional nature of all of this new data has also impacted the nature and scope of bioinformatics. Because of the statistical rules surrounding “multiple testing”, the significance of expression changes that are detected for a single gene in a hypothesis-driven experiment is much greater than the significance of detecting the same expression changes in “any” gene in a genome-wide experiment. Bioinformatics tools that are designed to analyze these types of experiments must therefore account for such considerations, and bioinformaticians must accordingly have a strong grasp of biostatistics. In addition, bioinformaticians must increasingly use sophisticated programming and data management skills to create and maintain relational databases that are too large and complex for standard commercial software. The sheer quantity of data that is generated is also too great to be uploaded, downloaded, or otherwise transferred between computers in a timely manner and therefore necessitates that bioinformaticians become proficient at working remotely on a server using to manipulate and utilize data. Because the skills required to analyze “big data” for bioinformatics are also highly applicable other kinds of data, such as hospital records, bioinformatics has co-evolved with an array of related specialties, such as health informatics, that serve to further drive demand for skilled analysts. As techniques and systems continue to grow in power, scale, and sophistication, we can expect to see ever-increasing demand for “big data analytics” in bioinformatics and related fields. How can we encourage young professionals to evolve to meet this demand?