The 21st Century has seen a tremendous increase in the amount of biological data
This has been due to rapid advances in DNA sequencing and other technologies
Developments in scientific research have been accompanied by improvements in computing, enabling scientists to interpret complex biological data using bioinformatics applications
Bioinformatics is an interdisciplinary field that develops methods and software to help further our understanding of life by making sense of this data
Although many new bioinformatics applications are at the forefront of applied computing, most scientific research uses standard tools and databases
Data related to gene sequence, protein structure, gene expression or metabolites is curated, annotated and stored in databases such as GenBank, NCBI, EBI, PDB
A range of open source software tools is available to query this data
Sequence similarity
If a scientist has an unknown DNA sequence, they can determine if it codes for a gene
BLAST (Basic Local Alignment Search Tool) search can compare the unknown DNA sequence to all known gene sequences in a particular database
BLAST finds regions of similarity between sequences
The search returns ‘hits’ which are the sequences most related to the search sequence (depending on the parameters set)
There are many variations of BLAST that can be used for different analyses such as protein sequences or comparing multiple input sequences at once
Genetic variation and evolutionary relationships
Scientists can compare homologous gene sequences between many organisms
Sequences are compared using an alignment tool such as Clustal W (there are many alternatives)
This aligns (stacks) the sequences based on similar regions so that variable regions can be identified
This determines the degree of similarity between organisms which gives an indication of how closely related the organisms are
There may be a common ancestral origin but in some organisms, the gene might have accumulated differences over times from random mutations
Tree-like evolutionary diagrams (phylogenetic trees) can be constructed with software such as PhyloWin to show the degree of relatedness to a recent common ancestor
Phylogenetic analysis is useful for biological classification, conservation studies, forensics or molecular epidemiology which can help dictate public health policy
Variants of highly infectious pathogens such as SARS-CoV-2 (a well-known coronavirus) can be identified using these techniques
Sequencing DNA to determine protein sequences
The genetic code can be used to determine the amino acid sequence within a protein
This primary structure information can be used to predict how proteins will fold into their tertiary structure
This gives a greater level of understanding of how a protein functions or interacts with other proteins or molecules
Such information can be used for a range of applications, such as drug design or novel protein engineering in synthetic biology
Bioinformatics allows for large amounts of biological data to be available instantly to researchers across the globe