Turning publicly available genome data into proteins

Turning publicly available genome data into proteins

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I'm a computer scientist who is starting to dabble with biology. My eventual goal is to model different kinds of cells with a computer program. As of right now, I'm just trying to take some smaller steps.

First, I downloaded a complete human genome from There is a FASTA file for each chromosome.

Then, I wrote a java program which can convert FASTA DNA sequences into the appropriate amino acid chain.

Next, I made my program look for the "start" code (ATG) and "stop" codes (TAA, TAG, TGA).

So, now I have sequences of amino acids which might theoretically end up folding into proteins. But, before I start diving into protein folding, I wanted to try to verify that the steps I took so far were done correctly. I looked up some important human genes in an online database and found their amino acid sequences. I then searched through my program's data for those sequences and confirmed that they were there. However, the gene was in a different base-pair location than the database said that it should be in.

This led me to some questions, which, so far I have been unable to answer and hopefully people here will be able to help shed some light.

  1. I know there are a lot of different publicly available genomes. Maybe the UCSC one that I downloaded is different from the one used by the gene database. How much does each genome vary from each other genome and in what ways do they vary?

  2. In attempting to answer that first question, I was going to download a bunch of genomes from the 1000genomes website and do some comparisons, but I wasn't sure which files to download. Each of the files begins with either ERR or SRR and I'm not sure what that means. This is the folder I'm currently looking in

  3. Lets say I'm trying to model a white blood cell. How do I know which parts of the genome get turned into proteins for that type of cell?

Sorry if anything I said doesn't make sense. As I said, my expertise lies in programming, not biology/genetics.

No, your approach will not work, you are taking a very simplistic view of an extremely complex system. Some of the problems you are ignoring are:

  • Genes (eukaryotic genes anyway) are spliced to produce mRNA, a process that removes introns and leaves only the exons. If you just translate the entire chromosome file you will get noise.

  • Splicing also changes the frame a gene is read in, you don't mention frames at all in your question but you can't work with sequences unless you deal with them.

  • Many genes (most even, in some species) are alternatively spliced. One gene can give rise to multiple protein sequences. Which one is produced at any one time can depend on a multitude of factors ranging from pure chance, through environmental conditions to the cell type where the gene is expressed.

  • Genes can be present on both strands of DNA and a gene on the + strand can overlap with a gene on the - strand. In some cases they can even overlap on the same strand (nested genes). You need to check both strands for coding sequences.

  • You're assuming that all coding sequences start with ATG (most do, not all) and you seem to be assuming that an ATG always starts a coding sequence. A given gene can have dozens or hundreds of ATG codons, how can you know which one is used as a START codon?

The process of identifying the parts of the genome that get translated into protein is not trivial. It is the subject of countless PhD theses, mine for example. There are many programs (gene predictors) that are designed specifically to detect genes in genomic sequences. Having spent many years working with them I can assure you that they're not something you can just whip up one afternoon. They tend to involve very complex models of coding vs. non-coding sequences and are way more sophisticated than simply looking for START and STOP codons. Trying to write one without knowing a lot more about biology than you seem to is just a waste of time.

Your specific questions are basically irrelevant because of the points mentioned above. Nevertheless, the answers are:

  1. They vary but not much. For well annotated genomes like the human one, the differences will be negligible. That is not why you have strange results though as I explained above.

  2. All public FTP sites tend to have a README file that explains what the files provided are. You should read the relevant README from

  3. Answering that question will get you a Nobel prize. There simply is no way of predicting what genes will be activated in a particular cell. We're not even close to that level of understanding of how a cell works but I can tell you that it will not depend on the sequence, you will never be able to predict whether a gene is active in a particular cell based on its DNA sequence. It will depend on various things including the gene's methylation state and is largely an emergent quality of the cell's complexity (think of various proteins interacting with one another, leading to the activation of a gene). The best you can do is get a list of genes that are known to be active from the literature.

In summary, if you want to do something as complex as modelling a cell I suggest you first take the time and study some basic biology so you can understand the system you are trying to model a bit better. The cell is not only an extremely complex system that we don't fully understand yet, it is also not wholly deterministic and contains a lot of stochasticity that you seem to be ignoring completely.

Why bother predicting proteins badly from DNA sequence when you could have just as well downloaded the manually curated human proteome?

As to your questions:

  1. Are you asking about human genomes or genomes in general? The vast majority of the variance in human genomes is in non-coding sequence. As to genomes in general, they vary in pretty much every imaginable way.

  2. I think those files are quality filtered Illumina reads. SRA = Sequence Read Achieve. SRR = SRA RUN accession. ERA = EMBL SRA. ERR = ERA RUN accession.

  3. You should look into transcriptomics data. Predicting such stuff in silico is currently pretty much undoable.

MicroProtein-Mediated Recruitment of CONSTANS into a TOPLESS Trimeric Complex Represses Flowering in Arabidopsis

Affiliations Center for Plant Molecular Biology, University of Tübingen, Tübingen, Germany, Copenhagen Plant Science Centre, University of Copenhagen, Copenhagen, Denmark, Department for Plant and Environmental Sciences, University of Copenhagen, Copenhagen, Denmark

Affiliations Center for Plant Molecular Biology, University of Tübingen, Tübingen, Germany, Copenhagen Plant Science Centre, University of Copenhagen, Copenhagen, Denmark, Department for Plant and Environmental Sciences, University of Copenhagen, Copenhagen, Denmark

Affiliations Center for Plant Molecular Biology, University of Tübingen, Tübingen, Germany, Copenhagen Plant Science Centre, University of Copenhagen, Copenhagen, Denmark, Department for Plant and Environmental Sciences, University of Copenhagen, Copenhagen, Denmark

Affiliations Center for Plant Molecular Biology, University of Tübingen, Tübingen, Germany, Copenhagen Plant Science Centre, University of Copenhagen, Copenhagen, Denmark, Department for Plant and Environmental Sciences, University of Copenhagen, Copenhagen, Denmark

Affiliations Center for Plant Molecular Biology, University of Tübingen, Tübingen, Germany, Copenhagen Plant Science Centre, University of Copenhagen, Copenhagen, Denmark, Department for Plant and Environmental Sciences, University of Copenhagen, Copenhagen, Denmark

Affiliation Leibniz Institute of Plant Genetics and Crop Plant Research, Gatersleben, Germany

Affiliations Center for Plant Molecular Biology, University of Tübingen, Tübingen, Germany, Copenhagen Plant Science Centre, University of Copenhagen, Copenhagen, Denmark, Department for Plant and Environmental Sciences, University of Copenhagen, Copenhagen, Denmark


Chinese hamster ovary (CHO) cell lines represent the most commonly used mammalian expression system for the production of therapeutic proteins. In this context, detailed knowledge of the CHO cell transcriptome might help to improve biotechnological processes conducted by specific cell lines. Nevertheless, very few assembled cDNA sequences of CHO cells were publicly released until recently, which puts a severe limitation on biotechnological research. Two extended annotation systems and web-based tools, one for browsing eukaryotic genomes (GenDBE) and one for viewing eukaryotic transcriptomes (SAMS), were established as the first step towards a publicly usable CHO cell genome/transcriptome analysis platform. This is complemented by the development of a new strategy to assemble the ca. 100 million reads, sequenced from a broad range of diverse transcripts, to a high quality CHO cell transcript set. The cDNA libraries were constructed from different CHO cell lines grown under various culture conditions and sequenced using Roche/454 and Illumina sequencing technologies in addition to sequencing reads from a previous study. Two pipelines to extend and improve the CHO cell line transcripts were established. First, de novo assemblies were carried out with the Trinity and Oases assemblers, using varying k-mer sizes. The resulting contigs were screened for potential CDS using ESTScan. Redundant contigs were filtered out using cd-hit-est. The remaining CDS contigs were re-assembled with CAP3. Second, a reference-based assembly with the TopHat/Cufflinks pipeline was performed, using the recently published draft genome sequence of CHO-K1 as reference. Additionally, the de novo contigs were mapped to the reference genome using GMAP and merged with the Cufflinks assembly using the cuffmerge software. With this approach 28,874 transcripts located on 16,492 gene loci could be assembled. Combining the results of both approaches, 65,561 transcripts were identified for CHO cell lines, which could be clustered by sequence identity into 17,598 gene clusters.

Citation: Rupp O, Becker J, Brinkrolf K, Timmermann C, Borth N, Pühler A, et al. (2014) Construction of a Public CHO Cell Line Transcript Database Using Versatile Bioinformatics Analysis Pipelines. PLoS ONE 9(1): e85568.

Editor: Christophe Antoniewski, CNRS UMR7622 & University Paris 6 Pierre-et-Marie-Curie, France

Received: October 1, 2013 Accepted: December 3, 2013 Published: January 10, 2014

Copyright: © 2014 Rupp et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: The project is co-funded by the European Union (European Regional Development Fund - Investing in your future) and the German federal state North Rhine-Westphalia (NRW). JB acknowledges the receipt of a scholarship from the CLIB Graduate Cluster Industrial Biotechnology ( CT is funded by Ziel2.NRW (, the European Regional Development Fund and Ministerium für Innovation, Wissenschaft und Forschung des Landes Nordrhein-Westfalen (MIWF). NB acknowledges funding by ACIB (

acib/index.php/wbindex/start), a COMET K2 center of the Austrian FFG. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

MSACL 2017 US Abstract

Tao Huan (Presenter)
The Scripps Research Institute

Bio: I am a Research Associate in Gary Siuzdak’s lab at the Center for Metabolomics and Mass Spectrometry in The Scripps Research Institute (La Jolla, CA). My research interests focus on the development and application of mass spectrometry based technologies for metabolomics. One important aspect of my research is to invent new bioinformatic tools to provide convenient metabolomic data processing and multi-omics integration. Before I joined the Siuzdak lab, I received my Ph.D. degree in Analytical Chemistry from University of Alberta under the supervision of Dr. Liang Li and my thesis topic was chemical isotope labeling LC-MS based metabolomics.

Authorship: Tao Huan, Duane Rinehart, H Paul Benton, Erica Forsberg, Jose Rafael Montenegro Burke, Mingliang Fang, Aries Aisporna, and Gary Siuzdak
The Scripps Research Institute, La Jolla, CA

Over the last 15 years, metabolomics has emerged as a powerful technology to interrogate cellular biochemistry, perform diagnostic testing, and characterize biochemical mechanisms of disease. Owing to innovative developments in informatics, analytical technologies and integration of orthogonal biological approaches, it is now possible to expand metabolomic analyses into understanding the systems-level effects of metabolites. In this work, we incorporated systems level technologies into XCMS, a widely used metabolomic platform, to gain insight into the mechanisms of disease progression in clinical applications. Our platform allows users to directly map metabolomic data onto metabolic pathways in “one-click” and carry out multi-omic integration with self uploaded and/or database archived epigenome, genetic variations, genome, transcriptome, and proteome data in a user-friendly approach.

While the success of metabolomics has been driven by mass spectrometry and NMR analytical advances, equally important have been developments in bioinformatic resources for data processing. For example, the widely used metabolomic software XCMS Online(1), developed by our lab, has been the cornerstone of the field and are used by thousands of investigators worldwide. Currently, XCMS Online has over 13,000 registered users in 180+ countries and its user base grows daily. These statistics reflect the rapid growth of the metabolomic field and our commitment to develop easy-to-use intuitive analytic tools for analyzing comprehensive metabolomic data.

In this work, we further extend the capacity of XCMS Online platform and bring it up to a new level to execute multi-omic integrative analysis. To achieve this goal, we first implemented a metabolic pathway prediction algorithm to allow the direct mapping of metabolomic data onto metabolic pathways prior to the time-consuming metabolic identification. We then incorporate transcriptomic and proteomic databases to allow the automatic integrative analysis of the dysregulated metabolic pathways confirmed from the metabolomic results. Further, we constructed libraries to include epigenome, (DNA methylation) and genetic variations (single nucleotide polymorphisms (SNPs) and trait-associated SNPs) within XCMS Online, which allows users to find the association of these gene regulation elements with each specific gene, by pathway in an interactive format and linked to the analysis results in XCMS Online. To demonstrate its performance, we applied this systems biology platform to a colon cancer study to understand how genetic regulations influence the progress of colon cancer and cancer metabolism.

Species-specific pathway information was archived with pathways and genes from Biocyc, proteins from Uniprot, and metabolites from KEGG and METLIN. Over 7600 metabolic species are provided in the platform, including human, mouse, yeast, and etc.

With respect to epigenomic data, we have archived DNA methylation data of 26 cancer types from The Cancer Genome Atlas (TCGA) and human aging from public available datasets through Gene Expression Omnibus (GEO) in NCBI. It’s worth noting that we are also actively including DNA methylation data for other common diseases (Diabetes, Alzheimer etc.) and phenotypes (such as drug resistance or addiction) via active searching and user-based requesting.

SNPs data were acquired to include all known SNPs in both human and mouse model downloaded via UCSC Genome. The current version of SNPs database contains >120 million entries and >81 million entries for HUMAN and MOUSE respectively. Besides, trait-associated SNPs data were also included in a separate category from Genome Wide Association Studies (GWAS) obtained from NCBI. Similar to the approach of acquiring DNA methylation data, we are actively including these SNPs from NCBI data repository.

To perform pathway analysis, a metabolic pathway enrichment analysis algorithm, mummichog(2), was modified and implemented into XCMS Online. This tool operates directly on the resulting XCMS feature table to reveal processed biological relevance of dysregulated metabolites in the form of metabolic networks and pathways. Further, to perform multi-omic integration, users can upload a list of differentially expressed genes and proteins. The multi-omic analysis tool then performs gene and/or protein matching to identify the overlapping gens and/or proteins from user uploaded data onto previously predicted pathways revealed from the results of the metabolic pathway analysis.

Both epigenome and genetic variations play important roles in gene regulation and influence downstream metabolic pathways significantly. Therefore we implemented our epigenetic and gene variations databases in our multi-omic analysis platform to allow the association of these gene regulation elements with each specific dysregulated gene from user uploaded data. We further augment our data visualization tools to graphically display the quantity and identification of theses results by pathway in an interactive format and linked to the analysis results in XCMS Online. This systems-level integration also allows the hyperlinks to additional detailed information about each DNA methylation and SNP, providing a comprehensive multi-omic analysis.

Paired colon tissue samples (tumor vs. normal) from 60 colon cancer patients were received and stored in -80 ºC freezer. Detailed clinical information are also available for patients and tumors (size, metastases, locations). After metabolites were extracted from tissue samples with organic solvents, comprehensive metabolomic data was acquired using HPLC-MS in ESI positive mode and HILIC-MS in ESI negative mode. Metabolomic data was processed in XCMS Online. Comprehensive transcriptomic and proteomic data were downloaded from The Cancer Genome Atlas (TCGA) and, The Cancer Genome Atlas (TCGA), respectively. Multi-omic integrative analysis was performed in XCMS Online.

Traditionally in metabolomics study, significant metabolites are reduced from the entire metabolomic dataset using subjectively defined fold change, p-value, and signal intensities followed by manual identity confirmation. The related pathways in which the dysregulated metabolites are involved are determined and then compared with differentially expressed genes and proteins using either bioinformatic tools or by manual examination, overall a tedious and time consuming process. In our strategy, we developed a one-step approach within XCMS Online to conveniently make a direct linkage between metabolomic data and their biological contents in the form of metabolic pathways. Further, integrative analysis of these metabolic pathways is achieved by correlating the details in metabolic pathways with epigenetic, genetic variations, transcriptomic and proteomic data to decipher the metabolic network at the systems level.

We demonstrated this platform using a colon cancer study to exam the metabolic differences between patient-derived samples of colon cancer and normal tissues (paired analyses with n=30). Over 7,000 metabolic features were detected (XCMS Online public job ID# 1100254) and among them, 10% had statistical significance with p-values less than 0.01. These features were then used to predict associated metabolic pathways with the mummichog algorithm. Comprehensive RNAseq transcriptomic and shotgun proteomic data were acquired from The Cancer Genome Atlas (TCGA) and Clinical Proteomic Tumor Analysis Consortium (CPTAC) on separate samples (n=44). In total, over 10,000 significantly differentiated mRNAs (fold change ≥ 1.2, p-value ≤ 0.01) and over 2,500 statistically significant proteins (fold change ≥ 1.2, p-value ≤ 0.01) were used to correlate genes and proteins with metabolites. In total, ten metabolic pathways were identified with statistical significance (p value ≤ 0.01). Among them, five of the pathways have been previously implicated in the progression of cancer. Specifically, we noticed that spermine and spermidine degradation pathway was dysregulated in not only metabolite concentrations, but also gene expression and protein synthesis levels. This demonstrates the power of performing integrative analysis with real clinical samples, which allows us to have a systems-level view of cancer metabolism. More importantly integrative analysis with colon cancer specific epigenetic and genetic variations data archived in XCMS Online reveals several important DNA methylation and colon cancer associated SNPs sites that have not been reported before. The detailed study of their biological and clinical importance is still ongoing.

In this study, we developed a metabolomics guided systems biology platform and implemented it within the XCMS interactive interfaces to address the need for new bioinformatics developments in pathway mapping and integrative multi-omic analysis for clinical applications. This interface streamlines interpretation of metabolomic data to provide results that can immediately be put into the biological context. This platform is designed as a free cloud-based resource and is readily used by the online community that now hosts over 13,000 registered researchers. In the meantime, this system biology platform is tested using an ongoing colon cancer research project, attempting to address the biological function of genetic regulations on colon cancer progression. This application allows us to systematically understand how cancer progression and cancer metabolism dramatically affected by genetic regulation factors, such as DNA methylation and SNPs occurs in gene promoter regions.

References & Acknowledgements:

1. R. Tautenhahn, G. J. Patti, D. Rinehart, G. Siuzdak, XCMS Online: a web-based platform to process untargeted metabolomic data. Analytical chemistry 84, 5035-5039 (2012).

2. S. Li et al., Predicting network activity from high throughput metabolomics. PLoS Comput. Biol 9, e1003123 (2013).


The investigation of the process of the lung cancer developing from an unfatal subtype, such as AIS, to the invasive stage provided the insights for understanding the mechanisms responsible for deterioration of the disease. We combined the two independent datasets to infer invasive specific subnetworks. The gene expression alteration patterns tend to be more robust than somatic mutations in different patient groups. Almost 98% DEGs were the same in GSE52248 and TCGA LUAD patients. However, the putative somatic driver genes only have about the 13.4% overlap rate, reflecting the high genetic heterogeneity for the disease. Two genes, TRIM9 and CYP4F3, have opposite expression patterns between the two datasets which may be explained by the diverse isoform expression patterns such as HNF4A. Karthikeyani Chellappa, et al. found that the diverse isoforms of HNF4A, especially P2-HNF4α, showed different expression patterns in various tissue samples [19]. As a tumor suppressor, HNF4A is usually downregulated in tumor samples. Interestingly, this gene was over-expressed in lung invasive tumor samples than normal of both GSE52248 and TCGA data.

The size of the chromosome of GA affects the optimal solution that the algorithm is able to find. Here, the size of the chromosome equals to the number of the candidate genes which directly or indirectly interact with the seed genes. The maximum searching distance from the seed gene was three for our subnetworks construction. At the outermost layer of the subnetworks, the total number of candidate genes often reached 18,000, which covered the majority human protein-coding genes (

23,000). Compared to the greedy algorithm, GA can identify global optimum subnetworks associated with the disease. The fitness function is an important factor for GA searching. Here, we used mutual information to calculate fitness score, which was estimated using discrete expression bins derived from continuous expression values. When the sample size is small, the number of final subnetworks can rapidly increase with less stability. Thus, for a small sample size, GA-based network construction may need a different fitness function guiding the searching process. In general, we found that a larger sample size could lead to more stable optimal gene groups.

The correlation of genome size and DNA methylation rate in metazoans

Total DNA methylation rates are well known to vary widely between different metazoans. The phylogenetic distribution of this variation, however, has not been investigated systematically. We combine here publicly available data on methylcytosine content with the analysis of nucleotide compositions of genomes and transcriptomes of 78 metazoan species to trace the evolution of abundance and distribution of DNA methylation. The depletion of CpG and the associated enrichment of TpG and CpA dinucleotides are used to infer the intensity and localization of germline CpG methylation and to estimate its evolutionary dynamics. We observe a positive correlation of the relative methylation of CpG motifs with genome size. We tested this trend successfully by measuring total DNA methylation with LC/MS in orthopteran insects with very different genome sizes: house crickets, migratory locusts and meadow grasshoppers. We hypothesize that the observed correlation between methylation rate and genome size is due to a dependence of both variables from long-term effective population size and is driven by the accumulation of repetitive sequences that are typically methylated during periods of small population sizes. This process may result in generally methylated, large genomes such as those of jawed vertebrates. In this case, the emergence of a novel demethylation pathway and of novel reader proteins for methylcytosine may have enabled the usage of cytosine methylation for promoter-based gene regulation. On the other hand, persistently large populations may lead to a compression of the genome and to the loss of the DNA methylation machinery, as observed, e.g., in nematodes.

This is a preview of subscription content, access via your institution.

Genome Analysis Of 426 Africans Finds Over 3 Million New Variants

An international group of researchers carried out DNA sequencing analyses of 426 individuals to . [+] explore genomic diversity across Africa.

In a new Nature study, an international group of researchers sequenced and analyzed DNA from 426 individuals across Africa, finding more than three million previously undescribed variants (i.e. changes in the genome).

DNA is made up of four base pairs: adenine (A), cytosine (C), guanine (G) and thymine (T). These four bases make up the genes in our DNA, similar to how letters in the alphabet form words and collectively build sentences.

When it comes to exploring DNA, scientists have many different tools available at their fingertips. For example, they can sequence specific genes through gene panels, examine genes which code for proteins through whole exome sequencing, or analyze all the regions within our DNA using whole genome sequencing. In each case, scientists will map the sequenced DNA to a reference genome in order to identify any differences in base pairs, referred to as variants. These variants may play a role in causing diseases or may have no known effect.

However, the reference genome isn’t representative of the diversity seen across the world. In addition, in the past, there have been few large-scale genomics studies in Africa in part due to limited research infrastructure. To address some of these issues, the H3Africa (Human, Heredity and Health in Africa) Initiative was launched to facilitate research and build capacity with the “goal of improving the health of African populations.”

In this study, the researchers analyzed genome sequencing data from 426 individuals across Africa, who were recruited from the H3Africa Consortium, the Southern African Human Genome Programme and the Trypanosomiasis Genomics Network of the H3Africa Consortium (TryopanoGEN). I spoke to two co-authors, Zané Lombard and Adebowale A. Adeyemo, to learn more about this study.

Fake ‘Mona Lisa’ Sells For $3.4 Million

Study Documents Changes In Brain After Covid-19 Infection

CDC Further Investigating Heart Inflammation Cases After Pfizer, Moderna Covid-19 Vaccination

“This [study] was really born out of a need to have more African reference genome data,” says Lombard, who is a principal medical scientist at the National Health Laboratory Service’s Division of Human Genetics, and an associate professor at the Faculty of Health Sciences’ School of Pathology at the University of the Witwatersrand. “We are both part of the H3Africa Consortium that is funded by the NIH and the Wellcome Trust. Most of the studies, if not all, need some kind of reference genomic data because we’re all working in African populations looking at diseases and traits that are pertinent to African populations. So it really was from that point of view, that it would be good to add additional whole-genome sequencing data just to the public domain.”

“When we talk about certain parts of the world, like Africa, being under-represented in genomics studies, we really don’t appreciate how really bad it is,” says Adeyemo, who is a physician-scientist and also serves as the Deputy Director at the NIH National Human Genome Research Institute’s (NHGRI) Center for Research on Genomics and Global Health. “If you look at genome-wide association studies, which are considered [to be] one of the most common higher quality genetic studies that you can do, only a tiny fraction are actually done in non-Europeans, and most of those are done in East Asians. When you even talk about African ancestry, most are African-Americans, not Africans in Africa. [. ] so really, we’re trying to break these gaps so that we have better data, and also, better tools.”

Adeyemo also adds that “Africa is large — it’s the home of mankind. It has over 2,000 ethnolinguistic groups. Only a few have made it into any large-scale sequencing projects.”

In this study, the researchers chose to focus only on single nucleotide variants (i.e. changes in the DNA that differ by a single base pair from the reference genome). They conducted a number of analyses, including identifying variants with clinical relevance to African populations.

“Even by adding just a couple of hundred additional genomes from populations that hadn’t been studied before, we discovered more than three million new genomic variants that hadn’t been described previously,” says Lombard. “We were able to use some of these variants to then look at things like migration across the African continent [. ] One of the things that we could see is that there most probably was some migration route taken through Angola, across Zambia, and then out to the rest of eastern and southern Africa.”

In addition, the researchers found that damaging (i.e. “pathogenic”) variants impacting genes considered to be medically relevant were uncommon in the 426 sequenced individuals, but that instead, variants categorized as “likely pathogenic” were commonly observed in genes not considered medically relevant.

“A critical point here is that if a variant is really deleterious, and is really bad in terms of survival, then you expect it not to be common. In other words, it would be rare and not common at all because most people who carry that variant will die before they have children,” says Adeyemo. “What our data showed that is that some variants that are supposedly very bad were really quite common in African populations, which suggests that they really cannot be that bad, and cannot be deleterious to survival if they are that common. In genetics-speak, you say that you improve your yield because you were able to show that most likely, some of these variants were misclassified and ought to be classified differently.”

Findings from this study were initially shared by co-author Neil Hanchard, an assistant professor at the Baylor College of Medicine, at the opening plenary session at the 2019 American Society of Human Genetics’ (ASHG) Annual Meeting. The publication of these findings in Nature this week was accompanied by an editorial calling for more funding from national and regional sources across Africa to support such research.

Looking forward, Lombard and Adeyemo say that this study is the first of such analyses, and that there are ongoing efforts to analyze different types of genetic variation in African populations, including structural variants and repeat sequences.

“We’re very proud that this study was very much driven, and performed, on the African continent, and with our partners in the US and the UK. We have African researchers from more than 24 different institutions across Africa that participated in this study,” says Lombard. “I think it’s really a large feat to see this kind of large-scale study driven [by] the African continent.”

Marine Metagenomics Portal (MMP) publicly available

On June 26th, the Research Council of Norway announced its continued funding to ELIXIR Norway – and 18 other infrastructure funding proposals submitted in October 2016. ELIXIR Norway is lead by professor Inge Jonassen, head of the Computational Biology Unit at the University of Bergen, and includes the University of Oslo, NTNU, NMBU and the University of Tromsø as partners.

The funding is for the period 2017-2022 and will allow us to continue and strengthen our infrastructure solutions – including NeLS, our national help desk and training activities, and not the least our role as a Node in the European ELIXIR Infrastructure, says Jonassen.

Read more about all 19 funded infrastructure projects here, and the UiB projects here (NB Norwegian).

The marine databases MarRef, MarDb, and MarCat was presented at the ELIXIR All-Hands meeting in Rome 21/3 2017, and are now public available. This project is an international deliverable in the ELIXIR-EXCELERATE project.

The marine resources, which have been implemented in the Marine Metagenomics Portal (MMP), are a collection of richly annotated and manually curated contextual (metadata) and sequence databases representing three tiers of accuracy.

While MarRef is a database for completely sequenced marine prokaryotic genomes, which represent a marine prokaryote reference genome database, MarDb includes all sequenced marine prokaryotic genomes regardless of level of completeness. MarCat represent a gene (protein) catalogue of uncultivable (and cultivable) marine genes and proteins derived from metagenomics samples.

The first versions of MarRef and MarDb contain 484 and 2557 entries, respectively. Each record is build up of 104 metadata fields including attributes for sampling, sequencing, assembly and annotation in addition to organism and taxonomic information.

The contextual and sequence Mar databases and are available at

ELIXIR Finland and ELIXIR Estonia held a successful two-day ELIXIR Innovation and SME Forum in Helsinki on 27-28 March 2017.

The two-day event presented some of the bioinformatics resources in Genomics and Health available through ELIXIR and showcased several companies that are already using public data resources in their business.

Fifty attendees from companies, academic institutions and ELIXIR partners had a chance to hear introductions to ELIXIR activities in Europe, particularly in Finland and Estonia. Talks were given by two SMEs, Blueprint Genetics (Finland) and Protobios (Estonia), both active users of public data. Representatives of ELIXIR Nodes and companies also showcased their technologies, services and products in a series of flash talks. Read more

Scientists across the world can now discover and query data from genomics projects in six different countries in Europe. This has been a success of the ELIXIR collaboration with the Global Alliance for Genomics and Health (GA4GH) on the Beacon project which has been recently expanded and extended into 2017.

The Beacon Project is developing an open sharing platform that helps genomic data centres to make their data discoverable. Beacons allow researchers to query individual datasets to determine whether they contain a specific genetic variant of interest. For example, researchers can ask Beacons simple questions like, ‘Do your data resources have genomes with this allele at that position?’

The first stage of the project (2015-2016) resulted in lighting Beacons in five ELIXIR Nodes - Sweden, Finland, France, Switzerland and Belgium, and in the European Genome-phenome archive (EGA, a joint project of EMBL-EBI and the Center for Genomic Regulation in Barcelona, ELIXIR Spain). Another Beacon will soon be launched in ELIXIR Netherlands. Each ELIXIR Beacon makes one or more genomics datasets discoverable to the international research community. Read more

Guidelines recently published by Google for the discovery of science datasets help data providers to describe their datasets in a structured way using, enabling internet search engines to find and index rich metadata to better present scientific datasets The published guidelines draw on the metadata specifications for life-science datasets developed by BioSchemas. One of the early adopters of the specifications is the Omics Discovery Index (OmicsDI), which has been presented as a good practice example in recent Google Research Blog post. OmicsDI has been developed by EMBL-EBI and supported by BD2K, and is an active member of the BioSchemas community. It provides dataset discovery service across a heterogeneous, distributed group of -omics data from eight repositories across the world.

BioSchemas is an open community initiative driven by ELIXIR to improve interoperability of life-science data. Building on and extending the markup, Bioschemas develop a collection of specifications that provide guidelines for describing metadata about life science information. Besides life science datasets, BioSchemas is working on specifications for samples, phenotypes, data repositories or proteins sequences.

To support the work of Bioschemas, ELIXIR has recently launched the BioSchemas Implementation study. The main partners in the study are BBMRI, BD2K and FORCE11, however, it has support of over 40 stakeholders. The BioSchemas group for life science datasets includes representatives from PDBe, UniProt, Pfam, DataMed and DATS, Repositive, OmicsDI, Intermine and Google. Read more

Hungary has become the 21st Member to join ELIXIR, following the signature of the ELIXIR Consortium Agreement by Dr József Pálinkás, President of the National Research, Development and Innovation Office in Hungary.

The ELIXIR Node in Hungary is currently under development. The ELIXIR Node will be led by the MTA Research Centre for Natural Sciences and coordinated by Professor Laszlo Patthy of the Institute of Enzymology within the Research Centre for Natural Sciences of the Hungarian Academy of Sciences. The focus of the Hungarian Node will be on novel tools, services and databases in the field of protein sequence and structure investigation, DNA sequence analysis and translational medicine.

"Our membership in ELIXIR will help us sustain and safeguard our national investments in life sciences by linking our research community and resources to the ELIXIR infrastructure,” said Professor Laszlo Patthy. In turn, Europe and existing ELIXIR Nodes will benefit from Hungarian expertise and resources in systems and computational biology.

Dr Niklas Blomberg, ELIXIR Director, said: "In just three years since its launch in December 2013, ELIXIR membership has skyrocketed from the six founding members to the current 21. I am delighted to welcome our Hungarian colleagues to ELIXIR and look forward to our collaboration. Hungarian Membership of ELIXIR will open up opportunities for new collaborations and will benefit both the Hungarian as well as European life-science and bioinformatics community.”

In response to European Union's public consultation on the interim evaluation of Horizon 2020, ELIXIR has published its Position Paper on Horizon 2020, the European Union's funding programme for research and development.

The Paper includes recommendations that could be adopted for the reminder of Horizon 2020, as well as several longer-term suggestions to consider for a successor programme. Read more

The news posted on this page are reports on bioinformatics activities in general, and on the Norwegian bioinformatics platform in particular. If you have any suggestions for news that you think is relevant for users of this portal, please let us know.

Send news heading and text to:

Thank you for contributing the to Norwegian bioinformatics platform portal.


Our primary goal is to identify the putative circRNA-related genetic SNPs and INDELs at the genome level. In current release, we did not explore the structural variants which might comprise multiple circRNAs. Therefore, it is not easy to evaluate the functional effects of a single circRNA from those hundreds of affected circRNAs. In sum, these pre-calculated genetic variants of circRNAs provide a comprehensive resource for discovering the commonality or uniqueness of genetic changes for all reported circRNAs. In sum, these pre-calculated genetic variants of circRNAs provide a comprehensive resource for discovery of the commonality or uniqueness of genetic changes for all reported circRNA. For example, previously studies have indicated that the protein-coding genes are unevenly distributed on 24 chromosomes, among which the densities of genes on chromosomes 1, 11 and 19 are particularly high [17]. Our circRNA distribution confirmed the high density of circRNAs on chromosome 19. Interestingly, we also found more clustered circRNAs on chromosome 17, which is different from the density of protein-coding genes.

The current version of circVAR contains: i) 93,708 annotated genetic variants with phenotype information from genome-wide associated studies (GWAS data from GWASCatalog) ii) 1,858,343 well-classified genetic variants with clinical applications from the ClinVAR database iii) 2,597,987 somatic variants in cancer tissues from the COSMIC database and iv) 26,361,367 common variants from the 1000 Genomes Project data. Our web interface also allows users to perform text queries and browse circRNAs based on their mapped genes and data sources. For advanced bioinformatics analysis, we have provided the bulk downloadable files for all the circRNAs with the two most popular genomic coordinates (GRCH 37 and GRCH 38). In addition, over 30 Gb of genetic variant annotation files were provided for the majority of the circRNAs.

Although the extensive integration and mapping of circRNA variants provides a blueprint for general genetic features, there are more circRNAs data generated from various tissues. Our goal is to incorporate more human circRNAs by curating the circRNAs from RNAseq data in the future. With the potential clinical and therapeutic applications of circRNAs, the genetic diversity in various human populations will become one of the keys to evaluate its risk. In addition, we may also conduct the more extensive meta-analysis on those circRNA-related variants with clinical phenotypes, because majority of GWAS hits are mapped in non-coding regions such as lncRNAs or circRNAs.

Welcome to!

The official bioinformatics tool list for the Journal of Integrative Bioinformatics (JIB).

All bioinformatics tools published in JIB are automatically added to with first authors having the ability to edit their entries and directly import the tools information to

To better understand the dynamic behavior of metabolic networks in a wide variety of conditions, the field of Systems Biology has increased its interest in the use of kinetic models. The different databases, available these days, do not contain enough data regarding this topic. Given that a significant part of the relevant information for the development of such models is still wide spread in the literature, it becomes essential to develop specific and powerful text mining tools to collect these data. In this context, this work has as main objective the development of a text mining tool to extract, from scientific literature, kinetic parameters, their respective values and their relations with enzymes and metabolites. The approach proposed integrates the development of a novel plug-in over the text mining framework @Note2. In the end, the pipeline developed was validated with a case study on Kluyveromyces lactis, spanning the analysis and results of 20 full text documents.

  • Alão Freitas A, Costa H, Rocha I. Extracting kinetic information from literature with KineticRE. J Integr Bioinform. 201512(4). doi 10.2390/biecoll-jib-2015-282 PubMed 26673933
  • Castellanos-garzón JA, Díaz F. An evolutionary and visual framework for clustering of DNA microarray data. J Integr Bioinform. 201310(3):232. doi 10.2390/biecoll-jib-2013-232 PubMed 24231146

Desktop application
Sequence analysis
Maximum-likelihood methods based on models of codon substitution have been widely used to infer positively selected amino acid sites that are responsible for adaptive changes. Nevertheless, in order to use such an approach, software applications are required to align protein and DNA sequences, infer a phylogenetic tree and run the maximum-likelihood models. Therefore, a significant effort is made in order to prepare input files for the different software applications and in the analysis of the output of every analysis. In this paper we present the ADOPS (Automatic Detection Of Positively Selected Sites) software. It was developed with the goal of providing an automatic and flexible tool for detecting positively selected sites given a set of unaligned nucleotide sequence data. An example of the usefulness of such a pipeline is given by showing, under different conditions, positively selected amino acid sites in a set of 54 Coffea putative S-RNase sequences. ADOPS software is freely available and can be downloaded from

  • Reboiro-jato D, Reboiro-jato M, Fdez-riverola F, Vieira CP, Fonseca NA, Vieira J. ADOPS--Automatic Detection Of Positively Selected Sites. J Integr Bioinform. 20129(3):200. doi 10.2390/biecoll-jib-2012-200 PubMed 22829571

In this demo paper, we sketch B-Fabric, an all-in-one solution for management of life sciences data. B-Fabric has two major purposes. First, it is a system for the integrated management of experimental data and scientific annotations. Second, it is a system infrastructure supporting on-the fly coupling of user applications, and thus serving as extensible platform for fast-paced, cutting-edge, collaborative research.

  • Türker C, Akal F, Schlapbach R. Life Sciences Data and Application Integration with B-Fabric. J Integr Bioinform. 20118(2). doi 10.2390/biecoll-jib-2011-159 PubMed 21772064

BacillOndex is an extension of the Ondex data integration system, providing a semantically annotated, integrated knowledge base for the model Gram-positive bacterium Bacillus subtilis. This application allows a user to mine a variety of B. subtilis data sources, and analyse the resulting integrated dataset, which contains data about genes, gene products and their interactions. The data can be analysed either manually, by browsing using Ondex, or computationally via a Web services interface. We describe the process of creating a BacillOndex instance, and describe the use of the system for the analysis of single nucleotide polymorphisms in B. subtilis Marburg. The Marburg strain is the progenitor of the widely-used laboratory strain B. subtilis 168. We identified 27 SNPs with predictable phenotypic effects, including genetic traits for known phenotypes. We conclude that BacillOndex is a valuable tool for the systems-level investigation of, and hypothesis generation about, this important biotechnology workhorse. Such understanding contributes to our ability to construct synthetic genetic circuits in this organism.

  • Misirli G, Wipat A, Mullen J, et al. BacillOndex: an integrated data resource for systems and synthetic biology. J Integr Bioinform. 201310(2):224. doi 10.2390/biecoll-jib-2013-224 PubMed 23571273

As high-throughput technologies become cheaper and easier to use, raw sequence data and corresponding annotations for many organisms are becoming available. However, sequence data alone is not sufficient to explain the biological behaviour of organisms, which arises largely from complex molecular interactions. There is a need to develop new platform technologies that can be applied to the investigation of whole-genome datasets in an efficient and cost-effective manner. One such approach is the transfer of existing knowledge from well-studied organisms to closely-related organisms. In this paper, we describe a system, BacillusRegNet, for the use of a model organism, Bacillus subtilis, to infer genome-wide regulatory networks in less well-studied close relatives. The putative transcription factors, their binding sequences and predicted promoter sequences along with annotations are available from the associated BacillusRegNet website (

  • Misirli G, Hallinan J, Röttger R, Baumbach J, Wipat A. BacillusRegNet: A transcriptional regulation database and analysis platform for Bacillus species. J Integr Bioinform. 201411(2). doi 10.2390/biecoll-jib-2014-244 PubMed 25001169
  • Carreiro AV, Anunciação O, Carriço JA, Madeira SC. Prognostic Prediction through Biclustering-Based Classification of Clinical Gene Expression Time Series. J Integr Bioinform. 20118(3). doi 10.2390/biecoll-jib-2011-175 PubMed 21926438

This paper presents a novel bioinformatics data warehouse software kit that integrates biological information from multiple public life science data sources into a local database management system. It stands out from other approaches by providing up-to-date integrated knowledge, platform and database independence as well as high usability and customization. This open source software can be used as a general infrastructure for integrative bioinformatics research and development. The advantages of the approach are realized by using a Java-based system architecture and object-relational mapping (ORM) technology. Finally, a practical application of the system is presented within the emerging area of medical bioinformatics to show the usefulness of the approach. The BioDWH data warehouse software is available for the scientific community at

  • Töpel T, Kormeier B, Klassen A, Hofestädt R. BioDWH: A Data Warehouse Kit for Life Science Data Integration. J Integr Bioinform. 20085(2). doi 10.2390/biecoll-jib-2008-93 PubMed 20134070

Command-line tool Workflow
Data integration and warehousing
As research projects require multiple data sources, mapping between these sources becomes necessary. Utilized workflow systems and integration tools therefore need to process large amounts of heterogeneous data formats, check for data source updates, and find suitable mapping methods to cross-reference entities from different databases. BioDWH2 is an open-source, graph-based data warehouse and mapping tool, capable of helping researchers with these issues. A workspace centered approach allows project-specific data source selections and Neo4j or GraphQL server tools enable quick access to the database for analysis. The BioDWH2 tools are available to the scientific community at

  • Friedrichs M. BioDWH2: an automated graph-based data warehouse and mapping tool.. J Integr Bioinform. 2021. doi 10.1515/jib-2020-0033 PubMed 33618440

The study of microorganism consortia, also known as biofilms, is associated to a number of applications in biotechnology, ecotechnology and clinical domains. Nowadays, biofilm studies are heterogeneous and data-intensive, encompassing different levels of analysis. Computational modelling of biofilm studies has become thus a requirement to make sense of these vast and ever-expanding biofilm data volumes. The rationale of the present work is a machine-readable format for representing biofilm studies and supporting biofilm data interchange and data integration. This format is supported by the Biofilm Science Ontology (BSO), the first ontology on biofilms information. The ontology is decomposed into a number of areas of interest, namely: the Experimental Procedure Ontology (EPO) which describes biofilm experimental procedures the Colony Morphology Ontology (CMO) which characterises morphologically microorganism colonies and other modules concerning biofilm phenotype, antimicrobial susceptibility and virulence traits. The overall objective behind BSO is to develop semantic resources to capture, represent and share data on biofilms and related experiments in a regularized fashion manner. Furthermore, the present work also introduces a framework in assistance of biofilm data interchange and analysis - BiofOmics ( - and a public repository on colony morphology signatures - MorphoCol (

  • Sousa AM, Ferreira A, Azevedo NF, Pereira MO, Lourenço A. Computational approaches to standard-compliant biofilm data for reliable analysis and integration. J Integr Bioinform. 20129(3). doi 10.2390/biecoll-jib-2012-203 PubMed 22829574
  • Loyek C, Bunkowski A, Vautz W, Nattkemper TW. Web2.0 paves new ways for collaborative and exploratory analysis of chemical compounds in spectrometry data. J Integr Bioinform. 20118(2):158. doi 10.2390/biecoll-jib-2011-158 PubMed 21768655

The speed and accuracy of new scientific discoveries - be it by humans or artificial intelligence - depends on the quality of the underlying data and on the technology to connect, search and share the data efficiently. In recent years, we have seen the rise of graph databases and semi-formal data models such as knowledge graphs to facilitate software approaches to scientific discovery. These approaches extend work based on formalised models, such as the Semantic Web. In this paper, we present our developments to connect, search and share data about genome-scale knowledge networks (GSKN). We have developed a simple application ontology based on OWL/RDF with mappings to standard schemas. We are employing the ontology to power data access services like resolvable URIs, SPARQL endpoints, JSON-LD web APIs and Neo4j-based knowledge graphs. We demonstrate how the proposed ontology and graph databases considerably improve search and access to interoperable and reusable biological knowledge (i.e. the FAIRness data principles).

  • Brandizi M, Singh A, Rawlings C, Hassani-Pak K. Towards FAIRer Biological Knowledge Networks Using a Hybrid Linked Data and Graph Database Approach.. J Integr Bioinform. 201815(3). doi 10.1515/jib-2018-0023 PubMed 30085931

Given the great potential impact of the growing number of complete genome-scale metabolic network reconstructions of microorganisms, bioinformatics tools are needed to simplify and accelerate the course of knowledge in this field. One essential component of a genome-scale metabolic model is its biomass equation, whose maximization is one of the most common objective functions used in Flux Balance Analysis formulations. Some components of biomass, such as amino acids and nucleotides, can be estimated from genome information, providing reliable data without the need of performing lab experiments. In this work a java tool is proposed that estimates microbial biomass composition in amino acids and nucleotides, from genome and transcriptomic information, using as input files sequences in FASTA format and files with transcriptomic data in the csv format. This application allows to obtain the results rapidly and is also a user-friendly tool for users with any or little background in informatics ( The results obtained using this tool are fairly close to experimental data, showing that the estimation of amino acid and nucleotide compositions from genome information and from transcriptomic data is a good alternative when no experimental data is available.

  • Santos S, Rocha I. Estimation of biomass composition from genomic and transcriptomic information. J Integr Bioinform. 201613(2):285. doi 10.2390/biecoll-jib-2016-285 PubMed 28187415

While high-throughput technology, advanced techniques in biochemistry and molecular biology have become increasingly powerful, the coherent interpretation of experimental results in an integrative context is still a challenge. BioModelKit (BMK) approaches this challenge by offering an integrative and versatile framework for biomodel-engineering based on a modular modelling concept with the purpose: (i) to represent knowledge about molecular mechanisms by consistent executable sub-models (modules) given as Petri nets equipped with defined interfaces facilitating their reuse and recombination (ii) to compose complex and integrative models from an ad hoc chosen set of modules including different omic and abstraction levels with the option to integrate spatial aspects (iii) to promote the construction of alternative models by either the exchange of competing module versions or the algorithmic mutation of the composed model and (iv) to offer concepts for (omic) data integration and integration of existing resources, and thus facilitate their reuse. BMK is accessible through a public web interface (, where users can interact with the modules stored in a database, and make use of the model composition features. BMK facilitates and encourages multi-scale model-driven predictions and hypotheses supporting experimental research in a multilateral exchange.

  • Blätke MA. BioModelKit - An Integrative Framework for Multi-Scale Biomodel-Engineering.. J Integr Bioinform. 201815(3). doi 10.1515/jib-2018-0021 PubMed 30205646

The visualization of biological data gained increasing importance in the last years. There is a large number of methods and software tools available that visualize biological data including the combination of measured experimental data and biological networks. With growing size of networks their handling and exploration becomes a challenging task for the user. In addition, scientists also have an interest in not just investigating a single kind of network, but on the combination of different types of networks, such as metabolic, gene regulatory and protein interaction networks. Therefore, fast access, abstract and dynamic views, and intuitive exploratory methods should be provided to search and extract information from the networks. This paper will introduce a conceptual framework for handling and combining multiple network sources that enables abstract viewing and exploration of large data sets including additional experimental data. It will introduce a three-tier structure that links network data to multiple network views, discuss a proof of concept implementation, and shows a specific visualization method for combining metabolic and gene regulatory networks in an example.

  • Klapperstück M, Schreiber F. BioNetLink - An Architecture for Working with Network Data. J Integr Bioinform. 201411(2). doi 10.2390/biecoll-jib-2014-241 PubMed 24980619

BIOchemical PathwaY DataBase is developed as a manually curated, readily updatable, dynamic resource of human cell specific pathway information along with integrated computational platform to perform various pathway analyses. Presently, it comprises of 46 pathways, 3189 molecules, 5742 reactions and 6897 different types of diseases linked with pathway proteins, which are referred by 520 literatures and 17 other pathway databases. With its repertoire of biochemical pathway data, and computational tools for performing Topological, Logical and Dynamic analyses, BIOPYDB offers both the experimental and computational biologists to acquire a comprehensive understanding of signaling cascades in the cells. Automated pathway image reconstruction, cross referencing of pathway molecules and interactions with other databases and literature sources, complex search operations to extract information from other similar resources, integrated platform for pathway data sharing and computation, etc. are the novel and useful features included in this database to make it more acceptable and attractive to the users of pathway research communities. The RESTful API service is also made available to the advanced users and developers for accessing this database more conveniently through their own computer programmes.

  • Chowdhury S, Sinha N, Ganguli P, Bhowmick R, Singh V, Nandi S, Sarkar RR. BIOPYDB: A Dynamic Human Cell Specific Biochemical Pathway Database with Advanced Computational Analyses Platform.. J Integr Bioinform. 201815(3). doi 10.1515/jib-2017-0072 PubMed 29547394

Metagenomics provides quantitative measurements for microbial species over time. To obtain a global overview of an experiment and to explore the full potential of a given dataset, intuitive and interactive visualization tools are needed. Therefore, we established BioSankey to visualize microbial species in microbiome studies over time as a Sankey diagram. These diagrams are embedded into a project-specific webpage which depends only on JavaScript and Google API to allow searches of interesting species without requiring a web server or connection to a database. BioSankey is a valuable tool to visualize different data elements from single or dual RNA-seq datasets and additionally enables a straightforward exchange of results among collaboration partners.

  • Platzer A, Polzin J, Rembart K, Han PP, Rauer D, Nussbaumer T. BioSankey: Visualization of Microbial Communities Over Time.. J Integr Bioinform. 201815(4). doi 10.1515/jib-2017-0063 PubMed 29897884
  • Borowski K, Soh J, Sensen CW. Visual Comparison of Multiple Gene Expression Datasets in a Genomic Context. J Integr Bioinform. 20085(2). doi 10.2390/biecoll-jib-2008-97 PubMed 20134066

One of the major challenges in bioinfomatics is to integrate and manage data from different sources as well as experimental microarray data and present them in a user-friendly format. Therefore, we present CardioVINEdb, a data warehouse approach developed to interact with and explore life science data. The data warehouse architecture provides a platform independent web interface that can be used with any common web browser. A monitor component controls and updates the data from the different sources to guarantee up-todateness. In addition, the system provides a "static" and "dynamic" visualization component for interactive graphical exploration of the data.

  • Kormeier B, Hippe K, Töpel T, Hofestädt R. CardioVINEdb: a data warehouse approach for integration of life science data in cardiovascular diseases. J Integr Bioinform. 20107(1):142. doi 10.2390/biecoll-jib-2010-142 PubMed 20585146
  • Guo D, Li X, Zhu P, Feng Y, Yang J, Zheng Z, Yang W, Zhang E, Zhou S, Wang H. Online High-throughput Mutagenesis Designer Using Scoring Matrix of Sequence-specific Endonucleases.. J Integr Bioinform. 201512(1). doi 10.1515/jib-2015-283 PubMed 29220955

Desktop application
ChIP-seq Mapping Molecular interactions, pathways and networks Data architecture, analysis and design
The mapping of DNA-protein interactions is crucial for a full understanding of transcriptional regulation. Chromatin-immunoprecipitation followed by massively parallel sequencing (ChIP-seq) has become the standard technique for analyzing these interactions on a genome-wide scale. We have developed a software system called CASSys (ChIP-seq data Analysis Software System) spanning all steps of ChIP-seq data analysis. It supersedes the laborious application of several single command line tools. CASSys provides functionality ranging from quality assessment and -control of short reads, over the mapping of reads against a reference genome (readmapping) and the detection of enriched regions (peakdetection) to various follow-up analyses. The latter are accessible via a state-of-the-art web interface and can be performed interactively by the user. The follow-up analyses allow for flexible user defined association of putative interaction sites with genes, visualization of their genomic context with an integrated genome browser, the detection of putative binding motifs, the identification of over-represented Gene Ontology-terms, pathway analysis and the visualization of interaction networks. The system is client-server based, accessible via a web browser and does not require any software installation on the client side. To demonstrate CASSys's functionality we used the system for the complete data analysis of a publicly available Chip-seq study that investigated the role of the transcription factor estrogen receptor-α in breast cancer cells.

  • Alawi M, Kurtz S, Beckstette M. CASSys: an integrated software-system for the interactive analysis of ChIP-seq data. J Integr Bioinform. 20118(2):155. doi 10.2390/biecoll-jib-2011-155 PubMed 21690655

Command-line tool Desktop application
Computational biology
Using the lac operon as a paradigmatic example for a gene regulatory system in prokaryotes, we demonstrate how qualitative knowledge can be initially captured using simple discrete (Boolean) models and then stepwise refined to multivalued logical models and finally to continuous (ODE) models. At all stages, signal transduction and transcriptional regulation is integrated in the model description. We first show the potential benefit of a discrete binary approach and discuss then problems and limitations due to indeterminacy arising in cyclic networks. These limitations can be partially circumvented by using multilevel logic as generalization of the Boolean framework enabling one to formulate a more realistic model of the lac operon. Ultimately a dynamic description is needed to fully appreciate the potential dynamic behavior that can be induced by regulatory feedback loops. As a very promising method we show how the use of multivariate polynomial interpolation allows transformation of the logical network into a system of ordinary differential equations (ODEs), which then enables the analysis of key features of the dynamic behavior.

  • Franke R, Theis FJ, Klamt S. From Binary to Multivalued to Continuous Models: The lac Operon as a Case Study. J Integr Bioinform. 20107(1). doi 10.2390/biecoll-jib-2010-151 PubMed 21200084

With the advent of modern day high-throughput technologies, the bottleneck in biological discovery has shifted from the cost of doing experiments to that of analyzing results. clubber is our automated cluster-load balancing system developed for optimizing these "big data" analyses. Its plug-and-play framework encourages re-use of existing solutions for bioinformatics problems. clubber's goals are to reduce computation times and to facilitate use of cluster computing. The first goal is achieved by automating the balance of parallel submissions across available high performance computing (HPC) resources. Notably, the latter can be added on demand, including cloud-based resources, and/or featuring heterogeneous environments. The second goal of making HPCs user-friendly is facilitated by an interactive web interface and a RESTful API, allowing for job monitoring and result retrieval. We used clubber to speed up our pipeline for annotating molecular functionality of metagenomes. Here, we analyzed the Deepwater Horizon oil-spill study data to quantitatively show that the beach sands have not yet entirely recovered. Further, our analysis of the CAMI-challenge data revealed that microbiome taxonomic shifts do not necessarily correlate with functional shifts. These examples (21 metagenomes processed in 172 min) clearly illustrate the importance of clubber in the everyday computational biology environment.

  • Miller M, Zhu C, Bromberg Y. clubber: removing the bioinformatics bottleneck in big data analyses. J Integr Bioinform. 201714(2). doi 10.1515/jib-2017-0020 PubMed 28609295

Desktop application
Molecular interactions, pathways and networks Bioinformatics Cell biology Computer science Structural biology
Detailed investigation of socially important diseases with modern experimental methods has resulted in the generation of large volume of valuable data. However, analysis and interpretation of this data needs application of efficient computational techniques and systems biology approaches. In particular, the techniques allowing the reconstruction of associative networks of various biological objects and events can be useful. In this publication, the combination of different techniques to create such a network associated with an abstract cell environment is discussed in order to gain insights into the functional as well as spatial interrelationships. It is shown that experimentally gained knowledge enriched with data warehouse content and text mining data can be used for the reconstruction and localization of a cardiovascular disease developing network beginning with MUPP1/MPDZ (multi-PDZ domain protein).

  • Sommer B, Tiys ES, Kormeier B, et al. Visualization and Analysis of a Cardio Vascular Diseaseand MUPP1-related Biological Network combining Text Mining and Data Warehouse Approaches. J Integr Bioinform. 20107(1). doi 10.2390/biecoll-jib-2010-148 PubMed 21068463
  • Kovanci G, Ghaffar M, Sommer B. Web-based hybrid-dimensional Visualization and Exploration of Cytological Localization Scenarios. J Integr Bioinform. 201613(4):47–58. doi 10.2390/biecoll-jib-2016-298 PubMed 28187414
  • Sommer B. The CELLmicrocosmos Tools: A Small History of Java-Based Cell and Membrane Modelling Open Source Software Development.. J Integr Bioinform. 2019. doi 10.1515/jib-2019-0057 PubMed 31560649

The CELLmicrocosmos 4.2 PathwayIntegration (CmPI) is a tool which provides hybrid-dimensional visualization and analysis of intracellular protein and gene localizations in the context of a virtual 3D environment. This tool is developed based on Java/Java3D/JOGL and provides a standalone application compatible to all relevant operating systems. However, it requires Java and the local installation of the software. Here we present the prototype of an alternative web-based visualization approach, using Three.js and D3.js. In this way it is possible to visualize and explore CmPI-generated localization scenarios including networks mapped to 3D cell components by just providing a URL to a collaboration partner. This publication describes the integration of the different technologies – Three.js, D3.js and PHP – as well as an application case: a localization scenario of the citrate cycle. The CmPI web viewer is available at:

  • Kovanci G, Ghaffar M, Sommer B. Web-based hybrid-dimensional Visualization and Exploration of Cytological Localization Scenarios. J Integr Bioinform. 201613(4):47–58. doi 10.2390/biecoll-jib-2016-298 PubMed 28187414
  • Sommer B. The CELLmicrocosmos Tools: A Small History of Java-Based Cell and Membrane Modelling Open Source Software Development.. J Integr Bioinform. 2019. doi 10.1515/jib-2019-0057 PubMed 31560649

Comparative analysis of biological networks is a major problem in computational integrative systems biology. By computing the maximum common edge subgraph between a set of networks, one is able to detect conserved substructures between them and quantify their topological similarity. To aid such analyses we have developed CytoMCS, a Cytoscape app for computing inexact solutions to the maximum common edge subgraph problem for two or more graphs. Our algorithm uses an iterative local search heuristic for computing conserved subgraphs, optimizing a squared edge conservation score that is able to detect not only fully conserved edges but also partially conserved edges. It can be applied to any set of directed or undirected, simple graphs loaded as networks into Cytoscape, e.g. protein-protein interaction networks or gene regulatory networks. CytoMCS is available as a Cytoscape app at

  • Larsen SJ, Baumbach J. CytoMCS: A Multiple Maximum Common Subgraph Detection Tool for Cytoscape. J Integr Bioinform. 201714(2). doi 10.1515/jib-2017-0014 PubMed 28731857

This work presents DaTo, a semi-automatically generated world atlas of biological databases and tools. It extracts raw information from all PubMed articles which contain exact URLs in their abstract section, followed by a manual curation of the abstract and the URL accessibility. DaTo features a user-friendly query interface, providing extensible URL-related annotations, such as the status, the location and the country of the URL. A graphical interaction network browser has also been integrated into the DaTo web interface to facilitate exploration of the relationship between different tools and databases with respect to their ontology-based semantic similarity. Using DaTo, the geographical locations, the health statuses, as well as the journal associations were evaluated with respect to the historical development of bioinformatics tools and databases over the last 20 years. We hope it will inspire the biological community to gain a systematic insight into bioinformatics resources. DaTo is accessible via

  • Li Q, Zhou Y, Jiao Y, et al. DaTo: an atlas of biological databases and tools. J Integr Bioinform. 201613(4):297. doi 10.2390/biecoll-jib-2016-297 PubMed 28187413
  • Mehlhorn H, Schreiber F. DBE2 - management of experimental data for the VANTED system. J Integr Bioinform. 20118(2):162. doi 10.2390/biecoll-jib-2011-162 PubMed 21788680

Web service
Sequence analysis
During the last years several new tools applicable to protein analysis have made available on the IBIVU web site. Recently, a number of tools, ranging from multiple sequence alignment construction to domain prediction, have been updated and/or extended with services for programmatic access using SOAP. We provide an overview of these tools and their application.

  • Brandt BW, Heringa J. Protein analysis tools and services at IBIVU. J Integr Bioinform. 20118(2). doi 10.2390/biecoll-jib-2011-168 PubMed 21900709

Proteins and their interactions are essential for the functioning of all organisms and for understanding biological processes. Alternative splicing is an important molecular mechanism for increasing the protein diversity in eukaryotic cells. Splicing events that alter the protein structure and the domain composition can be responsible for the regulation of protein interactions and the functional diversity of different tissues. Discovering the occurrence of splicing events and studying protein isoforms have become feasible using Affymetrix Exon Arrays. Therefore, we have developed the versatile Cytoscape plugin DomainGraph that allows for the visual analysis of protein domain interaction networks and their integration with exon expression data. Protein domains affected by alternative splicing are highlighted and splicing patterns can be compared.

  • Emig D, Cline MS, Klein K, et al. Integrative visual analysis of the effects of alternative splicing on protein domain interaction networks. J Integr Bioinform. 20085(2). doi 10.2390/biecoll-jib-2008-101 PubMed 20134061

In this paper we present two case studies of Proteomics applications development using the AIBench framework, a Java desktop application framework mainly focused in scientific software development. The applications presented in this work are Decision Peptide-Driven, for rapid and accurate protein quantification, and Bacterial Identification, for Tuberculosis biomarker search and diagnosis. Both tools work with mass spectrometry data, specifically with MALDI-TOF spectra, minimizing the time required to process and analyze the experimental data.

  • López-Fernández H, Reboiro-Jato M, Glez-Peña D, et al. Rapid development of Proteomic applications with the AIBench framework. J Integr Bioinform. 20118(3):171. doi 10.2390/biecoll-jib-2011-171 PubMed 21926434

Expression efficiency is one of the major characteristics describing genes in various modern investigations. Expression efficiency of genes is regulated at various stages: transcription, translation, posttranslational protein modification and others. In this study, a special EloE (Elongation Efficiency) web application is described. The EloE sorts the organism's genes in a descend order on their theoretical rate of the elongation stage of translation based on the analysis of their nucleotide sequences. Obtained theoretical data have a significant correlation with available experimental data of gene expression in various organisms. In addition, the program identifies preferential codons in organism's genes and defines distribution of potential secondary structures energy in 5´ and 3´ regions of mRNA. The EloE can be useful in preliminary estimation of translation elongation efficiency for genes for which experimental data are not available yet. Some results can be used, for instance, in other programs modeling artificial genetic structures in genetically engineered experiments.

  • Sokolov V, Zuraev B, Lashin S, Matushkin Y. Web application for automatic prediction of gene translation elongation efficiency.. J Integr Bioinform. 201512(1). doi 10.2390/biecoll-jib-2015-256 PubMed 26527190

The prevalence of comorbid diseases poses a major health issue for millions of people worldwide and an enormous socio-economic burden for society. The molecular mechanisms for the development of comorbidities need to be investigated. For this purpose, a workflow system was developed to aggregate data on biomedical entities from heterogeneous data sources. The process of integrating and merging all data sources of the workflow system was implemented as a semi-automatic pipeline that provides the import, fusion, and analysis of the highly connected biomedical data in a Neo4j database GenCoNet. As a starting point, data on the common comorbid diseases essential hypertension and bronchial asthma was integrated. GenCoNet ( is a curated database that provides a better understanding of hereditary bases of comorbidities.

  • Shoshi A, Hofestädt R, Zolotareva O, Friedrichs M, Maier A, Ivanisenko VA, Dosenko VE, Bragina EY. GenCoNet - A Graph Database for the Analysis of Comorbidities by Gene Networks.. J Integr Bioinform. 201815(4). doi 10.1515/jib-2018-0049 PubMed 30864352

Script Library
Bioinformatics DNA Gene regulation
The interconversion of sequences that constitute the genome and the proteome is becoming increasingly important due to the generation of large amounts of DNA sequence data. Following mapping of DNA segments to the genome, one fundamentally important task is to find the amino acid sequences which are coded within a list of genomic sections. Conversely, given a series of protein segments, an important task is to find the genomic loci which code for a list of protein regions. To perform these tasks on a region by region basis is extremely laborious when a large number of regions are being studied. We have therefore implemented an R package geno2proteo which performs the two mapping tasks and subsequent sequence retrieval in a batch fashion. In order to make the tool more accessible to users, we have created a web interface of the R package which allows the users to perform the mapping tasks by going to the web page and using the web service.

  • Li Y, Aguilar-Martinez E, Sharrocks AD. Geno2proteo, a Tool for Batch Retrieval of DNA and Protein Sequences from Any Genomic or Protein Regions.. J Integr Bioinform. 2019. doi 10.1515/jib-2018-0090 PubMed 31301672

The need to process large quantities of data generated from genomic sequencing has resulted in a difficult task for life scientists who are not familiar with the use of command-line operations or developments in high performance computing and parallelization. This knowledge gap, along with unfamiliarity with necessary processes, can hinder the execution of data processing tasks. Furthermore, many of the commonly used bioinformatics tools for the scientific community are presented as isolated, unrelated entities that do not provide an integrated, guided, and assisted interaction with the scheduling facilities of computational resources or distribution, processing and mapping with runtime analysis. This paper presents the first approximation of a Web Services platform-based architecture (GITIRBio) that acts as a distributed front-end system for autonomous and assisted processing of parallel bioinformatics pipelines that has been validated using multiple sequences. Additionally, this platform allows integration with semantic repositories of genes for search annotations. GITIRBio is available at:

  • Castillo LF, López-Gartner G, Isaza GA, et al. GITIRBio: A Semantic and Distributed Service Oriented- Architecture for Bioinformatics Pipeline. J Integr Bioinform. 201512(1):1–15. doi 10.2390/biecoll-jib-2015-255 PubMed 26527189
  • Taha K, Elmasri R. GMB: an efficient query processor for biological data. J Integr Bioinform. 20118(2):165. doi 10.2390/biecoll-jib-2011-165 PubMed 21881166

Web application
Mapping Ontology and terminology Proteins Sequence analysis Sequencing
The functional annotation of genomic data has become a major task for the ever-growing number of sequencing projects. In order to address this challenge, we recently developed GOblet, a free web service for the annotation of anonymous sequences with Gene Ontology (GO) terms. However, to overcome limitations of the GO terminology, and to aid in understanding not only single components but as well systemic interactions between the individual components, we have now extended the GOblet web service to integrate also pathway annotations. Furthermore, we extended and upgraded the data analysis pipeline with improved summaries, and added term enrichment and clustering algorithms. Finally, we are now making GOblet available as a stand-alone application for high-throughput processing on local machines. The advantages of this frequently requested feature is that a) the user can avoid restrictions of our web service for uploading and processing large amounts of data, and that b) confidential data can be analysed without insecure transfer to a public web server. The stand-alone version of the web service has been implemented using platform independent Tcl-scripts, which can be run with just a single runtime file utilizing the Starkit technology. The GOblet web service and the stand-alone application are freely available at

  • Groth D, Hartmann S, Panopoulou G, Poustka AJ, Hennig S. GOblet: annotation of anonymous sequence data with gene ontology and pathway terms. J Integr Bioinform. 20085(2). doi 10.2390/biecoll-jib-2008-104 PubMed 20134064

Despite the large number of software tools developed to address different areas of microarray data analysis, very few offer an all-in-one solution with little learning curve. For microarray core labs, there are even fewer software packages available to help with their routine but critical tasks, such as data quality control (QC) and inventory management. We have developed a simple-to-use web portal to allow bench biologists to analyze and query complicated microarray data and related biological pathways without prior training. Both experiment-based and gene-based analysis can be easily performed, even for the first-time user, through the intuitive multi-layer design and interactive graphic links. While being friendly to inexperienced users, most parameters in Goober can be easily adjusted via drop-down menus to allow advanced users to tailor their needs and perform more complicated analysis. Moreover, we have integrated graphic pathway analysis into the website to help users examine microarray data within the relevant biological content. Goober also contains features that cover most of the common tasks in microarray core labs, such as real time array QC, data loading, array usage and inventory tracking. Overall, Goober is a complete microarray solution to help biologists instantly discover valuable information from a microarray experiment and enhance the quality and productivity of microarray core labs. The whole package is freely available at A demo web server is available at

  • Luo W, Gudipati M, Jung K, Chen M, Marschke KB. Goober: a fully integrated and user-friendly microarray data management and analysis solution for core labs and bench biologists. J Integr Bioinform. 20096(1):108. doi 10.2390/biecoll-jib-2009-108 PubMed 20134074

Detecting sources of bias in transcriptomic data is essential to determine signals of Biological significance. We outline a novel method to detect sequence specific bias in short read Next Generation Sequencing data. This is based on determining intra-exon correlations between specific motifs. This requires a mild assumption that short reads sampled from specific regions from the same exon will be correlated with each other. This has been implemented on Apache Spark and used to analyse two D. melanogaster eye-antennal disc data sets generated at the same laboratory. The wild type data set in drosophila indicates a variation due to motif GC content that is more significant than that found due to exon GC content. The software is available online and could be applied for cross-experiment transcriptome data analysis in eukaryotes.

  • Alnasir J, Shanahan HP. A Novel Method to Detect Bias in Short Read NGS Data. J Integr Bioinform. 201714(3). doi 10.1515/jib-2017-0025 PubMed 28941355

This work presents a sophisticated information system, the Integrated Analysis Platform (IAP), an approach supporting large-scale image analysis for different species and imaging systems. In its current form, IAP supports the investigation of Maize, Barley and Arabidopsis plants based on images obtained in different spectra. Several components of the IAP system, which are described in this work, cover the complete end-to-end pipeline, starting with the image transfer from the imaging infrastructure, (grid distributed) image analysis, data management for raw data and analysis results, to the automated generation of experiment reports.

  • Klukas C, Pape JM, Entzian A. Analysis of high-throughput plant image data with the information system IAP. J Integr Bioinform. 20129(2):191. doi 10.2390/biecoll-jib-2012-191 PubMed 22745177
  • Lamurias A, Ferreira JD, Couto FM. Identifying interactions between chemical entities in biomedical text. J Integr Bioinform. 201411(3):247. doi 10.2390/biecoll-jib-2014-247 PubMed 25339081

Knowledge found in biomedical databases, in particular in Web information systems, is a major bioinformatics resource. In general, this biological knowledge is worldwide represented in a network of databases. These data is spread among thousands of databases, which overlap in content, but differ substantially with respect to content detail, interface, formats and data structure. To support a functional annotation of lab data, such as protein sequences, metabolites or DNA sequences as well as a semi-automated data exploration in information retrieval environments, an integrated view to databases is essential. Search engines have the potential of assisting in data retrieval from these structured sources, but fall short of providing a comprehensive knowledge except out of the interlinked databases. A prerequisite of supporting the concept of an integrated data view is to acquire insights into cross-references among database entities. This issue is being hampered by the fact, that only a fraction of all possible cross-references are explicitely tagged in the particular biomedical informations systems. In this work, we investigate to what extend an automated construction of an integrated data network is possible. We propose a method that predicts and extracts cross-references from multiple life science databases and possible referenced data targets. We study the retrieval quality of our method and report on first, promising results. The method is implemented as the tool IDPredictor, which is published under the DOI 10.5447/IPK/2012/4 and is freely available using the URL:

  • Mehlhorn H, Lange M, Scholz U, Schreiber F. IDPredictor: predict database links in biomedical database. J Integr Bioinform. 20129(2):1–15. doi 10.2390/biecoll-jib-2012-190 PubMed 22736059
  • Camacho R, Pereira M, Costa VS, et al. A relational learning approach to Structure-Activity Relationships in drug design toxicity studies. J Integr Bioinform. 20118(3):182. doi 10.2390/biecoll-jib-2011-182 PubMed 21926445

Metabolomics Data mining Data management
Over the last decade the evaluation of odors and vapors in human breath has gained more and more attention, particularly in the diagnostics of pulmonary diseases. Ion mobility spectrometry coupled with multi-capillary columns (MCC/IMS), is a well known technology for detecting volatile organic compounds (VOCs) in air. It is a comparatively inexpensive, non-invasive, high-throughput method, which is able to handle the moisture that comes with human exhaled air, and allows for characterizing of VOCs in very low concentrations. To identify discriminating compounds as biomarkers, it is necessary to have a clear understanding of the detailed composition of human breath. Therefore, in addition to the clinical studies, there is a need for a flexible and comprehensive centralized data repository, which is capable of gathering all kinds of related information. Moreover, there is a demand for automated data integration and semi-automated data analysis, in particular with regard to the rapid data accumulation, emerging from the high-throughput nature of the MCC/IMS technology. Here, we present a comprehensive database application and analysis platform, which combines metabolic maps with heterogeneous biomedical data in a well-structured manner. The design of the database is based on a hybrid of the entity-attribute-value (EAV) model and the EAV-CR, which incorporates the concepts of classes and relationships. Additionally it offers an intuitive user interface that provides easy and quick access to the platform’s functionality: automated data integration and integrity validation, versioning and roll-back strategy, data retrieval as well as semi-automatic data mining and machine learning capabilities. The platform will support MCC/IMS-based biomarker identification and validation. The software, schemata, data sets and further information is publicly available at

  • Schneider T, Hauschild A-C, Baumbach JI, Baumbach J. An Integrative Clinical Database and Diagnostics Platform for Biomarker Identification and Analysis in Ion Mobility Spectra of Human Exhaled Air. J Integr Bioinform. 201310(2). doi 10.2390/biecoll-jib-2013-218 PubMed 23545212

At the present, coding sequence (CDS) has been discovered and larger CDS is being revealed frequently. Approaches and related tools have also been developed and upgraded concurrently, especially for phylogenetic tree analysis. This paper proposes an integrated automatic Taverna workflow for the phylogenetic tree inferring analysis using public access web services at European Bioinformatics Institute (EMBL-EBI) and Swiss Institute of Bioinformatics (SIB), and our own deployed local web services. The workflow input is a set of CDS in the Fasta format. The workflow supports 1,000 to 20,000 numbers in bootstrapping replication. The workflow performs the tree inferring such as Parsimony (PARS), Distance Matrix - Neighbor Joining (DIST-NJ), and Maximum Likelihood (ML) algorithms of EMBOSS PHYLIPNEW package based on our proposed Multiple Sequence Alignment (MSA) similarity score. The local web services are implemented and deployed into two types using the Soaplab2 and Apache Axis2 deployment. There are SOAP and Java Web Service (JWS) providing WSDL endpoints to Taverna Workbench, a workflow manager. The workflow has been validated, the performance has been measured, and its results have been verified. Our workflow's execution time is less than ten minutes for inferring a tree with 10,000 replicates of the bootstrapping numbers. This paper proposes a new integrated automatic workflow which will be beneficial to the bioinformaticians with an intermediate level of knowledge and experiences. All local services have been deployed at our portal

  • Damkliang K, Tandayya P, Sangket U, Pasomsub E. Integrated Automatic Workflow for Phylogenetic Tree Analysis Using Public Access and Local Web Services. J Integr Bioinform. 201613(1):287. doi 10.2390/biecoll-jib-2016-287 PubMed 28187423

AstraZeneca’s Oncology in vivo data integration platform brings multidimensional data from animal model efficacy, pharmacokinetic and pharmacodynamic data to animal model profiling data and public in vivo studies. Using this platform, scientists can cluster model efficacy and model profiling data together, quickly identify responder profiles and correlate molecular characteristics to pharmacological response. Through meta-analysis, scientists can compare pharmacology between single and combination treatments, between different drug scheduling and administration routes.

  • Wei J, Chen M. Oncology In Vivo Data Integration for Hypothesis Generation. J Integr Bioinform. 20129(2). doi 10.2390/biecoll-jib-2012-193 PubMed 22773158

Genetic variance within the genotype of population and its mapping to phenotype variance in a systematic and high throughput manner is of interest for biodiversity and breeding research. Beside the established and efficient high throughput genotype technologies, phenotype capabilities got increased focus in the last decade. This results in an increasing amount of phenotype data from well scaling, automated sensor platform. Thus, data stewardship is a central component to make experimental data from multiple domains interoperable and re-usable. To ensure a standard and comprehensive sharing of scientific and experimental data among domain experts, FAIR data principles are utilized for machine read-ability and scale-ability. In this context, BrAPI consortium, provides a comprehensive and commonly agreed FAIRed guidelines to offer a BrAPI layered scientific data in a RESTful manner. This paper presents the concepts, best practices and implementations to meet these challenges. As one of the worlds leading plant research institutes it is of vital interest for the IPK-Gatersleben to transform legacy data infrastructures into a bio-digital resource center for plant genetics resources (PGR). This paper also demonstrates the benefits of integrated database back-ends, established data stewardship processes, and FAIR data exposition in a machine-readable, highly scalable programmatic interfaces.

  • Ghaffar M, Schüler D, König P, Arend D, Junker A, Scholz U, Lange M. Programmatic Access to FAIRified Digital Plant Genetic Resources.. J Integr Bioinform. 202016(4). doi 10.1515/jib-2019-0060 PubMed 31913851 2.0 is a new approach to more closely embed the curation process in the publication process. This website hosts the tools, software applications, databases and workflow systems published in the Journal of Integrative Bioinformatics (JIB). As soon as a new tool-related publication is published in JIB, the tool is posted to and can afterwards be easily transferred to, a large information repository of software tools, databases and services for bioinformatics and the life sciences. In this way, an easily-accessible list of tools is provided which were published in JIB a well as status information regarding the underlying service. With newer registries like providing these information on a bigger scale, 2.0 closes the gap between journal publications and registry publication. (Reference:

  • Friedrichs M, Shoshi A, Chmura PJ, Ison J, Schwämmle V, Schreiber F, Hofestädt R, Sommer B. 2.0 - A Bioinformatics Registry for Journal Published Tools with Interoperability to J Integr Bioinform. 202016(4). doi 10.1515/jib-2019-0059 PubMed 31913853

Measuring differential methylation of the DNA is the nowadays most common approach to linking epigenetic modifications to diseases (called epigenome-wide association studies, EWAS). For its low cost, its efficiency and easy handling, the Illumina HumanMethylation450 BeadChip and its successor, the Infinium MethylationEPIC BeadChip, is the by far most popular techniques for conduction EWAS in large patient cohorts. Despite the popularity of this chip technology, raw data processing and statistical analysis of the array data remains far from trivial and still lacks dedicated software libraries enabling high quality and statistically sound downstream analyses. As of yet, only R-based solutions are freely available for low-level processing of the Illumina chip data. However, the lack of alternative libraries poses a hurdle for the development of new bioinformatic tools, in particular when it comes to web services or applications where run time and memory consumption matter, or EWAS data analysis is an integrative part of a bigger framework or data analysis pipeline. We have therefore developed and implemented Jllumina, an open-source Java library for raw data manipulation of Illumina Infinium HumanMethylation450 and Infinium MethylationEPIC BeadChip data, supporting the developer with Java functions covering reading and preprocessing the raw data, down to statistical assessment, permutation tests, and identification of differentially methylated loci. Jllumina is fully parallelizable and publicly available at

  • Almeida D, Skov I, Lund J, et al. Jllumina - A comprehensive Java-based API for statistical Illumina Infinium HumanMethylation450 and MethylationEPIC data processing. J Integr Bioinform. 201613(4):294. doi 10.2390/biecoll-jib-2016-294 PubMed 28187410
  • Pürzer A, Grassmann F, Birzer D, Merkl R. Key2Ann: a tool to process sequence sets by replacing database identifiers with a human-readable annotation. J Integr Bioinform. 20118(1). doi 10.2390/biecoll-jib-2011-153 PubMed 21372341

Web application
Plant biology Genomics
Search engines and retrieval systems are popular tools at a life science desktop. The manual inspection of hundreds of database entries, that reflect a life science concept or fact, is a time intensive daily work. Hereby, not the number of query results matters, but the relevance does. In this paper, we present the LAILAPS search engine for life science databases. The concept is to combine a novel feature model for relevance ranking, a machine learning approach to model user relevance profiles, ranking improvement by user feedback tracking and an intuitive and slim web user interface, that estimates relevance rank by tracking user interactions. Queries are formulated as simple keyword lists and will be expanded by synonyms. Supporting a flexible text index and a simple data import format, LAILAPS can easily be used both as search engine for comprehensive integrated life science databases and for small in-house project databases. With a set of features, extracted from each database hit in combination with user relevance preferences, a neural network predicts user specific relevance scores. Using expert knowledge as training data for a predefined neural network or using users own relevance training sets, a reliable relevance ranking of database hits has been implemented. In this paper, we present the LAILAPS system, the concepts, benchmarks and use cases. LAILAPS is public available for SWISSPROT data at

  • Lange M, Spies K, Bargsten J, et al. The LAILAPS Search Engine: Relevance Ranking in Life Science Databases. J Integr Bioinform. 20107(2):1–11. doi 10.2390/biecoll-jib-2010-110 PubMed 20134080
  • Lange M, Spies K, Colmsee C, Flemming S, Klapperstück M, Scholz U. The LAILAPS Search Engine: A Feature Model for Relevance Ranking in Life Science Databases. J Integr Bioinform. 20107(3). doi 10.2390/biecoll-jib-2010-118 PubMed 20375444
  • Esch M, Chen J, Weise S, Hassani-Pak K, Scholz U, Lange M. A query suggestion workflow for life science IR-systems. J Integr Bioinform. 201411(2):237. doi 10.2390/biecoll-jib-2014-237 PubMed 24953306

Distinct bacteria are able to cope with highly diverse lifestyles for instance, they can be free living or host-associated. Thus, these organisms must possess a large and varied genomic arsenal to withstand different environmental conditions. To facilitate the identification of genomic features that might influence bacterial adaptation to a specific niche, we introduce LifeStyle-Specific-Islands (LiSSI). LiSSI combines evolutionary sequence analysis with statistical learning (Random Forest with feature selection, model tuning and robustness analysis). In summary, our strategy aims to identify conserved consecutive homology sequences (islands) in genomes and to identify the most discriminant islands for each lifestyle.

  • Barbosa E, Röttger R, Hauschild A-C, et al. LifeStyle-Specific-Islands (LiSSI): Integrated Bioinformatics Platform for Genomic Island Analysis. J Integr Bioinform. 201714(2). doi 10.1515/jib-2017-0010 PubMed 28678736
  • Srinivas V, Gopal S. LmTDRM Database: A Comprehensive Database on Thiol Metabolic Gene/Gene Products in Listeria monocytogenes EGDe. J Integr Bioinform. 201411(1). doi 10.2390/biecoll-jib-2014-245 PubMed 25228549

omics datasets generated by microarray, mass spectrometry and next generation sequencing technologies requires an integrated platform that can combine results from different

omics datasets to provide novel insights in the understanding of biological systems. MADMAX is designed to provide a solution for storage and analysis of complex

omics datasets. In addition, analysis results (such as lists of genes) will be merged to reveal candidate genes supported by all datasets. The system constitutes an ISA-Tab compliant LIMS part which is independent of different analysis pipelines. A pilot study of different type of

omics data in Brassica rapa demonstrates the possible use of MADMAX. The web-based user interface provides easy access to data and analysis tools on top of the database.

    Lin K, Kools H, De groot PJ, et al. MADMAX - Management and analysis database for multiple

  • Hildebrandt C, Wolf S, Neumann S. Database supported candidate search for metabolite identification. J Integr Bioinform. 20118(2):157. doi 10.2390/biecoll-jib-2011-157 PubMed 21734330

In recent years the amount of biological data has exploded to the point where much useful information can only be extracted by complex computational analyses. Such analyses are greatly facilitated by metadata standards, both in terms of the ability to compare data originating from different sources, and in terms of exchanging data in standard forms, e.g. when running processes on a distributed computing infrastructure. However, standards thrive on stability whereas science tends to constantly move, with new methods being developed and old ones modified. Therefore maintaining both metadata standards, and all the code that is required to make them useful, is a non-trivial problem. Memops is a framework that uses an abstract definition of the metadata (described in UML) to generate internal data structures and subroutine libraries for data access (application programming interfaces--APIs--currently in Python, C and Java) and data storage (in XML files or databases). For the individual project these libraries obviate the need for writing code for input parsing, validity checking or output. Memops also ensures that the code is always internally consistent, massively reducing the need for code reorganisation. Across a scientific domain a Memops-supported data model makes it easier to support complex standards that can capture all the data produced in a scientific area, share them among all programs in a complex software pipeline, and carry them forward to deposition in an archive. The principles behind the Memops generation code will be presented, along with example applications in Nuclear Magnetic Resonance (NMR) spectroscopy and structural biology.

  • Fogh RH, Boucher W, Ionides JMC, Vranken WF, Stevens TJ, Laue ED. MEMOPS: Data modelling and automatic code generation. J Integr Bioinform. 20107(3). doi 10.2390/biecoll-jib-2010-123 PubMed 20375445

Helicobacter pylori is a pathogenic bacterium that colonizes the human epithelia, causing duodenal and gastric ulcers, and gastric cancer. The genome of H. pylori 26695 has been previously sequenced and annotated. In addition, two genome-scale metabolic models have been developed. In order to maintain accurate and relevant information on coding sequences (CDS) and to retrieve new information, the assignment of new functions to Helicobacter pylori 26695s genes was performed in this work. The use of software tools, on-line databases and an annotation pipeline for inspecting each gene allowed the attribution of validated EC numbers and TC numbers to metabolic genes encoding enzymes and transport proteins, respectively. 1212 genes encoding proteins were identified in this annotation, being 712 metabolic genes and 500 non-metabolic, while 191 new functions were assignment to the CDS of this bacterium. This information provides relevant biological information for the scientific community dealing with this organism and can be used as the basis for a new metabolic model reconstruction.

  • Resende T, Correia DM, Rocha M, Rocha I. Re-annotation of the genome sequence of Helicobacter pylori 26695. J Integr Bioinform. 201310(3):233. doi 10.2390/biecoll-jib-2013-233 PubMed 24231147

Database portal
Endocrinology and metabolism Plant biology Molecular interactions, pathways and networks Enzymes
Crop plants play a major role in human and animal nutrition and increasingly contribute to chemical or pharmaceutical industry and renewable resources. In order to achieve important goals, such as the improvement of growth or yield, it is indispensable to understand biological processes on a detailed level. Therefore, the well-structured management of fine-grained information about metabolic pathways is of high interest. Thus, we developed the MetaCrop information system, a manually curated repository of high quality information concerning the metabolism of crop plants. However, the data access to and flexible export of information of MetaCrop in standard exchange formats had to be improved. To automate and accelerate the data access we designed a set of web services to be integrated into external software. These web services have already been used by an add-on for the visualisation toolkit VANTED. Furthermore, we developed an export feature for the MetaCrop web interface, thus enabling the user to compose individual metabolic models using SBML.

  • Hippe K, Colmsee C, Czauderna T, et al. Novel Developments of the MetaCrop Information System for Facilitating Systems Biological Approaches. J Integr Bioinform. 20107(3). doi 10.2390/biecoll-jib-2010-125 PubMed 20375443
  • Üstünkar G, Son YA. METU-SNP: An Integrated Software System for SNPComplex Disease Association Analysis. J Integr Bioinform. 20118(2). doi 10.2390/biecoll-jib-2011-187 PubMed 22156365
  • Flanagan K, Nakjang S, Hallinan J, et al. Microbase2.0: A Generic Framework for Computationally Intensive Bioinformatics Workflows in the Cloud. J Integr Bioinform. 20129(2). doi 10.2390/biecoll-jib-2012-212 PubMed 23001322

Proteomic and transcriptomic technologies resulted in massive biological datasets, their interpretation requiring sophisticated computational strategies. Efficient and intuitive real-time analysis remains challenging. We use proteomic data on 1417 proteins of the green microalga Chlamydomonas reinhardtii to investigate physicochemical parameters governing selectivity of three cysteine-based redox post translational modifications (PTM): glutathionylation (SSG), nitrosylation (SNO) and disulphide bonds (SS) reduced by thioredoxins. We aim to understand underlying molecular mechanisms and structural determinants through integration of redox proteome data from gene- to structural level. Our interactive visual analytics approach on an 8.3 m2 display wall of 25 MPixel resolution features stereoscopic three dimensions (3D) representation performed by UnityMol WebGL. Virtual reality headsets complement the range of usage configurations for fully immersive tasks. Our experiments confirm that fast access to a rich cross-linked database is necessary for immersive analysis of structural data. We emphasize the possibility to display complex data structures and relationships in 3D, intrinsic to molecular structure visualization, but less common for omics-network analysis. Our setup is powered by MinOmics, an integrated analysis pipeline and visualization framework dedicated to multi-omics analysis. MinOmics integrates data from various sources into a materialized physical repository. We evaluate its performance, a design criterion for the framework.

  • Maes A, Martinez X, Druart K, Laurent B, Guégan S, Marchand CH, Lemaire SD, Baaden M. MinOmics, an Integrative and Immersive Tool for Multi-Omics Analysis.. J Integr Bioinform. 201815(2). doi 10.1515/jib-2018-0006 PubMed 29927748
  • Busato M, Distefano R, Bates F, Karim K, Bossi AM, López Vilariño JM, Piletsky S, Bombieri N, Giorgetti A. MIRATE: MIps RATional dEsign Science Gateway.. J Integr Bioinform. 201815(4). doi 10.1515/jib-2017-0075 PubMed 29897885

MicroRNAs (miRNAs/miRs) are important cellular components that regulate gene expression at posttranscriptional level. Various upstream components regulate miR expression and any deregulation causes disease conditions. Therefore, understanding of miR regulatory network both at upstream and downstream level is crucial and a resource on this aspect will be helpful. Currently available miR databases are mostly related to downstream targets, sequences, or diseases. But as of now, no database is available that provides a complete picture of miR regulation in a specific condition. Our miR regulation web resource (miReg) is a manually curated one that represents validated upstream regulators (transcription factor, drug, physical, and chemical) along with downstream targets, associated biological process, experimental condition or disease state, up or down regulation of the miR in that condition, and corresponding PubMed references in a graphical and user friendly manner, browseable through 5 browsing options. We have presented exact facts that have been described in the corresponding literature in relation to a given miR, whether it's a feed-back/feed-forward loop or inhibition/activation. Moreover we have given various links to integrate data and to get a complete picture on any miR listed. Current version (Version 1.0) of miReg contains 47 important human miRs with 295 relations using 190 absolute references. We have also provided an example on usefulness of miReg to establish signalling pathways involved in cardiomyopathy. We believe that miReg will be an essential miRNA knowledge base to research community, with its continuous upgrade and data enrichment. This HTML based miReg can be accessed from: or

  • Barh D, Bhat D, Viero C. miReg: a resource for microRNA regulation. J Integr Bioinform. 20107(1). doi 10.2390/biecoll-jib-2010-144 PubMed 20693604

Identification of microRNA (miRNA) precursors has seen increased efforts in recent years. The difficulty in experimental detection of pre-miRNAs increased the usage of computational approaches. Most of these approaches rely on machine learning especially classification. In order to achieve successful classification, many parameters need to be considered such as data quality, choice of classifier settings, and feature selection. For the latter one, we developed a distributed genetic algorithm on HTCondor to perform feature selection. Moreover, we employed two widely used classification algorithms libSVM and random forest with different settings to analyze the influence on the overall classification performance. In this study we analyzed 5 human retro virus genomes Human endogenous retrovirus K113, Hepatitis B virus (strain ayw), Human T lymphotropic virus 1, Human T lymphotropic virus 2, Human immunodeficiency virus 2, and Human immunodeficiency virus 1. We then predicted pre-miRNAs by using the information from known virus and human pre-miRNAs. Our results indicate that these viruses produce novel unknown miRNA precursors which warrant further experimental validation.

  • Saçar demirci MD, Toprak M, Allmer J. A Machine Learning Approach for MicroRNA Precursor Prediction in Retro-transcribing Virus Genomes. J Integr Bioinform. 201613(5):303. doi 10.2390/biecoll-jib-2016-303 PubMed 28187417

Small non-coding RNAs, in particular microRNAs, are critical for normal physiology and are candidate biomarkers, regulators, and therapeutic targets for a wide variety of diseases. There is an ever-growing interest in the comprehensive and accurate annotation of microRNAs across diverse cell types, conditions, species, and disease states. Highthroughput sequencing technology has emerged as the method of choice for profiling microRNAs. Specialized bioinformatic strategies are required to mine as much meaningful information as possible from the sequencing data to provide a comprehensive view of the microRNA landscape. Here we present miRquant 2.0, an expanded bioinformatics tool for accurate annotation and quantification of microRNAs and their isoforms (termed isomiRs) from small RNA-sequencing data. We anticipate that miRquant 2.0 will be useful for researchers interested not only in quantifying known microRNAs but also mining the rich well of additional information embedded in small RNA-sequencing data.

  • Kanke M, Baran-Gale J, Villanueva J, Sethupathy P. miRquant 2.0: an Expanded Tool for Accurate Annotation and Quantification of MicroRNAs and their isomiRs from Small RNA-Sequencing Data. J Integr Bioinform. 201613(5). doi 10.2390/biecoll-jib-2016-307 PubMed 28187421
  • Baumbach J, Wittkop T, Weile J, Kohl T, Rahmann S. MoRAine--a web server for fast computational transcription factor binding motif re-annotation. J Integr Bioinform. 20085(2). doi 10.2390/biecoll-jib-2008-91 PubMed 20134062
  • Wittkop T, Rahmann S, Baumbach J. Efficient online transcription factor binding site adjustment by integrating transitive graph projection with MoRAine 2.0. J Integr Bioinform. 20107(3). doi 10.2390/biecoll-jib-2010-117 PubMed 20375458

Web application
Sequence analysis Proteins Molecular interactions, pathways and networks Sequencing Protein interactions
During the last years several new tools applicable to protein analysis have made available on the IBIVU web site. Recently, a number of tools, ranging from multiple sequence alignment construction to domain prediction, have been updated and/or extended with services for programmatic access using SOAP. We provide an overview of these tools and their application.

  • Brandt BW, Heringa J. Protein analysis tools and services at IBIVU. J Integr Bioinform. 20118(2). doi 10.2390/biecoll-jib-2011-168 PubMed 21900709

Command-line tool
DNA Mobile genetic elements Sequence analysis
Background Miniature inverted repeat transposable element (MITE) is a short transposable element, carrying no protein-coding regions. However, its high proliferation rate and sequence-specific insertion preference renders it as a good genetic tool for both natural evolution and experimental insertion mutagenesis. Recently active MITE copies are those with clear signals of Terminal Inverted Repeats (TIRs) and Direct Repeats (DRs), and are recently translocated into their current sites. Their proliferation ability renders them good candidates for the investigation of genomic evolution. Results This study optimizes the C++ code and running pipeline of the MITE Uncovering SysTem (MUST) by assuming no prior knowledge of MITEs required from the users, and the current version, MUSTv2, shows significantly increased detection accuracy for recently active MITEs, compared with similar programs. The running speed is also significantly increased compared with MUSTv1. We prepared a benchmark dataset, the simulated genome with 150 MITE copies for researchers who may be of interest. Conclusions MUSTv2 represents an accurate detection program of recently active MITE copies, which is complementary to the existing template-based MITE mapping programs. We believe that the release of MUSTv2 will greatly facilitate the genome annotation and structural analysis of the bioOMIC big data researchers.

  • Ge R, Mai G, Zhang R, Wu X, Wu Q, Zhou F. MUSTv2: An Improved De Novo Detection Program for Recently Active Miniature Inverted Repeat Transposable Elements (MITEs). J Integr Bioinform. 201714(3). doi 10.1515/jib-2017-0029 PubMed 28796642

Database portal Web application
Biodiversity Data integration and warehousing
Fungi have crucial roles in ecosystems, and are important associates for many organisms. They are adapted to a wide variety of habitats, however their global distribution and diversity remains poorly documented. The exponential growth of DNA barcode information retrieved from the environment is assisting considerably the traditional ways for unraveling fungal diversity and detection. The raw DNA data in association to environmental descriptors of metabarcoding studies are made available in public sequence read archives. While this is potentially a valuable source of information for the investigation of Fungi across diverse environmental conditions, the annotation used to describe environment is heterogenous. Moreover, a uniform processing pipeline still needs to be applied to the available raw DNA data. Hence, a comprehensive framework to analyses these data in a large context is still lacking. We introduce the MycoDiversity DataBase, a database which includes public fungal metabarcoding data of environmental samples for the study of biodiversity patterns of Fungi. The framework we propose will contribute to our understanding of fungal biodiversity and aims to become a valuable source for large-scale analyses of patterns in space and time, in addition to assisting evolutionary and ecological research on Fungi.

  • Martorelli I, Helwerda LS, Kerkvliet J, Gomes SIF, Nuytinck J, van der Werff CRA, Ramackers GJ, Gultyaev AP, Merckx VSFT, Verbeek FJ. Fungal metabarcoding data integration framework for the MycoDiversity DataBase (MDDB).. J Integr Bioinform. 2020. doi 10.1515/jib-2019-0046 PubMed 32463383

Biological networks can be large and complex, often consisting of different sub-networks or parts. Separation of networks into parts, network partitioning and layouts of overview and sub-graphs are of importance for understandable visualisations of those networks. This article presents NetPartVis to visualise non-overlapping clusters or partitions of graphs in the Vanted framework based on a method for laying out overview graph and several sub-graphs (partitions) in a coordinated, mental-map preserving way.

  • Garkov D, Klein K, Klukas C, Schreiber F. Mental-Map Preserving Visualisation of Partitioned Networks in Vanted.. J Integr Bioinform. 2019. doi 10.1515/jib-2019-0026 PubMed 31199771

Organisms try to maintain homeostasis by balanced uptake of nutrients from their environment. From an atomic perspective this means that, for example, carbon:nitrogen:sulfur ratios are kept within given limits. Upon limitation of, for example, sulfur, its acquisition is triggered. For yeast it was shown that transporters and enzymes involved in sulfur uptake are encoded as paralogous genes that express different isoforms. Sulfur deprivation leads to up-regulation of isoforms that are poor in sulfur-containing amino acids, that is, methinone and cysteine. Accordingly, sulfur-rich isoforms are down-regulated. We developed a web-based software, doped Nutrilyzer, that extracts paralogous protein coding sequences from an annotated genome sequence and evaluates their atomic composition. When fed with gene-expression data for nutrient limited and normal conditions, Nutrilyzer provides a list of genes that are significantly differently expressed and simultaneously contain significantly different amounts of the limited nutrient in their atomic composition. Its intended use is in the field of ecological stoichiometry. Nutrilyzer is available at Here we describe the work flow and results with an example from a whole-genome Arabidopsis thaliana gene-expression analysis upon oxygen deprivation. 43 paralogs distributed over 37 homology clusters were found to be significantly differently expressed while containing significantly different amounts of oxygen.

  • Lotz K, Schreiber F, Wünschiers R. Nutrilyzer: A Tool for Deciphering Atomic Stoichiometry of Differentially Expressed Paralogous Proteins. J Integr Bioinform. 20129(2). doi 10.2390/biecoll-jib-2012-196 PubMed 22796635

We present Omics Fusion, a new web-based platform for integrative analysis of omics data. Omics Fusion provides a collection of new and established tools and visualization methods to support researchers in exploring omics data, validating results or understanding how to adjust experiments in order to make new discoveries. It is easily extendible and new visualization methods are added continuously. It is available for free under:

  • Brink BG, Seidel A, Kleinbölting N, Nattkemper TW, Albaum SP. Omics Fusion - A Platform for Integrative Analysis of Omics Data. J Integr Bioinform. 201613(4):296. doi 10.2390/biecoll-jib-2016-296 PubMed 28187412

High throughput genomic studies can identify large numbers of potential candidate genes, which must be interpreted and filtered by investigators to select the best ones for further analysis. Prioritization is generally based on evidence that supports the role of a gene product in the biological process being investigated. The two most important bodies of information providing such evidence are bioinformatics databases and the scientific literature. In this paper we present an extension to the Ondex data integration framework that uses text mining techniques over Medline abstracts as a method for accessing both these bodies of evidence in a consistent way. In an example use case, we apply our method to create a knowledge base of Arabidopsis proteins implicated in plant stress response and use various scoring metrics to identify key protein-stress associations. In conclusion, we show that the additional text mining features are able to highlight proteins using the scientific literature that would not have been seen using data integration alone. Ondex is an open-source software project and can be downloaded, together with the text mining features described here, from

  • Hassani-Pak K, Legaie R, Canevet C, van den Berg HA, Moore JD, Rawlings CJ. Enhancing data integration with text analysis to find proteins implicated in plant stress response. J Integr Bioinform. 20107(3). doi 10.2390/biecoll-jib-2010-121 PubMed 20375451

The automated annotation of data from high throughput sequencing and genomics experiments is a significant challenge for bioinformatics. Most current approaches rely on sequential pipelines of gene finding and gene function prediction methods that annotate a gene with information from different reference data sources. Each function prediction method contributes evidence supporting a functional assignment. Such approaches generally ignore the links between the information in the reference datasets. These links, however, are valuable for assessing the plausibility of a function assignment and can be used to evaluate the confidence in a prediction. We are working towards a novel annotation system that uses the network of information supporting the function assignment to enrich the annotation process for use by expert curators and predicting the function of previously unannotated genes. In this paper we describe our success in the first stages of this development. We present the data integration steps that are needed to create the core database of integrated reference databases (UniProt, PFAM, PDB, GO and the pathway database Ara-Cyc) which has been established in the ONDEX data integration system. We also present a comparison between different methods for integration of GO terms as part of the function assignment pipeline and discuss the consequences of this analysis for improving the accuracy of gene function annotation. The methods and algorithms presented in this publication are an integral part of the ONDEX system which is freely available from

  • Pesch R, Lysenko A, Hindle M, et al. Graph-based sequence annotation using a data integration approach. J Integr Bioinform. 20085(2). doi 10.2390/biecoll-jib-2008-94 PubMed 20134069

Electronic laboratory notebooks (ELNs) are more accessible and reliable than their paper based alternatives and thus find widespread adoption. While a large number of commercial products is available, small- to mid-sized laboratories can often not afford the costs or are concerned about the longevity of the providers. Turning towards free alternatives, however, raises questions about data protection, which are not sufficiently addressed by available solutions. To serve as legal documents, ELNs must prevent scientific fraud through technical means such as digital signatures. It would also be advantageous if an ELN was integrated with a laboratory information management system to allow for a comprehensive documentation of experimental work including the location of samples that were used in a particular experiment. Here, we present OpenLabNotes, which adds state-of-the-art ELN capabilities to OpenLabFramework, a powerful and flexible laboratory information management system. In contrast to comparable solutions, it allows to protect the intellectual property of its users by offering data protection with digital signatures. OpenLabNotes effectively closes the gap between research documentation and sample management, thus making Open-LabFramework more attractive for laboratories that seek to increase productivity through electronic data management.

  • List M, Franz M, Tan Q, Mollenhauer J, Baumbach J. OpenLabNotes--An Electronic Laboratory Notebook Extension for OpenLabFramework. J Integr Bioinform. 201512(3):274. doi 10.2390/biecoll-jib-2015-274 PubMed 26673790

Command-line tool Library
Microarray experiment Gene expression
Correlation analysis assuming coexpression of the genes is a widely used method for gene expression analysis in molecular biology. Yet growing extent, quality and dimensionality of the molecular biological data permits emerging, more sophisticated approaches like Boolean implications. We present an approach which is a combination of the SOM (self organizing maps) machine learning method and Boolean implication analysis to identify relations between genes, metagenes and similarly behaving metagene groups (spots). Our method provides a way to assign Boolean states to genes/metagenes/spots and offers a functional view over significantly variant elements of gene expression data on these three different levels. While being able to cover relations between weakly correlated entities Boolean implication method also decomposes these relations into six implication classes. Our method allows one to validate or identify potential relationships between genes and functional modules of interest and to assess their switching behaviour. Furthermore the output of the method renders it possible to construct and study the network of genes. By providing logical implications as updating rules for the network it can also serve to aid modelling approaches.

  • Çakır MV, Binder H, Wirth H. Profiling of Genetic Switches using Boolean Implications in Expression Data. J Integr Bioinform. 201411(1). doi 10.2390/biecoll-jib-2014-246 PubMed 25318120
  • Silva FJM da, Pérez JMS, Pulido JAG, Rodríguez MAV. Parallel Niche Pareto AlineaGA – an Evolutionary Multiobjective approach on Multiple Sequence Alignment. J Integr Bioinform. 20118(3). doi 10.2390/biecoll-jib-2011-174 PubMed 21926437

Molecular interactions, pathways and networks Data management
Biological pathways are crucial to much of the scientific research today including the study of specific biological processes related with human diseases. PathJam is a new comprehensive and freely accessible web-server application integrating scattered human pathway annotation from several public sources. The tool has been designed for both (i) being intuitive for wet-lab users providing statistical enrichment analysis of pathway annotations and (ii) giving support to the development of new integrative pathway applications. PathJam’s unique features and advantages include interactive graphs linking pathways and genes of interest, downloadable results in fully compatible formats, GSEA compatible output files and a standardized RESTful API.

  • Glez-Peña D, Reboiro-Jato M, Domínguez R, Gómez-López G, Pisano DG, Fdez-Riverola F. PathJam: a new service for integrating biological pathway information. J Integr Bioinform. 20107(1). doi 10.2390/biecoll-jib-2010-147 PubMed 20980714

Desktop application
Systems biology Molecular interactions, pathways and networks
Our understanding of complex biological processes can be enhanced by combining different kinds of high-throughput experimental data, but the use of incompatible identifiers makes data integration a challenge. We aimed to improve methods for integrating and visualizing different types of omics data. To validate these methods, we applied them to two previous studies on starvation in mice, one using proteomics and the other using transcriptomics technology. We extended the PathVisio software with new plugins to link proteins, transcripts and pathways. A low overall correlation between proteome and transcriptome data was detected (Spearman rank correlation: 0.21). At the level of individual genes, correlation was highly variable. Many mRNA/protein pairs, such as fructose biphosphate aldolase B and ATP Synthase, show good correlation. For other pairs, such as ferritin and elongation factor 2, an interesting effect is observed, where mRNA and protein levels change in opposite directions, suggesting they are not primarily regulated at the transcriptional level. We used pathway diagrams to visualize the integrated datasets and found it encouraging that transcriptomics and proteomics data supported each other at the pathway level. Visualization of the integrated dataset on pathways led to new observations on gene-regulation in the response of the gut to starvation. Our methods are generic and can be applied to any multi-omics study. The PathVisio software can be obtained at Supplemental data are available at , including instructions on reproducing the pathway visualizations of this manuscript.

  • Van iersel MP, Sokolović M, Lenaerts K, et al. Integrated visualization of a multi-omics study of starvation in mouse intestine. J Integr Bioinform. 201411(1):235. doi 10.2390/biecoll-jib-2014-235 PubMed 24675236

MicroRNAs (miRs) are known to interfere with mRNA expression, and much work has been put into predicting and inferring miR-mRNA interactions. Both sequence-based interaction predictions as well as interaction inference based on expression data have been proven somewhat successful furthermore, models that combine the two methods have had even more success. In this paper, I further refine and enrich the methods of miRmRNA interaction discovery by integrating a Bayesian clustering algorithm into a model of prediction-enhanced miR-mRNA target inference, creating an algorithm called PEACOAT, which is written in the R language. I show that PEACOAT improves the inference of miR-mRNA target interactions using both simulated data and a data set of microarrays from samples of multiple myeloma patients. In simulated networks of 25 miRs and mRNAs, our methods using clustering can improve inference in roughly two-thirds of cases, and in the multiple myeloma data set, KEGG pathway enrichment was found to be more significant with clustering than without. Our findings are consistent with previous work in clustering of non-miR genetic networks and indicate that there could be a significant advantage to clustering of miR and mRNA expression data as a part of interaction inference.

  • Godsey B. Discovery of miR-mRNA interactions via simultaneous Bayesian inference of gene networks and clusters using sequence-based predictions and expression data. J Integr Bioinform. 201310(1). doi 10.2390/biecoll-jib-2013-227 PubMed 23846182

Systems biology plays a central role for biological network analysis in the post-genomic era. Cytoscape is the standard bioinformatics tool offering the community an extensible platform for computational analysis of the emerging cellular network together with experimental omics data sets. However, only few apps/plugins/tools are available for simulating network dynamics in Cytoscape 3. Many approaches of varying complexity exist but none of them have been integrated into Cytoscape as app/plugin yet. Here, we introduce PetriScape, the first Petri net simulator for Cytoscape. Although discrete Petri nets are quite simplistic models, they are capable of modeling global network properties and simulating their behaviour. In addition, they are easily understood and well visualizable. PetriScape comes with the following main functionalities: (1) import of biological networks in SBML format, (2) conversion into a Petri net, (3) visualization as Petri net, and (4) simulation and visualization of the token flow in Cytoscape. PetriScape is the first Cytoscape plugin for Petri nets. It allows a straightforward Petri net model creation, simulation and visualization with Cytoscape, providing clues about the activity of key components in biological networks.

  • Almeida D, Azevedo V, Silva A, Baumbach J. PetriScape - A plugin for discrete Petri net simulations in Cytoscape. J Integr Bioinform. 201613(1):284. doi 10.2390/biecoll-jib-2016-284 PubMed 27402693

Improvements in genome sequencing technology increased the availability of full genomes and transcriptomes of many organisms. However, the major benefit of massive parallel sequencing is to better understand the organization and function of genes which then lead to understanding of phenotypes. In order to interpret genomic data with automated gene annotation studies, several tools are currently available. Even though the accuracy of computational gene annotation is increasing, a combination of multiple lines of experimental evidences should be gathered. Mass spectrometry allows the identification and sequencing of proteins as major gene products and it is only these proteins that conclusively show whether a part of a genome is a coding region or not to result in phenotypes. Therefore, in the field of proteogenomics, the validation of computational methods is done by exploiting mass spectrometric data. As a result, identification of novel protein coding regions, validation of current gene models, and determination of upstream and downstream regions of genes can be achieved. In this paper, we present new functionality for our proteogenomic tool, PGMiner which performs all proteogenomic steps like acquisition of mass spectrometric data, peptide identification against preprocessed sequence databases, assignment of statistical confidence to identified peptides, mapping confident peptides to gene models, and result visualization. The extensions cover determining proteotypic peptides and thus unambiguous protein identification. Furthermore, peptides conflicting with gene models can now automatically assessed within the context of predicted alternative open reading frames.

  • Has C, Lashin SA, Kochetov A, Allmer J. PGMiner reloaded, fully automated proteogenomic annotation tool linking genomes to proteomes. J Integr Bioinform. 201613(4):16–23. doi 10.2390/biecoll-jib-2016-293 PubMed 28187409
  • Thiele H, Glandorf J, Hufnagel P. Bioinformatics strategies in life sciences: from data processing and data warehousing to biological knowledge extraction. J Integr Bioinform. 20107(1):141. doi 10.2390/biecoll-jib-2010-141 PubMed 20508300
  • Mallika V, Sivakumar KC, Jaichand S, Soniya EV. Kernel based machine learning algorithm for the efficient prediction of type III polyketide synthase family of proteins. J Integr Bioinform. 20107(1). doi 10.2390/biecoll-jib-2010-143 PubMed 20625199

Web application
Structure prediction Protein secondary structure Sequence analysis Protein folds and structural domains Nucleic acid structure analysis
During the last years several new tools applicable to protein analysis have made available on the IBIVU web site. Recently, a number of tools, ranging from multiple sequence alignment construction to domain prediction, have been updated and/or extended with services for programmatic access using SOAP. We provide an overview of these tools and their application.

  • Brandt BW, Heringa J. Protein analysis tools and services at IBIVU. J Integr Bioinform. 20118(2). doi 10.2390/biecoll-jib-2011-168 PubMed 21900709

MicroRNAs are short non-coding RNA transcripts that act as master cellular egulators with roles in orchestrating virtually all biological functions. The recent affordability and widespread use of high-throughput microRNA profiling technologies has grown along with the advancement of bioinformatics tools available for analysis of the mounting data flow. While there are many computational resources available for the management of data from genome sequenced animals, researchers are often faced with the challenge of identifying the biological implications of the daunting amount of data generated from these high-throughput technologies. In this article, we review the current state of highthroughput microRNA expression profiling platforms, data analysis processes, and computational tools in the context of comparative molecular physiology. We also present RBioMIR and RBioFS, our R package implementations for differential expression analysis and random forest-based gene selection. Detailed installation guides are available at

  • Zhang J, Hadj-Moussa H, Storey KB. Current Progress of High-Throughput MicroRNA Differential Expression Analysis and Random Forest Gene Selection for Model and Non-Model Systems: an R Implementation. J Integr Bioinform. 201613(5). doi 10.2390/biecoll-jib-2016-306 PubMed 28187420

MicroRNAs are short non-coding RNA transcripts that act as master cellular egulators with roles in orchestrating virtually all biological functions. The recent affordability and widespread use of high-throughput microRNA profiling technologies has grown along with the advancement of bioinformatics tools available for analysis of the mounting data flow. While there are many computational resources available for the management of data from genome sequenced animals, researchers are often faced with the challenge of identifying the biological implications of the daunting amount of data generated from these high-throughput technologies. In this article, we review the current state of highthroughput microRNA expression profiling platforms, data analysis processes, and computational tools in the context of comparative molecular physiology. We also present RBioMIR and RBioFS, our R package implementations for differential expression analysis and random forest-based gene selection. Detailed installation guides are available at

  • Zhang J, Hadj-Moussa H, Storey KB. Current Progress of High-Throughput MicroRNA Differential Expression Analysis and Random Forest Gene Selection for Model and Non-Model Systems: an R Implementation. J Integr Bioinform. 201613(5). doi 10.2390/biecoll-jib-2016-306 PubMed 28187420

Desktop application
Systems biology Biochemistry Chemical biology Simulation experiment
Reaction-diffusion systems are mathematical models that describe how the concentrations of substances distributed in space change under the influence of local chemical reactions, and diffusion which causes the substances to spread out in space. The classical representation of a reaction-diffusion system is given by semi-linear parabolic partial differential equations, whose solution predicts how diffusion causes the concentration field to change with time. This change is proportional to the diffusion coefficient. If the solute moves in a homogeneous system in thermal equilibrium, the diffusion coefficients are constants that do not depend on the local concentration of solvent and solute. However, in nonhomogeneous and structured media the assumption of constant intracellular diffusion coefficient is not necessarily valid, and, consequently, the diffusion coefficient is a function of the local concentration of solvent and solutes. In this paper we propose a stochastic model of reaction-diffusion systems, in which the diffusion coefficients are function of the local concentration, viscosity and frictional forces. We then describe the software tool Redi (REaction-DIffusion simulator) which we have developed in order to implement this model into a Gillespie-like stochastic simulation algorithm. Finally, we show the ability of our model implemented in the Redi tool to reproduce the observed gradient of the bicoid protein in the Drosophila Melanogaster embryo. With Redi, we were able to simulate with an accuracy of 1% the experimental spatio-temporal dynamics of the bicoid protein, as recorded in time-lapse experiments obtained by direct measurements of transgenic bicoidenhanced green fluorescent protein.

  • Lecca P, Ihekwaba AEC, Dematté L, Priami C. Stochastic simulation of the spatio-temporal dynamics of reaction-diffusion systems: the case for the bicoid gradient. J Integr Bioinform. 20107(1). doi 10.2390/biecoll-jib-2010-150 PubMed 21098882
  • Pitkänen E, Åkerlund A, Rantanen A, Jouhten P, Ukkonen E. ReMatch: a web-based tool to construct, store and share stoichiometric metabolic models with carbon maps for metabolic flux analysis. J Integr Bioinform. 20085(2). doi 10.2390/biecoll-jib-2008-102 PubMed 20134058
  • Ameline de cadeville B, Loréal O, Moussouni-marzolf F. RetroMine, or how to provide in-depth retrospective studies from Medline in a glance: the hepcidin use-case. J Integr Bioinform. 201512(3):275. doi 10.2390/biecoll-jib-2015-275 PubMed 26673791

Understanding how metabolic reactions translate the genome of an organism into its phenotype is a grand challenge in biology. Genome-wide association studies (GWAS) statistically connect genotypes to phenotypes, without any recourse to known molecular interactions, whereas a molecular mechanistic description ties gene function to phenotype through gene regulatory networks (GRNs), protein-protein interactions (PPIs) and molecular pathways. Integration of different regulatory information levels of an organism is expected to provide a good way for mapping genotypes to phenotypes. However, the lack of curated metabolic model of rice is blocking the exploration of genome-scale multi-level network reconstruction. Here, we have merged GRNs, PPIs and genome-scale metabolic networks (GSMNs) approaches into a single framework for rice via omics’ regulatory information reconstruction and integration. Firstly, we reconstructed a genome-scale metabolic model, containing 4,462 function genes, 2,986 metabolites involved in 3,316 reactions, and compartmentalized into ten subcellular locations. Furthermore, 90,358 pairs of protein-protein interactions, 662,936 pairs of gene regulations and 1,763 microRNA-target interactions were integrated into the metabolic model. Eventually, a database was developped for systematically storing and retrieving the genome-scale multi-level network of rice. This provides a reference for understanding genotype-phenotype relationship of rice, and for analysis of its molecular regulatory network.

  • Liu L, Mei Q, Yu Z, Sun T, Zhang Z, Chen M. An integrative bioinformatics framework for genome-scale multiple level network reconstruction of rice. J Integr Bioinform. 201310(2):223. doi 10.2390/biecoll-jib-2013-223 PubMed 23563093
  • Lee HM, Dietz KJ, Hofestädt R. Prediction of thioredoxin and glutaredoxin target proteins by identifying reversibly oxidized cysteinyl residues. J Integr Bioinform. 20107(3). doi 10.2390/biecoll-jib-2010-130 PubMed 20375441

SAD_BaSe is a blood bank data analysis software, created to assist in the management of blood donations and the blood production chain in blood establishments. In particular, the system keeps track of several collection and production indicators, enables the definition of collection and production strategies, and the measurement of quality indicators required by the Quality Management System regulating the general operation of blood establishments. This paper describes the general scenario of blood establishments and its main requirements in terms of data management and analysis. It presents the architecture of SAD_BaSe and identifies its main contributions. Specifically, it brings forward the generation of customized reports driven by decision making needs and the use of data mining techniques in the analysis of donor suspensions and donation discards.

  • Ramoa A, Maia S, Lourenço A. A rational framework for production decision making in blood establishments. J Integr Bioinform. 20129(3):204. doi 10.2390/biecoll-jib-2012-204 PubMed 22829575

Advances in bioinformatics have contributed towards a significant increase in available information. Information analysis requires the use of distributed computing systems to best engage the process of data analysis. This study proposes a multiagent system that incorporates grid technology to facilitate distributed data analysis by dynamically incorporating the roles associated to each specific case study. The system was applied to genetic sequencing data to extract relevant information about insertions, deletions or polymorphisms.

  • González R, Zato C, Benito R, et al. Automatic knowledge extraction in sequencing analysis with multiagent system and grid computing. J Integr Bioinform. 20129(3):206. doi 10.2390/biecoll-jib-2012-206 PubMed 22829577

During the last years several new tools applicable to protein analysis have made available on the IBIVU web site. Recently, a number of tools, ranging from multiple sequence alignment construction to domain prediction, have been updated and/or extended with services for programmatic access using SOAP. We provide an overview of these tools and their application.

  • Brandt BW, Heringa J. Protein analysis tools and services at IBIVU. J Integr Bioinform. 20118(2). doi 10.2390/biecoll-jib-2011-168 PubMed 21900709

The identification of genes and SNPs involved in human diseases remains a challenge. Many public resources, databases and applications, collect biological data and perform annotations, increasing the global biological knowledge. The need of SNPs prioritization is emerging with the development of new high-throughput genotyping technologies, which allow to develop customized disease-oriented chips. Therefore, given a list of genes related to a specific biological process or disease as input, a crucial issue is finding the most relevant SNPs to analyse. The selection of these SNPs may rely on the relevant a-priori knowledge of biomolecular features characterising all the annotated SNPs and genes of the provided list. The bioinformatics approach described here allows to retrieve a ranked list of significant SNPs from a set of input genes, such as candidate genes associated with a specific disease. The system enriches the genes set by including other genes, associated to the original ones by ontological similarity evaluation. The proposed method relies on the integration of data from public resources in a vertical perspective (from genomics to systems biology data), the evaluation of features from biomolecular knowledge, the computation of partial scores for SNPs and finally their ranking, relying on their global score. Our approach has been implemented into a web based tool called SNPRanker, which is accessible through at the URL . An interesting application of the presented system is the prioritisation of SNPs related to genes involved in specific pathologies, in order to produce custom arrays.

  • Calabria A, Mosca E, Viti F, Merelli I, Milanesi L. SNPRanker: a tool for identification and scoring of SNPs associated to target genes. J Integr Bioinform. 20107(3). doi 10.2390/biecoll-jib-2010-138 PubMed 20375450

Command-line tool
Systems biology Molecular interactions, pathways and networks Genomics
The generation and use of metabolic network reconstructions has increased over recent years. The development of such reconstructions has typically involved a time-consuming, manual process. Recent work has shown that steps undertaken in reconstructing such metabolic networks are amenable to automation. The SuBliMinaL Toolbox ( facilitates the reconstruction process by providing a number of independent modules to perform common tasks, such as generating draft reconstructions, determining metabolite protonation state, mass and charge balancing reactions, suggesting intracellular compartmentalisation, adding transport reactions and a biomass function, and formatting the reconstruction to be used in third-party analysis packages. The individual modules manipulate reconstructions encoded in Systems Biology Markup Language (SBML), and can be chained to generate a reconstruction pipeline, or used individually during a manual curation process. This work describes the individual modules themselves, and a study in which the modules were used to develop a metabolic reconstruction of Saccharomyces cerevisiae from the existing data resources KEGG and MetaCyc. The automatically generated reconstruction is analysed for blocked reactions, and suggestions for future improvements to the toolbox are discussed.

  • Swainston N, Smallbone K, Mendes P, Kell DB, Paton NW. The SuBliMinaL Toolbox: automating steps in the reconstruction of metabolic networks. J Integr Bioinform. 20118(2). doi 10.2390/biecoll-jib-2011-186 PubMed 22095399

Visualization is pivotal for gaining insight in systems biology data. As the size and complexity of datasets and supplemental information increases, an efficient, integrated framework for general and specialized views is necessary. MAYDAY is an application for analysis and visualization of general 'omics' data. It follows a trifold approach for data visualization, consisting of flexible data preprocessing, highly customizable data perspective plots for general purpose visualization and systems based plots. Here, we introduce two new systems biology visualization tools for MAYDAY. Efficiently implemented genomic viewers allow the display of variables associated with genomic locations. Multiple variables can be viewed using our new track-based ChromeTracks tool. A functional perspective is provided by visualizing metabolic pathways either in KEGG or BioPax format. Multiple options of displaying pathway components are available, including Systems Biology Graphical Notation (SBGN) glyphs. Furthermore, pathways can be viewed together with gene expression data either as heatmaps or profiles. We apply our tools to two 'omics' datasets of Pseudomonas aeruginosa. The general analysis and visualization tools of MAYDAY as well as our ChromeTracks viewer are applied to a transcriptome dataset. We furthermore integrate this dataset with a metabolome dataset and compare the activity of amino acid degradation pathways between these two datasets, by visually enhancing the pathway diagrams produced by MAYDAY.

  • Symonsy S, Zipplies C, Battke F, Nieselt K. Integrative Systems Biology Visualization with MAYDAY. J Integr Bioinform. 20107(3):1–14. doi 10.2390/biecoll-jib-2010-115 PubMed 20375461
  • Bartocci E, Cacciagrano D, Di berardini MR, Merelli E, Vito L. UBioLab: a web-laboratory for ubiquitous in-silico experiments. J Integr Bioinform. 20129(1):192. doi 10.2390/biecoll-jib-2012-192 PubMed 22773116

Phylogenetics Genomics Sequence analysis Protein structure analysis
Unipro UGENE is an open-source bioinformatics toolkit that integrates popular tools along with original instruments for molecular biologists within a unified user interface. Nowadays, most bioinformatics desktop applications, including UGENE, make use of a local data model while processing different types of data. Such an approach causes an inconvenience for scientists working cooperatively and relying on the same data. This refers to the need of making multiple copies of certain files for every workplace and maintaining synchronization between them in case of modifications. Therefore, we focused on delivering a collaborative work into the UGENE user experience. Currently, several UGENE installations can be connected to a designated shared database and users can interact with it simultaneously. Such databases can be created by UGENE users and be used at their discretion. Objects of each data type, supported by UGENE such as sequences, annotations, multiple alignments, etc., can now be easily imported from or exported to a remote storage. One of the main advantages of this system, compared to existing ones, is the almost simultaneous access of client applications to shared data regardless of their volume. Moreover, the system is capable of storing millions of objects. The storage itself is a regular database server so even an inexpert user is able to deploy it. Thus, UGENE may provide access to shared data for users located, for example, in the same laboratory or institution. UGENE is available at:

  • Protsyuk IV, Grekhov GA, Tiunov AV, Fursov MY. Shared bioinformatics databases within the Unipro UGENE platform. J Integr Bioinform. 201512(1):257. doi 10.2390/biecoll-jib-2015-257 PubMed 26527191

Desktop application
Systems biology Molecular interactions, pathways and networks Data visualisation Biomedical science
VANESA is a modeling software for the automatic reconstruction and analysis of biological networks based on life-science database information. Using VANESA, scientists are able to model any kind of biological processes and systems as biological networks. It is now possible for scientists to automatically reconstruct important molecular systems with information from the databases KEGG, MINT, IntAct, HPRD, and BRENDA. Additionally, experimental results can be expanded with database information to better analyze the investigated elements and processes in an overall context. Users also have the possibility to use graph theoretical approaches in VANESA to identify regulatory structures and significant actors within the modeled systems. These structures can then be further investigated in the Petri net environment of VANESA. It is platform-independent, free-of-charge, and available at

  • Brinkrolf C, Janowski SJ, Kormeier B, et al. VANESA - a software application for the visualization and analysis of networks in system biology applications. J Integr Bioinform. 201411(2):239. doi 10.2390/biecoll-jib-2014-239 PubMed 24953454
  • Brinkrolf C, Henke NA, Ochel L, Pucker B, Kruse O, Lutter P. Modeling and Simulating the Aerobic Carbon Metabolism of a Green Microalga Using Petri Nets and New Concepts of VANESA.. J Integr Bioinform. 201815(3). doi 10.1515/jib-2018-0018 PubMed 30218605
  • Kormeier B, Hippe K, Arrigo P, Töpel T, Janowski S, Hofestädt R. Reconstruction of biological networks based on life science data integration. J Integr Bioinform. 20107(2). doi 10.2390/biecoll-jib-2010-146 PubMed 20978286
  • Hamzeiy H, Suluyayla R, Brinkrolf C, Janowski SJ, Hofestaedt R, Allmer J. Visualization and Analysis of MicroRNAs within KEGG Pathways using VANESA.. J Integr Bioinform. 201714(1). doi 10.1515/jib-2016-0004 PubMed 28609293
  • Soh J, Xiao M, Do T, Meruvia-Pastor O, Sensen CW. Integrative visualization of temporally varying medical image patterns. J Integr Bioinform. 20118(2):161. doi 10.2390/biecoll-jib-2011-161 PubMed 21778531

Web application
Sequence sites, features and motifs Sequence analysis Database management Gene and protein families Molecular modelling
During the last years several new tools applicable to protein analysis have made available on the IBIVU web site. Recently, a number of tools, ranging from multiple sequence alignment construction to domain prediction, have been updated and/or extended with services for programmatic access using SOAP. We provide an overview of these tools and their application.

  • Brandt BW, Heringa J. Protein analysis tools and services at IBIVU. J Integr Bioinform. 20118(2). doi 10.2390/biecoll-jib-2011-168 PubMed 21900709

Structure, is a widely used software tool to investigate population genetic structure with multi-locus genotyping data. The software uses an iterative algorithm to group individuals into "K" clusters, representing possibly K genetically distinct subpopulations. The serial implementation of this programme is processor-intensive even with small datasets. We describe an implementation of the program within a parallel framework. Speedup was achieved by running different replicates and values of K on each node of the cluster. A web-based user-oriented GUI has been implemented in PHP, through which the user can specify input parameters for the programme. The number of processors to be used can be specified in the background command. A web-based visualization tool "Visualstruct", written in PHP (HTML and Java script embedded), allows for the graphical display of population clusters output from Structure, where each individual may be visualized as a line segment with K colors defining its possible genomic composition with respect to the K genetic sub-populations. The advantage over available programs is in the increased number of individuals that can be visualized. The analyses of real datasets indicate a speedup of up to four, when comparing the speed of execution on clusters of eight processors with the speed of execution on one desktop. The software package is freely available to interested users upon request.