FuncZyme Database

A revolution in genomics over the last decade has largely streamlined the process of genome assembly and gene identification, leading to deposition of hundreds of plant genomes in databases. However, predicting and validating the functions of the identified genes is still a major challenge. In most plant genomes, genes especially those involved in metabolism, lie in large gene families of dozens of members and are poorly annotated. For example, >80 of the 101 BAHD acyltransferases in cultivated tomato are annotated with just their domain architecture, which is not informative for dissecting genetics of metabolic traits such as lignin/cuticular wax production, fruit ripening, stress response, and mutualist interactions.

There are two critical bottlenecks here:

  • although thousands of enzyme activities are published, a miniscule fraction is actually in function databases like UniProt and Gene Ontology, thereby remaining hidden from powerful function prediction programs and machine learning approaches, and
  • vocabularies and tools for function transfer are not based on substrate chemistry and do not take into account enzyme promiscuity

The FuncZyme project seeks to address these challenges by:

  • curating thousands of enzyme activities of 10 large plant enzyme families from previously published papers
  • developing computational models to predict activities of uncharacterized enzymes from hundreds of plant genomes
  • discovering and validating new functions for uncharacterized enzymes in agriculturally important species such as tomato and soybean using novel SynBio methods
  • facilitating the deposition of these curated activities into global genome/proteome databases including UniProt, Gene Ontology and others.

This project is funded by the National Science Foundation IOS-Plant Genome Research Program Award #2310395 (link)

Curated Data

Enzyme activity data curated from research publications including various outputs of the FuncFetch pipeline

Predictions

We will develop enzyme function prediction models based on Curated Data. These models will be applied on various high-quality sequenced plant genomes. These predictions will be stored here.