User talk:Putmantime/Archive 1: Difference between revisions

From Wikidata
Jump to navigation Jump to search
Content deleted Content added
Putmantime (talk | contribs)
Putmantime (talk | contribs)
No edit summary
Line 4: Line 4:
Greetings Project Molecular Biology!
Greetings Project Molecular Biology!


I am new to the community and am here to propose a large scale project, designed by Andrew Su, to build a centralized model organism database for microbial genome, gene and gene product information. This project will create a centralized, consistent, and structured database that will include current genetic information for all microbial organisms. To do this, pending community support, we will modify the previously established ProteinBoxBot infrastructure to aggregate the wealth of knowledge of microbial genetics into the Wikidata project, starting with bacteria.
I am new to the community, working with Andrew Su and collaborators, and am here to to build a centralized model organism database for microbial genome, gene and gene product information. This project will create a centralized, consistent, and structured database that will include current genetic information for all microbial organisms. To do this, pending community support, we will modify the previously established ProteinBoxBot infrastructure to aggregate the wealth of knowledge of microbial genetics into the Wikidata project, starting with bacteria.


The project will consist of two major stages:
The project will consist of two major stages:

Revision as of 22:24, 19 August 2015

Proposal for bringing microbial genome, gene, and protein items to Wikidata

Greetings Project Molecular Biology!

I am new to the community, working with Andrew Su and collaborators, and am here to to build a centralized model organism database for microbial genome, gene and gene product information. This project will create a centralized, consistent, and structured database that will include current genetic information for all microbial organisms. To do this, pending community support, we will modify the previously established ProteinBoxBot infrastructure to aggregate the wealth of knowledge of microbial genetics into the Wikidata project, starting with bacteria.

The project will consist of two major stages:

1) Develop ProteinBoxBot to populate Wikidata with microbial gene models/annotations, genome features and gene product information, from reliable public repositories.

2) Create a generic genome browser for all microbial organisms that will take advantage of the consistent and computer readable format of Wikidata genetic items.


It is a great time to bring microbes to Wikdata! In addition to the reliable and robust MyGene.info, the May 17th 2015 release of the updated NCBI bacterial RefSeq repository , Release 70, provides an excellent initial data source. From the ~6400 re- or newly annotated bacterial genome assemblies included, 3268 high quality reference and representative genomes (i.e. those that represent strain groupings for a species) have been selected to populate the bacterial genome database maintained by NCBI Bacterial RefSeq. This greatly reduces the noise of the redundant nature of microbial genomics, by presenting a representative set of genomes for each species of bacteria, and provides an ideal initial framework for our project.

Gene item development

The first task has been designing what microbial gene and protein items will look like. The task page for the project PBB/Microbial gene and protein items presents the structure of our model for a microbial gene item in Figure 1. This diagram displays the properties and QIDs that define an item, along with the linkage between the gene item, the organism it is found in, and the product it encodes. This basic structure creates a solid and consistent framework that experts in the research community will be able to add to.

Links to the prototype gene item displayed in Figure 1 (a gene in the bacterial species Chlamydia trachomatis), and the related items in the structure are as follows:

Relative items

(organism) Chlamydia trachomatis Q131065,

(representative strain A) C. trachomatis L2/434/BU Q20800254,

(representative strain B) C. trachomatis D/UW-3/CX Q20800373,

(gene item) translocated actin-recruiting phosphoprotein Q20797449,

(protein item) translocated actin-recruiting phosphoprotein Q17126483,

Properties

This structure takes advantage of existing gene properties and (at this point) requires no new properties be created. However, because of the multiple strain nature of microbial species, multiple values for a few single value properties are necessary. Attached to each value will be the qualifier “found in taxon” that points to the representative strain item that the gene or protein item originates from.

The properties that require multiple values include:

Gene items

Entrez Gene ID P351,

genomic start P644,

genomic end P645,

Protein items

UniProt ID P352

I am really curious to see what the community thinks of our model, so please express your support, opposition, suggestions or comments below. 


Cheers,

--Putmantime

Support

Oppose

Comments/Suggestions