User:Kyleschutter/Sandbox

Difficulties in Automating Module Annotation Paper Easy to find. Author et al "First part of name with ellipsis..." year Species Varies from paper to paper: its more difficult to find in some papers. Sometimes the species is not specified in the paper at all and you have to look in the methods to see what cell line was used, google the cell line to see what it is. often these cases are human, mouse or rat. Sometimes the species is in the abstract itself. Stage/Location/Cell Type Sometimes in the abstract. Not always clear if embryos or cell lines were tested. If there is the term "dpc" it is probably embryonic. Gene Name Always in title or abstract, unless the paper is about a TF that regulates many genes or unknown genes Function Often in the abstract or in first section (eg introduction). Sometimes it has to be found on the NCBI website on the gene page below common names. Gene ID/Accession #/Start and End sites Easy to find on NCBI website. Sometimes the Gene ID is incorrect (!) and the accession number has to be used with start and end sites. Module Name Sometimes found in the paper. Sometimes the author gives a name or refers to it as the "enhancer." The author may not state a module name. In those cases I do not fill in this field. Module Start and End Rarely stated. I put the region including the two furthest apart binding sites as the module if there is nothing else to go by. Sometimes a group of sites are talked about as a module but no coordinates or sequences are offered. Conserved in If the module with about the same binding sites in the same order are present in other organisms I say it is conserved even if the sequences don't show 100% identity. This information can be found when the site is lined up with another (many other) organisms in a figure with identical sequences highlighted.

Inputs: Factor Name Usually in title and/or abstract, though it may not be known or there may be many binding factors. sometimes they say a certain family of TFs bind. Start/End Boxed region in a sequence diagram comparing species. If only a sequence is given (eg there is a diagram that says the putative binding site NNNNNNN was tested and later on they say that binding site x was found to be involved) then that site is used. Caution: Sometimes putative binding sites are reported in both of the above cases but later on it is found that the site didnt bind or there is not enough evidence for it. Conserved in Kyle says it is conserved if there is 100% identity with the site in another species and chooses 100% for consistency. Tamar will sometimes put "conserved" if the author says so. Function Kyle: Often the whole paper has to be read and even then, only scant conclusions can be made. Sometimes the conclusions or concluding sentences of paragraphs hold function information. Often this does not give enough context, however. Kyle and Tamar disagree on how to specify this: Tamar thinks the function is simple: activator, repressor, etc., whereas Kyle also specifies whether it is input into and/or logic, strong or weak activator and if it is only seen with another factor, etc

Groups: If there is logical, additive or synergistic relationships between sites then I fill in this field with the corresponding function. If there are many sites for a single factor but it isnt clear what kind of combinatorial outcome is evoked then I leave the group section blank.