User:Jerzyo/FileQuirks

FileQuirks is a bioinformatic web server for recognition of biological data types developed in Laboratory of Bioinformatics and Protein Engineering in IIMCB Warsaw (GeneSilico). It enables to quickly check the format of a file with a biological data.

Background
We currently observe an explosion of publicly available bioinformatic tools and data. In parallel we can also observe constant increase in number of data formats used, such as: FASTA format,Mass spectrometry data format,European Data Format,Protein Data Bank (file format). For example, despite several unification attempts, there are more than 20 formats for biological sequences used (and the number is still growing). Although standardized XML, CSV or tabular formats are promoted by different initiatives, most of commonly used file formats have a form of raw text files and have no characteristic features that might be used to identify or distinguish them. As a result, users of bioinformatic software spend significant amount of time on checking what are their File formats and assessing whether they are compatible with input or output formats of the tools they would like to use.

Algorithm
FileQuirks checks the format of the data file using an extremely simple and data-driven algorithm.

Example files for each of the file formats are stored in the database. Adding a new file format to recognize requires only providing example files of this format.

Systems calculates a set of (hundreds or more) descriptors of values 0 or 1, which are evaluated for each of the stored files. The currently used descriptors are regular expressions. Regular expressions are designed in a way to recognize common patterns used in biology, like word "BLAST" present in every BLAST report or ">" sign at the beginning of the line of sequence formats. If a regular expressions matches given file, the value of the descriptor is 1, otherwise it is 0. The matching is performed by python module re, with multiline flag enabled.

User query is evaluated against all regular expressions in the database. Afterwards, the data formats which example files match similare regular expressions are presented to the user.

To improve the result a set of "expert" regular experssions are also present, which are designed to recognize only one specific format. An example of such expression is "(>([^\t\n\r\f\v]*)\r?\n\r?([ANCTGUanctgu\n\r]{20,})){2,}" - which (believe or not) matches only files with more than one sequence of nucleic acid in FASTA format. Expert expressions are evaluated against every user query and matching data types are presented.

Input
A text file encoded in ASCII, UTF8 or UTF16

Output
A list of biological data formats and the probability that they match query