Talk:FASTQ format

Untitled
Technically fastq format is multi-lined, but the use of it in short-read sequencing obviously disguises this issue.

Hence sequences may be line-wrapped, and quality values too. Given that @ is a legal quality value and it may occur just after a newline in a line-wrapped quality string, care must be taken when parsing it. The ideal solution here is simply to count the number of bases in the sequence lines and then parse with the expectation of the same number of bases in the quality lines. (If after this there isn't a new sequence header immediately starting after the quality then the format is in error.)

Unfortunately many people have implemented broken parsers and so you'll sometimes see ghastly messes where the first quality value on each line has been changed to zero (ascii '!'). This is just a bug!

193.62.203.214 (talk) 15:36, 16 April 2009 (UTC) jkb

The Celera Assembler implements yet another quality format based on this theme...
The input for the Celera Assembler is a 'frg' file

Apparently they take the (presumably Phred style) quality score and add 48 before converting to ascii for storage in the frg file. i.e. "chr(ord(0)+$qual)".

--Dan|(talk) 15:27, 30 July 2009 (UTC)

IonTorrent quality range
I've seen some IonTorrent quality values and they seem have different range from sanger or illumina. However I don't have access to such machine or output so can't be sure. Can anyone with the machine confirm and put the range up? — Preceding unsigned comment added by Hena wp (talk • contribs) 18:25, 30 April 2013 (UTC)

Would adding color to the FASTQ versions test make it clearer?
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS .....................................................  .......................... XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX ......................   ............................... IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII ......................   ................................. JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ ......................   LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL .................................................... !"#$%&'*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~  |                         |    |        |                              |                     |  33                        59   64       73                            104                   126  0........................26...31.......40                                                           -5....0........9.............................40                                  0........9.............................40                                     3.....9.............................40   0........................26...31........41                                 S - Sanger        Phred+33,  raw reads typically (0, 40)   X - Solexa        Solexa+64, raw reads typically (-5, 40)   I - Illumina 1.3+ Phred+64,  raw reads typically (0, 40)   J - Illumina 1.5+ Phred+64,  raw reads typically (3, 40) with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) (Note: See discussion above). L - Illumina 1.8+ Phred+33, raw reads typically (0, 41)

Colors picked at random, and I don't absolutely guarantee that the alignment is correct. And there appears to be a problem with the J alignment in the original figure.

Tnabtaf (talk) 02:17, 22 October 2012 (UTC)

Got no comments; posting to page.

Tnabtaf (talk) 05:59, 22 January 2013 (UTC)

The Sanger FASTQ format has no limit on the range - it goes all the way up to ~ (93). After all there is no limit on either the Phred or Solexa quality scale. The same is probably true of the Solexa/Illumina<1.8 versions too, albeit that the sequencing machines never gave a value above X because it could never been *that* confident. It is unlikely that X is 40 for all of these tools. Moreover, it's incorrect to say that the FORMAT doesn't support values larger than 40, just because the tools that produced them do not. — Preceding unsigned comment added by 2A02:8071:B1C0:C01:84E:7023:C079:6527 (talk) 17:45, 11 December 2016 (UTC)

The alignment of the figure is extremely unclear - it suggests that "I" represents both Phred scores of both 40 and 41 in the two different Phred+33 lines — Preceding unsigned comment added by 138.253.68.174 (talk) 14:28, 18 October 2018 (UTC)

Sequence letter definitions?
I'm writing a fastq parser for Illumina exome data, and I found this article very useful! Thanks for writing it. The only data I see missing from this article that would aid me in completing the parser is sequence letter definitions. I see ACTG throughout the Illumina data, which makes sense, but I don't know what 'N' stands for. I'll figure it out, but it would be cool if sequence letters were documented here.WaywardGeek (talk) 12:00, 5 August 2013 (UTC)

External links modified
Hello fellow Wikipedians,

I have just added archive links to 1 one external link on FASTQ format. Please take a moment to review my edit. If necessary, add after the link to keep me from modifying it. Alternatively, you can add to keep me off the page altogether. I made the following changes:
 * Added archive https://web.archive.org/20100610232559/http://genomecenter.ucdavis.edu/dna_technologies/documents/pipeline_1_4.pdf to http://genomecenter.ucdavis.edu/dna_technologies/documents/pipeline_1_4.pdf

When you have finished reviewing my changes, please set the checked parameter below to true to let others know.

Cheers.—cyberbot II  Talk to my owner :Online 19:21, 27 January 2016 (UTC)