Transcription (music)



In music, transcription is the practice of notating a piece or a sound which was previously unnotated and/or unpopular as a written music, for example, a jazz improvisation or a video game soundtrack. When a musician is tasked with creating sheet music from a recording and they write down the notes that make up the piece in music notation, it is said that they created a musical transcription of that recording. Transcription may also mean rewriting a piece of music, either solo or ensemble, for another instrument or other instruments than which it was originally intended. The Beethoven Symphonies transcribed for solo piano by Franz Liszt are an example. Transcription in this sense is sometimes called arrangement, although strictly speaking transcriptions are faithful adaptations, whereas arrangements change significant aspects of the original piece. Further examples of music transcription include ethnomusicological notation of oral traditions of folk music, such as Béla Bartók's and Ralph Vaughan Williams' collections of the national folk music of Hungary and England respectively. The French composer Olivier Messiaen transcribed birdsong in the wild, and incorporated it into many of his compositions, for example his Catalogue d'oiseaux for solo piano. Transcription of this nature involves scale degree recognition and harmonic analysis, both of which the transcriber will need relative or perfect pitch to perform.

In popular music and rock, there are two forms of transcription. Individual performers copy a note-for-note guitar solo or other melodic line. As well, music publishers transcribe entire recordings of guitar solos and bass lines and sell the sheet music in bound books. Music publishers also publish PVG (piano/vocal/guitar) transcriptions of popular music, where the melody line is transcribed, and then the accompaniment on the recording is arranged as a piano part. The guitar aspect of the PVG label is achieved through guitar chords written above the melody. Lyrics are also included below the melody.

Adaptation
Some composers have rendered homage to other composers by creating "identical" versions of the earlier composers' pieces while adding their own creativity through the use of completely new sounds arising from the difference in instrumentation. The most widely known example of this is Ravel's arrangement for orchestra of Mussorgsky's piano piece Pictures at an Exhibition. Webern used his transcription for orchestra of the six-part ricercar from Bach's The Musical Offering to analyze the structure of the Bach piece, by using different instruments to play different subordinate motifs of Bach's themes and melodies.

In transcription of this form, the new piece can simultaneously imitate the original sounds while recomposing them with all the technical skills of an expert composer in such a way that it seems that the piece was originally written for the new medium. But some transcriptions and arrangements have been done for purely pragmatic or contextual reasons. For example, in Mozart's time, the overtures and songs from his popular operas were transcribed for small wind ensemble simply because such ensembles were common ways of providing popular entertainment in public places. Mozart himself did this in his opera Don Giovanni, transcribing for small wind ensemble several arias from other operas, including one from his own opera The Marriage of Figaro. A more contemporary example is Stravinsky´s transcription for four hands piano of The Rite of Spring, to be used on the ballet's rehearsals. Today musicians who play in cafes or restaurants will sometimes play transcriptions or arrangements of pieces written for a larger group of instruments.

Other examples of this type of transcription include Bach's arrangement of Vivaldi's four-violin concerti for four keyboard instruments and orchestra; Mozart's arrangement of some Bach fugues from The Well-Tempered Clavier for string trio; Beethoven's arrangement of his Große Fuge, originally written for string quartet, for piano duet, and his arrangement of his Violin Concerto as a piano concerto; Franz Liszt's piano arrangements of the works of many composers, including the symphonies of Beethoven; Tchaikovsky's arrangement of four Mozart piano pieces into an orchestral suite called "Mozartiana"; Mahler's re-orchestration of Schumann symphonies; and Schoenberg's arrangement for orchestra of Brahms's piano quintet and Bach's "St. Anne" Prelude and Fugue for organ.

Since the piano became a popular instrument, a large literature has sprung up of transcriptions and arrangements for piano of works for orchestra or chamber music ensemble. These are sometimes called "piano reductions", because the multiplicity of orchestral parts—in an orchestral piece there may be as many as two dozen separate instrumental parts being played simultaneously—has to be reduced to what a single pianist (or occasionally two pianists, on one or two pianos, such as the different arrangements for George Gershwin's Rhapsody in Blue) can manage to play.

Piano reductions are frequently made of orchestral accompaniments to choral works, for the purposes of rehearsal or of performance with keyboard alone.

Many orchestral pieces have been transcribed for concert band.

Notation software
Since the advent of desktop publishing, musicians can acquire music notation software, which can receive the user's mental analysis of notes and then store and format those notes into standard music notation for personal printing or professional publishing of sheet music. Some notation software can accept a Standard MIDI File (SMF) or MIDI performance as input instead of manual note entry. These notation applications can export their scores in a variety of formats like EPS, PNG, and SVG. Often the software contains a sound library that allows the user's score to be played aloud by the application for verification.

Slow-down software
Prior to the invention of digital transcription aids, musicians would slow down a record or a tape recording to be able to hear the melodic lines and chords at a slower, more digestible pace. The problem with this approach was that it also changed the pitches, so once a piece was transcribed, it would then have to be transposed into the correct key. Software designed to slow down the tempo of music without changing the pitch of the music can be very helpful for recognizing pitches, melodies, chords, rhythms and lyrics when transcribing music. However, unlike the slow-down effect of a record player, the pitch and original octave of the notes will stay the same, and not descend in pitch. This technology is simple enough that it is available in many free software applications.

The software generally goes through a two-step process to accomplish this. First, the audio file is played back at a lower sample rate than that of the original file. This has the same effect as playing a tape or vinyl record at slower speed – the pitch is lowered meaning the music can sound like it is in a different key. The second step is to use Digital Signal Processing (or DSP) to shift the pitch back up to the original pitch level or musical key.

Pitch tracking software
As mentioned in the Automatic music transcription section, some commercial software can roughly track the pitch of dominant melodies in polyphonic musical recordings. The note scans are not exact, and often need to be manually edited by the user before saving to file in either a proprietary file format or in Standard MIDI File Format. Some pitch tracking software also allows the scanned note lists to be animated during audio playback.

Automatic music transcription
The term "automatic music transcription" was first used by audio researchers James A. Moorer, Martin Piszczalski, and Bernard Galler in 1977. With their knowledge of digital audio engineering, these researchers believed that a computer could be programmed to analyze a digital recording of music such that the pitches of melody lines and chord patterns could be detected, along with the rhythmic accents of percussion instruments. The task of automatic music transcription concerns two separate activities: making an analysis of a musical piece, and printing out a score from that analysis.

This was not a simple goal, but one that would encourage academic research for at least another three decades. Because of the close scientific relationship of speech to music, much academic and commercial research that was directed toward the more financially resourced speech recognition technology would be recycled into research about music recognition technology. While many musicians and educators insist that manually doing transcriptions is a valuable exercise for developing musicians, the motivation for automatic music transcription remains the same as the motivation for sheet music: musicians who do not have intuitive transcription skills will search for sheet music or a chord chart, so that they may quickly learn how to play a song. A collection of tools created by this ongoing research could be of great aid to musicians. Since much recorded music does not have available sheet music, an automatic transcription device could also offer transcriptions that are otherwise unavailable in sheet music. To date, no software application can yet completely fulfill James Moorer’s definition of automatic music transcription. However, the pursuit of automatic music transcription has spawned the creation of many software applications that can aid in manual transcription. Some can slow down music while maintaining original pitch and octave, some can track the pitch of melodies, some can track the chord changes, and others can track the beat of music.

Automatic transcription most fundamentally involves identifying the pitch and duration of the performed notes. This entails tracking pitch and identifying note onsets. After capturing those physical measurements, this information is mapped into traditional music notation, i.e., the sheet music.

Digital Signal Processing is the branch of engineering that provides software engineers with the tools and algorithms needed to analyze a digital recording in terms of pitch (note detection of melodic instruments), and the energy content of un-pitched sounds (detection of percussion instruments). Musical recordings are sampled at a given recording rate and its frequency data is stored in any digital wave format in the computer. Such format represents sound by digital sampling.

Pitch detection
Pitch detection is often the detection of individual notes that might make up a melody in music, or the notes in a chord. When a single key is pressed upon a piano, what we hear is not just one frequency of sound vibration, but a composite of multiple sound vibrations occurring at different mathematically related frequencies. The elements of this composite of vibrations at differing frequencies are referred to as harmonics or partials.

For instance, if we press the Middle C key on the piano, the individual frequencies of the composite's harmonics will start at 261.6 Hz as the fundamental frequency, 523 Hz would be the 2nd Harmonic, 785 Hz would be the 3rd Harmonic, 1046 Hz would be the 4th Harmonic, etc. The later harmonics are integer multiples of the fundamental frequency, 261.6 Hz ( ex: 2 x 261.6 = 523,   3 x 261.6 = 785,   4 x 261.6 = 1046 ). While only about eight harmonics are really needed to audibly recreate the note, the total number of harmonics in this mathematical series can be large, although the higher the harmonic's numeral the weaker the magnitude and contribution of that harmonic. Contrary to intuition, a musical recording at its lowest physical level is not a collection of individual notes, but is really a collection of individual harmonics. That is why very similar-sounding recordings can be created with differing collections of instruments and their assigned notes. As long as the total harmonics of the recording are recreated to some degree, it does not really matter which instruments or which notes were used.

A first step in the detection of notes is the transformation of the sound file's digital data from the time domain into the frequency domain, which enables the measurement of various frequencies over time. The graphic image of an audio recording in the frequency domain is called a spectrogram or sonogram. A musical note, as a composite of various harmonics, appears in a spectrogram like a vertically placed comb, with the individual teeth of the comb representing the various harmonics and their differing frequency values. A Fourier Transform is the mathematical procedure that is used to create the spectrogram from the sound file’s digital data.

The task of many note detection algorithms is to search the spectrogram for the occurrence of such comb patterns (a composite of harmonics) caused by individual notes. Once the pattern of a note's particular comb shape of harmonics is detected, the note's pitch can be measured by the vertical position of the comb pattern upon the spectrogram.

There are basically two different types of music which create very different demands for a pitch detection algorithm: monophonic music and polyphonic music. Monophonic music is a passage with only one instrument playing one note at a time, while polyphonic music can have multiple instruments and vocals playing at once. Pitch detection upon a monophonic recording was a relatively simple task, and its technology enabled the invention of guitar tuners in the 1970s. However, pitch detection upon polyphonic music becomes a much more difficult task because the image of its spectrogram now appears as a vague cloud due to a multitude of overlapping comb patterns, caused by each note's multiple harmonics.

Another method of pitch detection was invented by Martin Piszczalski in conjunction with Bernard Galler in the 1970s and has since been widely followed. It targets monophonic music. Central to this method is how pitch is determined by the human ear. The process attempts to roughly mimic the biology of the human inner ear by finding only but a few of the loudest harmonics at a given instant. That small set of found harmonics are in turn compared against all the possible resultant pitches' harmonic-sets, to hypothesize what the most probable pitch could be given that particular set of harmonics. To date, the complete note detection of polyphonic recordings remains a mystery to audio engineers, although they continue to make progress by inventing algorithms which can partially detect some of the notes of a polyphonic recording, such as a melody or bass line.

Beat detection
Beat tracking is the determination of a repeating time interval between perceived pulses in music. Beat can also be described as 'foot tapping' or 'hand clapping' in time with the music. The beat is often a predictable basic unit in time for the musical piece, and may only vary slightly during the performance. Songs are frequently measured for their Beats Per Minute (BPM) in determining the tempo of the music, whether it be fast or slow.

Since notes frequently begin on a beat, or a simple subdivision of the beat's time interval, beat tracking software has the potential to better resolve note onsets that may have been detected in a crude fashion. Beat tracking is often the first step in the detection of percussion instruments.

Despite the intuitive nature of 'foot tapping' of which most humans are capable, developing an algorithm to detect those beats is difficult. Most of the current software algorithms for beat detection use a group competing hypothesis for beats-per-minute, as the algorithm progressively finds and resolves local peaks in volume, roughly corresponding to the foot-taps of the music.

How automatic music transcription works
To transcribe music automatically, several problems must be solved:

1. Notes must be recognized – this is typically done by changing from the time domain into the frequency domain. This can be accomplished through the Fourier transform. Computer algorithms for doing this are common. The fast Fourier transform algorithm computes the frequency content of a signal, and is useful in processing musical excerpts.

2. A beat and tempo need to be detected (Beat detection)- this is a difficult, many-faceted problem.

The method proposed in Costantini et al. 2009 focuses on note events and their main characteristics: the attack instant, the pitch and the final instant. Onset detection exploits a binary time-frequency representation of the audio signal. Note classification and offset detection are based on constant Q transform (CQT) and support vector machines (SVMs). A collection of public domain sheet music can be found here. 

This in turn leads to a “pitch contour” namely a continuously time-varying line that corresponds to what humans refer to as melody. The next step is to segment this continuous melodic stream to identify the beginning and end of each note. After that, each “note unit” is expressed in physical terms (e.g., 442 Hz, .52 seconds). The final step is then to map this physical information into familiar music-notation-like terms for each note (e.g., an A4, quarter note).

Detailed computer steps behind automatic music transcription
In terms of actual computer processing, the principal steps are to 1) digitize the performed, analog music, 2) do successive short-term, fast Fourier transform (FFTs) to obtain the time-varying spectra, 3) identify the peaks in each spectrum, 4) analyze the spectral peaks to get pitch candidates, 5) connect the strongest individual pitch candidates to get the most likely time-varying, pitch contour, 6) map this physical data into the closest music-notation terms. These fundamental steps, originated by Piszczalski in the 1970s, became the foundation of automatic music transcription.

The most controversial and difficult step in this process is detecting pitch . The most successful pitch methods operate in the frequency domain, not the time domain. While time-domain methods have been proposed, they can break down for real-world musical instruments played in typically reverberant rooms.

The pitch-detection method invented by Piszczalski again mimics human hearing. It follows how only certain sets of partials “fuse” together in human listening. These are the sets that create the perception of a single pitch only. Fusion occurs only when two partials are within 1.5% of being a perfect, harmonic pair (i.e., their frequencies approximate a low-integer pair set such as 1:2, 5:8, etc.) This near harmonic match is required of all the partials in order for a human to hear them as only a single pitch.