OMR engine output file format

Disclaimer: I cannot guarantee the correctness or completeness of the information here. I cannot guarantee that the format will not change in the future, though I have made a serious attempt to make it future-proof.

Overview

The format is text-based and human readable (with difficulty). It is not intended to be edited by hand.

It is extendible, so that programs that read the format should be able to read newer versions and skip the parts they don't understand. I doubt that I have acheived this aim entirely, but I hope the changes needed to keep reading programs up to date will be minimised.

It is only intended that information will stored in this format as a transition into another notation format. It is too bulky and limited as a general purpose format.

SharpEye is written in C and the format reflects that. It is intended to be read using fscanf() and its structure closely reflects the C structs and arrays I use.

The output from the OMR engine (Liszt) is less interpreted than the output from SharpEye. The same format is used for both kinds of file.

The current interpretations done by SharpEye: SharpEye unifies time signatures (makes all time signatures occuring at the same time the same) and makes the score 'rectangular', ie the same number of staves per system. SharpEye attaches slurs/ties to notes where it can, and decides if they are ties or not. SharpEye attaches lyric syllables to notes where it can. SharpEye does rhythm analysis when it loads, and this includes guessing which notes belong to a triplet.

I use the file extension .mro on Windows, a file of type 'SharpEye' or 0x183 on RISC OS for these files.

General Preliminaries

Syntactical structure

The first thing in the file is an identifier. The rest consists of

[name] [value]

pairs. Each [name] is a text string made of printable non-whitespace ASCII characters. It is a 'slot name' or 'field name'. The [value] can be of two main types: simple or complex.

Simple values are of two types. (1) They can be strings containing printable non-whitespace ASCII characters, usually representing numeric, boolean, or 'type' information. (2) They can be strings enclosed in double quotes (as in a CSV file) representing textual information. Non-ASCII characters may appear between the double quotes but nowhere else in the file.

A complex value is a list of [name] [value] pairs enclosed in curly brackets. Eg: { [name] [value] [name] [value] [name] [value] }

It is important to note that names, values, '{', and '}' are always separated from one another by whitespace.

The name preceeding a textual value always ends with a '$'. Other names never do.

Many items in the file are in arrays or lists. In order to conform with the "name-value" structure, arrays look like:

arrayname { nof 2 elementname {...} elementname {...} }

(The 'nof' is literal, arrayname and elementname will vary.)

When reading this format, you should not assume that parts of any structure occur in any particular order. You should assume that you will find [name] tokens you don't understand.

Within reason, you should not count on finding things within a structure. For example, you won't find a list of clefs in a bar with no clefs in it, or a list of lyric lines for a stave if there are no lyrics. In most cases, the bottom level fields will all be present. It would be absurd to have a note with no pitch, for example. Mostly, information will be explicitly present even when there are obvious defaults, eg the note head structure will say "accid None", but is safest for future compatibility to construct defaults, then overwrite them with what you find in the file (if you find anything).

Low level interpretations

The value 'True' means true or present, 'False' for false or absent. Eg:

staccato True

- this note has a staccato dot.

Integer values are represented in decimal form with an optional minus sign. Eg:

nofpages 3

- there are 3 pages in this score.

A pair of integer values is represented by two decimal numbers with a comma between, often to represent a position. Eg:

flagposn 66,86

- the flag position of this chord is 66 units from the top and 86 units from the left of the stave's top-left.

A rational number is represented by two decimal numbers with a forward slash '/' between, often to represent a time. Eg:

tupletransform 2/3

- this note is to last 2/3 of its normal value.

There are no floating point numbers in the current version.

Text strings in double quotes, using "" within the string if a double quote is needed. The text may be encoded in ASCII, or ISO8859-1, possibly UTF8 or others in the future. Note that ISO8859-1 and UTF8 are both extensions of ASCII, so a string in ASCII is the same in all three encodings.

Some general conventions

These are not strictly adhered to. They are to aid readability.

* Use all lower case for slots (field names).

* Use upper case initials for types/shapes.

* For integer values which are normally nonnegative, use -1 for impossible/nonexistent/unknown. Use 0/0 for a similar purpose for rationals.

Comments

The names 'comment' and 'comment$' are reserved. They will never be used to represent any musical element. Therefore, as long as the values following them obey the syntax, they will be skipped by a reading program.

The most generally useful form is:

comment$ "This is a comment"

Units

All graphical coordinates increase to right and down. They written as row,column pairs, ie y,x.

There are 16 units between stave lines in the output, at least for the current version. Nearly all coordinates are in these units. Exceptions wll be pointed out.

Some values are in 'pitch units' where the midline of the stave is zero, with values going up towards the bottom. So the note B on a the midline of a treble stave is 0, C is -1, D is -2, etc.

Input units are in image pixels.

String values

Here is a list of strings to represent types and shapes of things.

boolean type: True False

accidental type: None Sharp Flat Natural DoubleSharp DoubleFlat NaturalSharp NaturalFlat

note/rest shape: Breve Sbreve Minim Solid Breverest Sbreverest Minimrest Crotchetrest Quaverrest Squaverrest DSquaverrest HDSquaverrest MultiBarRest

clef shape: Treble Bass Alto TrebleUp8 TrebleDown8

barline shape: Single Double Leftrepeat Rightrepeat Backtobackrepeat

Main structure

The top level structure is

fileheader {...} score {...}

Since the format is extendible, there could be other structure in later versions. Things will be added within the score structure, so what follows is a minimum you can expect to find.

A score has some information to itself, plus a list of pages.

A page has some information to itself, plus a list of systems.

A system has some information to itself, plus a list of staves, plus a list of slurs/ties.

A stave has some information to itself, plus a list of bars, plus a list of lyric lines, plus a list of dynamics (ppp...fff and hairpins).

A bar has has some information to itself, plus a list of clefs, a list of keysigs, a list of chords, a bar line, and possibly a timesig.

A chord is fairly complex, and is used to represent single notes, rests as well as proper chords.

A slur represents a slur or tie or phrase mark.

A lyric line has some information to itself, plus a list of elements (syllables).

fileheader {...} is

version [N] characterencoding [E]

[N] is the version number of the file as an integer. It is 1000-1999 for version 1 of SharpEye (currently only 1000 used). It is 2000-??? for version 2 of SharpEye. Currently 2000 (SharpEye 2.00-2.10), 2011 (SharpEye 2.11-2.30), 3000 (SharpEye 2.31-2.49), 3100 (SharpEye 2.50-??) are possible.

[E] is "ASCII" or "ISO88591" or "UTF8". It is ASCII in v1000, and ISO88591 in v2000,v2011,v3000,v3100.

score {...} is

score
{
title$ [T] unitsperstavespacing [U]  preedit [P] pages { nof [N] page {...} ... page {...} }
}

[T] is the title of the piece of music. It may be the empty string, ie "". [U] is the number of units per stave spacing. In the current version (SharpEye 1 and 2) this is 16. Thus a normal 5-line staff is 64 units high. The positions of objects are stored in these units.

Note that the file contains some positions relating to the input image. These are in pixels. When generating another format you would skip these, and they are ignored here.

[P] is for use by SharpEye.

[N] is the number of pages in the score.

page {...} is

page
{
width [W] height [H] origwidth [IW] origheight [IH] 
skewangle [K] rowoffset [R] coloffset [C] 	[S]
imagefpath$ [PATH] 
systems { nof [N] system {...} ... system {...} }
}

[W] is page width, [H] is page height in output units.

[IW] is image width, [IH] is image height in pixels.

[K], [R], [C], are for mapping between output and input coordinates by SharpEye.

[S] is the input staff spacing units of 1024 per pixel, for the 'dominant' size of staff on the page. In general that means the most common size of staff, but don't count on that: since the scale is estimated before staves are found, it is even possible that it will find no staves of the dominant size. This number can be used together with unitsperstavespacing in the score structure to relate the dimensions in the output to the original image.

[PATH] is the file path of the input image.

[N] is the number of systems in the page.

system {...} is

system
{ 
top [T] left [L] width [W] height [H]
staves { nof [N] stave {...} ... stave {...} }
slurs { nof [N] slur {...} ... slur {...} }
}

[T] is the distance between top of page and top of system. [L] is the distance between left of page and left of system. [W] is the width of the system, and [H] is its height, from top of top stave to bottom of bottom stave.

stave {...} is

stave
{
top [T] left [L] width [W] size [Z] voicessplit [VS] joinedtobelow [JB]
bars { nof [N] bar {...} ... bar {...} }
lyriclines { nof [N] lyricelement {...} ... lyricelement {...} }
texts { nof [N] text {...} ... text {...} }
dynamics { nof [N] dynamic {...} ... dynamic {...} }
}

[T] is the distance between top of page and top of stave. [L] is the distance between left of page and left of stave. [W] is the width of the stave. In SharpEye version 1 and 2 at least [L] and [W] will be identical to the values for the system.

[Z] is the (vertical) size of the stave. It will normally be very close to 64 since the spacing between lines is 16 units. For a stave which is not the dominant size on the page, it may be bigger or smaller. Also see the 'spacing' field in the page structure.

[VS] is 'True' or 'False'. It will always be False in an mro file directly from the recognition engine, but can be set by user of SharpEye. [JB] is 'True' or 'False'. From 2.50 stave braces as join left and right hand piano staves are recognised and will result in this being True. Both [VS] and [JB] affect export from SharpEye as NIFF, MusicXML, MIDI.

Version 2000: dynamics are new. Version 2011: texts are new.

bar {...} is

bar
{
clefs { nof [N] clef {...} ... clef {...} }
keysigs { nof [N] keysig {...} ... keysig {...} }
timesig {...}
chords { nof [N] chord {...} ... chord {...} }
barline {...}
}

Note that a 'bar' is a physical/graphical bar, not always a musical/logical bar. A bar ends at a barline, but that barline may be a double bar line, or a repeat sign and does not always mean the end of a musical bar.

Note also that symbols in the bar are stored by type, and not left to right. They need to be sorted in order to make musical sense of them. The symbols have position information, and this can be used for sorting.

clef {...} is

clef { shape [S] centre [r,c] pitchposn [P] }

[S] is one of 'Treble' 'Bass' or 'Alto' (G clef, F clef, C clef).

[r,c] is the position relative to stave top-left of the centre of the clef.

[P] is the 'pitch position' of the clef. It is in pitch units. For a standard treble clef it will be 2, bass clef -2, alto 0, tenor -2. Currently you won't see any other values. It doesn't make much odds for now, but the [P] value should be used in preference to the r value ready for the day when eg baritone clefs are recognised.

Version 2000: [S] can now be 'TrebleUp8' 'TrebleDown8' as well as the above, meaning a treble clef with a little 8 top or bottom to indicate an octave shift up or down.

Version 2000: [P] can now be any of -4,-2,0,2,4 for Alto clefs.

keysig {...} is

keysig { key [K] centre [r,c] }

[K] is an integer in the range -7 to 7. Negative numbers count flats, and positive ones count sharps.

[r,c] is the position relative to stave top-left of the centre of the keysig.

timesig {...} is

timesig
{
showasalpha [F] top [T] bottom [B] centre [r,c] timeslice [p/q] }
}

[F] is either 'True' or 'False'. If True it means the time signature is displayed as a single symbol for common or alla breve time. Otherwise it is displayed as two numbers.

[T] is the top number, and [B] is the bottom. These values are valid even if [F] is True (in which case they would be 2 and 2, or 4and 4.

[r,c] is the position relative to stave top-left of the centre of the timesig.

chord {...} is




chord
{
virtualstem [V] stemup [U] stemslash 
tuplettransform [p/q] tupletcount [TC]
nofmmrestbars [MMR]
accent [E] staccato [E] marcato [E] staccatissimo [E] tenuto [E] pause [E]
upbow [E] downbow [E] trill [E] mordent [E] invmordent [E]
staccato_dr [SDR] tenuto_dr [TDR] pause_dr [PDR] accent_dr [ADR]
naugdots [R] nflags [F] flagposn [r,c] headend [H]
beam {...}
notes { nof [N] note {...} ... note {...} }
}

For version 2000, tupletcount [TC] is replaced by tupletID [TI]

nofmmrestbars is new in version 3100. marcato, staccatissimo, upbow, downbow, trill, mordent, invmordent are new in version 3100.

Like NIFF's stem, and ENIGMA's entry, this structure represents chords, single notes, and rests. Single notes are regarded as 'degenerate' chords, and rests as silent chords.

[V] is 'True' or 'False'. If True it means there is no stem, ie the chord is a breve, semi-breve or rest. (it's redundant but convenient.)

[U] is 'True' or 'False'. If True it means the stem points up from the note(s).

[SS] is 'True' or 'False'. If True it means the stem has a slash, as in an acciaccatura grace note.

[p/q] is the multiplier applied to the time to deal with tuplets. It is 1/1 for most notes, and 2/3 for notes in triplets.

[TC] is 0 for notes not in a tuplet, otherwise a count from 1. This is used when editing. Tuplets is an area needs reworking, and you should ignore this. (Sharpeye version 1)

[TI] is -1 for notes not in a tuplet, otherwise an integer >= 0 that uniquely identifies the tuplet within a bar. (Version 2000 onwards)

[MMR] is the number of bars (measures) in the multi-measure rest.

[E] is 'True' or 'False', and signify the presence or absence of a various expression marks, articulations and ornaments. 'pause' means a fermata sign. 'invmordent' is an inverted mordent. The others should be clear. It is likely that these will not be present in later versions if the value is False. Eg. there will either be "staccato True" or nothing.

[SDR], [TDR], [PDR], [ADR] are not yet implemented. They are the vertical offset of the centre of the expression mark from the chords flag position. They are therefore positive if the expression is below the flag.

[R] is the number of augmentation dots following the chord. It is 0,1,2 or 3.

[F] is the number of flags on a chord which is not a rest, or part of a beamed group. It is 1 for a quaver, 2 for a semi-quaver, etc. NB: This applies to grace notes as well as normal notes. An earlier version of this documentation said otherwise.

Also note that flags on grace notes are not currently counted (August 2001) so this field will be 1 for all grace notes for the time being.

[SS] is True or False. If True it means the stem has a slash, eg for acciaccatura. This field will always not be present so assume a default of false when reading. Stem slashes are not currently (August 2001) recognised by the engine. (New in version 2000).

[r,c] is the position of the flag or beam end of the stem on this chord. In the case of a stemless note (rest, breve, semi-breve) the c value is still valid, and is the centre of the note, chord or rest.

[H] is the position of the head that is furthest from the flag or beam. It is in 'pitch' units, which means the midline of the stave is zero, with values going up towards the bottom. So the note B on a the midline of a treble stave is 0, C is -1, D is -2, etc.

Note that there will always be at least one note in the note list, which has further information.

beam {...} is

beam
{ id [I] nofnodes [N] nofleft [L] nofright [R] }

Like NIFF, beams are made of 'nodes' There is one node for each chord that the beam joins.

[I] is a integer which uniquely identifies the beam in the bar.

[N] is the number of chords joined by the beam.

[L] is the number of beam-parts that point left from the chord.

[R] is the number of beam-parts that point right from the chord.

note {...} is

note 
{
shape [S] staveoffset [O] p [P] accid [A] accid_dc [DC] normalside [N]
}

[S] is one of 'Breve' 'SBreve' 'Minim' 'Solid' 'Grace' 'BreveRest' 'SBreveRest' 'MinimRest' 'CrotchetRest' 'QuaverRest' 'SQuaverRest' 'DSQuaverRest' 'HDSQuaverRest' 'MultiBarRest'

Version 2000: Grace notes are new. Version 3100: multi-bar rests are new.

[O] is the stave offset of this notehead. It is usually zero, meaning that the notehead belongs to the same stave as the chord structure. However, when a chord or beamed group spans more than one stave, it is regarded as belonging logically to the uppermost stave on which it has any noteheads, and any noteheads which belong to staves below this will have a positive stave offset. The engine currently doesn't recognise multi-stave objects like this, so [O] will be 0.

Version 2000: The engine does now recognise multi-stave objects, but it chops them up into single-stave objects, so [O] remains at zero.

[P] is the pitch position, using the same encoding as the [H] field in the chord structure. It also gives the vertical position of a rest in the case where a chord structure is used for a rest. In this case, [P] is the position of the centre of the rest in most cases, but the top of a rest for semibreve rest, and the bottom for a minim rest.

[A] is one of 'None' 'Sharp' 'Flat' 'Natural' 'DoubleSharp' 'DoubleFlat' 'NaturalSharp' 'NaturalFlat'

[DC] is the horizontal offset of the accidental if any. It is measured from the left edge of the notehead to the centre of the accidental. It is negative (unless a recognition error has occured.)

[N] is 'True' or 'False'. If False, it means that the head goes the wrong way, as used for chords with second intervals in them.

barline {...} is 

barline
{
type [T] leftlinex [L] rightlinex [R] trueend [E] invented [I]
}

[T] is one of 'Single' 'Double' 'Leftrepeat' 'Rightrepeat' 'Backtobackrepeat' 'ThinThick'

[L] is the x-posn of the centre of the leftmost vertical line in the barline, relative to the stave left.

[R] is the x-posn of the centre of the rightmost vertical line in the barline, relative to the stave left.

[I] is 'True' or 'False'. If True, it means the barline was invented by the recognition engine. This sometimes happens at the end of a stave.

slur {...} is

slur
{
leftpt [LR,LC] rightpt [RR,RC] radius [RAD]
partner [P]
}

Slurs, ties, phrase marks, any other curves found are approximated by an arc, and assigned to a system, but not interpreted further by the OMR engine.

The coordinates are relative to the system top left. [RAD] is a signed value, a negative values means the slur is above the centre of the arc, ie it is like /^\, and a positive value means \_/. The absolute values of RAD is the radius.

lyricline {...} is

lyricline
{
abot [A] height [H] style [S] elements { nof [N] lyricelement {...} ... lyricelement {...} }
}

[A] is the vertical coordinate of the line relative to the top of the stave. It is the postion of the baseline of the text, the bottom of an 'a', not a 'g'.

[H] is the height of the text. This is the point size of the text (height of '�g'). Not properly implemented until version 2011 of this format (version 2.11 of SharpEye). If reading an earlier version than 2011, should ignore this and make up a default. 36 is about right, ie 2.25 stave spacings. From 2011 this is based on the scan. It is probably best to average these values for all the lyrics in the score.

[S] is new in v2011. It is the font style.

0     sanserif (Arial)         plain
1     sanserif (Arial)         bold
2     sanserif (Arial)         italic
3     sanserif (Arial)         bold italic
4     serif (Times)            plain
5     serif (Times)            bold
6     serif (Times)            italic
7     serif (Times)            bold italic
8     monospaced (Courier)     plain
9     monospaced (Courier)     bold
10    monospaced (Courier)     italic
11    monospaced (Courier)     bold italic

lyricelement {...} is

lyricelement
{
extender [E] c0 [L] c1 [R] 124 text$ [TEXT] midc [M] bar [B] symbol [S]
}

[E] is 'True' or 'False'. If true it means this lyricelement is an extender line like this_______ and the text$ field should be ignored. Currently, [E] will always be False as extender lines are not recognised.

[L], [R], [M] are the left, right, and middle x-posns of the element. I intend using [L] and [R] for extender lines and [M] for syllables. Currently you should only rely on [M].

If [E] is False, [TEXT] is the text of the syllable, in double quotes. Syllables with one or more hyphens following are represented by making the last character in the syllable a hyphen.

text {...} is
{
abot [A] c0 [L] height [H] style [S] type [T] text$ [TEXT]
}

[A] is the baseline of the text, relative to the top of the stave.

[L] is the left of the text, relative to the left of the stave.

[H] is the height of the text.

[S] is the font style.

[T] is the type (function) of the text. 0 is musical direction, 1 is chord. Currently, 0 means anything that isn't a lyric or a chord, since the recognition does not distinguish further. Version 3000: type is new.

[TEXT] is the text itself.

See lyricline for details of [H] and [S].

dynamic {...} is

dynamic
{
type [T] c [C] c0 [CO] c1 [C1] r [R]
}

[T] is one of 'Hairpindim' 'Hairpincres' 'Dyn_pppp' 'Dyn_ppp' Dyn_pp' 'Dyn_p' 'Dyn_mp' 'Dyn_mf' 'Dyn_f' Dyn_ff' Dyn_fff' 'Dyn_ffff' 'Dyn_fp' 'Dyn_sf' 'Dyn_fz' 'Dyn_sfz' 'Dyn_sffz' 'Dyn_sfp' 'Dyn_sfpp'

For Hairpindim, Hairpincres, [R] is the vertical centre, [C0] is the left, [C1] is the right.

For the others, [R] is the baseline of the text, and [C] is the horizontal centre.

All positions are relative to the stave top left.

Version 3100: Dyn_pppp, Dyn_ffff, Dyn_fp, Dyn_sf, Dyn_fz, Dyn_sfz, Dyn_sffz, Dyn_sfp, Dyn_sfpp are new.