Skip to main content

Grammar-Based Parser

Versionator uses a formal EBNF grammar to parse version strings. This page explains the approach, its benefits, and trade-offs.

Why a Formal Grammar?

Most version parsers use ad-hoc string manipulation or regular expressions. Versionator takes a different approach: we define a formal grammar based on the SemVer 2.0.0 specification and generate a parser from it.

The Problem with Ad-Hoc Parsing

Traditional regex-based version parsing has issues:

// Common regex approach - fragile and hard to maintain
var semverRegex = regexp.MustCompile(
`^v?(\d+)\.(\d+)\.(\d+)(-[0-9A-Za-z-.]+)?(\+[0-9A-Za-z-.]+)?$`)

Problems:

  • Hard to extend - Adding support for new formats requires regex surgery
  • Difficult to validate - Edge cases (leading zeros, empty identifiers) need separate checks
  • Poor error messages - Regex failures give no context about what went wrong
  • Specification drift - Easy to deviate from the SemVer spec without noticing

The Grammar Approach

Instead, we define the grammar formally:

version = [ prefix ], version-core ;
prefix = "v" | "V" ;

version-core = semver | partial-version ;

semver = major, ".", minor, ".", patch, [ pre-release ], [ build-metadata ] ;

pre-release = "-", pre-release-identifier, { ".", pre-release-identifier } ;
pre-release-identifier = alphanumeric-identifier | numeric-identifier ;

numeric-identifier = "0" | positive-digit, { digit } ;

This grammar is:

  • Readable - Maps directly to the SemVer specification
  • Verifiable - Can be reviewed against the spec
  • Self-documenting - The grammar IS the documentation

Implementation

Versionator uses Participle, a parser generator for Go that builds parsers from struct tags.

Lexer

The lexer tokenizes input into meaningful units:

var VersionLexer = lexer.MustSimple([]lexer.SimpleRule{
{Name: "Prefix", Pattern: `[vV]`},
{Name: "Number", Pattern: `[0-9]+`},
{Name: "Dot", Pattern: `\.`},
{Name: "Dash", Pattern: `-`},
{Name: "Plus", Pattern: `\+`},
{Name: "Ident", Pattern: `[a-zA-Z][a-zA-Z0-9-]*`},
{Name: "Mixed", Pattern: `[0-9]+[a-zA-Z][a-zA-Z0-9-]*`},
})

AST

The grammar maps to Go structs:

type Version struct {
Prefix string `parser:"@Prefix?"`
Core *VersionCore `parser:"@@"`
PreRelease []*Identifier `parser:"('-' @@ ('.' @@)* )?"`
BuildMetadata []*Identifier `parser:"('+' @@ ('.' @@)* )?"`
Raw string
}

type VersionCore struct {
Major *int `parser:"@Number"`
Minor *int `parser:"('.' @Number)?"`
Patch *int `parser:"('.' @Number)?"`
Revision *int `parser:"('.' @Number)?"`
}

Validation

After parsing, semantic validation ensures SemVer compliance:

  • No leading zeros in numeric identifiers (except 0 itself)
  • Pre-release identifiers contain only [0-9A-Za-z-]
  • Build metadata identifiers contain only [0-9A-Za-z-]
  • Only v or V prefixes allowed

Benefits

1. Correctness by Construction

The grammar enforces structure that regex can't:

1.2.3-alpha.1      ✓ Valid
1.2.3-alpha..1 ✗ Empty identifier (caught by grammar)
1.2.3-alpha.01 ✗ Leading zero (caught by validation)

2. Better Error Messages

Grammar-based parsing knows exactly where parsing failed:

parse error: 1:6: unexpected token "."
^
1.2..3

3. Easy Extension

Adding support for new formats requires adding grammar rules, not debugging regex:

(* Add 4-component .NET assembly versions *)
assembly-version = major, ".", minor, ".", build, ".", revision ;

4. Testability

Grammar rules can be tested in isolation:

func TestParse_PreRelease(t *testing.T) {
tests := []struct{
input string
expected []string
}{
{"1.0.0-alpha", []string{"alpha"}},
{"1.0.0-alpha.1", []string{"alpha", "1"}},
{"1.0.0-alpha.beta.1", []string{"alpha", "beta", "1"}},
}
// ...
}

5. Documentation as Code

The grammar file serves as executable documentation:

  • docs/grammar/version.ebnf - Complete formal grammar
  • docs/grammar/railroad.html - Visual railroad diagrams

Trade-offs

Complexity

A grammar-based parser is more complex than a simple regex:

ApproachLines of CodeDependencies
Regex~50None
Grammar~300participle

Build-Time Cost

The parser is constructed at init time, adding ~5ms to startup. This is negligible for CLI usage but measurable.

Learning Curve

Contributors need to understand:

  • EBNF grammar notation
  • Participle's struct tag syntax
  • Lexer token definitions

When This Matters

The grammar approach pays off when:

  1. Parsing complex formats - SemVer has many edge cases
  2. Strict compliance needed - We claim SemVer 2.0.0 compliance
  3. Multiple formats supported - Go pseudo-versions, assembly versions, partial versions
  4. Good error messages matter - CLI users need actionable feedback

Philosophy: The Joy of Excess

Let's be honest: a regex would do.

var semverRegex = regexp.MustCompile(
`^v?(\d+)\.(\d+)\.(\d+)(-[0-9A-Za-z-.]+)?(\+[0-9A-Za-z-.]+)?$`)

That's 80 characters. It would parse 99% of version strings correctly. Ship it.

Instead, versionator has a 370-line formal EBNF grammar covering SemVer, Go pseudo-versions, .NET assembly versions, NuGet, npm, CalVer, and template variables. Complete with ISO/IEC 14977 notation, proper CC BY 3.0 licensing attribution for the SemVer spec, and railroad diagram generation.

This is, by any reasonable measure, excessive.

And that's exactly the point.

Versionator is a hobby project solving a real problem. The problem could be solved with less. But hobby projects aren't about less. They're about:

  • Doing it right - Not "right enough," but right. The SemVer spec is prose; a formal grammar removes ambiguity.
  • Learning by building - Writing an EBNF grammar teaches you things a regex never will.
  • Joy in craft - There's satisfaction in a grammar that could serve as a community reference, even if it only serves one CLI tool.
  • Permissible digressions - In professional work, you ship the regex. In passion projects, you write the grammar, generate the railroad diagrams, and document the precedence rules.

The grammar exists because it was fun to write, correct to use, and excessive in exactly the right way for a project where the journey matters as much as the destination.

If you're evaluating versionator for production use: the grammar works, it's well-tested, and it handles edge cases gracefully. If you're evaluating versionator as a codebase to learn from: welcome, fellow enthusiast.

Grammar Reference

The complete grammar is in docs/grammar/version.ebnf.

Key rules:

RuleDescription
versionTop-level: optional prefix + version-core
prefixOnly v or V per SemVer convention
semverFull Major.Minor.Patch with optional pre-release/metadata
partial-versionMajor only or Major.Minor (defaults missing to 0)
pre-releaseDash-prefixed, dot-separated identifiers
build-metadataPlus-prefixed, dot-separated identifiers
numeric-identifierNo leading zeros (except "0")
alphanumeric-identifierContains at least one letter

Community Use

The EBNF grammar in docs/grammar/version.ebnf is available for the broader SemVer community.

As Implemented by the Community

This grammar represents SemVer as implemented by the community, not just the formal specification. We've tried to capture:

  • SemVer 2.0.0 core - The official specification
  • Common extensions - Go module pseudo-versions, npm conventions
  • Practical variations - Partial versions (1.2), optional prefix
  • Adjacent ecosystems - Microsoft Assembly versions (4-component)

We explicitly chose not to make unilateral modifications to the SemVer core. Where the community has established patterns (like the v prefix), we document and support them. Where ecosystems have their own conventions (like .NET's 4-component versions), we include them as separate grammar rules rather than mixing them with SemVer.

Why This Matters

The SemVer specification uses prose descriptions that leave room for interpretation. A formal grammar removes ambiguity and provides a reference that can be:

  • Ported to other languages - The EBNF notation is parser-generator agnostic
  • Used as a test oracle - Validate your parser against the grammar
  • Extended for your needs - Add custom rules while maintaining SemVer compliance
  • Referenced in discussions - Point to specific grammar rules when debating edge cases

Contributing

If you find the grammar useful, consider adopting it in your project. If you find bugs, ambiguities, or community conventions we've missed, we welcome issues and pull requests.

Licensing

The SemVer 2.0.0 specification by Tom Preston-Werner is licensed under Creative Commons Attribution 3.0 (CC BY 3.0).

Our EBNF grammar is a formalization of that specification. The grammar file includes:

  • SemVer rules - Derived from the CC BY 3.0 specification (attribution preserved)
  • Extensions - Go pseudo-versions, Assembly versions, partial versions (BSD 3-Clause)
  • Original formalization work - The EBNF encoding itself (BSD 3-Clause)

Important: Our release of this grammar under BSD 3-Clause does not remove or supersede the CC BY 3.0 attribution requirements of the underlying SemVer specification. When using the grammar, you must retain attribution to Tom Preston-Werner and the SemVer project as required by CC BY 3.0.

See Also