Grammar-Based Parser
Versionator uses a formal EBNF grammar to parse version strings. This page explains the approach, its benefits, and trade-offs.
Why a Formal Grammar?
Most version parsers use ad-hoc string manipulation or regular expressions. Versionator takes a different approach: we define a formal grammar based on the SemVer 2.0.0 specification and generate a parser from it.
The Problem with Ad-Hoc Parsing
Traditional regex-based version parsing has issues:
// Common regex approach - fragile and hard to maintain
var semverRegex = regexp.MustCompile(
`^v?(\d+)\.(\d+)\.(\d+)(-[0-9A-Za-z-.]+)?(\+[0-9A-Za-z-.]+)?$`)
Problems:
- Hard to extend - Adding support for new formats requires regex surgery
- Difficult to validate - Edge cases (leading zeros, empty identifiers) need separate checks
- Poor error messages - Regex failures give no context about what went wrong
- Specification drift - Easy to deviate from the SemVer spec without noticing
The Grammar Approach
Instead, we define the grammar formally:
version = [ prefix ], version-core ;
prefix = "v" | "V" ;
version-core = semver | partial-version ;
semver = major, ".", minor, ".", patch, [ pre-release ], [ build-metadata ] ;
pre-release = "-", pre-release-identifier, { ".", pre-release-identifier } ;
pre-release-identifier = alphanumeric-identifier | numeric-identifier ;
numeric-identifier = "0" | positive-digit, { digit } ;
This grammar is:
- Readable - Maps directly to the SemVer specification
- Verifiable - Can be reviewed against the spec
- Self-documenting - The grammar IS the documentation
Implementation
Versionator uses Participle, a parser generator for Go that builds parsers from struct tags.
Lexer
The lexer tokenizes input into meaningful units:
var VersionLexer = lexer.MustSimple([]lexer.SimpleRule{
{Name: "Prefix", Pattern: `[vV]`},
{Name: "Number", Pattern: `[0-9]+`},
{Name: "Dot", Pattern: `\.`},
{Name: "Dash", Pattern: `-`},
{Name: "Plus", Pattern: `\+`},
{Name: "Ident", Pattern: `[a-zA-Z][a-zA-Z0-9-]*`},
{Name: "Mixed", Pattern: `[0-9]+[a-zA-Z][a-zA-Z0-9-]*`},
})
AST
The grammar maps to Go structs:
type Version struct {
Prefix string `parser:"@Prefix?"`
Core *VersionCore `parser:"@@"`
PreRelease []*Identifier `parser:"('-' @@ ('.' @@)* )?"`
BuildMetadata []*Identifier `parser:"('+' @@ ('.' @@)* )?"`
Raw string
}
type VersionCore struct {
Major *int `parser:"@Number"`
Minor *int `parser:"('.' @Number)?"`
Patch *int `parser:"('.' @Number)?"`
Revision *int `parser:"('.' @Number)?"`
}
Validation
After parsing, semantic validation ensures SemVer compliance:
- No leading zeros in numeric identifiers (except
0itself) - Pre-release identifiers contain only
[0-9A-Za-z-] - Build metadata identifiers contain only
[0-9A-Za-z-] - Only
vorVprefixes allowed
Benefits
1. Correctness by Construction
The grammar enforces structure that regex can't:
1.2.3-alpha.1 ✓ Valid
1.2.3-alpha..1 ✗ Empty identifier (caught by grammar)
1.2.3-alpha.01 ✗ Leading zero (caught by validation)
2. Better Error Messages
Grammar-based parsing knows exactly where parsing failed:
parse error: 1:6: unexpected token "."
^
1.2..3
3. Easy Extension
Adding support for new formats requires adding grammar rules, not debugging regex:
(* Add 4-component .NET assembly versions *)
assembly-version = major, ".", minor, ".", build, ".", revision ;
4. Testability
Grammar rules can be tested in isolation:
func TestParse_PreRelease(t *testing.T) {
tests := []struct{
input string
expected []string
}{
{"1.0.0-alpha", []string{"alpha"}},
{"1.0.0-alpha.1", []string{"alpha", "1"}},
{"1.0.0-alpha.beta.1", []string{"alpha", "beta", "1"}},
}
// ...
}
5. Documentation as Code
The grammar file serves as executable documentation:
docs/grammar/version.ebnf- Complete formal grammardocs/grammar/railroad.html- Visual railroad diagrams
Trade-offs
Complexity
A grammar-based parser is more complex than a simple regex:
| Approach | Lines of Code | Dependencies |
|---|---|---|
| Regex | ~50 | None |
| Grammar | ~300 | participle |
Build-Time Cost
The parser is constructed at init time, adding ~5ms to startup. This is negligible for CLI usage but measurable.
Learning Curve
Contributors need to understand:
- EBNF grammar notation
- Participle's struct tag syntax
- Lexer token definitions
When This Matters
The grammar approach pays off when:
- Parsing complex formats - SemVer has many edge cases
- Strict compliance needed - We claim SemVer 2.0.0 compliance
- Multiple formats supported - Go pseudo-versions, assembly versions, partial versions
- Good error messages matter - CLI users need actionable feedback
Philosophy: The Joy of Excess
Let's be honest: a regex would do.
var semverRegex = regexp.MustCompile(
`^v?(\d+)\.(\d+)\.(\d+)(-[0-9A-Za-z-.]+)?(\+[0-9A-Za-z-.]+)?$`)
That's 80 characters. It would parse 99% of version strings correctly. Ship it.
Instead, versionator has a 370-line formal EBNF grammar covering SemVer, Go pseudo-versions, .NET assembly versions, NuGet, npm, CalVer, and template variables. Complete with ISO/IEC 14977 notation, proper CC BY 3.0 licensing attribution for the SemVer spec, and railroad diagram generation.
This is, by any reasonable measure, excessive.
And that's exactly the point.
Versionator is a hobby project solving a real problem. The problem could be solved with less. But hobby projects aren't about less. They're about:
- Doing it right - Not "right enough," but right. The SemVer spec is prose; a formal grammar removes ambiguity.
- Learning by building - Writing an EBNF grammar teaches you things a regex never will.
- Joy in craft - There's satisfaction in a grammar that could serve as a community reference, even if it only serves one CLI tool.
- Permissible digressions - In professional work, you ship the regex. In passion projects, you write the grammar, generate the railroad diagrams, and document the precedence rules.
The grammar exists because it was fun to write, correct to use, and excessive in exactly the right way for a project where the journey matters as much as the destination.
If you're evaluating versionator for production use: the grammar works, it's well-tested, and it handles edge cases gracefully. If you're evaluating versionator as a codebase to learn from: welcome, fellow enthusiast.
Grammar Reference
The complete grammar is in docs/grammar/version.ebnf.
Key rules:
| Rule | Description |
|---|---|
version | Top-level: optional prefix + version-core |
prefix | Only v or V per SemVer convention |
semver | Full Major.Minor.Patch with optional pre-release/metadata |
partial-version | Major only or Major.Minor (defaults missing to 0) |
pre-release | Dash-prefixed, dot-separated identifiers |
build-metadata | Plus-prefixed, dot-separated identifiers |
numeric-identifier | No leading zeros (except "0") |
alphanumeric-identifier | Contains at least one letter |
Community Use
The EBNF grammar in docs/grammar/version.ebnf is available for the broader SemVer community.
As Implemented by the Community
This grammar represents SemVer as implemented by the community, not just the formal specification. We've tried to capture:
- SemVer 2.0.0 core - The official specification
- Common extensions - Go module pseudo-versions, npm conventions
- Practical variations - Partial versions (
1.2), optional prefix - Adjacent ecosystems - Microsoft Assembly versions (4-component)
We explicitly chose not to make unilateral modifications to the SemVer core. Where the community has established patterns (like the v prefix), we document and support them. Where ecosystems have their own conventions (like .NET's 4-component versions), we include them as separate grammar rules rather than mixing them with SemVer.
Why This Matters
The SemVer specification uses prose descriptions that leave room for interpretation. A formal grammar removes ambiguity and provides a reference that can be:
- Ported to other languages - The EBNF notation is parser-generator agnostic
- Used as a test oracle - Validate your parser against the grammar
- Extended for your needs - Add custom rules while maintaining SemVer compliance
- Referenced in discussions - Point to specific grammar rules when debating edge cases
Contributing
If you find the grammar useful, consider adopting it in your project. If you find bugs, ambiguities, or community conventions we've missed, we welcome issues and pull requests.
Licensing
The SemVer 2.0.0 specification by Tom Preston-Werner is licensed under Creative Commons Attribution 3.0 (CC BY 3.0).
Our EBNF grammar is a formalization of that specification. The grammar file includes:
- SemVer rules - Derived from the CC BY 3.0 specification (attribution preserved)
- Extensions - Go pseudo-versions, Assembly versions, partial versions (BSD 3-Clause)
- Original formalization work - The EBNF encoding itself (BSD 3-Clause)
Important: Our release of this grammar under BSD 3-Clause does not remove or supersede the CC BY 3.0 attribution requirements of the underlying SemVer specification. When using the grammar, you must retain attribution to Tom Preston-Werner and the SemVer project as required by CC BY 3.0.
See Also
- Version Grammar Explained - Plain English guide to version string syntax
- SemVer 2.0.0 Specification
- VERSION File Format
- Semantic Versioning Concepts