[JavaScript] New structure for expressions #1009

Thom1729 · 2017-05-29T17:33:34Z

Background

In order to correctly parse JavaScript expressions, you must always know whether you are at the beginning or end of an expression. Examples:

/abc/ /def/ /ghi/;
// <- string.regexp
//    ^ keyword.operator.arithmetic
//     ^^^ variable.other.readwrite
//          ^ keyword.operator.arithmetic
//          ^^^^^ string.regexp

x // expression continues
/abc/ /def/ /ghi/ y;
// <- keyword.operator.arithmetic
//^^ variable.other.readwrite
//  ^ keyword.operator.arithmetic
//    ^^^^^ string.regexp
//          ^ keyword.operator.arithmetic
//           ^^^ variable.other.readwrite
//              ^ keyword.operator.arithmetic

{} // semicolon not required
// <- meta.block

+ // expression continues
{};
// <- meta.object-literal

function foo() {} // semicolon not required after function declaration
/42/;
// <- string.regexp

+ // expression continues
function foo() {} // expression continues
/42/ z;
// <- keyword.operator.arithmetic

All of these cases can be handled exactly correctly by sublime's parsing engine. This is not a case like arrow functions where perfect parsing is impossible due to engine limitations. The premise for this work is that the JavaScript syntax therefore should handle these distinctions perfectly.

Current state

The existing syntax tries to maintain these distinctions, but it does not do so systematically. For instance, the distinction between division and regexps is in many cases determined using a lookahead. This approach is buggy (#874, #885, #894).

In addition, the distinction is not made in some cases where it may be desirable. For instance, the syntax does not distinguish between array literals and bracketed property accesses (#324):

[item];
// <- meta.brackets

x
[prop];
// <- meta.brackets

As a result, it is impossible to distinguish the rare comma operator from the more common comma separator punctuation (#831). An implementation of the distinction between arrays and property accesses has been attempted, but that implementation is vulnerable to the same sorts of bugs as the regex/division implementation and for the same reason (https://github.com/jcberquist/Packages/commit/3240bcb5ea4162e5acc2a4a8c0efbc93adde9088).

Since the current syntax was adopted for the core package, there have been a number of tweaks and patches to this behavior. Most notable is the addition of the after-operator and after-identifier contexts (62d142a). These contexts attempt to maintain the critical distinction between the beginning and end of an expression. They do so in a way that is relatively noninvasive. Unfortunately, because they are noninvasive, they do not and cannot solve the problem in the general case. While these contexts are a substantial improvement, bugs remain, and so do various patches that attempt to handle regexp corner cases.

The solution

The core principle of the solution is that there must be separate states for the beginning and end of an expression. That is, when parsing x / y;, the parser must move back and forth between a state that expects constructs like numbers and regexps and a state that expects constructs like binary operators. When the parser is in one state, a slash will be interpreted as the beginning of a regexp, and when it is in the other state, it will be interpreted as the division operator.

The after-operator and after-identifier contexts represent such a system, but as implemented they are insufficient. The entire expression syntax must be written in this tick-tock fashion; it cannot be bolted on or it will not work in the general case. On the other hand, if expressions are designed in this way from the ground up, then these knotty syntax problems become simple and no patches or special cases will be needed.

After much trial and error, I have come up with a framework that handles this efficiently and effectively. To demonstrate this framework, here is a toy example:

%YAML 1.2
---
name: Toy Syntax
scope: source.toy

contexts:
  main:
    - match: ';'
      scope: punctuation.terminator
    - match: (?=\S)
      push: [ expression-end, expression-begin ]

  expression-begin:
    - match: \d+
      scope: constant.numeric
      pop: true

    - match: /
      set:
        - meta_scope: string.regexp
        - match: /
          pop: true

  expression-end:
    - match: (?=;)
      pop: true
    - match: '[-+*/]'
      scope: keyword.operator.arithmetic
      push: expression-begin

This example has numeric literals, regexp literals, and the four common arithmetic operators. It will perfectly distinguish between division and regexps. (Try /abc/ /def/ /ghi/;, and add line breaks wherever you please.)

There are several key points that deserve explanation.

First, the expression contexts parse a single expression. The contexts are highly stack-oriented, pushing and popping after nearly every match. In the end, when a complete expression has been consumed, the contexts will both pop off the stack. This means that whatever parent context uses expressions must push them rather than including them.

Second, they parse a single expression. Terminating semicolons are not part of an expression. Nor are JavaScript constructs like throw and return. Nor, in JavaScript, are function declarations. This is critical: a function declaration is a complete statement that does not need to be terminated by a semicolon, but an expression does need termination. The examples at the top include an illustration of this distinction.

Third, we are pushing and popping rather than setting back and forth between expression-begin and expression-end. There are several reasons for this. One is that in this scheme, expression-begin does not have to refer explicitly to expression-end; it merely pops to get there. This is very important, because it allows us to have specialized versions of expression-end. In JavaScript, I use one of these for expressions that may not contain commas at top level (such as for function arguments, list items, or lambda concise body expressions). Another is used to implement the automatic semicolon insertion algorithm for expression statements. If set were used instead of push and pop, then we would need duplicate copies of expression-begin. Compare Ecmascript Sublime, which uses many redundant almost-copies of contexts and in doing so exceeds 5,000 lines.

Fourth, every contexts that wants expressions does have to push both expression-end and expression-begin in the correct order. However, this can be easily shortcut for convenience:

  main:
    - match: ';'
      scope: punctuation.terminator
    - match: (?=\S)
      push: expression

  expression:
    - match: (?=\S)
      set: [ expression-end, expression-begin ]

Fifth, we don't have any error handling. This is easily remedied in practice; each of expression-begin and expression-end should pop if no match succeeds. In practice, I have found pushing and popping in this manner to be far more resilient to invalid input than setting.

This toy example does little, but I hope that it can be seen how this framework can trivially distinguish blocks from objects, arrays from property accesses, function invocations from parenthesized expressions.

Implementation

In order to implement the above approach in the existing JavaScript syntax, I made a number of changes.

Zeroth, I moved comment logic to a prototype (#1007). This cleaned up the other scopes and fixed at least one bug.

First, I separated statements from expressions by handling semicolons in statements and popping expressions whenever one was encountered.

Second, I moved non-expression constructs like variable declarations, throw, and so forth from expressions to statements. In the process, I implemented restricted productions in a general way and removed the existing implementations (that failed in some cases).

Third, I implemented "expression statements", which are basically just expressions that handle automatic semicolon insertion. I added class and function declarations to statements.

Fourth, I made the large change of implementing the two-level expression framework. I:

Created the expression-begin and expression-end contexts, and made expressions set them.
Split the former contents of expressions between them as appropriate.
Caused those contents to push or pop as appropriate.
Broke out unary-operators into prefix-operators and postfix-operators.
Broke out round-brackets into round-brackets and parenthesized-expression.
Broke out square-brackets into square-brackets and bracketed-property-access.
Added expression-no-comma for various uses.
Removed after-identifier and after-operator.
Changed various places that had included expressions to instead push it.
Fixed various tests that specified incorrect behavior.

Fifth, I renamed expressions to singular expression, reflecting its new behavior.

As a result of these changes, expressions universally use the two-context framework. All tests currently pass, including some that needed modification because the previous behavior was incorrect. Issues #874, #885, and #894 are fixed.

Next steps

I write this PR early, as soon as the change is fundamentally implemented and before refactoring clutters the diffs. More work should be done to clean up the syntax and take advantage of the new expressions framework:

Most of the match anything/push expressions can and should be refactored to something cleaner, generally by using the stack more.
The match anything/pop behavior should probably be factored out and included.
Much of the lookahead logic for constructs like x = function() {}; should be refactored for simplicity. I would not be surprised if bugs were hiding there after the big change.
Arrays should be rescoped ([JavaScript] possible enhancement: distinguish uses of square brackets #324). This is a trivial change.
The line continuation lookahead probably has bugs and should be carefully reviewed.

I anticipate that these changes will decrease the size and complexity of the syntax as well as likely fixing bugs. In addition, I see opportunities for future improvement:

Better scoping for the inside of an object literal.
Specific-enough meta scopes that we may be able to provide lexically-aware completions (e.g. all variables in lexical scope).
Easy JSX support. (It would be simple enough to provide versions both with and without JSX. The key difference would be a single include in expression-begin.)
([JavaScript] Should the comma operator be highlighted as an operator? #831) Marking the comma operator as an operator (leaving comma punctuation as punctuation).

For now, I would appreciate feedback on the general approach and the specific implementation. I am aware that this is a major change to the structure of the syntax and that we will certainly want to add more tests. I will be available to fix any bugs that may come out of it in the forseeable future.

Merge in upstream changes

… a bug that was fixed.

…29/Packages into javascript-fix-statements

wbond · 2017-06-08T20:50:31Z

It seems in the process of merging your other PRs, this one got conflicted. Would you mind resolving the conflict for me?

Thom1729 · 2017-06-08T21:10:45Z

I'll try to resolve the conflicts and wrap this up by the end of Saturday. All tests pass (at least before the recent merges), but there are a couple of oddities in the test file that aren't covered.

Thom1729 · 2017-06-10T16:39:19Z

That's probably correct; I'll verify after lunch. The only new change is fixing a test that had a missing semicolon.

…ecause the change broke it.

Thom1729 · 2017-06-10T23:22:24Z

That should be it, I think. I've fixed the remaining weirdness; it turns out there was a tiny regex typo that interacted badly with the new structure.

…hitecture.

wbond · 2017-10-18T00:55:27Z

Can you check and see what the filesize of the {data}/Cache/JavaScript/JavaScript.sublime-syntax.rcache file is on your machine?

wbond · 2017-10-18T00:58:27Z

There appear to be 29 anonymous nested contexts. I don't think that is too much to maintain. I'd much rather that than layer a macro system over the whole thing.

Thom1729 · 2017-10-18T01:40:18Z

31 KB (versus 78 KB for the original). The new version also takes about 58 ms on my machine to parse the complete jQuery source, down from 79 ms.

Myself, I find the anonymous contexts to be easier to read, but then I've been staring at this file for an awfully long time and become accustomed to its quirks. If you think it looks better with named contexts instead, I can certainly do that.

Thom1729 · 2017-10-18T16:28:50Z

I've removed all nested lists from the syntax.

rwols · 2017-10-19T22:01:21Z

The new version also takes about 58 ms on my machine to parse the complete jQuery source, down from 79 ms.

I've removed all nested lists from the syntax.

What can you report about the performance now?

wbond · 2017-10-20T00:51:26Z

@rwols The change was strictly syntax, so there should be a 0% change in performance.

Thom1729 · 2017-11-08T15:18:40Z

@wbond Have you had a chance to take a look at the syntax now that the nested lists are gone? I'm anxious to get to work on #1269, but I don't want to pile yet more changes on top of an unmerged PR.

wbond

Overall things look good for a merge. I'm sure there are some things that could be tweaked, but I'm confident we'll get feedback if so. I like it that the tests changed minimally.

wbond · 2017-11-08T15:33:09Z

JavaScript/syntax_test_js.js

@@ -198,8 +234,12 @@ not_a_comment;
    '// /* not a comment'() {},
 //  ^^^^^^^^^^^^^^^^^^^^^^^^^^^ -comment() {}

+    '// /* not a comment'() {},
+//  ^^^^^^^^^^^^^^^^^^^^^^^^^^^ -comment() {}
+


How is this different than the assertions right above?

It almost certainly isn't. Looks like it was added four months ago, probably by mistake.

wbond · 2017-11-08T16:14:51Z

Thanks for all of your work on this @Thom1729, and thanks for the reviews @rwols and @jcberquist.

…ents [JavaScript] New structure for expressions

Thom1729 and others added 20 commits May 27, 2017 18:37

Merge pull request #1 from sublimehq/master

d643d59

Merge in upstream changes

Moved comments to prototype. Added missing tests for comments and for…

8b71483

… a bug that was fixed.

Added another meta_include_prototype that I missed due to other bugs.

918a535

Expression scope is now pushed.

29d9336

Moved statement keywords to statements context.

0495ab1

Added class and function statements. Implemented semicolon insertion.

5ad77af

Implemented the two-level expression scheme. All tests pass, somehow.

6d3cf3e

Renamed "expressions" to singular "expression".

d06206c

Moved comment contexts to the top of the file.

5114ff9

Expression scope is now pushed.

51d6019

Moved statement keywords to statements context.

edf9547

Added class and function statements. Implemented semicolon insertion.

8bf34b8

Implemented the two-level expression scheme. All tests pass, somehow.

e522556

Renamed "expressions" to singular "expression".

9cd5d1a

Minor cleanup.

bb77569

Scope arrays as meta.sequence per sublimehq#790. Fixes sublimehq#324.

b18022f

Improved line continuation lookahead.

35bd19b

Merge branch 'javascript-fix-statements' of https://github.com/Thom17…

c6b0264

…29/Packages into javascript-fix-statements

Cleanup in object literal.

a993ab8

Minor simplifications. Also, postfix decrement is scoped properly.

6365568

Merge branch 'master' into javascript-fix-statements

15a0601

Fixed variable names to pop correctly. Then fixed constructor stuff b…

2bcf9e0

…ecause the change broke it.

This was referenced Jun 11, 2017

[JavaScript] Issue with regex empty character class #873

Closed

[JavaScript] Newline between as and alias breaks import/export parsing. #1027

Closed

Thomas Smith added 2 commits June 11, 2017 15:16

Fixed and unified import/export aliases.

2bc6eb4

Removed various *-continue scopes that aren't needed with the new arc…

477c785

…hitecture.

thom added 2 commits October 18, 2017 09:42

Merge branch 'master' into javascript-fix-statements

1c56319

Refactored anonymous contexts to avoid nested lists.

983b2e4

Thom1729 mentioned this pull request Oct 31, 2017

[JavaScript] ECMAScript Omnibus #1269

Closed

Merge branch 'master' into javascript-fix-statements

78a2b75

wbond requested changes Nov 8, 2017

View reviewed changes

thom added 2 commits November 8, 2017 11:05

Removed redundant test.

24c4755

Merge branch 'master' into javascript-fix-statements

1645983

wbond approved these changes Nov 8, 2017

View reviewed changes

wbond merged commit 00305fc into sublimehq:master Nov 8, 2017

Thom1729 deleted the javascript-fix-statements branch November 8, 2017 18:53

This was referenced Nov 11, 2017

[Vue] Broken highlighting in Build 3154 #1302

Closed

Invalid scope in JS export statement #1309

Closed

This was referenced Dec 3, 2017

wrong pull... borela/naomi#27

Closed

[FJSX15] Expression and statement prediction enhancement; Scope normalization; Candyman support borela/naomi#28

Merged

Thom1729 mentioned this pull request Mar 3, 2018

JSCustom breaks vue syntax highlighting Thom1729/Sublime-JS-Custom#20

Closed

This was referenced Mar 21, 2018

[JavaScript] Refactor statements. #1469

Merged

The Future of Babel-sublime babel/babel-sublime#355

Closed

Thom1729 mentioned this pull request May 9, 2018

[JavaScript] Distinguish the comma operator from comma punctuation. #1551

Merged

This was referenced May 22, 2018

[JavaScript] Highlighting issues #1600

Closed

[JavaScript] Improve parsing of class names. #1614

Merged

Thom1729 mentioned this pull request Mar 17, 2019

JS Custom takes too long to set syntax on views Thom1729/Sublime-JS-Custom#58

Closed

deathaxe pushed a commit to deathaxe/sublime-packages that referenced this pull request Jun 9, 2019

Merge pull request sublimehq#1009 from Thom1729/javascript-fix-statem…

cd7a9fb

…ents [JavaScript] New structure for expressions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JavaScript] New structure for expressions #1009

[JavaScript] New structure for expressions #1009

Thom1729 commented May 29, 2017

wbond commented Jun 8, 2017

Thom1729 commented Jun 8, 2017

Thom1729 commented Jun 10, 2017

Thom1729 commented Jun 10, 2017

wbond commented Oct 18, 2017

wbond commented Oct 18, 2017

Thom1729 commented Oct 18, 2017

Thom1729 commented Oct 18, 2017

rwols commented Oct 19, 2017

wbond commented Oct 20, 2017

Thom1729 commented Nov 8, 2017

wbond left a comment

wbond Nov 8, 2017

Thom1729 Nov 8, 2017

wbond commented Nov 8, 2017

[JavaScript] New structure for expressions #1009

[JavaScript] New structure for expressions #1009

Conversation

Thom1729 commented May 29, 2017

Background

Current state

The solution

Implementation

Next steps

wbond commented Jun 8, 2017

Thom1729 commented Jun 8, 2017

Thom1729 commented Jun 10, 2017

Thom1729 commented Jun 10, 2017

wbond commented Oct 18, 2017

wbond commented Oct 18, 2017

Thom1729 commented Oct 18, 2017

Thom1729 commented Oct 18, 2017

rwols commented Oct 19, 2017

wbond commented Oct 20, 2017

Thom1729 commented Nov 8, 2017

wbond left a comment

Choose a reason for hiding this comment

wbond Nov 8, 2017

Choose a reason for hiding this comment

Thom1729 Nov 8, 2017

Choose a reason for hiding this comment

wbond commented Nov 8, 2017