-
Notifications
You must be signed in to change notification settings - Fork 586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JavaScript] New structure for expressions #1009
Conversation
It seems in the process of merging your other PRs, this one got conflicted. Would you mind resolving the conflict for me? |
I'll try to resolve the conflicts and wrap this up by the end of Saturday. All tests pass (at least before the recent merges), but there are a couple of oddities in the test file that aren't covered. |
That's probably correct; I'll verify after lunch. The only new change is fixing a test that had a missing semicolon. |
…ecause the change broke it.
That should be it, I think. I've fixed the remaining weirdness; it turns out there was a tiny regex typo that interacted badly with the new structure. |
Can you check and see what the filesize of the |
There appear to be 29 anonymous nested contexts. I don't think that is too much to maintain. I'd much rather that than layer a macro system over the whole thing. |
31 KB (versus 78 KB for the original). The new version also takes about 58 ms on my machine to parse the complete jQuery source, down from 79 ms. Myself, I find the anonymous contexts to be easier to read, but then I've been staring at this file for an awfully long time and become accustomed to its quirks. If you think it looks better with named contexts instead, I can certainly do that. |
I've removed all nested lists from the syntax. |
What can you report about the performance now? |
@rwols The change was strictly syntax, so there should be a 0% change in performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall things look good for a merge. I'm sure there are some things that could be tweaked, but I'm confident we'll get feedback if so. I like it that the tests changed minimally.
JavaScript/syntax_test_js.js
Outdated
@@ -198,8 +234,12 @@ not_a_comment; | |||
'// /* not a comment'() {}, | |||
// ^^^^^^^^^^^^^^^^^^^^^^^^^^^ -comment() {} | |||
|
|||
'// /* not a comment'() {}, | |||
// ^^^^^^^^^^^^^^^^^^^^^^^^^^^ -comment() {} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this different than the assertions right above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It almost certainly isn't. Looks like it was added four months ago, probably by mistake.
Thanks for all of your work on this @Thom1729, and thanks for the reviews @rwols and @jcberquist. |
…ents [JavaScript] New structure for expressions
Background
In order to correctly parse JavaScript expressions, you must always know whether you are at the beginning or end of an expression. Examples:
All of these cases can be handled exactly correctly by sublime's parsing engine. This is not a case like arrow functions where perfect parsing is impossible due to engine limitations. The premise for this work is that the JavaScript syntax therefore should handle these distinctions perfectly.
Current state
The existing syntax tries to maintain these distinctions, but it does not do so systematically. For instance, the distinction between division and regexps is in many cases determined using a lookahead. This approach is buggy (#874, #885, #894).
In addition, the distinction is not made in some cases where it may be desirable. For instance, the syntax does not distinguish between array literals and bracketed property accesses (#324):
As a result, it is impossible to distinguish the rare comma operator from the more common comma separator punctuation (#831). An implementation of the distinction between arrays and property accesses has been attempted, but that implementation is vulnerable to the same sorts of bugs as the regex/division implementation and for the same reason (https://github.com/jcberquist/Packages/commit/3240bcb5ea4162e5acc2a4a8c0efbc93adde9088).
Since the current syntax was adopted for the core package, there have been a number of tweaks and patches to this behavior. Most notable is the addition of the
after-operator
andafter-identifier
contexts (62d142a). These contexts attempt to maintain the critical distinction between the beginning and end of an expression. They do so in a way that is relatively noninvasive. Unfortunately, because they are noninvasive, they do not and cannot solve the problem in the general case. While these contexts are a substantial improvement, bugs remain, and so do various patches that attempt to handle regexp corner cases.The solution
The core principle of the solution is that there must be separate states for the beginning and end of an expression. That is, when parsing
x / y;
, the parser must move back and forth between a state that expects constructs like numbers and regexps and a state that expects constructs like binary operators. When the parser is in one state, a slash will be interpreted as the beginning of a regexp, and when it is in the other state, it will be interpreted as the division operator.The
after-operator
andafter-identifier
contexts represent such a system, but as implemented they are insufficient. The entire expression syntax must be written in this tick-tock fashion; it cannot be bolted on or it will not work in the general case. On the other hand, if expressions are designed in this way from the ground up, then these knotty syntax problems become simple and no patches or special cases will be needed.After much trial and error, I have come up with a framework that handles this efficiently and effectively. To demonstrate this framework, here is a toy example:
This example has numeric literals, regexp literals, and the four common arithmetic operators. It will perfectly distinguish between division and regexps. (Try
/abc/ /def/ /ghi/;
, and add line breaks wherever you please.)There are several key points that deserve explanation.
First, the expression contexts parse a single expression. The contexts are highly stack-oriented, pushing and popping after nearly every match. In the end, when a complete expression has been consumed, the contexts will both pop off the stack. This means that whatever parent context uses expressions must push them rather than including them.
Second, they parse a single expression. Terminating semicolons are not part of an expression. Nor are JavaScript constructs like
throw
andreturn
. Nor, in JavaScript, are function declarations. This is critical: a function declaration is a complete statement that does not need to be terminated by a semicolon, but an expression does need termination. The examples at the top include an illustration of this distinction.Third, we are pushing and popping rather than setting back and forth between
expression-begin
andexpression-end
. There are several reasons for this. One is that in this scheme,expression-begin
does not have to refer explicitly toexpression-end
; it merely pops to get there. This is very important, because it allows us to have specialized versions ofexpression-end
. In JavaScript, I use one of these for expressions that may not contain commas at top level (such as for function arguments, list items, or lambda concise body expressions). Another is used to implement the automatic semicolon insertion algorithm for expression statements. Ifset
were used instead ofpush
andpop
, then we would need duplicate copies ofexpression-begin
. Compare Ecmascript Sublime, which uses many redundant almost-copies of contexts and in doing so exceeds 5,000 lines.Fourth, every contexts that wants expressions does have to push both
expression-end
andexpression-begin
in the correct order. However, this can be easily shortcut for convenience:Fifth, we don't have any error handling. This is easily remedied in practice; each of
expression-begin
andexpression-end
should pop if no match succeeds. In practice, I have found pushing and popping in this manner to be far more resilient to invalid input than setting.This toy example does little, but I hope that it can be seen how this framework can trivially distinguish blocks from objects, arrays from property accesses, function invocations from parenthesized expressions.
Implementation
In order to implement the above approach in the existing JavaScript syntax, I made a number of changes.
Zeroth, I moved comment logic to a prototype (#1007). This cleaned up the other scopes and fixed at least one bug.
First, I separated
statements
fromexpressions
by handling semicolons instatements
and poppingexpressions
whenever one was encountered.Second, I moved non-expression constructs like variable declarations,
throw
, and so forth fromexpressions
tostatements
. In the process, I implemented restricted productions in a general way and removed the existing implementations (that failed in some cases).Third, I implemented "expression statements", which are basically just expressions that handle automatic semicolon insertion. I added class and function declarations to
statements
.Fourth, I made the large change of implementing the two-level expression framework. I:
expression-begin
andexpression-end
contexts, and madeexpressions
set them.expressions
between them as appropriate.unary-operators
intoprefix-operators
andpostfix-operators
.round-brackets
intoround-brackets
andparenthesized-expression
.square-brackets
intosquare-brackets
andbracketed-property-access
.expression-no-comma
for various uses.after-identifier
andafter-operator
.expressions
to instead push it.Fifth, I renamed
expressions
to singularexpression
, reflecting its new behavior.As a result of these changes, expressions universally use the two-context framework. All tests currently pass, including some that needed modification because the previous behavior was incorrect. Issues #874, #885, and #894 are fixed.
Next steps
I write this PR early, as soon as the change is fundamentally implemented and before refactoring clutters the diffs. More work should be done to clean up the syntax and take advantage of the new expressions framework:
x = function() {};
should be refactored for simplicity. I would not be surprised if bugs were hiding there after the big change.I anticipate that these changes will decrease the size and complexity of the syntax as well as likely fixing bugs. In addition, I see opportunities for future improvement:
include
inexpression-begin
.)For now, I would appreciate feedback on the general approach and the specific implementation. I am aware that this is a major change to the structure of the syntax and that we will certainly want to add more tests. I will be available to fix any bugs that may come out of it in the forseeable future.