feat: hand written parser#40717
Conversation
|
Looks like this PR is not ready to merge, because of the following issues:
Please fix the issues and try again If you have any trouble, please check the PR guidelines |
|
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Hey @amitkumarashutosh , good work getting this done. The class is clean and readable overall. I do have a few thoughts on the OOP design. The core issue is that methods like advance(), save(), restore(), and pos being public are exposing the internal mechanics of how the scanner works, rather than what the scanner does from the caller's perspective. This breaks encapsulation — callers shouldn't need to know or care how the scanner moves through the input internally. Let me use an example. Imagine you order a car from Factory A. Its public interface is start(), drive(), brake(), and getSpeed(). Now Factory B delivers a car with those same methods plus increaseSpeed(), decreaseSpeed(), and setRPM(). Most people would prefer Factory A's car — not because Factory B's car is broken, but because Factory A made a decision about what the driver should and shouldn't need to think about. The engine is managing speed under the hood, the driver just presses the pedal. Factory B is forcing the driver to manage internals that should be invisible, which makes the experience more complex and error-prone. Expose too much, and the driver causes a crash not because they were reckless, but because the interface invited the mistake. The same applies here — the Scanner class is our car, and the caller is our driver. advance(n) is a raw cursor movement that exposes internal position management directly to the caller. Instead, the scanner should move itself forward as a side effect of higher-level operations. Methods like consume(), consumeWhile(predicate), or expect(literal) would advance internally and return meaningful results — the caller should be asking what to consume, not how many characters to move. save() and restore() expose the backtracking mechanism as a manual two-step process. This is fragile because a caller can save and forget to restore. A pattern like tryConsume(literal) or a scoped attempt(() => ...) callback would hide the save/restore cycle entirely and make misuse much harder. pos being public lets anyone outside the class read and reason about the raw cursor index. Position is an implementation detail. If the caller needs location info for something like error reporting, a dedicated method like getLocation() returning something meaningful would serve that need without leaking the internals. sliceFrom(savedPos) requires the caller to hold a raw position integer and pass it back in, which ties the caller directly to internal state. A mark() method returning an opaque token, paired with getSliceSinceMarked(token), would achieve the same result without exposing the position type at all.
A well-designed class should make the right usage easy and the wrong usage hard. Right now the public API of Scanner requires callers to know a lot about how the internals work in order to use it correctly. The goal is to push that knowledge inward — hide the cursor, hide the backtracking, and expose only the operations that make sense at the scanning level.
| if (scanner.matches('mailto:')) { | ||
| scanner.advance(7); | ||
| } |
There was a problem hiding this comment.
Looks like a magic number slipped through here! Could you please fix this one and double-check the rest of the code for any others?
37e61da to
c673fba
Compare
|
Hey @amitkumarashutosh , I'm getting a build failure in |
c673fba to
65fd782
Compare
@ahmed-n-abdeltwab, I resolved the unused variable ( |
Thanks. It's always a good idea to double-check everything before pushing a commit. It's okay to make these kinds of mistakes; I do it all the time and they're hard to catch. That's why I usually run the build to ensure everything compiles properly, and then run the tests and lint checks to make sure I don't mess up the code or the project. It's like evolving a third eye that's looking out for those mistakes! :) |
ac46e14 to
c67061d
Compare
| if (scanner.matches('mailto:')) { | ||
| scanner.consume(7); | ||
| } |
There was a problem hiding this comment.
here is a magic number
Handwritten Parser Foundation
This PR introduces a handwritten TypeScript parser implementation to replace the
PeggyJS-generated parser, building it up incrementally while maintaining AST
compatibility with the existing grammar.
What's Added
parser.ts— core handwritten parser with block dispatch and inline parsing loopscanner.ts— cursor abstraction withposition()/backtrack()for backtrackingchars.ts— character classification helpers (isAlpha, isDigit, isPlainChar, etc.) and the emoticon tableutils.tsfor node generation, and types fromdefinitions.tsindex.tsParsing Features Implemented
paragraphnodelineBreaknodes; trailing newlines are swallowed\*,\_,\~,\`,\#,\.codewith no markup parsing inside*text*and**text**with whitespace-only and triple-asterisk edge cases_text_and__text__with word-boundary guards to protectsnake_case~text~and~~text~~with whitespace-only and triple-tilde edge casesskipBold,skipItalic,skipStrike) mirroring PeggyJS skip flags#through####prefix with required space, inline content inside```with optional language tag, raw content (no markup inside)@usernamesupporting special characters like.-_:@in names#channelwith word-boundary guard>prefixed lines with inline markup support inside||text||inline and multi-line||\n...\n||block support[title](url)and<url|title>-/*and1.markers with inline content per item; asterisk bullets respect the grammar's trailing-*guard (so* *,* Hello*stay inline)tldts:emoji:parsing, plus big-emoji handling for shortcode-only messageslocal@domain→mailto:link, with TLD validation andmailto:prefix parts+number→tel:link viaphoneChecker, with prefix/grouping variants<t:...>supporting Unixtime, ISO-8601 (with/without milliseconds and timezone), relativeHH:MM[:SS]times, and thet/T/d/D/f/F/Rformat specifiers:),:D, etc. via a symbol→shortcode table with longest-match and word-boundary dispatch, integrated into big-emoji detectionTest Progress
Goal
Notes