JSON parsing

Published at May 9, 2024

Okay I know I said I would write soon. That is obviously a lie. Now that I have fully admitted my war crimes to a single reader, let’s start off with something a little light.

Back around the start of this year I was working on a configuration language, Azoth. Think Pkl but turing complete. Since then I still have a tendency to not complete projects, but instead of incompetence and laziness, I have been trying to yak shave on this language to no end, in terms of both implementation and syntax. I’m still deciding on what to do going forward.

Anyway, the language in question handles a superset of JSON. JSON might seem boring at first glance, but trying to teach Regex to recognise JSON as token streams have made me both older and wiser as a result. But even then, JSON isn’t as cursed as XML. It is for this very reason that I have decided to start off with the simplest topic I know.

You can refer to this for a more concise explanation on JSON. You now no longer have to read the rest of this blog post.

Note: I have to write regexes in a special sort of way since the lexer library I’m using does not support backtracking.

Strings

For something to be considered a valid JSON string, it has to be:

Surrounded by double quotes
Can contain escapes for characters like double quotes, backslashes, and forward slashes.

Hold up, did you say forward slashes?

While escaping that is not mandatory, let’s turn back the clock to the early 2000s. Back then, JavaScript was busy being JScript and Douglas Crockford first released the JSON specifications to the world. Most JSON parsers at that time treated JSON as valid JavaScript, something that could be inlined in script tags, and therein lies the problem:

var lyrics = {"atrocity": "It's okay to leave your dog in a hot car</script>"};

As you can see, naive HTML parsers sees the closing tag in the string and immediately closes the entire script tag. To prevent that from happening, people write <\/script> like this and people don’t bat an eye. Nowadays, if you see this in the wild, people probably are doing it for compatibility reasons.

Other weird escapes include:

\b
\f
\u

\b represents a backspace character, deleting the last character. Does any programming language delete the last character? Try typing print("Hi\b\b\b there") in your Python REPL.

\f represents a form feed character. In typewriters, a form feed instructs it to advance the paper to the top of the next page. Due to technological limitations in the early 20th century, not everyone has a monitor. They were bulky, expensive, and of poor quality. Teletypewriters (TTYs) can be thought of as tiny thin clients that can send and receive data electronically, but output data via a printed piece of paper, like traditional typewriters. They were considered “dumb terminals” which has no logic processing of its own, though some TTYs do come close to being an actual computer. Nowadays, modern terminal emulators don’t even bother trying to emulate this behaviour, for obvious reasons.

These escapes are not unique to JSON, but most programmers including me never use them, so I might as well include them in.

This is also the reason why serde-json does not allow you to deserialise to a &str, as the parser will have to add and remove characters when there are escape characters in the JSON string. However, deserialising to a Cow<str> is possible as it allows it to be owned or borrowed on a case to case basis.

Now for the valid characters that make up a JSON string. It must be in the range 0x0020 to 0x10FFFF. Characters lower than 0x0020 are control characters like carriage return, and can cause headaches when handling them. JSON includes characters that are within the Basic Multilingual Plane (BMP) (0x0020 - 0xFFFF) and Supplementary Planes (0x10000 - 0x10FFFF) as part of its supported character set.

BMP includes common characters that are from alphabets like Latin, Cyrillic, and even Chinese characters. SP contains less used characters. Plane 1 contains rare and historical characters. Most emojis are included in this plane. Plane 2 contains even more Chinese, Japanese, Korean, etc. characters that are less frequently used, plane 14 includes language tags and variation selectors, which are specialized characters in Unicode that help specify additional details about how characters should be interpreted or presented. For instance, a character might look the same in different languages. These characters help to tell them apart. For example, the smiling face emoji 😃 (0x1F603) can have a text variant (using 0xFE0E) or an emoji variant (using 0xFE0F). Plane 15 and 16 are reserved for private use by individuals and organizations.

Do you know: Some emojis can be formed by combining multiple Unicode code points, often to create a more specific or complex representation. For example, combining the man (0x1F468) and the rocket (0x1F680) emoji with a zero width joiner (ZWJ) (0x200D) forms the “man astronaut” emoji (👨‍🚀).

Anyway, here’s the regex I used:

"([\u0020-\u0021\u0023-\u005B\u005D-\U0010FFFF]|(\\(["\\/bfnrt]|(u[\dA-Fa-f]{4}))))*"

Numbers

Numbers are simple right? 123, this is a number. Unfortunately, JSON has to handle more than just positive whole numbers. A number can start off with a negative sign, or it may not. -0 is a valid JSON number, but anything above a single digit means that you can no longer start your number with a 0. A number can also have a decimal point, and anything after it is more numbers. The trouble comes when JSON chose to support scientific notation, or more specifically scientific notations written in the form of E notations.

E notations

You’ve probably heard of E notations before if you have ever messed with a scientific calculator. In JSON, both “E” and “e” are considered valid. After the E part comes the sign. Both + and - character are valid. Lastly, comes the exponent. An exponent is a number of any number of digits. It can start off with the digit 0 for some reason.

-?(([1-9]\d+)|\d)(\.\d+)?([Ee][+-]?\d+)?

Whitespace

Lastly, we have whitespace. But how do we define whitespace? In JSON, it is basically made up of these four characters:

0x0020
0x000A
0x000D
0x0009

The first character represents the space character, there’s nothing much to talk about that.

The next character is the line feed (or newline). You use it to move the character to the next line.

The third character is the carriage return. It originally meant moving the cursor to the start of the line without moving to the next line, but these days it’s just a relic of operating systems like Windows in which it is often used in conjunction with the line feed. This is mostly due to historical reasons dating back to the early days of computing, where designers felt very strongly about not significantly deviating away from the workflow of TTYs.

The last character is the tab character (or horizontal tab character). It is used to move the cursor a certain amount of spaces to the right. It is useful for indenting documents are whatnot, and the spacing depends on application settings.

[\u0020\u000A\u000D\u0009]+

See the Unicode code points in the regex? You use the format \uXXXX to specify a four digit code point and \UXXXXXXXX for more extended Unicode ranges.

Conclusion

Like I said, this is going to be a simple and short blog post. Writing this blog post gave the markdown renderer for this site a chance to stretch its legs, I had to implement a few styling rules for certain markdown components like headers and hyperlinks. It all worked out beautifully in the end, so I’m pretty glad about that. I have a few more topics lined up for this blog, so stay tuned if you want to learn about the NES architecture that goes wildly into the weeds so often you’ll feel like you have ADHD, a basic introduction to image recognition and some matrix math, some weird Java facts you might have never heard of, and much more!

As of what I’m going to do during the holidays, probably some Minecraft Java modding (which should be done at the time of writing), learning music, learning some electrical engineering so I can build a cool retro sound card, a Rust crate that implements the signals primitive found in a lot of Javascript web frameworks, learn more about the NES, digital signal processing, and maybe start my second attempt on implementing a synthesizer framework in Rust. I definitely wouldn’t be able to finish all of that by the end of the holidays, but a man can try. Hopefully I can write about more stuff once I have improved in the art of creating things, cuz this is what this blog is all about. I’ll see y’all next time.