Since a string doesn't have any font rendering metrics (in that it lacks a font ...

int_19h · on Sept 7, 2017

It's not about encodings at all, actually. It's about the API that is presented to the programmer.

And the way you take it all into account is by refusing to accept any defaults. So, for example, a string type should not have a "length" operation at all. It should have "length in code points", "length in graphemes" etc operations. And maybe, if you do decide to expose UTF-8 (which I think is a bad idea) - "length in bytes". But every time someone talks about the length, they should be forced to specify what they want (and hence think about why they actually need it).

Similarly, strings shouldn't support simple indexing - at all. They should support operations like "nth codepoint", "nth grapheme" etc. Again, forcing the programmer to decide every time, and to think about the implications of those decisions.

It wouldn't solve all problems, of course. But I bet it would reduce them significantly, because wrong assumptions about strings are the most common source of problems.

Avernar · on Sept 7, 2017

What do you mean by "expose UTF-8"? Because nothing about UTF-8 requires that you give byte access to the string.

As for indexing, strings shouldn't require indexing period. That's the ASCII way of thinking, especially fixed width columns and such. You should be thinking relatively. For example, find me the first space then using that point in the string the next character needs to be letter. When you build you're code that way you don't fall for the trap of byte indexing or the performance hit of codepoint indexing (UTF-8) or grapheme indexing (all encodings).

ubernostrum · on Sept 7, 2017

There are real-world textual data types for which your idealized approach simply does not work. As in, it would be impossible or impossibly unwieldy to validate conformance to the type using your approach, because they require indexing to specific locations, or determining length, or both.

For example, I work for a company that does business in the (US) Medicare space. Every Medicare beneficiary has a HICN -- Health Insurance Claim Number -- and HICNs come in different types which need to be identified. Want to know how to identify them? By looking at prefix and suffix characters in specific positions, and the length of what comes between them. For example, the prefix 'A' followed by six digits means the person identified is the primary beneficiary and was first covered under the Railroad Retirement Board benefit program. Doing this without indexing and length operations is madness.

These data types can and should be subjected first to some basic checks to ensure they're not nonsense (i.e., something expected to be a numeric value probably should not contain Linear B code points, and it's probably a good idea to at least throw a regex at it first, but then applying regex to Unicode also has quirks people don't often expect at first...).

Avernar · on Sept 7, 2017

I don't see why this would be hard with iterators. You have an iterstor to the start of the HICN, either at the start of a or deep in the string. Take a second iterator and set it to the first. Loop six times advancing that iterator checking to see if it's a digit. Then check if the next position is a space.

For the prefix and suffix and how many characters between them you do the above but use the second iterator to find the suffix. Then you either keep track of how many characters you advanced or ask for how many characters between the two.

It's very easy to think about it this way as that's how a normal (non programmer) human would do it. Basically the code literally does what you wrote in english above.

My point being is that iterators are much faster than indexing when the underlying string system uses graphemes. You can do pretty much anyting just as easy or easier with iterators than with indexing. The big exception is fixed width columnar tet files. I've seen a lot of these in financial situations but fortuanately those systems are ASCII based so not an issue.

ubernostrum · on Sept 7, 2017

You're not really changing anything, though; you're basically saying that instead of indexing to position N, you're going to take an iterator and advance it N positions, and somehow say that's a completely different operation. It isn't a different operation, and doesn't change anything about what you're doing.

If you want to argue that there should be ways to iterate over graphemes and index based on graphemes, then that is a genuine difference, but splitting semantic hairs over whether you're indexing or iterating doesn't get you a solution.

Avernar · on Sept 7, 2017

If the string is stored as ASCII characters or Unicode code points (UCS-16 or UCS-32) then you are correct that not much changes. But if the string is in UTF-8, UTF-16 or the string system uses graphemes then indexing goes from O(1) to O(N). Every index operation would have to start a linear scan from the beginning of the string to get to the correct spot. With iterators it would be a quick operation to access what it's pointing to and very quick to advance it.

My argument is that iterators are far superior to indexing when using graphemes (or code points stored as UTF-8 but grapheme support is superior). And they don't hurt when used on ASCII or fixed width strings either so the code will work with either string format. No hairs, split or otherwise here.

int_19h · on Sept 8, 2017

I agree that iterators generally make more sense with strings. But sometimes, you really do want to operate on code points - for example, because you're writing a lexer, and the spec that you're implementing defines lexemes as sequences of code points.

Avernar · on Sept 9, 2017

That's why the search functions need to be more intelegent. If you pass the search function a grapheme it will do more work. If it notices you just passed in a grapheme that's just a code point it can do a code point scan. And if the internal representstion is UTF-8 and it sees you passed in an ASCII charzcter (very common in lexing/parsing) it will just do a fast byte scan for it.

Now if the spec thinks identifiers are just a collection of code points then it's being imprecise. But things would still work if the lexer/parser you wrote returns identifiers as a bunch of graphemes because ultimately they're just a bunch of code points strung together.

It's only in situations where you need to truncate identifiers to a certain length that graphemes become important. Also normalizing them when matching identifiers would also probably be a good idea.

mjevans · on Sept 7, 2017

int_19h's approach is still valid for this; you're asking for whole displayed characters which are combined of some (you don't need to know) number of bits in memory across several units of the memory segment(s) that hold the string.

Based on your description, the correct solution is probably to use a structure or class of a more regular format to store the decoded HICN in pre-broken form. If they really only allow numbers in runs of text you might save space and speed comparison/indexing by doing this.

ubernostrum · on Sept 7, 2017

It's more that I get tired of people declaring that indexing and length operations need to be completely and utterly and permanently forbidden and removed, and then proposing that they be replaced by operations which are equivalent to indexing and length operations.

Doing these operations on sequences of code points can be perfectly safe and correct, and in 99.99%+ of real-world cases probably will be perfectly safe and correct. My preference is for people to know what the rare failure cases are, and to teach how to watch out for and handle those cases, while the other approach is to forbid the 99.99% case to shut down the risk of mis-handling the 0.001% case.

mjevans · on Sept 7, 2017

When people say they should be removed they mean primitive operations (like a standard 'length' attribute/function, or an array index operator) shouldn't exist for that type.

Just like it is better to have something like .nth(X) as a function for stepping to a numbered node, so to does a language string demand operations like .nth_printing(X) .nth_rune(X) and .nth_octet(X); to make it clear to any programmer working with that code what the intent is.

Avernar · on Sept 7, 2017

Semantically equivalent yes, access time equivalent for variable width strings no. One of the reasons for Python 3's odd internal string format is because they wanted to keep indexing and have indexing be O(1). The reason why I think replacing indexing with iterators is that it removes this restriction and they could have made the internal format UTF-8 and/or easily added support for graphemes.

I prefer to have a system where 100% of the cases are valid and teaching people corner cases is not required. We all know how well teaching people about surrogate pairs went. And we're not forbidding the 99.99% case but providing an alternative way to accomplish the exact same thing. The vast majority of code uses index variables as a form of iterator anyways so it's not that big of a change.

The main reason people keep clinging to indexing strings is that's all they know. Most high level languages don't provide another way of doing it. People who program in C quickly switch from indexing to pointers into strings. Give a C programmer an iterator into strings and they'll easily handle it.

int_19h · on Sept 8, 2017

By "expose UTF-8" I mean exposing the underlying UTF-8 representation of the string directly on the object itelf, instead of going through a separate byte array (or byte array view, to avoid copying)

Avernar · on Sept 9, 2017

Ah, I see. I agree that it would be a bad idea to give acess to the UTF-8 representstion.

As for length in bytes, a good way to handle most use cases regarding that is to have a function that truncates the string to fit into a certain number of bytes. That way you can make sure it fits into whatever fixed buffer and the truncation would happen on a grapheme level.

saagarjha · on Sept 7, 2017

This is exactly how Swift’s Strings work.