Names are hard. Or not. (aka: TL;DR)

When I worked at the Journal of Asian Studies and re-developed their books and book reviews database as a Django application what took me the longest—perhaps discounting procrastination—was not the actual coding or testing, both of which went rather quickly once I was in a ‘zone,’ but rather the modeling of certain pieces of data.

These included but were not limited to ‘Books’ and publishers as well as lots of ‘utility containers/relationships,’ things more or less ontology vapid on their own but necessarily for modeling or business logic. My pet peeve and personal project, what became the monkey on my back, what I over-analyzed, etc., was how to represent names.

(Aside: Generally when speaking of tables and fields below, I’m thinking in terms of a relational database. Class and model will refer to Python classes in a Django project and representations of those database tables in Django … or any similar web application that uses an ORM.)

I. Names are hard.

The way of doing names you see across the internet and elsewhere is a matter of two form fields on a web page: First Name, Last Name. If you are U.S. American you’ll certainly see First Name, Last Name, and either Middle Name or Middle Initial on government forms.

Let us assume for the moment that his model works, regardless of whether it is correct. How long should these fields be—8 characters is certainly too short, just think of Heimerdinger, and I wouldn’t be happy with 16; what if, after all, someone has a hyphenated last name made up of two 8+ character names?—and what encoding scheme should you use. Is ASCII enough? UTF-8? Perhaps we should settle for 32 characters per field, Unicode …

Patrick McKenzie’s post—”Falsehoods Programmers Believe About Names”—is one of the better internet texts out there on the topic, inspired both examples and thoughtful contributions as well as rather ignorant and arrogant responses, and begins with the simple words of wisdom that demonstrate why names are hard:

John Graham-Cumming wrote an article today complaining about how a computer system he was working with described his last name as having invalid characters. It of course does not, because anything someone tells you is their name is — by definition — an appropriate identifier for them.

This is a theoretical limitation, less so a practical one, though it does still throw up a number of impediments. McKenzie goes over a number of those problem, the things most just do not think about … encoding, length … things that are more or less ‘assumptions’ that can be overcome thorugh better design. “Anything someone tells you is their name is […] an appropriate identifier for them” is, however, not actually the case … there are legal roadblocks to all sorts of names, and some countries keep registries of allowed or disallowed names.

A given legal or cultural framework could guarantee that all personal names aligned with a narrow dictionary or system, so that a computer could algorithmically deal with all input, etc. So long, so many parts, semantics, spelling, etc. And perhaps you’re writing an application for such a system. You’re fine. Then someone from outside that system comes along and wants to join. “Change your name to fit my rules!” you say, though [1] what is your authority for doing so and [2] if this person is a ‘customer’ do you want to lose them if they refuse? What if you are modeling “reality,” not the rules you want reality to follow?

Des June 17, 2010 at 8:26 pm #

Dude, you’re an idiot. Most of your steps are redundant with other steps. And hey, guess what! I’m writing US, English software. That means you have to abide by our rules if you want to be in our database.

And rules change.

We haven’t even gotten to little Bobby Drop Tables.

II. Our Needs

McKenzie is right: there is probably [1] no computer system currently that does names ‘correctly’ and [2] you’re unlikely to find one, mainly because of the false assumptions that  go into designing them. But it’s quite possible that “First Name, Last Name” is perfectly good for a given application, and perhaps it was for ours.

As others have said … don’t let the perfect be the enemy of the good.

We dealth with [a] authors, [b] editors, and occasionally [c] translators. All of these people were adults, most were alive, and most were involved in publishing or academia, so they had already made the decision to use a “Name” that other people—and computer systems—could deal with. But because we were dealing with books/publications, we often had no freedom. A book reviewer? He or she could say, “please print my name as XYZ,” and we could say “we publish names as First name, last name, as well as any initials, titles, or suffixes you want to give us, Prof. Dr. Primo Secondo Tertius IV.” But a book? This was bibliographic work: you printed a name as it was published. Alfred Baxter Cooper sometimes published as A. B. Cooper, other times as Alfred B. Cooper, and in his earliest works as Alfred Cooper.

Don’t assume that Peter Singer, or even Peter A. Singer, is a Princeton ethicist … there’s nothing unique about names; someone might use multiple names; and you might not have the choice of picking one and only one of them for your system.

And so we needed at least two models: Name and Person.

At my previous journal job our authors and the like were almost always Anglo-American or German. First name, last name worked rather well. Furthermore, even when we got a ‘foreign’ name that none of us on the staff really recognized, we at least could say, “This is the first name, this is the last name.” I’d lived in Hungary and I already knew that first-last was not the only way to go, that if my landlord, for example, was Farkás Balint, then Balint was his given name and Farkás his last/family. Last name, first name was common in a lot of cultures, many of which I was not that familiar with. I could not just look at a name and parse it properly.

III. There is an easy way to do names.

None of this might matter … ignore McKenzie’s last “assumption,” that people have names; most of the rest can be suitably worked around by providing a large enough Unicode character or text field. Forget name order, culture, and the like: just provide a single form field and put “the name” in there. Is it perfect? No … not at all.

And it would not be suitable for us, as a bibliographic database  would require alphabetization of names, and would we want to, on a case by case base, inspect a full name, and decide whether “Liszt Ferenc” should be under “L” or “F” (answer: L)? Perhaps we would add a heuristic to the database table: what’s the “culture” of the given name? Hungarian, Traditional Chinese, “Default/Western” (here we show our … privilege), and so on …

Python could quite nicely parse a string for us, tokenize it, and based on the heuristic tell us whether the first or last word was the family name. We still wouldn’t have the vons, vans, des, and similars under control. And what about Mary Bishop Carter? Bishop the middle name? Maiden name? A nice ‘southern’ double first name (“Mary Bishop” being the first name), or a doublt last name … see also: Alexander McCall Smith.

It has become almost a game with me … when I peruse a bookstore I try to find whether he has been placed under ‘S’ next to the other Smiths or under ‘M’.

There would always be exceptions, but it would work pretty well, wouldn’t it? But exceptions would pile up. And by handling this at the Python level as business logic, what about my queries … the database would only know of the “Name” as a whole, so I could not alphabetize a set of results until I’d pulled them from the database and run them through my application layer … that didn’t seem right.

But … but … if [1] you want a user’s or person’s “name” and you’re going to do little or nothing with it except address them, then forget first name and last name, and please leave out middle name or initial … just provide a general “Name”/”Preferred Name” field. You do not need to alphabetize, you do not need to parse deeply. You just may, however, be asking why you need “Name” at all.

IV. Or not …

We needed:

  • names broken down enough to alphabetize them
  • multiple names per person
  • a way to tell apart family-name-first and given-name-first schema

Regarding the first, perhaps we could have a single “Name” field and an auxiliary “alphabetized name” field, but this seemed a bit too de-normalized. Instead of an entire alphabetized name field, perhaps just an “alphabetize by” field with the last name. This would overcome the Hungarian-vs-German issue, let’s say. But what about Björk?

Björk Guðmundsdóttir, you ask? Yes, I reply. And of course you realize that “Guðmundsdóttir” is not her last name … it’s a patronymic … the whole “daughter of …” type thing. Back to rules and regulations: yes, history and culture are interesting, and after all when we look at all those at-one-point-roughly-Scandanavian “-sons” running around, we realize that at some point those were not “family names” but rather patronymics.

Yet the perfect could not be the enemy of the good. Other people besides me would have to use this “system.” And I still did not have an answer as to what the “right” system was. Was there even one? So we settled: [1] first name (including multiple first names, middle names or initials, etc.), [2] last name(s) (however we should alphabetize), [3] suffixes (we didn’t care about Dr. and Prof. and the like, but Jr., III, and Esq. might pop up as relevant), and [4] a “name style,” telling us how we should present the name in address: first last, or last first. It was really no different than the “standard system” except that it aided in our multi-culti issue. And by separating people and names we could deal with collective names, peudonyms, multiple spellings and the like.

It was a mess, but it was our mess.

VI. And yet …

From time to time I would think about it some more. The “traditional” way was simply wrong; the simplified way (single field) was better for most uses … by which something had to be admitted: most uses were not about use at all. The name was being used only as a tag, a label. It had no grammar, and little functionality. It contained no knowledge. It was just an attribute … and it was precisely because that most applications didn’t really care about the semantics or syntax of names that we could model 99% of cases so easily.

I thought, we need to keep the person-name divide. It was a truly many-to-many issue. “Stephen King” and “Richard Bachman” referred to the same man. When it came to genealogy all you really had were records, not people, and you were dealing with names on records, not people. Often there would be a corporate name used by all the “people” who wrote on a given series of books … one name, multiple people. And of course a “name” as sequence of words was not the same as “name” as identifier; while all the Nancy Drew titles were ghostwritten under the name Carolyn Keene, and so that Carolyn Keene name links a number of different people, not all Carolyn Keene’s are linked to the Nancy Drew books. The database must have multiple John Smith entries, and so on.

But back to names …

A given person might have multiple names, a “legal” name, perhaps, a “nickname,” a “preferred name,” aliases, pseudonyms, collective names, etc. But a given name could consist of so many more parts than first, middle and last. First we’d have to add matronymics and patronymics. Various prefixes and suffixes. Multiple of most of these. Instead of a set number of “fields” for a given “Name” object, we’d use a Name object container. Name Components would then have a [a] foreign key back to the Name, [b] value (the “name” part), [c] a “name type” (first, last, middle, patronymic, etc.), and a [d] rank or order … to present the “Name,” collect all the “Components” and order them by rank, first to last. The name as a whole would also have a “style” attribute tell us the cultural background; this could be a foreign key to a “Style” object that had [i] the style, [ii] a description of it, and [iii] a boolean value as to whether the family name was presented first or last … (admittedly, the “rank” of the components might obviate the need for this).

Business logic could tell us how to alphabetize, e.g. “if there is a surname, present surname, first name ….; else […]” Prefixes and suffixes were a separate table (so that there could be multiples of each). But then you—back to ‘von’ and ‘van,’ for example, and more!—have the issue of “particles” that are not themselves “names” (not a last name, not a first name, not a patronymic, nickname, etc.), but have a syntactic or semantic function. Take the the various semitic patronymic/matronymic particles, such as Arabic “ibn” or “ben” (etc.) or Hebrew “ben” or “bat,” or the Aramaic “bar.” They function like the Scandanavian “-son,” but are not part of the name, but a separate “word” preceding the name like a prefix. So a “prefix” or “title” is not something you would only put before the name as a whole, just as a “suffix” may not be tied to the name as a whole, but a part of the name.

VII. A Way Out … By Going In?

And this matter of prefix and suffix particles hints at a solution, if not the solution: recursion and nesting.

A “Name” is given; it has a value (one or more words/tokens in series), a type and style (as above, more or less). By way of an intermediate table it is a parent to zero or more children, each of which is a Name as well. [1] The parent-child relationship gets a rank/order (as above) and [2] since a child is a Name, it may have children as well. Now prefixes and suffixes can be treated, if desired, merely as Name values themselves (of the type prefix or suffix … or title for some of the prefixes, and so on). The depth one wishes to follow a long “Name” to is somewhat arbitrary; all Names could have depth of one with no internal structure. The vast majority of “Western” names would suffice with one parent and two to four children at a single level (first name and last, or first-middle-last …). Only a small number would require extral levels of depth. This is just a variation on the previous model.

But why propose it?

Answer: a kind of ontological elegance and simplicity. The only main object we have is “Name,” and parent and child are of the same type. The auxiliary models/tables—Type and Style—inform Name at every level. Furthermore their values could be drawn from a list of possible/allowed values, so that instead of separate models, they could just represent options in the database. The only other atomic structure to be added, if desired, would be treating “name value” as a foreign key to a “Signifier” (since “Name” was already taken); all values could/would be kept track of in a separate table, allowing us to add a text field describing a given signifier (name etymology, for example), but this part is not necessary.

About Steve

47 and counting.
This entry was posted in Code. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *