Classification - BCP 47 (IETF Language Tag)
List of Languages BCP 47 IETF Language Tag
Ownership
Published and maintained by the Internet Engineering Task Force (IETF), which describes itself as the “premier Internet standards organization. It follows open and well-documented processes for setting these standards. Once published, those standards are made freely available.” (IETF | Internet standards)
Governance Structure
(IETF | Standards process ; IETF Structure and Internet Standards Process)
- Volunteer based. The entire org contains mostly volunteer with some positions that are paid (secretary, etc.) or semi-paid. Expresses are sometimes covered through donation, etc.
- Peer reviews with representatives from different entities and individuals
- No member and no due; mostly donation based.
- Positions are selected by a nomination committee
- Contributors would submit proposal of standards to be reviewed by Area Directors
- Mode of communication = mostly on mailing list with opportunities of inputs from industry groups and internet users.
Resource Specified
For this example, it concerns with the (Language localisation – Wikipedia; Information on BCP 47 » RFC Editor (rfc-editor.org))
It is a best practice published as a guideline for the internet community, namely
- Entities publishing software and services related to web documents
- Content creators publishing web content
- Content consumers consuming web content
To classify the language for web content.
Examples
- Canadian English = en-CA (ISO language code for English; ISO country code for Canada)
- Canadian French = fr-CA
- Chinese used in Hong Kong = zh-Hant (Traditional Chinese) or zh-HK (Hong Kong Chinese, implying it uses the traditional character set); I have seen the use of zh-Hant-HK as well
Reusing the Examples
- It is the same example as the previous exercise but it describes a different aspect of the BCP 47 specification.
- This example is concerned with the use of the controlled vocabulary–the available valid options–in the specification.
Analysis
Strengths
- Simple to understand and use
- Built upon other widely used standards maintained by ISO
- Widely adopted with strong supports from communities
- Country + Language combo works perhaps 90% of the time
- 2nd level is optional; en = English without a country specified
Weakness
- It is mostly for written languages; it doesn’t deal with spoken languages in rich media (e.g., video, etc.) as well as it does not represent dialects or accents, especially for countries with fragmented populations of minority culture and dialects
- Only 2 levels: country – and language; it might not handle more complicated hierarchical /regional structure; workaround could and sometimes create ambiguity
- Example: Hong Kong and Taiwan both uses the traditional Chinese character set; it is acceptable to set the content to zh-Hant to indicate the content is in Traditional Chinese (and suitable for viewing by readers from both Hong Kong and Taiwan), but it doesn’t indicate the culture. One could use zh-Hant-HK but it would break the simple and easy to understand consistency. These workarounds are also sometimes not universally supported
- Relatively less frequently used languages, including indigenous languages, dialect, sub country regions as theses cannot be expressed in the system
- Non-binding nature
- It is a recommendation that most will follow in good faith but not all the case
- Implementation and usages are optional at the discretion of users
- Competing or obsolete standards
- Some companies / content creators might opt for different / older standards
- Standards and best practices evolve; it is costly for companies to keep up to date; old content might not be re-tagged based on new standards
- Loose enforcement of standards
- For web content, it is up to the content creator to self-tag the content with the proper language tag
- e., tagging Canadian French content as British English might confuse users / external software but there is no process to ensure accuracy or consistency
- For web content, it is up to the content creator to self-tag the content with the proper language tag
Likeliness to Encounter
- Very likely as any webpages is a candidate for classification
- Browsers use the language code to detect the language or content to offer services such as Translation, etc.
- I will very likely be implementing these as content publisher and software developer
I think there is some confusion re: this being a discussion of multiple standards. The BCP 47 is specifically designed to work with other already published standards such as the HTML specification and ISO-3166 country code standards, and it inherits many of the same limitations of these standards. For this particular example, we are looking only at the controlled vocabulary aspect of the standard.
For instance, the valid value of a language must use one of the recognized language and recognized ISO country. As one may see, the controlled vocabulary enjoys many of the strengths of the specification but also suffers many of its weaknesses at the same time.