Single language vs. multi-language
Introduction
During my daily forum visits, I came across a thread about how to create a game in more than one language.
This reminded me on the first multi-language game I was working on. We ran into several issues back then and I want to share my experience with you, so you don’t make the same mistakes.
When you create a multi-language game, several things change that would be static otherwise. Here is a list a few things to consider:
- You can’t hard-code every text, because you’re usually not the person who translates text.
- You can’t assume texts have same lengths in different languages.
- Prepare your GUI to word-/line-break and squeze text.
- Different languages might imply different character ranges (char, wchar, utf-8, choose which one to support before you start)
- Different languages might imply different font definitions
- Different languages might imply different graphics: title logo, textures with baked-in text, etc
- Different languages might imply different sounds: voice output
- Being able to change language in the options menu, in case it is required.
- Being able to change language at runtime during development.
- Testing time increases with more languages
- You might need to communicate with people who don’t speak your language (bug reports etc)
- Several persons involved in development have more work to do: programmer, translator, quality assurance, (artist, sound artist maybe)
Localization from a programmers perspective
If text is available in more than one language, it must not be hard-coded in the game code.
Functionality to retrieve texts, depending on the current language setting is essential. It can be something simple like a function that expects an identifier that represents the text in question and returns its text or a database query.
We decided to edit texts in Microsoft Excel, because it’s standard in many offices and we don’t wanted to force translators to use some unknown custom in-house tool that he/she might not understand and we would have to support.
The spreadsheet structure looked like:
1 2 3 4 5 | ID EN DE DESC ---------------+-------------------+-------------+--------------- helloworld | HELLO WORLD | HALLO WELT | Displayed in the welcome screen goodday | GOOD DAY | GUTEN TAG | Displayed when the player confirms the welcome screen ... |
ID represents the text identifier that is used to query the translated text.
EN and DE are the english and german translations.
DESC is an optional field that describes the purpose of this text. It can be quite challenging to find a good translation when you don’t know the context of the text, that’s what DESC should solve.
Localization, 1st iteration
Our first language system iteration worked like this:
- Create texts in Microsoft Excel and save as XML
- Custom tool to convert XML to our text format
- Custom tool outputs .cpp file with texts as array
- Custom tool outputs .h file with identifier #define’s that index into the text arrays
In order to get the translated text, we had a function that expects one of the generated identifiers and returned the corresponding text. The text system looked like this:
language.h file:
#ifndef __language_h__ #define __language_h__ #define TID_helloworld 0 #define TID_goodday 1 #define TID__MAX 2 #define LANG_EN 0 #define LANG_DE 1 const char* GetText(unsigned int textId); #endif // __language_h__
language.cpp file:
#include "language.h" // initial language is english int CurrentLanguage=LANG_EN; // german texts const char* const TEXT_DE[]= { "HALLO WELT", "GUTEN TAG" }; // english texts const char* const TEXT_EN[]= { "HELLO WORLD", "GOOD DAY" }; // get text of the specified text identifer const char* GetText(unsigned int textId) { assert(textId < TID__MAX); // invalid text id switch(CurrentLanguage) { case LANG_DE: return TEXT_DE[textId]; default: break; } return TEXT_EN[textId]; }
Once we worked with it for a while, we realised updating the language file is a time killer.
Everytime we modified text and exported, the header file changed, thus all source files that include language.h were recompiled due to header dependencies. This was a huge problem for us, nobody wanted to wait several minutes for a recompile, only because someone fixed a typo in the translation or added a new text.
Localization, 2nd iteration
What we learned from the 1st approach is no matter if we change, add or even remove text, it must not have a significant influence on compile times.
Rather than using generated identifiers, we used crc32’s of string literals to identify texts. This completely removed the header dependency / recompile problem! Our text system now worked like:
- Create texts in Microsoft Excel and save as XML
- Custom tool to convert XML to our text format, verify that all text identifiers generate unique checksums
- Custom tool outputs .cpp file with texts as array and checksum lookup table
In order to get a tranlated text, GetText() now expects a string literal as id, generates a crc32 of the incoming id, performs a binary search on the checksum table and then uses the lookup position to index into the language array.
This even allowed us to return the text id when the translated text was not found, so the tester could add the text id of the translated text that is missing to the bug report. But it also allowed us to switch text at runtime to display the id rather than text (”uh what text id is display here” belongs to the past).
The new text system was a bit slower than the 1st iteration, but it had no significant influence on the overall runtime performance.
We worried more about being able to have typos in text id’s, as those are strings and located in game code now, but this was never really a problem I think.
We added text id’s to the Excel file first, then always copy/paste text id’s from Excel to game code. However, we can’t be 100% certain that all possible missing texts were hunted down by the QA team. We still needed a verification system that does this automatically and 100% reliable, but more on this later.
Due to the additional checksum lookup table and the string literal id’s in the code itself, the game also requires more memory.
Most development systems feature some kind of debug memory (eg no$gba has an option to emulate 8MB rather than 4MB main memory), this is at least during development not a problem. More on this later, again.
language.h looked like this:
#pragma once enum Languages { LANG_EN, LANG_DE, }; const char* GetText(const char* textId);
We changed from #defines to enum’s too, because those are much more debug friendly.
language.cpp looked like this:
#include "language.h" // initial language is english Languages CurrentLanguage=LANG_EN; // crc32 checksums / hashes of text id's // in sorted ascending order const unsigned int TEXT_IDs[]= { 12345, // text id: helloworld 23456, // text id: goodday }; // german texts const char* const TEXT_DE[]= { "HALLO WELT", "GUTEN TAG" }; // english texts const char* const TEXT_EN[]= { "HELLO WORLD", "GOOD DAY" }; // performs a binary search on the the TEXT_IDs array // and returns the index where the hash is located, or // -1 when it could not be found. int FindTextIndex(unsigned int hash) { int left = 0; int right = (sizeof(TEXT_IDs) / sizeof(TEXT_IDs[0])) - 1; while (left <= right) { int index = (left + right) / 2; if (hash == TEXT_IDd[index]) return index; // hash found, leave! if (hash > TEXT_IDd[index]) left = index + 1; else right = index - 1; } return -1; // hash not found } // get text of the specified text identifer, // returns the textId if text could not be found const char* GetText(const char* textId) { // generate checksum / hash of incoming text id unsigned int hash = CalcCRC32(textId); // search for the checksum / hash in our TEXT_IDs array int index= FindTextIndex(hash); if(index == -1) { // text not found, return the id instead! return textId; } switch(CurrentLanguage) { case LANG_DE: return TEXT_DE[index]; default: break; } return TEXT_EN[index]; }
Localization, 3rd iteration
The 2nd iteration is not bad, but as the project came along, new requirements did pop up.
We not only needed to display texts in different languages, different title logos should be displayed. We just hacked to load different resources, depending on the language setting, in game code.
But we should have known before, this ain’t fulfils the artists vision. So before we clump the whole game code, we decided it should be handled automatically without any action from a programmer and this was easier than we thought.
We already used some sort of file archive, where all game content is stored, to load files from. Think of it as a zip archive. In order to load language dependend resources, all we had to do is to support more than one file archive and priorize it. When the game requests a file, file archives are searched by priority.
We added an additional file archive with german content and high-priorized it. When the game requests “title.bmp”, the german archive was searched first. If the resource could not be found, the next archive was consulted. This allowed to add language dependend resources without any programmer work!
Localization, 4th iteration
Having all languages in main memory is quite a waste, at least on systems that don’t feature hundrets of mega-bytes. In my experience, non-text-heavy games contain about 700 texts, where edutainment games can contain thousands.
If every text would be 50 chars and 700 texts are available, it’s 50*700 = 35000 chars, which in ASCII is about 34kb for one language! 5 languages sum up to 34kb*5 = 170kb.
This is more than 4% of the Nintendo DS main memory, only for text! Not really an option to spend that much precious memory for text if you could use those wasted 135kb and spend them on a larger level, more sounds or more textures instead.
On memory limited systems it makes sense to have one language in memory only, namely the current language. However, this comes with several things to consider:
- Memory consumption is different for different languages.
- GetText() returns different pointers for different languages.
Different memory consumption is a huge problem. It’s irresponsible if some levels don’t load when german language is activated, because german texts consume 2kb more memory than the english ones. It makes it also impossible to replace entire language files on-the-fly, change language setting in options for example, without any delete/new mechanism involved.
Furthermore it’s also quirky when GetText() returns a different pointer for every language, because GUI widgets can no longer store pointers to texts, because they would point to whatever memory if the language setting changes.
The secret is always try to keep memory consumption as static as possible! Our custom text tool compared texts of every language and padded shorter texts of different languages with zero-bytes to consume as much space as the longest text, for example:
You won.000000000
Du hast gewonnen.
Hello World
Hallo Welt0Where “0″ represents the padding 0×00 byte.
This makes sure that:
- every language file has the exact same size.
- offsets to texts inside the language file are always the same.
This approach makes it possible to allocate memory once for the language file and then being able to work with that buffer, because the size never changes for different language files of the same category. You can load other language files to this buffer and the text system still works.
When text is located in a language file, we also no longer have the const char* overhead from our text arrays, just make sure to null-terminate every text!
Localization, 5th iteration
The additional memory footprint introduced with the 2nd iteration bothered us and we wanted it better spending on textures than text and this is very simple again.
We supported a hybrid system of the 1st and 2nd iteration. The 2nd iteration was perfect for development purposes, as it does not require much recompile, but comes with higher memory footprint.
The 1st iteration on the other hand is horrible during development, because of the recompile times, but does not require any additional data (crc table) and is lighting-fast.
Instead of using string literals for the text identifier directly, we wrapped them in a TXT macro. The debug build stringified the incoming parameter, where as the release build concatenated it to create an identifier:
#if _DEBUG #define TXT(id) #id #else #define TXT(id) TID_##id #endif
It was used like this:
const char* text = GetText(TXT(helloworld));
The debug build replaced it with:
const char* text = GetText("helloworld");
and the release build with:
const char* text = GetText(TID_helloworld);
where TID_helloworld is the generated #define identifier of our custom text tool, as shown in 1st iteration.
We used the 2nd approach for debug builds and the 1st approach for release builds. When you switch between debug and release builds, you need to do a recompile anyway, so using the 1st does not hurt.
And at this point I can also resolve the “more on it later..” note from the 2nd iteration paragraph.
Because we used the 1st approach in release builds, we could catch all invalid text identifiers at compile time and had no memory overhead anymore, yay! Supporting both systems is also not really problematic in my opinion, since they’re pretty similar and not complicated anyway.
Conclusion
Creating a multi-language game comes with a couple of new tasks, don’t underestimate it!
