Converting .dsl files to Kindle dictionaries

In this post, I will describe how to convert dictionaries in ABBYY Lingvo’s format to mobi dictionaries that can work with Kindle.

I will assume that you already have appropriate .dsl files. The first step is to make sure that .dsl files use UTF-8 encoding. We may check a file encoding using file command:

$ file dict.dsl 
dict.dsl: Little-endian UTF-16 Unicode text, with CRLF line terminators

If you see something different than UTF-8 Unicode (with BOM), as in the above example, then you have to convert the files to UTF-8 first. We may use iconv for this purpose:

iconv -f UTF-16LE -t UTF-8 dict.dsl -o dict-utf8.dsl

We need to make sure that .dsl files does not contain metadata info (lines starting with # at the beginning of the file):

#NAME "Foo Dictionary"                                        
#INDEX_LANGUAGE "Russian"                                                
#CONTENTS_LANGUAGE  "Polish"                                             
а
...

If you see lines starting with # as in the above example, please remove them.

Next we need to grab dsl2mobi:

git clone https://github.com/Tvangeste/dsl2mobi.git

# Do I have ruby?
ruby -v
sudo apt install ruby

You need to have ruby installed on your machine for the script to work. Actually I don’t like running someone else’s code on my machine, so I ran the script inside a virtual machine (which, for security reasons, I also recommend you to do).

Now we can execute the script:

cd dsl2mobi
chmod +x ./dsl2mobi.rb

./dsl2mobi.rb -w ./wordforms/forms-RU.txt \
 -i ~/path-to/dict-utf8.dsl \
 -o ~/output-dir

In a lot of languages, the same word can occur in different forms, for example in English the word “write” can occur in forms: wrote, written, writes. We want our dictionary to recognize all these variations, and for this reason we need the so called wordforms. Fortunately for us dsl2mobi comes with a buildin wordforms files for several languages. If you want to create a dictionary from e.g. Russian to Polish you need to use Russian wordforms (as in our example). If you want to create a dictionary from English to Russian you would need to use English wordforms, etc.

dsl2mobi should create at least two files in the output-dir, one with .html extension (containing actual content) and one with .opf extension (containing metadata).

Next we need to grab KindleGen from Amazon to actually generate mobi files:

./kindlegen  ~/output-dir/dict.opf -o dict.mobi -verbose -c2

We use -c2 option to compress the dictionary.

Unfortunately, in my case kindlegen does not wanted to convert .opf file generated by dsl2mobi. To make it work, I needed to edit my .opf file to:

<?xml version="1.0"?><!DOCTYPE package SYSTEM "oeb1.ent">
<package unique-identifier="uid">
  <metadata>
    <dc-metadata xmlns:dc="http://purl.org/metadata/dublin_core" xmlns:oebpackage="http://openebook.org/namespaces/oeb-package/1.0/">
      <dc:Identifier id="uid">dic</dc:Identifier>
      <!-- Title of the document -->
      <dc:Title>Ru-Pl Dictionary</dc:Title>
      <dc:Language>ru</dc:Language>
      <dc:Subject BASICCode="REF008000">Dictionaries</dc:Subject>
      <dc:Creator>linuxboy</dc:Creator>
      <dc:Publisher>pinguin</dc:Publisher>
      <dc:Description>Generated by Dsl2Mobi-1.2-dev on 2019-11-28.</dc:Description>
    </dc-metadata>
    <x-metadata>
      <output encoding="utf-8" content-type="text/x-oeb1-document" />
      <DictionaryInLanguage>ru</DictionaryInLanguage>
      <DictionaryOutLanguage>pl</DictionaryOutLanguage>
    </x-metadata>
  </metadata>

  <!-- list of all the files needed to produce the .mobi file -->
  <manifest>
    <item id="item1" media-type="text/x-oeb1-document" href="dict.html"></item>
  </manifest>

  <!-- list of the html files in the correct order  -->
  <spine>
    <itemref idref="item1"/>
  </spine>

  <tours/>
  <guide>
   <reference type="toc" title="Table of Contents" href="dict.html#toc"></reference>
  </guide>
</package>

Also make that DictionaryInLanguage and DictionaryOutLanguage tags have proper values, otherwise your dict may not work on Kindle.

I also had to change the beginning of the dict.html file to:

<?xml version="1.0" encoding="utf-8"?>
<html xmlns:idx="www.mobipocket.com" xmlns:mbp="www.mobipocket.com" xmlns:xlink="http://www.w3.org/1999/xlink">
  <link rel="stylesheet" type="text/css" href="dic.css"/>
  <head>
    <meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />
    <title></title>
  </head>
  <body>
    <center>
      <h1>Generated by Dsl2Mobi-1.2-dev</h1>
      <hr />
    </center>
    <mbp:pagebreak />
    <a name="toc"></a>
    <mbp:pagebreak />

<!-- DICTIONARY ENTRIES -->
<a name="#а"/>
<idx:entry name="word" scriptable="yes">
<font size="6" color="#002984"><b><idx:orth>а
</idx:orth></b></font>
<idx:orth value="a"/>
<div class="dsl_m0"><span class="dsl_p"><i><font color="green">Spójnik</font></i></span></div>
...

After these changes I was able to generate a .mobi file that worked perfectly with my Kindle.

If your dictionary is really huge (the .html file bigger than 20MB), KindleGen may either take a lot of time (a few hours) or it may not finish at all. In this case I advice you to split, the single .html files into three or four smaller files (each should be less than 20MB), and then to add them as “chapters” to the .opf file:

  <!-- list of all the files needed to produce the .mobi file -->
  <manifest>
    <item id="item1" media-type="text/x-oeb1-document" href="dict-1.html"></item>
    <item id="item2" media-type="text/x-oeb1-document" href="dict-2.html"></item>
    <item id="item3" media-type="text/x-oeb1-document" href="dict-3.html"></item>
    <item id="item4" media-type="text/x-oeb1-document" href="dict-4.html"></item>
  </manifest>

  <!-- list of the html files in the correct order  -->
  <spine>
    <itemref idref="item1"/>
    <itemref idref="item2"/>
    <itemref idref="item3"/>
    <itemref idref="item4"/>
  </spine>

You can use wc, head and tail for the splitting:

$ cat dict.html | wc -l
1792747
$ echo $((1792747 / 2))
896373
$ cat dict.html | head -n 896373 > dict-1-2.html
$ cat dict.html | tail -n +896373 > dict-3-4.html
# Split files one more time to have four parts

Then you have to use vim or other editor to make sure that all files have proper <head> sections, and are properly ended with </body></html>. You will also have to make sure that dictionary entries are not split across the files. They are quite easy to recognize, as they usually start with <a> tag followed by <idx:entry> tag.

KindleGen needed around 1h of time to convert 80MB split into four parts, so be prepared to wait for a bit.