电子书epub原理剖析

EPUB是一种电子书格式规范，其文件扩展名为“.epub”。epub遵从xml的文件书写格式，几乎可以适配所有的硬件电子书阅读器。epub文件也可以作为一个zip压缩包来处理，以下是解压后的目录结构。

.epub

--ZIP Container--
mimetype
META-INF/
  container.xml
OEBPS/
  content.opf
  chapter1.xhtml  //书籍内容，或为.html格式
  chapter2.xhtml
  images/         //书中插图
  	cover.png
  	ch1-pic.png  
  css/            //字体和排版样式
    style.css
    myfont.otf
  toc.ncx         //书籍索引文件

mimetype文件的内容如下，

mimetype

1	`application/epub+zip`

旨在告诉应用该文件兼容epub和zip两种格式。

META-INF/container.xml中记录了电子书的根文件，即OEBPS/content.opf。

META-INF/container.xml

<?xml version="1.0" encoding="UTF-8"?>
<container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
  <rootfiles>
     <rootfile full-path="OEBPS/content.opf" media-type="application/oebps-package+xml"/>
  </rootfiles>
</container>

而content.opf其实也是一个xml文件，其包含了三项内容：metadata, manifest和spine。

metadata中记录的是书籍的一些基本数据，如出版号、封面、作者、出版商、书名、出版日期和语言类型等信息。

manifest则是罗列OEBPS中出content.opf外的其他文件，如xhtml, images和css等文件。
spine则是记录书籍的索引。

以下是content.opf的内容，

OEBPS/content.opf

<?xml version="1.0" encoding="UTF-8"?>
<package xmlns:opf="http://www.idpf.org/2007/opf" unique-identifier="bookid" xmlns="http://www.idpf.org/2007/opf" version="2.0">
   <metadata xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:opf="http://www.idpf.org/2007/opf" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/">
        <dc:identifier id="bookid">urn:uuid:273fd756-62f2-4858-8d67-99e08f24bba9</dc:identifier>
        <meta name="cover" content="cover-image"/>
        <dc:creator opf:file-as="Taylor, John" opf:role="aut">Taylor, John</dc:creator>
        <dc:date opf:event="publication">2020-09-29</dc:date>
        <dc:language>en</dc:language>
        <dc:publisher>UNKNOWN</dc:publisher>
        <dc:title>Website Building Made Simple: How to Create Your Own Website Using Wordpress</dc:title>
   </metadata>
   <manifest>
        <item href="charpter1.xhtml" id="id_1" media-type="application/xhtml+xml"/>
        <item href="charpter2.xhtml" id="id_2" media-type="application/xhtml+xml"/>
        <item href="images/cover.png" id="cover-image" media-type="image/png" properties="cover-image"/>
        <item href="images/ch1-pic.png" id="id_3" media-type="image/png"/>
        <item href="css/stype.css" id="id_4" media-type="text/css"/>
        <item href="css/myfont.otf" id="id_5" media-type="text/otf"/>
        <item href="toc.ncx" id="ncx" media-type="application/x-dtbncx+xml"/>
   </manifest>
   <spine toc = "ncx">
        <itemref idref="id_1"/>
        <itemref idref="id_2"/>
   </spine>
</package>

OEBPS中存放了书籍内容以及其他资源文件。.xhtml/.html文件是整本书的内容所在，我们可以按章节来分别存储到不同的xhtml文件中，这样便于对每一章节建立索引。当然，内容的划分粒度取决于你的实际需求。说到这里，对于熟悉html超文本标记语言的你一定了知道epub的工作原理。所以，对于文本的样式调整和图片插入自然就不需要我再啰嗦了。

最后需要提一下的是toc.ncx文件，它记录了整本书的目录结构，内容如下。

OEBPS/toc.ncx

<?xml version="1.0" encoding="utf-8" standalone="no"?>
<ncx:ncx xmlns:ncx="http://www.daisy.org/z3986/2005/ncx/" version="2005-1">
  <ncx:head>
    <ncx:meta name="cover" content="cover"/>
    <ncx:meta name="dtb:depth" content="-1"/>
    <ncx:meta name="dtb:totalPageCount" content="0"/>
    <ncx:meta name="dtb:maxPageNumber" content="0"/>
  </ncx:head>
  <ncx:docTitle>
    <ncx:text>BOOK TITLE</ncx:text>
  </ncx:docTitle>
  <ncx:navMap>
    <ncx:navPoint id="id286448480" playOrder="1">
      <ncx:navLabel>
        <ncx:text>BOOK TITLE</ncx:text>
      </ncx:navLabel>
      <ncx:content src="index.html"/>
      <ncx:navPoint playOrder="2" id="toc">
        <ncx:navLabel>
          <ncx:text>Table of Contents</ncx:text>
        </ncx:navLabel>
        <ncx:content src="toc.html"/>
      </ncx:navPoint>
      <ncx:navPoint id="id286448503" playOrder="3">
        <ncx:navLabel>
          <ncx:text>CHAPTER 1</ncx:text>
        </ncx:navLabel>
        <ncx:content src="chapter1.xhtml"/>
      </ncx:navPoint>
      <ncx:navPoint id="id286458153" playOrder="4">
        <ncx:navLabel>
          <ncx:text>CHAPTER 2</ncx:text>
        </ncx:navLabel>
        <ncx:content src="chapter2.xhtml"/>
      </ncx:navPoint>
      ...
    </ncx:navPoint>
    ...
  </ncx:navMap>
</ncx:ncx>

以上便是对epub文件的简单介绍。通常我们看到的一些能将epub转换为pdf或者txt文件的工具，其原理不过是对xml文件的解析处理罢了。当我们了解了epub的文件构造后，实现文件格式转换自然不是什么难事。

技术笔记

#epub

电子书epub原理剖析

https://r-future.github.io/post/电子书epub原理剖析/

Author

Future

Posted on

March 19, 2023

Licensed under

迁移docker生产环境 Next