Make a Ebook from Online Web Pages


It's often too slow to access a document online, so it should be really help to mirror it and make a ebook.

When I frequently access this site, Lemur Project and Indri Search Engine Wiki, I found it really frustrating...


wget is enough.

  1. first try
wget -r -p -k -I /p/lemur/wiki

follow directory works fine!
but wiki contents has too much versions, History, Feeds

  1. check the log

    Check the log, find these kinds of urls :

    ==>they are all navigational hyperlinks outside the main text, which are not needed.
    出现在正文外面的 导航 hyperlinks

    --2014-10-24 11:01:27-- 正在连接 (||:80... 已连接。 已发出 HTTP 请求,正在等待回应... 302 Found 位置: [跟随至新的 URL] --2014-10-24 11:01:28-- 正在连接 (||:80... 已连接。 已发出 HTTP 请求,正在等待回应... 200 OK 长度: 34905 (34K) [text/html] 正在保存至: “”

    ==>such jump , it means the url are actually the same in content. faint...................
    wget doesn't distinguish Directory and File name Indri/ exist --> Indri.1
    try "add -nc", but is says:
    "Both --no-clobber and --convert-links were specified,only --convert-links will be used."

    Check the log, find : it save result to file "RankLib" other than index.html under /RankLib.....

    ==>try "-nd"

  2. at last

wget -r -p -E -c -nd -k --max-redirect=3 -R history,feed*,*version=* -I /p/lemur/wiki -X /p/lemur/wiki/browse_pages,/p/lemur/wiki/search,/p/lemur/wiki/browse_tags -o lemur.log

this works fine~

2. ebook maker

  1. Publish online.
    just upload the directory to my online document repository

  2. Publish ebook.
    publish it to a ebook by Calibre~


Useful parameters of wget

       Turn on recursive retrieving.    The default maximum depth is 5.

   -l depth
       Specify recursion maximum depth level depth.

       This option causes Wget to download all the files that are
       necessary to properly display a given HTML page.  This includes
       such things as inlined images, sounds, and referenced stylesheets.

       Follow relative links only.  Useful for retrieving a specific home
       page without any distractions, not even those from the same hosts.

   -I list
       Specify a comma-separated list of directories you wish to follow
       when downloading.  Elements of list may contain wildcards.

       If a file is downloaded more than once in the same directory,
       Wget's behavior depends on a few options, including -nc.  In
       certain cases, the local file will be clobbered, or overwritten,
       upon repeated download.  In other cases it will be preserved.

   -o logfile
       Log all messages to logfile.  The messages are normally reported to
       standard error.
       Do not create a hierarchy of directories when retrieving
       recursively.  With this option turned on, all files will get saved
       to the current directory, without clobbering (if a name shows up
       more than once, the filenames will get extensions .n).

       If a file of type application/xhtml+xml or text/html is downloaded
       and the URL does not end with the regexp \.[Hh][Tt][Mm][Ll]?, this
       option will cause the suffix .html to be appended to the local
       filename.  This is useful, for instance, when you're mirroring a
       remote site that uses .asp pages, but you want the mirrored pages
       to be viewable on your stock Apache server.  Another good use for
       this is when you're downloading CGI-generated materials.  A URL
       like will be saved as

       Specifies the maximum number of redirections to follow for a
       resource.  The default is 20, which is usually far more than
       necessary. However, on those occasions where you want to allow more
       (or fewer), this is the option to use.

   -U agent-string
       Identify as agent-string to the HTTP server.

以前用onenote,自从转到ubuntu桌面,也找了一阵子替代软件,结果没有,于是就wine+evernote开始用。当前不久gerry送我一个5寸手机,我才发现原来多设备见的note同步功能这么好,可以随时随地查看笔记,真方便。对我来说,pad还是用得少。 最初,我按类别tag管理笔记,期望把info管理得有条理。比如一门课课件,比如MyinfoBook,比如一个Project。 接下来,用evernote过程中,逐渐改变了这个习惯。我发现,随任务,记录信息,聚合信息,成为一个更常用的方式。可以说,evernote成为解决一个问题时,记录信息,整理的理想工具。比如这篇笔记的诞生,就是从一个问题开始:我怎么读论文,做笔记,并管理好文件和参考文献,又能在手机/pad上阅读呢?于是,search+read,整个信息搜索,记录,整理的过程都汇聚在evernote的这篇笔记里。等问题解决了,再总结,就是现在敲下的文字~。一个个的微任务,就是我们生活当中一个个信息访问的实际场景,也是信息组织,管理的基本单位。从买菜,出游,购物,读书,写作,研究,各类事务里都有相似的信息访问模式。evernote是一个方便的辅助工具。如果大家都这样用,那么在evernote里记录的笔记们将会是非常有价值的信息,想一想,不得了,弄了半天,web上高质量内容都藏在evernote里面。它们怎么发挥更大的价值?我去另开一个笔记来调研吧。。。。。。
还得提evernote webclip,这解决了在web 上浏览时保存好信息的需求(虽然保存的内容极少可能会回头再次阅读,但它的确满足了人希望把信息保存下来的需求)。但是为了真正使用这些信息,我觉得,还得回到微任务模式里去——带着问题来阅读,整理,才是有效的。


在解决技术问题的调研中,evernote可以帮助记录整理web上的信息,不过稍微深入一点的内容,就涉及到了academic research paper。web内容,简单给一个URL reference或者evernote webclip的link就好,能够访问,但research paper需要链接到PDF file。 管理research paper是件麻烦事。通常,我会按project涉及的问题和领域,建立目录,下载paper保存。但今天调研的问题,涉及到一篇过去曾经看过的paper,也许曾经还对它作过阅读注释,但是现在它在哪?——用文件系统来管理,完全不行。 想到zotero。好久不用了,一直不写paper,也就不大关心reference。记得很久以前还是用EndNote写了我的毕业论文,那是第一次接触参考文献管理软件,感叹真是好用。后来fankai介绍用了zotero,在firefox里使用,library都在服务器上,不用考虑备份和保存了,很好。现在再一次做'研究', 回到zotero。 zotero, zandy, greader, evernote and me 提到了zotero和pad之间使用zotfile论文同步,这正是我想要的。安装zotfile,在web page上用zotero保存参考条目时,它神奇的自动保存pdf文件/或者监控download目录下新的pdf文件,把它作为附件,并且自动换名,到自己设置的library目录下。文中提到的同步,还得依赖其它pc-mobile之间的同步软件。 我用百度云,于是找到百度云/百度网盘的Python客户端, 试了试,还不错,也算解决了一个长期要换到windows机器上去做云盘同步的麻烦了。虽然,doukan不直接支持百度云盘,但总算打通了这条通道,而我也不需要文中提到的pdf 回传和annotation提取功能,因为duokan可以把它们直接输出到evernote~~~ Connecting Zotero and Evernote 提到在evernote里,通过zotero link translator 添加zotero的条目连接,这样就把笔记,注释和zotero library连接起来。


现在,有了一个完整的论文阅读,笔记,参考文献管理工具链了: zotero+zotfile -> baiduyun -> duokan reader ->evernote -> zotero. :)




  • 支持markdown
  • 方便写和发布


  • github pages seems a perfect answer!
  • jekyll sites, 找一个好模板,需要页面简洁,支持导航,支持评论。找到codepiano, 准备用他了~
  • 弄了半天,发现Github Pages需要特定的username/ repo命名,retry...
  • 修改categories.html中一个小错误,(btw, 下面贴出来的这段code,折腾了大半天,github总报告Page build failure, 说,“...that is a symlink or does not exist in your \_includes directory”
< \{\% include site/setup %\}
> \{\% include codepiano/setup %\}
  • 还没有评论支持? >检查,是_config.xml里面site被我改为myblog了,修改回去
site :
  version : 0.3.0


Welcome~, Thanks,github, jekyll, and codepiano~


  • 1: "搭建一个免费的,无限流量的Blog----github Pages和Jekyll入门"
  • 2: "个人博客,powered by jekyll && bootstrap"
  • 3: "jekyll sites"

—  原创作品许可 — 署名-非商业性使用-禁止演绎 3.0 未本地化版本 — CC BY-NC-ND 3.0   —