make a ebook from online web pages

Make a Ebook from Online Web Pages

Problem

It's often too slow to access a document online, so it should be really help to mirror it and make a ebook.

When I frequently access this site, Lemur Project and Indri Search Engine Wiki, I found it really frustrating...

Crawling

wget is enough.

  1. first try
wget -r -p -k -I /p/lemur/wiki http://sourceforge.net/p/lemur/wiki/Home/

follow directory works fine!
but wiki contents has too much versions, History, Feeds

  1. check the log

    Check the log, find these kinds of urls :
    http://sourceforge.net/p/lemur/wiki/search/?q=labelst%3A%22command-line%22&parser=standard&sort=score+desc
    http://sourceforge.net/p/lemur/wiki/browse
    pages/?sort=alpha&page=0
    http://sourceforge.net/p/lemur/wiki/browse_tags/

    ==>they are all navigational hyperlinks outside the main text, which are not needed.
    出现在正文外面的 导航 hyperlinks

    Check the log, find these kinds of urls :
    --2014-10-24 11:01:27-- http://sourceforge.net/p/lemur/wiki/Indri 正在连接 sourceforge.net (sourceforge.net)|216.34.181.60|:80... 已连接。 已发出 HTTP 请求,正在等待回应... 302 Found 位置:http://sourceforge.net/p/lemur/wiki/Indri/ [跟随至新的 URL] --2014-10-24 11:01:28-- http://sourceforge.net/p/lemur/wiki/Indri/ 正在连接 sourceforge.net (sourceforge.net)|216.34.181.60|:80... 已连接。 已发出 HTTP 请求,正在等待回应... 200 OK 长度: 34905 (34K) [text/html] 正在保存至: “sourceforge.net/p/lemur/wiki/Indri.1”

    ==>such jump , it means the url are actually the same in content. faint...................
    wget doesn't distinguish Directory and File name Indri/ exist --> Indri.1
    try "add -nc", but is says:
    "Both --no-clobber and --convert-links were specified,only --convert-links will be used."

    Check the log, find :
    http://sourceforge.net/p/lemur/wiki/RankLib/ it save result to file "RankLib" other than index.html under /RankLib.....

    ==>try "-nd"

  2. at last

wget -r -p -E -c -nd -k --max-redirect=3 -R history,feed*,*version=* -I /p/lemur/wiki -X /p/lemur/wiki/browse_pages,/p/lemur/wiki/search,/p/lemur/wiki/browse_tags -o lemur.log http://sourceforge.net/p/lemur/wiki/Home/

this works fine~

2. ebook maker

  1. Publish online.
    just upload the directory to my online document repository

  2. Publish ebook.
    publish it to a ebook by Calibre~

Reference

Useful parameters of wget

   -r
   --recursive
       Turn on recursive retrieving.    The default maximum depth is 5.

   -l depth
   --level=depth
       Specify recursion maximum depth level depth.

   -k
   --convert-links
       After the download is complete, convert the links in the document
       to make them suitable for local viewing.  This affects not only the
       visible hyperlinks, but any part of the document that links to
       external content, such as embedded images, links to style sheets,
       hyperlinks to non-HTML content, etc.

   --mirror
       Turn on options suitable for mirroring.  This option turns on
       recursion and time-stamping, sets infinite recursion depth and
       keeps FTP directory listings.  It is currently equivalent to -r -N
       -l inf --no-remove-listing.

   -p
   --page-requisites
       This option causes Wget to download all the files that are
       necessary to properly display a given HTML page.  This includes
       such things as inlined images, sounds, and referenced stylesheets.


   -L
   --relative
       Follow relative links only.  Useful for retrieving a specific home
       page without any distractions, not even those from the same hosts.

   -I list
   --include-directories=list
       Specify a comma-separated list of directories you wish to follow
       when downloading.  Elements of list may contain wildcards.

   -nc
   --no-clobber
       If a file is downloaded more than once in the same directory,
       Wget's behavior depends on a few options, including -nc.  In
       certain cases, the local file will be clobbered, or overwritten,
       upon repeated download.  In other cases it will be preserved.

   -o logfile
   --output-file=logfile
       Log all messages to logfile.  The messages are normally reported to
       standard error.
   -nd
   --no-directories
       Do not create a hierarchy of directories when retrieving
       recursively.  With this option turned on, all files will get saved
       to the current directory, without clobbering (if a name shows up
       more than once, the filenames will get extensions .n).

   -E
   --adjust-extension
       If a file of type application/xhtml+xml or text/html is downloaded
       and the URL does not end with the regexp \.[Hh][Tt][Mm][Ll]?, this
       option will cause the suffix .html to be appended to the local
       filename.  This is useful, for instance, when you're mirroring a
       remote site that uses .asp pages, but you want the mirrored pages
       to be viewable on your stock Apache server.  Another good use for
       this is when you're downloading CGI-generated materials.  A URL
       like http://site.com/article.cgi?25 will be saved as
       article.cgi?25.html.

   --max-redirect=number
       Specifies the maximum number of redirections to follow for a
       resource.  The default is 20, which is usually far more than
       necessary. However, on those occasions where you want to allow more
       (or fewer), this is the option to use.

   -U agent-string
   --user-agent=agent-string
       Identify as agent-string to the HTTP server.

   -k
   --convert-links
       After the download is complete, convert the links in the document
       to make them suitable for local viewing.  This affects not only the
       visible hyperlinks, but any part of the document that links to
       external content, such as embedded images, links to style sheets,
       hyperlinks to non-HTML content, etc.

文献阅读注释+参考文献管理(zotero+evernote+duokan)

文献阅读注释+参考文献管理(zotero+evernote+duokan)

duokan:

最近发现,在手机的5寸屏幕上看书居然是一件很方便的事情,阅读量甚至超过了kindle/pad,实在是因为手机小巧方便携带,在空闲时间里可以随时拿出来,这些不是kindle/pad适合的。找了一圈,从智阅,到kindleReader,最后到duokan多看,结论是多看阅读最方便,对pdf支持功能强(支持重排),自动做笔记注释,阅读进度的同步,还有云空间可以做书籍同步。还有一个关键功能:它支持笔记注释导入evernote。

evernote:

以前用onenote,自从转到ubuntu桌面,也找了一阵子替代软件,结果没有,于是就wine+evernote开始用。当前不久gerry送我一个5寸手机,我才发现原来多设备见的note同步功能这么好,可以随时随地查看笔记,真方便。对我来说,pad还是用得少。 最初,我按类别tag管理笔记,期望把info管理得有条理。比如一门课课件,比如MyinfoBook,比如一个Project。 接下来,用evernote过程中,逐渐改变了这个习惯。我发现,随任务,记录信息,聚合信息,成为一个更常用的方式。可以说,evernote成为解决一个问题时,记录信息,整理的理想工具。比如这篇笔记的诞生,就是从一个问题开始:我怎么读论文,做笔记,并管理好文件和参考文献,又能在手机/pad上阅读呢?于是,search+read,整个信息搜索,记录,整理的过程都汇聚在evernote的这篇笔记里。等问题解决了,再总结,就是现在敲下的文字~。一个个的微任务,就是我们生活当中一个个信息访问的实际场景,也是信息组织,管理的基本单位。从买菜,出游,购物,读书,写作,研究,各类事务里都有相似的信息访问模式。evernote是一个方便的辅助工具。如果大家都这样用,那么在evernote里记录的笔记们将会是非常有价值的信息,想一想,不得了,弄了半天,web上高质量内容都藏在evernote里面。它们怎么发挥更大的价值?我去另开一个笔记来调研吧。。。。。。
还得提evernote webclip,这解决了在web 上浏览时保存好信息的需求(虽然保存的内容极少可能会回头再次阅读,但它的确满足了人希望把信息保存下来的需求)。但是为了真正使用这些信息,我觉得,还得回到微任务模式里去——带着问题来阅读,整理,才是有效的。

zotero:

在解决技术问题的调研中,evernote可以帮助记录整理web上的信息,不过稍微深入一点的内容,就涉及到了academic research paper。web内容,简单给一个URL reference或者evernote webclip的link就好,能够访问,但research paper需要链接到PDF file。 管理research paper是件麻烦事。通常,我会按project涉及的问题和领域,建立目录,下载paper保存。但今天调研的问题,涉及到一篇过去曾经看过的paper,也许曾经还对它作过阅读注释,但是现在它在哪?——用文件系统来管理,完全不行。 想到zotero。好久不用了,一直不写paper,也就不大关心reference。记得很久以前还是用EndNote写了我的毕业论文,那是第一次接触参考文献管理软件,感叹真是好用。后来fankai介绍用了zotero,在firefox里使用,library都在服务器上,不用考虑备份和保存了,很好。现在再一次做'研究', 回到zotero。 zotero, zandy, greader, evernote and me 提到了zotero和pad之间使用zotfile论文同步,这正是我想要的。安装zotfile,在web page上用zotero保存参考条目时,它神奇的自动保存pdf文件/或者监控download目录下新的pdf文件,把它作为附件,并且自动换名,到自己设置的library目录下。文中提到的同步,还得依赖其它pc-mobile之间的同步软件。 我用百度云,于是找到百度云/百度网盘的Python客户端, 试了试,还不错,也算解决了一个长期要换到windows机器上去做云盘同步的麻烦了。虽然,doukan不直接支持百度云盘,但总算打通了这条通道,而我也不需要文中提到的pdf 回传和annotation提取功能,因为duokan可以把它们直接输出到evernote~~~ Connecting Zotero and Evernote 提到在evernote里,通过zotero link translator 添加zotero的条目连接,这样就把笔记,注释和zotero library连接起来。

Summary

现在,有了一个完整的论文阅读,笔记,参考文献管理工具链了: zotero+zotfile -> baiduyun -> duokan reader ->evernote -> zotero. :)


搭建个人博客(开篇~)

搭建个人博客(开篇~)

Task

  • 支持markdown
  • 方便写和发布

Process

  • github pages seems a perfect answer!
  • jekyll sites, 找一个好模板,需要页面简洁,支持导航,支持评论。找到codepiano, 准备用他了~
  • 弄了半天,发现Github Pages需要特定的username/useranem.github.io repo命名,retry...
  • 修改categories.html中一个小错误,(btw, 下面贴出来的这段code,折腾了大半天,github总报告Page build failure, 说,“...that is a symlink or does not exist in your \_includes directory”
< \{\% include site/setup %\}
---
> \{\% include codepiano/setup %\}
  • 还没有评论支持? >检查,是_config.xml里面site被我改为myblog了,修改回去
site :
  version : 0.3.0

Summary

Welcome~, http://kyhhdm.github.io/ Thanks,github, jekyll, and codepiano~

References

  • 1: http://www.ruanyifeng.com/blog/2012/08/bloggingwithjekyll.html "搭建一个免费的,无限流量的Blog----github Pages和Jekyll入门"
  • 2: http://codepiano.github.io/ "个人博客,powered by jekyll && bootstrap"
  • 3: https://github.com/jekyll/jekyll/wiki/Sites "jekyll sites"

—  原创作品许可 — 署名-非商业性使用-禁止演绎 3.0 未本地化版本 — CC BY-NC-ND 3.0   —