Writing articles with markdown

For a few months now I have been embracing markdown as my note taking tool and recently I have decided to take things further - write a full report in Markdown. Interestingly impulse to do so came from the post discussing shortcoming of the markdown. Adam have a very good point with non-standardised format yet I felt he was mistaken on the HTML front - I never expected markdown to replace web design. Then it stroke me - I did expected markdown to replace Latex with smaller documents.

Perils of Latex

Latex is amazing tool for writing and managing complex documents. I, similar with other people producing PhD thesis or any academic work, can’t imagine working with anything else. I create most of my documents in Latex as well as all of presentations (beamer) and a lot of visuals (tikz).
Yet it is not without some major shortcomings:

  • Learning curve is steep and using dedicated processors like Lyx just hides complexity and severely cripple your learning
  • Probably everybody agrees that its initial settings are same as Windows - needed to be changed to produce something looking nice. And it does take a while to do it.
  • Missing libraries are major pain.
  • There is probably a whole section of Hell where you look for missing brackets or misspelled command in large Latex document.

Online editors such as ShareLatex and Overleaf will make any work easier. In case of presentations just use metropolis - amazing visuals straight from the box. Just check my uni teaching slides.

Back on the track - initial approach with RStudio

So why do we discuss markdown if I prefer Latex? I am not alone in thinking that best way to produce document is working in pure ASCII yet problems with Latex syntax does prevent proper deep focus. Another nudge came from reading Deep Work by Cal Newport - I decided to focus on writing my next report in distraction free environment.

Testing without RStudio

With limited time (week to go) I decided to use Rstudio as it has a good description of R markdown. Two things I needed to have with my document, that are not part of markdown, where:

  • references - report was my teaching practice which required heavy citation
  • cross references within document (I like things tidy)

Cross references

To create cross references use HTML tag, for example top header would be #<a name="Introduction"></a> Introduction. I can then reference to it using As discussed in [Introduction](#Introduction).

References

including discussion about adding references.

I followed their approach quite literary with bibtex document created using JabRef and downloaded cls from https://www.zotero.org/styles. All you need now is:

  • Add citations in the text itself
As covered in previous research [@Nicometo2010]. @Nicometo2010 discussed that as well. 
Another aspect have been covered already [@Felder2012;@Taylor2011;@Fry2008].
  • Amend R markdown (rmd) header
---
title: "Efficient teaching to the large class of engineering student"
author: "Lukasz K Bonenberg"
output:
  html_document:
    highlight: pygments
    theme: cerulean
    toc: yes
bibliography: refs.bib
csl: emerald-harvard.csl
---
  • Kit in in RStudio

Going for dark side (hey, they have cookies…)

A next stage was to create Latex template and kit to Latex. This is when I decided to take short-cut via dark side. With this setting it is extremely easy to output to Word. This ouptut can be modified by the use of templates. All you need to do is to amend the header

---
title: "Efficient teaching to the large class of engineering student"
author: "Lukasz K Bonenberg"
output:
  word_document:
    fig_caption: yes
    highlight: pygments
bibliography: refs.bib
csl: emerald-harvard.csl
---

And create output word document. You can then edit document to your liking, rename it template.docx and knit nice looking output. It will also compile any latex equations or included graphics properly.

---
title: "Efficient teaching to the large class of engineering student"
author: "Lukasz K Bonenberg"
output:
  word_document:
    fig_caption: yes
    highlight: pygments
    reference_docx: template.docx
bibliography: refs.bib
csl: emerald-harvard.csl
---

Take two - pandoc

Pandoc is a conversion engine behind R markdown and knit. Just to see full picture I decided to compile my document without RStudio directly in pandoc.

Pandoc recognise citations and cross sections as well as utilise docx template. To compile my document you need to use the following command: 1

pandoc ATPreport.rmd --filter=pandoc-citeproc --biblio=refs.bib --csl=emerald-harvard.csl --reference-docx=template.docx --highlight-style=pygments --output=report.docx
  

Results are same as with RStudio apart from R code reduced to static text. I would strongly recommend playing with pandoc. Amount of different formats it can output is simply amazing. For anybody experimenting with cls styles this repo should be very helpful.

Summary

This approach worked really well allowing me to focus directly on writing text (with references) and not being distracted by compiling errors or missing brackets. As I was using RStudio I could also generate few simple plots from data.

What I want to do next is to generate similar output in PDF using Latex. No short cuts this time.

  1. You can replace = with space (‘ ‘) but you can’t mix it - pandoc will throw strange error.

Making git repos small again

I have been recently struggling with large git repo - it contained large data files I did not longer needed. Up to this point I preferred not to add data files to repo to avoid problems like this. This can’t be avoided if you share your code with other uses, without datasets, or fork exisitng repos. I was also looking at using something that is not destructive - end of the day you want your git to contain all the history so you can always replicated previous code.

To cut the story short I have located excellent package to do so BFG written in Scala. It is lightning fast (according to author it is 100x faster than git-filter-branch). After downloading java package, if you want to remove all files bigger then 30M from your git repo just type

cd your-repo-dir\
java -jar bfg.jar --strip-blobs-bigger-than 30M

This will not remove the files, it merely makes it for deletion. To alter your database you need to prune database yourself

git reflog expire --expire=now --all && git gc --prune=now --aggressive

##Aliases in DOS

As a side note above code can be simplified using aliases. Following this post all you need to do is

doskey bfg="java -jar d:\DIR-to-SOFT\bfg-1.12.8.jar"
doskey /MACROS:ALL

And then above code becomes

bfg --strip-blobs-bigger-than 30M

For more info check doskey /?.

##Running it against git hub

This approach will not work if the history you are chanching is already on remote repo (for ex github). As above approach change history, you wont be able to push and pull will just restore your changes.
Instead we need to make a local bare repo, pack it, change it and then push. A bit more work indeed, yet we usuall as in code below

git clone --mirror git://example.com/some-big-repo.git
cd some-big-repo.git
git gc --auto
git repack -d -l
bfg -b 30M
git reflog expire --expire=now --all && git gc --prune=now --aggressive
git push

This will force push to master and shrink both local and remote copy.

Machine Learning Case Study, Week1

Recently I found yet another machine learning course at Coursera - Machine Learning Foundations: A Case Study Approach, which seems to focus on case studies not actual techniques. This is similar to Foundations of strategic business analytics course, repo for which you can find on my github, yet seems to offer more solid background in ML. I do like this approach as it allows to see the big picture and understand better a end product.

##Getting started

Course is based on the python commercial package by https://dato.com/ called GraphLab Create - free for one year for academic and learning purpose. It is similar to pandas, with more support for online platforms.

##Installation
You can install dato on your machine using your launcher. Since I have working copy of python 2.7.x I decided to install GraphLab into my existing Anaconda using this setup, which can be summarised as creating new environment and using pip to install GraphLab-Create into it. For all not familiar with environment concept, it is directory that contains a specific collection of python packages, allowing to separate and sanitise your code - new versions of specific packages can be installed and tested without affecting your currently working setup.

So to sum it up, all we need to do is:

conda create -n dato-env python=2.7 anaconda activate dato-env pip install --upgrade --no-cache-dir https://get.dato.com/GraphLab-Create/1.7.1/email/LicenseNo/GraphLab-Create-License.tar.gz conda update ipython ipython-notebook

##Running ipython notebooks

This approach changes slightly the way we run the notebooks. We need to fist change to proper environment aand tehn call ipython. From command line it will be

activate dato-env ipython notebook

Apart from that week 1 of the course has been very basic and I hope for more action next week.
I will keep you posted how the course goes.

How to synchronise your R libraries between different PCs

I am using my R on a few computers (sometimes in the same time) and I do find it annoying when some libraries which I installed on one are missing on other. My usual solution was to use Dropbox and synchronise between my accounts. With R I had a problem, as defining library location required changes directly to R, hidden behind RStuido. I was in hurry so I gave up.

Recently, I have found stackoverflow solution and this blog post which made me re-visit problem again. Workflow below, shows my current solution:

  • used .libPaths() inside R to check current library paths;
  • identified which paths to keep. In my case it keep R original library but removed link to my documents;
  • found R-Home path using R.home() or Sys.getenv(“R_HOME”);
    • R-Home\R-3.2.2\etc\Rprofile.site is read every time R kernel starts, so any modification will be persistent to every run of R
  • edited R-Home\R-3.2.2\etc\Rprofile.site by adding[^]
# set library paths
.libPaths(.libPaths()[2])
  • restarted R (Ctr+Shift+F10).

[^]: note that I use Unix path notation despite using windows. R always use Unix notation, regardless of operating system. Also don’t add final “".

Where are my commits?

In August I have been very busy coding as a part of S2DS but none of my hard work shown on my GitHub contribution panel. Initially I assumed this is because I used private repo. Then I realised that Lin’s commits are showing on hers. I needed to investigate.

Error was on my command line side. I commit using command line, as I find it much faster. In order to keep those commits in sync with your GitHub account your email address must match. If you followed GitHub advice on keeping email private then all you need to do is, in command line:

git config --global user.email "your_email@users.noreply.github.com"
git config --global user.email

You can also set up specific email for single repository, according to help page.

##Side note about Git

Git is one of the most amazing tools I know. Lin have already reported on her experience with it. I used for a while, yet, only proper group experience made me truly appreciate its power. Recently I found two good blogs explaining the basics:

  • if you never used it, try this page
  • if you want to learn more about team work in git, try this intro

I can only recommend trying. You will love it. And remember, commit before you pull, or at least stash.