Jump to main navigation


Tutorial 17.3 - Markdown and Pandoc

17 Nov 2018

About

Both LaTeX and HTML are both markup languages. They both have standardized short-hand syntax to specify how content should be styled and formatted. Markdown is another markup language with its own specific syntax, yet is far simpler and less verbose than either LaTeX or HTML.

The goal of markup languages is to provide simple styling rules and syntax so as to allow the author to concentrate on the content. To this end, the highly simplified syntax of the markdown language makes it one of the briefest and content rich formats. Unlike, many other markup languages (such as LaTeX and HTML), carriage returns and spaces form an important part of the language structure and thus influence the formatting of the final document.

Simplicity also makes markdown an ideal language for acting as a base source from which other formats (such as PDF, HTML, Presentations, Ebooks) can be created as well as a sort of conduit language through which other formats are converted.

Pandoc is a universal document converter that converts between one markup language and another. Specifically, Pandoc can read markdown and subsets of the following formats:

  • HTML
  • LaTeX
  • Textile
  • reStructuredText
  • MediaWiki markup
  • DocBook XML
Pandoc can write the following formats:
  • plain text
  • markdown
  • HTML (XHTML, HTML5>)
  • LaTeX
  • PDF (when LaTeX installed)
  • Various HTML/Javascrip based slide shows (Slidy, Slideous, DZSlides, S5)
  • EPUB
  • Emacs org-mode
  • Rich Text Format (RTF)
  • OpenDocument XML
  • LibreOffice (Open Document Format, ODT)
  • Microsoft Word DOCX
  • MediaWiki markup
  • FictionBook2
  • Textile
  • groff man pages
  • AsciiDoc

Many of the above markup languages feature extensive definitions for styling and formatting rules that do not have direct equivalents within other languages. For example, Cascading Style Sheets and Javascript within HTML provide advanced styling and dynamic presentation of content that cannot be easily translated into other languages. Similarly, there are many macros available for LaTeX that enhance the styling and formatting of content relevant to PDF. Consequently, not all of the more advanced features of each of the languages are supported by Pandoc for conversion.

Pandoc fully supports markdown as an input language, making markdown a popular base language to create content from which other formats can be generated. For example, contents authored in markdown can then be converted into PDF, HTML, HTML presentations, eBooks and others. There are currently numerous dialects of the markdown language. Pandoc has its own enhanced dialect of markdown which includes syntax for bibliographies and citations, footnotes, code blocks, tables, enhanced lists, tables of contents, embedded LaTeX math.

This tutorial will focus on markdown as a base source language from which PDF, HTML, presentations and eBooks are created. As a result, the tutorial will focus on Pandoc's enhanced markdown.

Using Pandoc

Pandoc is a command line application with the basic use of:

  pandoc -o output.file input.file
where output.file is the name of the output file and input.file is the name of the input file or files (space separated). Pandoc uses the file extensions to determine the input and output formats. If multiple input files are specified, Pandoc will first concatenate (join) them together before parsing the combined input file.

Before going through the specifics of the Pandoc markdown syntax and the Pandoc options, I will illustrate a very basic example of Pandoc markdown conversion into a PDF, HTML and DZSlides presentation. Note, in the case of the PDF, the default is to produce a A4 size page, and therefore the font in the example below is going to look small.

Markdown (*.md)PDF result (*.pdf)
  ---
  title: This is the title
  author: D. Author
  date: 14-02-2013
  ...

  # Section 1
  Some text

HTML result (*.html)
DZSlides result (*.html)

Pandoc markdown

The basic philosophy of markdown is that a document formatted in markdown is completely readable in plain text without any obvious tags or formatting instructions. That is, the formatting rules are derived almost entirely from the structure of the plain text document. Furthermore, any additional "markup" (such as underlined text) should not appear out of place in a plain text document.

Pandoc's dialect of markdown, whilst retaining some of this philosophy for most elements, is nevertheless guided by its aims to support multiple input and output formats (not just markdown -> HTML).

Paragraphs

Paragraphs of text are specified by separating sentences by one or more blank lines. Hard line breaks within paragraphs can be specified by placing two or more spaces at the end of a sentence followed by a line break.

Using Pandoc

Pandoc is a command line application with the basic use of:

  pandoc -o output.file input.file
where output.file is the name of the output file and input.file is the name of the input file or files (space separated). Pandoc uses the file extensions to determine the input and output formats. If multiple input files are specified, Pandoc will first concatenate (join) them together before parsing the combined input file.

Before going through the specifics of the Pandoc markdown syntax and the Pandoc options, I will illustrate a very basic example of Pandoc markdown conversion into a PDF, HTML, presentation and eBook.

Markdown (*.md)
  ---
  title: This is the title
  author: D. Author
  date: 14-02-2013
  ...

  # Section 1
  Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent a velit quis ante dignissim 
  dignissim eget vitae tellus. Duis eget neque tellus, eu elementum leo. Nullam quis velit in 
  magna bibendum dictum. Curabitur tincidunt cursus tellus, in egestas augue porta ut. Phasellus 
  facilisis porttitor elit, vel pretium felis volutpat in. Praesent euismod sagittis tortor, eget 
  varius nisi consequat eget. Sed facilisis aliquet accumsan. Maecenas aliquam, dolor id 
  hendrerit viverra, lacus tortor elementum nunc, quis commodo ligula orci vel augue. 

  Suspendisse dolor purus, volutpat vel viverra vitae, laoreet blandit nulla. In eros ligula, 
  scelerisque id tempus nec, pulvinar vitae felis. Morbi tempor viverra orci, quis elementum metus 
  lobortis sed. Curabitur sit amet ante massa.

Show HTML output
Show PDF output

The metadata block

You may have noticed in the examples above, that at the start of the markdown there were a block of lines starting with three hypens (---) and ending with three dots (...). When processed via pandoc, these lines define the document's meta data (such as the title, author and creation date).

The meta data are a set of key value pairs in YAML format. The list of useful metadata depends on the intended output.

Note however, the metadata block only appears in the output when the --standalone (or -s) switch is used.

The following rules can be applied to field different outcomes:

  • The three fields must be in order of title, author(s), date with each on a separate line
  • When omitting a field, the field must be left as a line just containing the % character
  • ---
      title: This is the title
      author: D. Author
      date: 14-02-2013
      ...
      
  • Multiple authors can be defined by either:
    • separating each author by a ; (semicolon) character
    • placing each author on a separate line (indented by a single space)
    ---
    title: This is the title
    author:
      - name D. Author
      - name D. Other
    date: 14-02-2013
    ...
      

Text formatting

Brief changes to font styles within a block of text can be effective at emphasizing or applying different meanings to characters. Common text modifier styles are: italic, bold and strikethrough.

The following table indicates the mardown used to achieve these text formats.

MarkdownFont type
*Italic text* or _Italic text_Italic text
**bold text** or __bold text__bold text
~~strikethrough~~strikethrough text
`courier` or ``courier``courier or typewriter font shape.

Note, underlined text is not defined in any dialect of markdown (including pandoc markdown) as the developers believe that the underline style is a relic of the days of typewriters when there where few alternatives for emphasizing words. Furthermore, underlining of regular words within a sentence tends to break the aesthetic spacing of lines.

Subscript and Superscript

  • subscripts are supported by surrounding the content to be lowered by ~ characters
  • superscripts are supported by surrounding the content to be raised by ^ characters
If the content to be raised or lowered contains spaces, then they must be escaped by proceeding the space with a \ character.
Markdown (*.md)
  ---
  title: This is the title
  author: D. Author
  date: 14-02-2013
  ...
  
  The rate of oxygen consumption (O~2~ per min^-1^.mg^2^) ...
  
  Effect~Oxygen\ concentration~
Show HTML output
Show PDF output

Additional mathematical and symbol notation will be illustrated in a later section.

Section headings

Pandoc markdown supports two heading formats (pandoc markdown headings must be proceeded by a blank line):

  • Setext-style headings. Level 1 headings are specified by underlining the heading with a row of = characters and level 2 headings are specified by underlining with a row of - characters.
    Markdown (*.md)
    ---
    title: This is the title
    author: D. Author
    date: 14-02-2013
    ...
    
    Section 1
    ============
    
    Subsection 
    -----------
    
    Section 2
    ===========
      
    Show HTML output
    Show PDF output

    Setext-style headings only support level 1 and level 2 headings.
  • Atx-style headings. Levels 1-6 headings comprise one to six # characters followed by the heading text. by underlining the heading with a row of = characters.
    Markdown (*.md)
    ---
    title: This is the title
    author: D. Author
    date: 14-02-2013
    ...
    
    # Section 1
    
    ## Subsection 
    
    ### Subsubsection
      
    Show HTML output
    Show PDF output

Table of contents

A table of contents can be included by issuing the --toc command line switch to pandoc. For some output formats (such as HTML), a block of links to section headings is created, whilst for others (such as LaTeX), an instruction (\tableofcontentsfor the external driver to create the table of contents is generated.

Markdown (*.md)
---
title: This is the title
author: D. Author
date: 14-02-2013
...

# Section 1

## Subsection 

### Subsubsection
  
Show HTML output
Show PDF output

Block quotations

Block quotations in pandoc markdown follows email conventions - that is, each line is proceeded by a > character.

Markdown (*.md)
---
title: This is the title
author: D. Author
date: 14-02-2013
...

# Section 1
> This is a block quotation.  Block quotations are specified by
> proceeding each line with a > character.  The quotation block
> will be indented.
>
> To have paragraphs in block quotations, separate paragraphs
> with a line containing only the block quotation mark character.

  
Show HTML output
Show PDF output

Verbatim (code) blocks

Verbatim blocks are typically used to represent blocks of code syntax. The text within the verbatim block is rendered literally as it is typed (retaining all spaces and line breaks) and in monoscript font (typically courier). In pandoc markdown, verbatim text blocks are specified by indenting a block of text by either four spaces or a tab character. Within verbatim text, regular pandoc markdown formatting rules (due to spaces etc) are ignored.

Markdown (*.md)
---
title: This is the title
author: D. Author
date: 14-02-2013
...

# Section 1
a = rnorm(10,5,2)
for (i in 1:10) {
print(a[1])
}
  
Show HTML output
Show PDF output

Alternatively, verbatim blocks can be specified without indentation if the text block is surrounded by a row of three or more ~ characters. This format is often referred to as fenced code.

Markdown (*.md)
  ---
  title: This is the title
  author: D. Author
  date: 14-02-2013
  ...
  
  # Section 1
  ~~~~
  a = rnorm(10,5,2)
  for (i in 1:10) {
  print(a[1])
  }
  ~~~~
Show HTML output
Show PDF output

Lists

There are three basic list environments available within pandoc markdown:

  • Bullet lists - un-numbered itemized lists
  • Ordered lists - enumerated lists
  • Definition lists - descriptive lists

Bullet lists

A bullet list item begins with either a *, + or - character followed by a single space. Bullets can also be indented.

Markdown (*.md)
  ---
  title: This is the title
  author: D. Author
  date: 14-02-2013
  ...
  
  # Section 1
  * This is the first bullet item
  * This is the second.  
    To indent this sentence on the next line,
    the previous line ended in two spaces and
    this sentence is indented by four spaces.
  * This is the third item
Show HTML output
Show PDF output

Ordered lists

An ordered list item begins with a number followed by a space. The list enumerator can be a decimal number or a roman numeral. In addition to the enumerator, other formatting characters can be used to further define the format of the list numbering.

Markdown (*.md)
  ---
  title: This is the title
  author: D. Author
  date: 14-02-2013
  ...
  
  # Section 1
  1. This is the first numbered item.
  2. This is the second.
  1. This is the third item.  Note that the number I supplied is ignored
  
  # Section 2
  (i) This is list with roman numeral enumerators
  (ii) Another item
 
Show HTML output
Show PDF output

Note that only the value of the number used for the first item is considered. For subsequent list items the value of the numbers themselves are ignored, they are merely used to confirm that the list items have the same sort of enumerator.

Definition lists

The term (word or phrase) must fit on a single line and the definition must start with either a colon (:) or tilde (~) and be indented by four or more spaces.

Markdown (*.md)
  ---
  title: This is the title
  author: D. Author
  date: 14-02-2013
  ...
  
  # Section 1
  Term 1
   :  This is the definition of this term
  This is a phrase
   :  This is the definition of the phrase
Show HTML output
Show PDF output

Nesting and the four space rule

To include multiple paragraphs (or other blocked content) within a list item or nested lists, the content must be indented by four or more spaces from the main list item.

Markdown (*.md)
  ---
  title: This is the title
  author: D. Author
  date: 14-02-2013
  ...
  
  # Section 1
  1. This is the first numbered item.
  2. This is the second.
  i) this is a sub-point
  ii) and another sub-point
  1. This is the third item.  Note that the number I supplied is ignored
Show HTML output
Show PDF output

List ends

Normally, pandoc considers a list as complete when a blank line is followed by non-indented text (as markdown does not have starting and ending tags). However, if you wish to place indented text directly after a list, it is necessary to provide an explicit indication that the list is complete. This is done with the <!-- end of list --> marker.

Similarly, if you wish to place one list directly following on from another list, a <!-- --> marker must be used between the two lists so as to explicitly separate them.

Markdown (*.md)
  ---
  title: This is the title
  author: D. Author
  date: 14-02-2013
  ...
  
  # Section 1
  1. This is the first numbered item.
  2. This is the second.
  1. This is the third item.  Note that the number I supplied is ignored
  
  <:!-- -->
  
  1. Another list.
  2. With more points
Show HTML output
Show PDF output

Horizontal lines (rules)

Horizontal lines are indicated by a row of three or more *, - or _ characters (optionally separated by spaces) with a blank row either side.

Markdown (*.md)
  ---
  title: This is the title
  author: D. Author
  date: 14-02-2013
  ...
  
  # Section 1
  Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
  Praesent a velit quis ante dignissim dignissim eget vitae tellus. 
  Duis eget neque tellus, eu elementum leo. Nullam quis velit 
  in magna bibendum dictum. Curabitur tincidunt cursus tellus, 
  in egestas augue porta ut. 
  
  * * * *
  
  Phasellus facilisis porttitor elit, vel pretium felis volutpat in. 
  Praesent euismod sagittis tortor, eget varius nisi consequat eget. 
  Sed facilisis aliquet accumsan. Maecenas aliquam, dolor id hendrerit viverra, 
  lacus tortor elementum nunc, quis commodo ligula orci vel augue. Suspendisse 
  dolor purus, volutpat vel viverra vitae, laoreet blandit nulla.
  
  ---------
  
  In eros ligula, scelerisque id tempus nec, pulvinar vitae felis. Morbi
  tempor viverra orci, quis elementum metus lobortis sed. Curabitur sit amet ante massa.
  
 
Show HTML output
Show PDF output

Tables

As markdown is a very minimalist markup language that aims to be reasonably well formatted even read as plain text, table formatting must be defined by layout features that have meaning in plain text

Simple tables

The number of columns as well as column alignment are determined by the relative positions of the table headings and dashed row underneath:
  • if the dashed line is flush with the end of the column header, yet extends to the left of the start of the header text, then the column will be right aligned
  • if the dashed line is flush with the start of the column header, yet extends to the right of the end of the header text, then the column will be left aligned
  • if the dashed line extends to the left of the start and right of the end of the header text, then the column will be center aligned
  • if the dashed line is flush with the start and end of the header text, then the column will follow the default justification (typically left justified)

The table must finish in either a blank line or a row of dashes mirroring those below the header followed by a blank row.

Markdown (*.md)
  ---
  title: This is the title
  author: D. Author
  date: 14-02-2013
  ...
  
  # Section 1
  Column A    Column B    Column C
  ---------  ----------  ---------
  Category 1    High        100.00
  Category 2    High         80.50
  ---------  ----------  ---------
  
Show HTML output
Show PDF output

Multiline tables

Simple tables can be extended to allow cell contents to span multiple lines. This imposes the following additional layout requirements:

  • The table must start with a row of dashes that spans the full width of the table
  • The table must end with a row of dashes that spans the full width of the table followed by a blank line
  • Each table row must be separated by a blank line

Markdown (*.md)
  ---
  title: This is the title
  author: D. Author
  date: 14-02-2013
  ...
  
  # Section 1
  --------------------------------
  Column A    Column B      Column 
                                 C
  ---------  ----------  ---------
  Category 1    High        100.00
  High         95.00
  
  Category 2    High         80.50
  High         82.50
  --------------------------------
  
Show HTML output
Show PDF output

Grid tables

Grid tables have a little more adornment in that they use characters to mark all the cell boundaries. However, by explicitly defining the bounds of a cell, grid tables permit more complex cell contents. A grid table for example, can contain a list or a code block etc.

Cell corners are marked by + characters and the table header and main body are separated by a row of = characters

Markdown (*.md)
  ---
  title: This is the title
  author: D. Author
  date: 14-02-2013
  ...
  
  # Section 1
  +---------------+---------------+--------------------+
  | Fruit         | Price         | Advantages         |
  +===============+===============+====================+
  | Bananas       | $1.34         | - built-in wrapper |
  |               |               | - bright color     |
  +---------------+---------------+--------------------+
  | Oranges       | $2.10         | - cures scurvy     |
  |               |               | - tasty            |
  +---------------+---------------+--------------------+

  +-----------+----------+-----------+
  |Column A   |Column B  |   Column C|
  +===========+==========+===========+
  |Category 1 |100.00    | - point A |
  |           |          | - point B |
  +-----------+----------+-----------+
  |Category 2 | 85.00    | - point C |
  |           |          | - point D |
  +-----------+----------+-----------+
Show HTML output
Show PDF output

Although, grid tables require substantially more setup, emacs users will welcome that grid tables are compatible with emacs table mode.

Pipe tables

Finally, there are also pipe tables. These are somewhat similar to grid tables in requiring a little more explicit specification of cell boundaries, however, unlike grid tables, they have a means to configure column alignment. Cell alignment is specified via the use of : characters (see example below).. Nor is it necessary to indicate cell corners.

Markdown (*.md)
  ---
  title: This is the title
  author: D. Author
  date: 14-02-2013
  ...
  
  # Section 1
  | Default | left  | Center | Right  |
  |---------|:------|:------:|-------:|
  |   High  | Cat 1 | A      | 100.00 |
  |   High  | Cat 2 | B      |  85.50 |
  |   Low   | Cat 3 | C      |  80.00 |
Show HTML output
Show PDF output

Again, emacs users will appreciate that pipe tables are compatible with orgtbl-mode.

Math

Markdown leverages TeX math processing. Whilst this does technically break the rules that promote source documents that are readable in text only mode, the payoff is that math is rendered nicely in the various derivative documents (such as pdf or html). In fact, math are passed straight through to the derivative document allowing that document (or is reader) to handle TeX math as appropriate.

Inline math is defined as anything within a pair of $ characters and for math in its own environment (paragraph), use a pair of $$ characters.

Markdown (*.md)
---
title: This is the title
author: D. Author
date: 14-02-2013
...

# Section 1
The formula, $y=mx+c$, is displayed inline. 
Some symbols and equations (such as 
$\sum{x}$ or $\frac{1}{2}$) are rescaled 
to prevent disruptions to the regular 
line spacing.
For more voluminous equations (such as 
$\sum{\frac{(\mu - \bar{x})^2}{n-1}}$), 
some line spacing disruptions are unavoidable.  
Math should then be displayed in displayed mode.
$$\\sum{\frac{(\mu - \bar{x})^2}{n-1}}$$

  
Show HTML output
Show PDF output

Referencing

Referencing is the linking of information and content between different parts of a document or even between documents. These are alternatively referred to as links (particularly in the context of web documents).

Internal links

Internal links make use of the section identifiers that are automatically generated. That is, section headings are automatically defined as labels for referencing. Therefore, to reference (link to) a section simply involves using the target section header as a reference label in the following

[in text label](#Reference label)
  
The in text label is a word or phrase that should appear as the link in the text, and reference label is the title of the section you wish to link to. Note, there should not be any spaces between the square braces and the brackets.

Markdown (*.md)
---
title: This is the title
author: D. Author
date: 14-02-2013
...

# Introduction
Bla Bla Bla

# Section 2
See the [introduction](#Introduction).
  
Show HTML output
Show PDF output

To link to arbitrary parts of the document (non sections), it is necessary to include a point marker with a reference label, so that there is something to link to.

Markdown (*.md)
  ---
  title: This is the title
  author: D. Author
  date: 14-02-2013
  ...

  # Introduction
  Bla Bla Bla
  
  Table: this is the table caption [cap]: #cap
  
  | Default | left  | Center | Right  |
  |---------|:------|:------:|-------:|
  |   High  | Cat 1 | A      | 100.00 |
  |   High  | Cat 2 | B      |  85.50 |
  |   Low   | Cat 3 | C      |  80.00 |
  
  # Section 2
  See the [cap].
Show HTML output
Show PDF output

External links

Linking to external documents follows a similar format:
[in text label](url)
  
Markdown (*.md)
  ---
  title: This is the title
  author: D. Author
  date: 14-02-2013
  ...
  
  Goto the [Google search engine](http://www.google.com)
Show HTML output
Show PDF output

Images

Images are not displayed in plain text (obviously). However, an image link in pandoc markdown will insert the image into the various derivative document types (if appropriate), Image links are defined in a similar manner to other links, yet preceded immediately by a ! character.

![in text label](image.jpg)

#OR 

![label]
[label]: image.jpg
  
Markdown (*.md)
  ---
  title: This is the title
  author: D. Author
  date: 14-02-2013
  ...
  
  # Introduction
  Include the png figure
  ![j](images/ws9.2aQ1.1.png) 
Show HTML output
Show PDF output

Footnotes

Footnotes consist of a placemarker and the footnote text

To create a footnote[^note1]

[^note1]: A footnote marker cannot contain any spaces.
  
Markdown (*.md)
  ---
  title: This is the title
  author: D. Author
  date: 14-02-2013
  ...
  
  # Introduction
  To create a footnote[^note1]
  
  [^note1]: A footnote marker cannot contain any spaces.
Show HTML output
Show PDF output

Further reading