Simple Markdown Parsing using PythonConvert Posts into Gist-Friendly HTML and JSON

Dealing with messy Markdown rendering can be frustrating. In this guide, the process of converting Markdown into clean HTML and structured JSON is explored, with special attention to GitHub Gists and content blocks. Whether working on a frontend app or a custom blog engine, this approach provides better control, flexibility, and performance when handling content.

markdown file with simple text and github gist embedded symbolizing transformation into html and json clean lines jpeg
markdown file with simple text and github gist embedded symbolizing transformation into html and json clean lines jpeg

This approach offers a reliable way to convert any Markdown file into clean HTML and structured JSON, as long as the content follows standard Markdown syntax. The process is straightforward:

  • Use the markdown Python package to convert Markdown into HTML
  • Detect and embed GitHub Gists as script tags
  • Parse the HTML with BeautifulSoup to separate text, images, and scripts into structured JSON
  • Automate the entire flow to process multiple Markdown files at once using Python

Why Structure Markdown Output?

Unstructured HTML can complicate working with content. For example, when building a React app, one might want to render a Gist in one component and text in another. Alternatively, in some cases, text, embeds, and media might need to be broken into separate visual blocks. Structuring content in JSON provides more control and flexibility. This method improves frontend performance and maintainability by organizing content in a way that suits the platform’s needs.

Step 1: Convert Gist Links to Embed Scripts

When a Gist URL appears like https://gist.github.com/username/gistid, It will be converted into an embedded script tag:

<script id="gist" src="username/gistid"></script>

Here’s the Python function to handle this:

How the Regex Works

  • ^\s*: Matches optional whitespace at the start of the line
  • (https://gist\.github\.com/([a-zA-Z0-9\-]+/[a-zA-Z0-9]+)): Captures the Gist URL and the username/gistid part
  • \s*$: Matches optional whitespace at the end of the line

A Gist URL like https://gist.github.com/user/abc123 becomes:

<script id="gist" src="user/abc123"></script>

Step 2: Convert Markdown to HTML

The markdown package is used to convert the Markdown into HTML:

The markdown package is simple and handles standard formatting, like headers, bold, and code blocks, reliably.

Step 3: Break HTML into Components

Next, the HTML is parsed and broken into structured components: text, images, and embedded Gists. BeautifulSoup handles the task:

Instead of a large block of HTML, this process results in structured JSON like:

This makes rendering on the frontend easier.

Step 4: Automate Markdown Processing

To handle multiple Markdown files, a loop can be used for automation. To run the batch process, create a DataFrame with the Markdown file links:

This automates the process of converting Markdown into structured JSON. For efficient Markdown rendering, especially with embedded Gists or images. This approach offers clean, flexible output. It provides more control over frontend rendering while avoiding unnecessary complexity. Structured content makes websites faster, more flexible, and easier to maintain. It also allows content reuse across platforms, such as newsletters or content feeds.