Visualizing Bookworm

Mix.install([
  {:req, "~> 0.3.0"},
  {:floki, "~> 0.34.0"},
  {:kino_explorer, "~> 0.1.4"},
  {:libgraph, "~> 0.16.0"}
])

🤔 Why?

Bookworm is one of my favorite podcasts. Like many listeners I use it as a filter to find quality books to add to my queue. I love how the hosts don't just read but engage with the material, holding each other accountable with takeaways and challenges.

A recurring joke in Bookworm is how many times they refer to Ep. 42, How To Read a Book by Mortimer Adler. The last few times they've made that joke I've wanted to actually visualize that graph. I know Mike loves Obsidian, I love Elixir and have been learning Livebook… let's play!

(Livebook is a markdown-based workbook system (similar to Jupyter notebooks) that uses erlang/elixir. This blog post is mostly written in Livebook and ran in Livebook. I exported the content and output, added some screenshots, and here we are! You could actually take the markdown source of this post and open it in Livebook. Pretty cool.)

🎣 Scraping

The first thing is to scrape the pages for processing and evaluation. One lovely thing about Livebook is that if you structure each code block separately, it intelligently memoizes each variable and only re-evaluates them if the structure changes.

Previously, I had used HTTPoison to do some downloading, but my friend Seth has recommended Req, so let's give that a whirl.

Adding the following to the setup:

Mix.install([
  {:req, "~> 0.3.0"}
])

Now I can see if I can hit the opening page:

Req.get!("https://bookworm.fm/").body
"<!DOCTYPE html>\n<html lang=\"en-US\" xmlns:og=\"http://ogp.me/ns#\"" <>

Success! Let's be a little more intentional.

Bookworm's website is structured logically. Every episode will be in the form https://bookworm.fm/1/ up to https://bookworm.fm/<current>/. The most recent post is the top <h2> on the page.

Now I can hone the CSS selector I need to get the most current link. What's really nice (again) about Livebook is that as I rerun the code over and over… I'm not pummeling my internet friends' server with requests. Livebook's cells have localized and memoized all the variables.

most_recent_episode_number =
  Req.get!(url: "https://bookworm.fm/").body
  |> Floki.parse_document!()
  |> Floki.find("h2 a")
  |> List.first()
  |> Floki.attribute("href")
  |> List.first()
  |> String.split("/", trim: true)
  |> List.last()
  |> String.to_integer()
176

☝️ This is a super gross way to do this, but I can fix it later.

Anyway. Excelsior!

raw_episodes =
  Enum.map(1..most_recent_episode_number, fn num ->
    url = "https://bookworm.fm/#{num}/"

    %{
      # going to be important later! (basically the private key for the episode)
      url: url,
      request: Req.get!(url: url)
    }
  end)
[
  %{
    request: %Req.Response{
      status: 200,
      headers: [
        {"connection", "keep-alive"},
        {"content-length", "13177"},
        …
        {"x-fw-type", "FLYWHEEL_BOT"}
      ],
      body: "<!DOCTYPE html>\n<html lang=\"en-US\"" <> …,
      private: %{}
    },
    url: "https://bookworm.fm/1/"
  },
  %{…},
  …
]

The first time I tried this, I forgot the trailing slash at the end of the URL and was getting tons of redirects (I'm guessing to the URI with the trailing slash.) 🤷

Let's do some cleanup on this… we don't care about all the html, just the bit inside the main area.

episodes =
  raw_episodes
  |> Enum.map(fn %{request: request} = episode ->
    floki_article =
      request.body
      |> Floki.parse_fragment!()
      |> Floki.find("article")

    links =
      floki_article
      |> Floki.find("a")
      |> Enum.map(fn link ->
        link
        |> Floki.attribute("href")
        |> List.first()
      end)

    title =
      floki_article
      |> Floki.find("h1")
      |> Floki.text()

    published =
      floki_article
      |> Floki.find(".entry-time")
      |> Floki.attribute("datetime")
      |> List.first()
      |> DateTime.from_iso8601()
      # TODO: this should be a `with` block
      |> elem(1)

    episode
    |> Map.put(:floki_article, floki_article)
    |> Map.put(:article, Floki.text(floki_article, sep: " "))
    |> Map.put(:links, links)
    |> Map.put(:title, title)
    |> Map.put(:published, published)
  end)
[
  %{
    article: "1: Getting Things Done by David Allen July 7, 2016  • 1 hour 08 minutes Welcome to Bookworm! To kick things off, we’ve reread a book we both rely on heavily in our lives, Getting Things Done by David Allen. This is the best discussion either of us have had about GTD in a long time. Joe Buhlig Mike Schmitz 33: Analog vs. Digital with Mike Schmitz TPS57: Advanced OmniFocus Setup + Workflow w/ Joe Buhlig Getting Things Done by David Allen Defer Dates in OmniFocus 43 Folders – Merlin Mann SaneBox – Email Management for Any Inbox JIRA Software – Atlassian Confluence – Atlassian The Organized Mind by Daniel J. Levitin Deep Work by Cal Newport Mike's Rating: 4\n Joe's Rating: 5 http://traffic.libsyn.com/bookworm/BW001.mp3 Podcast:  Play in new window  |  Download  (Duration: 1:08:39 — 63.0MB) Subscribe:  Apple Podcasts  |  RSS",
    floki_article: [
      {"article",
       [
         {"class",
         …
       ]}
    ],
    links: ["http://joebuhlig.com", "http://mikeschmitz.me/", "http://joebuhlig.com/33/",
     "http://www.asianefficiency.com/podcast/057-joe-buhlig/",
     "https://www.amazon.com/Getting-Things-Done-Stress-Free-Productivity/dp/0143126563/ref=sr_1_1?tag=bookwormfm-20",
     "https://discourse.omnigroup.com/t/defer-date-how-to-use/1046", "http://www.43folders.com/",
     "http://www.sanebox.com/", "https://www.atlassian.com/software/jira",
     "https://www.atlassian.com/software/confluence",
     "https://www.amazon.com/Organized-Mind-Thinking-Straight-Information/dp/0147516315/ref=sr_1_1?tag=bookwormfm-20",
     "https://www.amazon.com/Deep-Work-Focused-Success-Distracted/dp/1455586692/ref=sr_1_1?tag=bookwormfm-20",
     "http://traffic.libsyn.com/bookworm/BW001.mp3", "http://traffic.libsyn.com/bookworm/BW001.mp3",
     "http://traffic.libsyn.com/bookworm/BW001.mp3",
     "https://itunes.apple.com/us/podcast/bookworm/id1132102092?mt=2&ls=1#episodeGuid=http%3A%2F%2Fbookworm.fm%2F%3Fp%3D12",
     "https://bookworm.fm/feed/podcast/"],
    request: %Req.Response{
        …
  },
  %{…},
  …
]

Everytime I create one of these tiny blocks, it feels wrong to my DRY-trained programming brain, but I know from experience with Livebooks that it rewards little iterative steps.

I've thought of a several different things I want to do with this:

  • a table of "most linked to episodes"
  • a graph showing all the links
  • transformation to markdown notes that could be stuck in someone's (cough Mike cough) Obsidian.

Let's go!

🗺️ Graphs and Maps

I thought that the first thing I'm going to need to do is to transform this list of raw bodies into something that we can actually work with and display as a table.

I started to add a Smart Cell Data Explorer, which will automatically add dependencies for all it's plugins and things… but I'm running into it freaking out over the EasyHTML/Floki structs… let's go for the throat and try the graph.

I've done this before for Advent of Code, so excuse the magic… 🧙

graph =
  episodes
  |> Enum.reduce(Graph.new(), fn %{url: url, links: links}, g ->
    links
    |> Enum.reduce(g, fn link, g ->
      Graph.add_edge(g, url, link)
    end)
  end)

graph =
  episodes
  |> Enum.reduce(graph, fn %{url: url, title: title}, g ->
    Graph.label_vertex(g, url, title)
  end)
#Graph<type: directed, num_vertices: 2201, num_edges: 4844>

Hey presto, we have a Graph object! We can now run lots of algorithms on the graph, see strongly connected components, find backlinks, all that good stuff.

But now we need to display it… the naive way with Graphviz!

LibGraph supplies a Graph.to_dot/1 method that graphviz can read:

dot_output = elem(Graph.to_dot(graph), 1)

IO.puts(dot_output)
…
    1209938012 -> 3153079849 [weight=1]
    1209938012 -> 3305474943 [weight=1]
    1209938012 -> 3354501824 [weight=1]
    1209938012 -> 3484679997 [weight=1]
    …
}

:ok

Ordinarily we could paste that into something like Graphviz Online and we'd be done, but Mike and Joe have linked a lot of links.

Let's output a file for safekeeping.

File.write!("#{__DIR__}/images/output.dot", dot_output)
:ok

The output.dot file is understandable by Graphviz tooling:

System.shell(
  "/opt/homebrew/bin/sfdp -Tpng #{__DIR__}/images/output.dot > #{__DIR__}/images/sfdp.png"
)
{"", 0}

This whole bit takes a long time… there's a lot of edges.

I spent some time staring at the initial visualization and realized a few things would need to e straightened up above:

  • [ ] I need to filter out some really common noisy URLs, like libsyn and iTunes.
  • [X] I needed to get the episode title or name of the book to label those nodes, maybe set them apart visually.

Normally I'd just keep working "down" the workbook, but to save on network calls it's easier to get the title when I'm doing the Req.get!.

filtered_graph =
  episodes
  |> Enum.map(fn %{links: links} = ep ->
    Map.put(
      ep,
      :links,
      links
      # one domain
      |> Enum.filter(&String.match?(&1, ~r/^https:\/\/bookworm\.fm/))
    )
  end)
  |> Enum.reduce(Graph.new(), fn %{url: url, links: links, title: title}, g ->
    links
    |> Enum.reduce(g, fn link, g ->
      Graph.add_edge(g, url, link)
    end)
    |> Graph.label_vertex(url, title)
  end)

Let's see if that filtering creates a better graphviz image…

File.write!("#{__DIR__}/images/output2.dot", elem(Graph.to_dot(filtered_graph), 1))

System.shell(
  "/opt/homebrew/bin/sfdp -Tpng #{__DIR__}/images/output2.dot > #{__DIR__}/images/filtered.png"
)

Better… somewhere I had to cut the code for labeling the bookworm vertices, unsure why that broke. Once again, I'm sure I could pass some better options to graphviz to emulate Obsidian… orrrrr…

Markdown for Obsidian

I decided to shift course and try a straightforward and explorable visualization: use Obsidian to visualize the connections.

System.cmd("rm", ["--rf", "#{__DIR__}/episodes"])
System.cmd("mkdir", ["#{__DIR__}/episodes"])

map =
  episodes
  |> Enum.map(fn %{url: url, title: title} -> {url, String.replace(title, ":", "-")} end)
  |> Map.new()

episodes
|> Enum.map(fn %{title: title, links: links, article: article} ->
  safe_title = String.replace(title, ":", "-")

  content = """
  # #{title}

  #{article}

  ## Mentions:
  #{Enum.reduce(links, '', fn link, str -> "#{str}\n- [[#{Map.get(map, link, link)}]]" end)}
  """

  File.write!("#{__DIR__}/episodes/#{safe_title}.md", content)
end)
[:ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok,
 :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok,
 :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, :ok, …]

This creates a folder called episodes wherever the .livemd file lives. Open that folder as a vault in Obsidian, and you'll get something like this:

That's pretty good… but let's set up some data science-y stuff.

⚗️ Data

Let's quickly build a list of Map objects that Livebook's Explorer Dataframes can understand. We'll use a Graph function on the graph object we built earlier in the workbook to find the number of vertices (links) that have an edge to the current episode… essentially how many "backlinks."

backlinks =
  episodes
  |> Enum.map(fn %{title: title, url: url, published: published} ->
    %{
      title: title,
      published: DateTime.to_iso8601(published),
      url: url,
      backlinks: Graph.in_edges(graph, url) |> Enum.count()
    }
  end)
[
  %{backlinks: 13, title: "1: Getting Things Done by David Allen", url: "https://bookworm.fm/1/"},
  %{
    backlinks: 10,
    title: "2: The Willpower Instinct by Kelly McGonigal",
    url: "https://bookworm.fm/2/"
  },
  %{backlinks: 7, title: "3: The War Of Art by Steven Pressfield", url: "https://bookworm.fm/3/"},
  %{…},
  …
]
require Explorer.DataFrame
backlinks |> Explorer.DataFrame.new()

👏👏👏

And there's the answer: they've mentioned (linked to) How To Read a Book 41 times.

Already some interesting insights on there… but there's so much to explore:

  • Could we see this plotted over a timeline?
  • Do references have a half-life, or do some books ring eternal in Mike and Joe's ears?
  • How does the episode book score relate to the number of backlinks in the future?
  • Could we use a transcription to find book mentions that aren't in shownotes?

That's for another day…

Check out the repo if you are interested.

References


🔖
Changelog
  • 2023-08-20 17:39:36 -0500
    Add link to repo

  • 2023-08-19 17:31:00 -0500
    Add published and missing pieces

  • 2023-08-19 15:17:43 -0500
    Wrap up

  • 2023-08-19 15:16:11 -0500
    Post: Visualizing Bookworm