Generating wordclouds from massive PDF libraries

My friend Dani posed an interesting question on Twitter yesterday:

Research pals, esp. those that use #zotero - do you know if there is a way I can visualise data from my library, e.g. word clouds? Paperplanes plugin no longer exists :( what I'd really like to do is see what words are coming up after a particular word is featured in the papers.
— Dr Daniella Trimboli (she/her) (@djtrimboli) June 24, 2022

I used Zotero to manage my bibliography during and after my PhD, and like any recovering academic, I have a giant folder full of PDFs that I am weirdly attached to and can’t bring myself to delete.

So I thought to myself: This must be possible, and I certainly don’t have anything better to do with my time.

We are going to make use of two Python packages: pdftotext and wordcloud. The following was done on a Mac but it should work anywhere Python works.

Note: There’s nothing actually Zotero specific here, the following works on any large directory of PDFs

A note on Zotero’s filesystem

Zotero’s local database consists of it’s SQLite bibliography database, and linked files. Linked files are stored in a tree directory structure, so lots of directories with a few files in each. So we will need to traverse these folders to find all PDFs.

Installing Packages

Assuming you have Python installed, we need to install the command line tools we will use. Instructions for Mac.

brew install pkg-config poppler python
pip install wordcloud pdftotext

Hopefully that all goes smoothly. I can’t help you install Python, sorry.

Converting a library of PDFs to a giant text file

The Wordcloud tool builds wordcloud images from a text file. We have a bit set of PDFs. So the first step is to merge the content of the PDFs into one giant text file.

find $THE_DIRECTORY_WHERE_MY_PDFS_ARE -name '*.pdf' -exec pdftotext "{}" - >> combined_text.txt \;

We now have a single text file containing all of the text content of all the PDFs in it. Note: This won’t work for scanned PDFs.

Converting the text to a wordcloud

Now that we have a single text file containing all our text, we can feed that into wordcloud.

wordcloud_cli --text combined_text.txt --imagefile wordcloud.png

Depending on how big your library is, this could take a little while. But you should end up with a nice Wordcloud:

Guess my research topic

The Wordcloud package has lots of options for customising the output.

Excluding Words

You can provide a list of words you want excluded. In the example above, ‘use’, ‘using’, and ‘used’ aren’t particularly useful. To exclude words, create a new text file with each excluded word on its own line, and provide it to wordcloud. This is great for excluding common words that are not particularly interesting for your topic.

Wordcloud has a built in list of stop words, and providing your own overrides its built in ones. You can start with the default list, which you can find here, and add your own words to it.

wordcloud_cli --text combined_text.txt --imagefile wordcloud.png --stopwords excluded.txt

And we end up with this:

Excluded a few words

Excluding short words

In addition to the stopwords list, you can also tell Wordcloud to just not include words shorter than a specific length:

wordcloud_cli --text combined_text.txt --imagefile wordcloud.png --stopwords excluded.txt --min_word_length 8

Things start to get a bit more interesting:

Only the big words

Customising the output

There are many ways to customise the resulting image, just run wordcloud --help for a list of all options. For example:

wordcloud_cli --text combined_text.txt \
              --imagefile wordcloud.png \
              --background white \
              --color purple \
              --min_word_length 8 \
              --width 1280 \
              --height 720 \
              --fontfile intro.otf

Which gives you this:

It's beautiful!

You get the idea.

Bonus: Work on a subset of my Zotero Library

If you have a big giant sprawling Zotero library, or different libraries for different projects, you might want to generate wordclouds from just some of your documents. You can export files from Zotero to accomplish this.

Select the publications you want to use
Right click and select Export Items
On the export options, make sure Export Files is checked
Choose where to save the export

You’ll end up with a new folder containing just the PDFs you selected.

Going Further

The wordcloud_cli is just a frontend to the wordcloud Python package. If the command line options don’t give you the customisation you are looking for then you can write code to accomplish what you want.

Join the conversation!

Hit me up on Twitter or send me an email.

This is a Gatsby Stan Page Now

Welcome to my new website… That looks exactly like the old one. Over the last few weekends I’ve rebuilt this website from Wordpress into a static site built with Gatsby.

Why?

I’ve wanted to migrate off Wordpress for a while. Some reasons:

I have less patience for managing servers in my spare time than I used to.
I’m paying for a VM that I don’t really need. It could be just a couple of cents/month for CloudFront, or host on Netlify for free.
I really don’t want to do this sysadmin work in my spare time

Doing a half-assed job of running a server is a security risk. It’s also costing me money unnecessarily.

Now, don’t get me wrong - I really like Wordpress. It’s my recommendation for anybody who wants to build a website and have a nice admin panel and editor. But more than most people, I’m happy writing in Markdown (I wrote my PhD thesis in LaTeX), and I’m happy working with Git.

A static site just makes sense.

Why Gatsby?

I had been messing with Hugo. I liked that builds were super fast. However, it didn’t click:

Extending Hugo through plugins isn’t really a thing
Image processing in Hugo is fairly basic (I really just wanted to do BlurHash)
I couldn’t get into Hugo’s templating system

I gave Gatsby a go and have found it really enjoyable to work with. Gatsby is more flexible, meaning in theory I had to do more to get a basic blog, but in reality, that work went really fast. In a few nights I’ve totally rebuilt the Wordpress theme, have really great image processing, migrated all the articles from Wordpress.

Maybe it’s because I’m fairly familiar with the Node ecosystem and Typescript, but I just found Gatsby so much easier to work with.

No Comment

The one thing that hasn’t been transferred over are comments, because all the commenting solutions for static sites suck. That doesn’t mean I don’t want to talk to you though - please talk to me on Twitter or send me an email.

But the commenting solutions for static sites suck.

The open source ones generally require a server. The main reason for me converting to a static site was so I didn’t have to run a server. Maybe someone will come up with a really great serverless solution.
The paid hosted options are expensive for a personal site like this
The free hosted options are gross - ugly, ad-filled, privacy invading

So comments are gone. But most of the comments on Wordpress were out of date anyway. So going forward, discussions can happen elsewhere.

Thank you for reading my webzone

I’m going to try and post more about Flutter and other programming stuff, to justify the effort I’ve just gone to.

Oh, and it’s Open Source now, by the way.

Join the conversation!

Hit me up on Twitter or send me an email.

Improving Flutter's iOS build times on CI

CareApp uses CI to test every commit of our mobile app, and build and deploy every merge to master, for iOS and Android. A few weeks ago I felt very seen by this tweet:

A build/deploy time of 2m50s (including infrastructure) makes me antsy as hell. I have no idea how people deal with builds that take double digit minutes and longer
— JT Official (@jtango18) October 6, 2020

What follows was an adventure to cut our CI times by 12 minutes.

Read all about it on the new CareApp Engineering Blog

Join the conversation!

Hit me up on Twitter or send me an email.

Remote Redux Debugging in Flutter

Connect your Flutter app’s Redux Store to the Redux Devtools from the web!

I really like Flutter, and I like using Redux when building mobile apps. There’s a great Redux implementation for Dart and Flutter, and a time travel capable debug store.

The Javascript world is spoilt with the fantastic Redux DevTools plugin. It allows you to inspect actions and application state in a web browser, time travel, and playback changes to your app’s state. There is an on screen time travel widget for Flutter, but that means sacrificing screen space for the UI.

So why not combine the Redux DevTools from the Javascript world with Redux.dart? Now you can, with the reduxremotedevtools package!

Debug your Redux Store with Flutter and Remote DevTools

This article gives a quick overview of how to get setup. The Git repository contains examples to help get you started.

Getting Started

Add the library to your app’s pubspec.yaml:

dependencies:
  redux-remote-devtools: ^0.0.4

And add the middleware to your app, and provide it a reference to your store so time travel actions from the remote can be dispatched:

var remoteDevtools = RemoteDevToolsMiddleware('YOUR_HOST_IP:8000');
await remoteDevtools.connect();
final store = new DevToolsStore&lt;AppState>(searchReducer,
  middleware: [
    remoteDevtools,
  ]);
remoteDevtools.store = store;

Startup the remotedev server, and then run your Flutter app:

npm install -g remotedev-server
remotedev --port 8000</code></pre>

You can then browse to http://localhost:8000 and start using Remote DevTools to debug your Flutter app!

Encoding Actions and State

In the Javascript world, Redux follows a convention that your redux state is a plain Javascript Object, and actions are also Javascript objects that have a type property. The JS Redux Devtools expect this. However, Redux.dart tries to take advantage of the strong typing available in Dart. To make Redux.dart work with the JS devtools, we need to convert actions and state instances to JSON before sending.

Remember that the primary reason for using devtools is to allow the developer to reason about what the app is doing. Therefore, exact conversion is not strictly necessary – it’s more important for what appears in devtools to be meaningful to the developer.

To make your actions and state JSON encodable, you have two options. Either add a toJson method to all your classes, or using a package like json_serializable to generate the serialisation code at build time. The GitHub search example demonstrates both approaches.

If your store is simple then you may be using enums for actions. These encode just fine without any extra effort.

Time Travel

If you have configured your app to use the DevToolsStore from redux_devtools, then you can time travel through your app state using the UI.

Remember that there are limitations to time travel, especially if you are using epics or other asynchronous processing with your Redux store.

Being a new library there are still things to work out. PRs are welcome if you’re up for helping out.

Now go build something cool with Flutter!

* Photo by Tim Mossholder on Unsplash

Join the conversation!

Hit me up on Twitter or send me an email.

Production Error Handling in Ionic

Nobody likes apps that crash or stop working properly. Handling and recovering from errors is obviously an important task for any developer; we should not assume that everything will run smoothly.

In this post we’re talking about what to do on top of your regular error handling — the last resort.

Read on the NextFaze Blog

* Photo by Kris Mikael Krister on Unsplash

Join the conversation!

Hit me up on Twitter or send me an email.