From Inkscape to poppler
👋 This page was last updated ~3 years ago. Just so you know.
What's next? Well... poppler is the library Inkscape uses to import PDFs.
Yes, the name comes from Futurama.
Turns out, poppler comes with a bunch of CLI tools, including pdftocairo
!
Halfway through this article, I realized the "regular weight" on my system was in fact Iosevka SS01 (Andale Mono Style) (see Releases), but the "bold weight" was the default Iosevka.
So, I removed both and reinstalled them from the official distribution, which explains visual and size changes after that point.
So, with a few more CLI incantations...
$ pdftocairo /tmp/export.pdf -svg /tmp/export.svg $ ls -lhA /tmp/export* -rw-r--r--. 1 amos amos 159K Nov 19 10:14 /tmp/export.pdf -rw-r--r--. 1 amos amos 739K Nov 19 10:14 /tmp/export.svg
We've got an SVG file! And it's a bit large, I wonder if it embeds part of a font, like the PDF does?
Well... it's a bit more complicated.
As it turns out, individual non-bold ("regular weight") letters actually refer to other paths:
But words made up of bold letters are a single, very lengthy path:
I wonder if that's because I've only installed the "Regular" weight for the Iosevka font... let's find out.
After installing the "Bold" weight, renaming /tmp/export.EXT
to
/tmp/export.regular.EXT
, and running both steps again, the PDF export is
smaller - and so is the SVG!
$ ls -lhAt /tmp/export.* -rw-r--r--. 1 amos amos 436K Nov 19 10:40 /tmp/export.svg -rw-r--r--. 1 amos amos 68K Nov 19 10:40 /tmp/export.pdf -rw-r--r--. 1 amos amos 739K Nov 19 10:39 /tmp/export.regular.svg -rw-r--r--. 1 amos amos 159K Nov 19 10:39 /tmp/export.regular.pdf
The PDF file now contains two partial embedded fonts:
%% Original object ID: 4 0 9 0 obj << /BaseFont /Iosevka-Bold /DescendantFonts [ 11 0 R ] /Encoding /Identity-H /Subtype /Type0 /ToUnicode 12 0 R /Type /Font >> endobj %% Original object ID: 5 0 10 0 obj << /BaseFont /Iosevka /DescendantFonts [ 14 0 R ] /Encoding /Identity-H /Subtype /Type0 /ToUnicode 15 0 R /Type /Font >> endobj
And we can see in the SVG file that bold characters now also take advantage
of the SVG use
tag.
So what happened with bold in the first export then? How did we even get bold letters, if we didn't have the corresponding font? Let's look at them both:
The bottom version is what Iosevka is supposed to look like. The top version is Chrome font's renderer (freetype?) doing its best to turn a regular font into a bold font, by just... embiggening stuff.
So anyway, now we have a reasonable SVG. It:
- Should look the same on any machine, no matter what fonts are installed
- Thus, can be downloaded and printed easily
- Does some deduplication for glyphs, so that for example the path for Iosevka's
0
glyph is only defined once, and then re-used a bunch of times
But, well, we used a CLI tool to do it. Ideally we'd be able to just do it from code, since we don't want any external dependencies (Chrome being the notable, and infuriating, exception).
GNOME has a pretty good story when it comes to Rust libraries. But the folks working on them are focusing mainly on cairo, gio, glib, pango, and gtk3/gtk4. There is a poppler crate on crates.io, but it is hopelessly out-of-date.
But the good news is: there's existing tooling for glib-based C libraries, and poppler is one of them. Can we use it to generate bindings before this article becomes so large it crashes your browser? Let's find out!
gobject introspection
In the year of our lord 2021, we could all use a little introspection. And APIs are absolutely no exception.
APIs are typically defined as a bunch of C headers, and that isn't
machine-friendly for a bunch of reasons. I know that because I once tried
writing a C preprocessor that basically
converted #ifdef
blocks into cargo features. It was awful.
So what a bunch of folks have been doing instead, is to have some canonical representation of the API as a structured language (that specifically isn't C), and then from there you can generate bindings with it.
That's what folks at Microsoft are doing with windows-rs do for example.
They actually have machinery involving clang and .NET (you can take a look at the win32metadata repository for more information), and the reference definitions look like this (as seen through ILSpy):
Apple is doing something similar with BridgeSupport, although I have found very little documentation about it, and at least one person claimed it was no longer supported.
And, well, the GNOME project has been doing the same thing! If the gobject-introspection Git history is to be trusted, they've started their effort in 2004! The Rust side of it, gtk-rs/gir was "only" started in 2015.
And like I said earlier, even though poppler is actually an offshoot from
xpdf (and so it looks different from a lot
of other GNOME-adjacent libraries), it does have a "glib interface" (alongside
a QT interface), and that glib interface has a .Gir
file, and so we can use
it with gtk-rs/gir
!
A .Gir
file is just plain XML, here's an excerpt from
/usr/share/gir-1.0/Poppler-0.18.Gir
on a Fedora 35 install:
<?xml version="1.0"?> <!-- This file was automatically generated from C sources - DO NOT EDIT! To affect the contents of this file, edit the original C definitions, and/or use gtk-doc annotations. --> <repository version="1.2" xmlns="http://www.gtk.org/introspection/core/1.0" xmlns:c="http://www.gtk.org/introspection/c/1.0" xmlns:glib="http://www.gtk.org/introspection/glib/1.0"> <include name="GObject" version="2.0"/> <include name="Gio" version="2.0"/> <include name="cairo" version="1.0"/> <package name="poppler-glib"/> <c:include name="poppler.h"/> <namespace name="Poppler" version="0.18" shared-library="libpoppler-glib.so.8,libpoppler.so.112" c:identifier-prefixes="Poppler" c:symbol-prefixes="poppler"> <!-- (skipping a few things to find the interesting bits...) --> <class name="Page" c:symbol-prefix="page" c:type="PopplerPage" parent="GObject.Object" glib:type-name="PopplerPage" glib:get-type="poppler_page_get_type"> <method name="render" c:identifier="poppler_page_render"> <doc xml:space="preserve" filename="glib/poppler-page.cc" line="336">Render the page to the given cairo context. This function is for rendering a page that will be displayed. If you want to render a page that will be printed use poppler_page_render_for_printing() instead. Please see the documentation for that function for the differences between rendering to the screen and rendering to a printer.</doc> <source-position filename="glib/poppler-page.h" line="38"/> <return-value transfer-ownership="none"> <type name="none" c:type="void"/> </return-value> <parameters> <instance-parameter name="page" transfer-ownership="none"> <doc xml:space="preserve" filename="glib/poppler-page.cc" line="338">the page to render from</doc> <type name="Page" c:type="PopplerPage*"/> </instance-parameter> <parameter name="cairo" transfer-ownership="none"> <doc xml:space="preserve" filename="glib/poppler-page.cc" line="339">cairo context to render to</doc> <type name="cairo.Context" c:type="cairo_t*"/> </parameter> </parameters> </method> </class> </namespace> </repository>
And, you know, it doesn't have all the information one could dream of, but it's a perfectly fine start to generate Rust bindings.
So after chatting with the wonderful folks in the Gnome/Rust Matrix
room, I got to
work and started making my own poppler-rs
.
A lot of "Rust bindings to C libraries" are actually two crates: a foobar-sys
crate that is full of unsafe functions, and a foobar
crate that wraps
foobar-sys
's functionality with safe abstractions.
And that's the model gtk-rs/gir
enforces as well, so I made a little workspace...
# in poppler-rs/Cargo.toml [workspace] members = [ "sys", "poppler", ]
And for the sys crate, I added a little config:
# in poppler-rs/sys/Gir.toml [options] library = "Poppler" version = "0.18" target_path = "." min_cfg_version = "0.70" girs_directories = ["../../gir-files", "../gir-files"] work_mode = "sys" external_libraries = [ "Gio", "GLib", "GObject", "Cairo", ] ignore = [ "Poppler.MAJOR_VERSION", "Poppler.MINOR_VERSION", "Poppler.MICRO_VERSION", ]
MAJOR_VERSION
etc. are defines in C. Because we link dynamically against
poppler in most scenarios, and the binding is generated once and then used
against many different versions of poppler, having them exposed to Rust is a)
unnecessary, and b) makes gir-generated unit tests fail (because the numbers
don't match up, even if the libraries would be compatible).
And then after running gir
in the sys/
directory, BOOM, we have a sys crate.
It has a single src/lib.rs
file, that has a preamble...
// in `poppler-rs/sys/src/lib.rs` // Generated by gir (https://github.com/gtk-rs/gir @ 8891a2f2c34b) // from ../../gir-files (@ c6afb5857607) // from ../gir-files (@ ec3e62ee546b) // DO NOT EDIT #![allow(non_camel_case_types, non_upper_case_globals, non_snake_case)] #![allow(clippy::approx_constant, clippy::type_complexity, clippy::unreadable_literal, clippy::upper_case_acronyms)] #![cfg_attr(feature = "dox", feature(doc_cfg))] use gio_sys as gio; use glib_sys as glib; use gobject_sys as gobject; use cairo_sys as cairo; #[allow(unused_imports)] use libc::{c_int, c_char, c_uchar, c_float, c_uint, c_double, c_short, c_ushort, c_long, c_ulong, c_void, size_t, ssize_t, intptr_t, uintptr_t, time_t, FILE}; #[allow(unused_imports)] use glib::{gboolean, gconstpointer, gpointer, GType}; // etc.
And then some enums...
// Enums pub type PopplerActionLayerAction = c_int; pub const POPPLER_ACTION_LAYER_ON: PopplerActionLayerAction = 0; pub const POPPLER_ACTION_LAYER_OFF: PopplerActionLayerAction = 1; pub const POPPLER_ACTION_LAYER_TOGGLE: PopplerActionLayerAction = 2;
And then some unions...
// Unions #[repr(C)] #[derive(Copy, Clone)] pub union PopplerAction { pub type_: PopplerActionType, pub any: PopplerActionAny, pub goto_dest: PopplerActionGotoDest, pub goto_remote: PopplerActionGotoRemote, pub launch: PopplerActionLaunch, pub uri: PopplerActionUri, pub named: PopplerActionNamed, pub movie: PopplerActionMovie, pub rendition: PopplerActionRendition, pub ocg_state: PopplerActionOCGState, pub javascript: PopplerActionJavascript, pub reset_form: PopplerActionResetForm, }
Some callbacks (not shown here), some "records", which I guess is what structs are called in gobject-introspection:
#[repr(C)] #[derive(Copy, Clone)] pub struct PopplerRectangle { pub x1: c_double, pub y1: c_double, pub x2: c_double, pub y2: c_double, } impl ::std::fmt::Debug for PopplerRectangle { fn fmt(&self, f: &mut ::std::fmt::Formatter) -> ::std::fmt::Result { f.debug_struct(&format!("PopplerRectangle @ {:p}", self)) .field("x1", &self.x1) .field("y1", &self.y1) .field("x2", &self.x2) .field("y2", &self.y2) .finish() } }
And then some classes!
#[repr(C)] pub struct PopplerPage(c_void); impl ::std::fmt::Debug for PopplerPage { fn fmt(&self, f: &mut ::std::fmt::Formatter) -> ::std::fmt::Result { f.debug_struct(&format!("PopplerPage @ {:p}", self)) .finish() } }
And then, well, then there's every function in poppler-glib:
#[link(name = "poppler-glib")] #[link(name = "poppler")] extern "C" { // (MANY functions skipped) // Oh look this one is gated behind a cargo feature automatically! #[cfg(any(feature = "v0_80", feature = "dox"))] #[cfg_attr(feature = "dox", doc(cfg(feature = "v0_80")))] pub fn poppler_print_duplex_get_type() -> GType; // skipping more... pub fn poppler_page_render(page: *mut PopplerPage, cairo: *mut cairo::cairo_t); // skipping everything else. }
And that's how you get a -sys
crate.
You'll note that it has only poppler functions. It doesn't have, for example, cairo functions, which is a dependency of poppler. Those are in other crates, which have already been generated and published to crates.io:
# in `poppler-rs/sys/Cargo.toml` [dependencies] cairo-sys-rs = "0.14.9" gio-sys = "0.14.0" glib-sys = "0.14.0" gobject-sys = "0.14.0" libc = "0.2"
Now that we have the low-level, unsafe crate, we can generate the high-level crate!
That one's a bit more complicated, because, again, the .Gir
files are missing
some information that matters for languages like Rust.
# in `poppler-rs/poppler/Gir.toml` [options] library = "Poppler" version = "0.18" target_path = "." min_cfg_version = "0.70" girs_directories = ["../../gir-files", "../gir-files"] work_mode = "normal" # 👈 this was "sys" for the previous crate deprecate_by_min_version = true single_version_file = true external_libraries = [ "Gio", "GLib", "GObject", "Cairo", ] # This tells gir "these types exist in _other crates_, you don't need to # generate them yourself BUT you shouldn't skip functions that use these" # (Normally gir skips anything that uses types that aren't explicitly # allowlisted). manual = [ "GLib.Bytes", "GLib.Error", "GLib.DateTime", "cairo.Context", "cairo.Surface", "cairo.Region", ] # This is the short way of telling gir what to generate generate = [ "Poppler.Backend", "Poppler.Document", ] # This is the long way of telling gir what to generate, where we can ignore # specific "object functions" (methods, really..), change the constness of some # parameters, etc. [[object]] name = "Poppler.Page" status = "generate" [[object.function]] name = "render" [[object.function.parameter]] name = "cairo" const = true [[object.function]] name = "render_for_printing" [[object.function.parameter]] name = "cairo" const = true [[object.function]] name = "get_text_layout" ignore = true [[object.function]] name = "get_text_layout_for_area" ignore = true [[object.function]] name = "get_crop_box" ignore = true [[object.function]] name = "get_bounding_box" rename = "get_bounding_box" [[object]] name = "Poppler.Rectangle" status = "generate" boxed_inline = true
There's a couple interesting workarounds I've got baked in there, for some value of "interesting".
For example, the poppler_page_get_bounding_box
function prototype looks like this:
gboolean poppler_page_get_bounding_box (PopplerPage *page, PopplerRectangle *rect);
And so by default, gtk-rs/gir
generated something like this:
impl Page { fn is_bounding_box(&mut self, rect: &mut Rectangle) -> bool; }
Ohhh because it returns a bool, right.
...hence the odd "rename get_bounding_box to get_bounding_box" configuration.
get_crop_box
generated code that straight up refused to compile, so I had to
ignore it - and I ran into a couple other issues, but I have to say I've been
using the 0.14 branch of gtk-rs/gir
, and the development branch contains a lot
of improvements already.
Wait, why did you use 0.14 then?
That's what the existing glib
and cairo-rs
crates were generated with.
So.. the versions have to match to be able to interoperate?
Precisely!
And again, just running gir
generates a whole crate, a high-level, safe one
this time:
// in `poppler-rs/poppler/src/auto/page.rs` // This file was generated by gir (https://github.com/gtk-rs/gir) // from ../../gir-files // from ../gir-files // DO NOT EDIT use crate::Rectangle; use glib::object::ObjectType as ObjectType_; use glib::signal::connect_raw; use glib::signal::SignalHandlerId; use glib::translate::*; use std::boxed::Box as Box_; use std::fmt; use std::mem; use std::mem::transmute; // that's how you know it's gonna get good glib::wrapper! { #[doc(alias = "PopplerPage")] pub struct Page(Object<ffi::PopplerPage>); match fn { type_ => || ffi::poppler_page_get_type(), } } impl Page { // (still not sure why this returns a bool / when this would ever return // false, the docs are non-existent) #[doc(alias = "poppler_page_get_bounding_box")] pub fn get_bounding_box(&self, rect: &mut Rectangle) -> bool { unsafe { from_glib(ffi::poppler_page_get_bounding_box(self.to_glib_none().0, rect.to_glib_none_mut().0)) } } // (skipped a bunch of methods) #[doc(alias = "poppler_page_render")] pub fn render(&self, cairo: &cairo::Context) { unsafe { ffi::poppler_page_render(self.to_glib_none().0, mut_override(cairo.to_glib_none().0)); } } #[doc(alias = "poppler_page_render_for_printing")] pub fn render_for_printing(&self, cairo: &cairo::Context) { unsafe { ffi::poppler_page_render_for_printing(self.to_glib_none().0, mut_override(cairo.to_glib_none().0)); } } // (skipped all the other methods) }
Just like before, the high-level poppler crate depends on high-level glib/cairo crates. And bitflags, for reasons™️
# in `poppler-rs/poppler/Cargo.toml` [dependencies] glib = "0.14.8" libc = "0.2.107" cairo-rs = "0.14.9" bitflags = "1.3.2"
And now, FINALLY, we can use these bindings.
Using our fresh poppler-rs bindings
I made a tiny version of pdftocairo
that exclusively renders to a cairo SVG
surface, just to try things out. Here it is in its entirety:
# in `pdftocairo/Cargo.toml` [package] name = "pdftocairo" version = "0.1.0" edition = "2021" [dependencies] # for utf-8 paths camino = "1.0.5" # for error handling color-eyre = "0.5.11" # *chants* poppler, poppler, poppler! poppler-rs = { path = "../poppler-rs/poppler" } # for rendering cairo-rs = { version = "0.14.9", features = ["svg"] } # for application-level tracing tracing = "0.1.29" tracing-error = "0.2.0" tracing-subscriber = { version = "0.3.1", features = ["env-filter"] }
// in `pdftocairo/src/main.rs` use std::fs::File; use cairo::{Context, SvgSurface}; use camino::Utf8PathBuf; use color_eyre::{eyre::eyre, Report}; use poppler::Rectangle; use tracing::info; fn main() -> Result<(), Report> { if std::env::var("RUST_LOG").is_err() { std::env::set_var("RUST_LOG", "info"); } color_eyre::install()?; install_tracing(); let path = Utf8PathBuf::from("/tmp/export.pdf"); info!(%path, "Reading file..."); let data = std::fs::read(&path)?; info!(%path, "Reading file... done!"); let doc = poppler::Document::from_data(&data[..], None)?; info!("Got the document! {:#?}", doc); info!("Producer = {:#?}", doc.producer()); info!("Num pages = {:#?}", doc.n_pages()); let page = doc.page(0).unwrap(); info!("page = {:#?}", page); let mut bb: Rectangle = Default::default(); page.get_bounding_box(&mut bb); info!("bb = {:#?}", *bb); info!("Creating file!"); let export_path = Utf8PathBuf::from("/tmp/export.svg"); let f = File::create(&export_path)?; info!("Creating surface..."); let surface = SvgSurface::for_stream(bb.x2 - bb.x1, bb.y2 - bb.y1, f)?; info!("Creating context..."); let cx = Context::new(&surface)?; info!("Rendering..."); page.render(&cx); info!("Finishing output stream..."); surface .finish_output_stream() .map_err(|e| eyre!("cairo error: {}", e.to_string()))?; info!(%export_path, "We're.. done?"); Ok(()) } fn install_tracing() { use tracing_error::ErrorLayer; use tracing_subscriber::prelude::*; use tracing_subscriber::{fmt, EnvFilter}; let fmt_layer = fmt::layer(); let filter_layer = EnvFilter::try_from_default_env() .or_else(|_| EnvFilter::try_new("info")) .unwrap(); tracing_subscriber::registry() .with(filter_layer) .with(fmt_layer) .with(ErrorLayer::default()) .init(); }
And here's proof it works!
$ cargo build Finished dev [unoptimized + debuginfo] target(s) in 0.02s $ ./target/debug/pdftocairo 2021-11-24T18:14:36.936369Z INFO pdftocairo: Reading file... path=/tmp/export.pdf 2021-11-24T18:14:36.936467Z INFO pdftocairo: Reading file... done! path=/tmp/export.pdf 2021-11-24T18:14:36.939146Z INFO pdftocairo: Got the document! Document( ObjectRef { inner: 0x000055bfa9458400, type: PopplerDocument, }, ) 2021-11-24T18:14:36.939199Z INFO pdftocairo: Producer = Some( "Skia/PDF m74", ) 2021-11-24T18:14:36.939239Z INFO pdftocairo: Num pages = 1 2021-11-24T18:14:36.939284Z INFO pdftocairo: page = Page( ObjectRef { inner: 0x000055bfa9458440, type: PopplerPage, }, ) 2021-11-24T18:14:36.941495Z INFO pdftocairo: bb = PopplerRectangle @ 0x55bfa9467810 { x1: 0.0, y1: 0.0, x2: 744.9599599999999, y2: 481.91998, } 2021-11-24T18:14:36.941563Z INFO pdftocairo: Creating file! 2021-11-24T18:14:36.941622Z INFO pdftocairo: Creating surface... 2021-11-24T18:14:36.941667Z INFO pdftocairo: Creating context... 2021-11-24T18:14:36.941691Z INFO pdftocairo: Rendering... 2021-11-24T18:14:36.947172Z INFO pdftocairo: Finishing output stream... 2021-11-24T18:14:36.955067Z INFO pdftocairo: We're.. done? export_path=/tmp/export.svg
Oooh, Skia!
And here's the result:
$ head /tmp/export.svg <?xml version="1.0" encoding="UTF-8"?> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="744.95996pt" height="481.91998pt" viewBox="0 0 744.95996 481.91998" version="1.1"> <defs> <g> <symbol overflow="visible" id="glyph0-0"> <path style="stroke:none;" d="M 0.703125 0 L 0.703125 -8.8125 L 5.28125 -8.8125 L 5.28125 0 Z M 1.109375 -3.96875 L 2.84375 -6.140625 L 4.65625 -8.40625 L 3.53125 -8.40625 L 1.109375 -5.375 Z M 1.109375 -5.84375 L 2.28125 -7.296875 L 3.15625 -8.40625 L 2.03125 -8.40625 L 1.109375 -7.234375 Z M 1.109375 -7.703125 L 1.671875 -8.40625 L 1.109375 -8.40625 Z M 1.109375 -2.078125 L 3.890625 -5.578125 L 4.890625 -6.8125 L 4.890625 -8.21875 L 1.109375 -3.5 Z M 1.109375 -0.390625 L 1.25 -0.390625 L 4.890625 -4.9375 L 4.890625 -6.359375 L 1.109375 -1.625 Z M 1.625 -0.390625 L 2.75 -0.390625 L 4.890625 -3.0625 L 4.890625 -4.46875 Z M 3.125 -0.390625 L 4.25 -0.390625 L 4.890625 -1.203125 L 4.890625 -2.59375 Z M 4.890625 -0.390625 L 4.890625 -0.734375 L 4.625 -0.390625 Z M 4.890625 -0.390625 "/> </symbol> <symbol overflow="visible" id="glyph0-1"> <path style="stroke:none;" d="M 0.796875 0 L 0.796875 -8.8125 L 5.28125 -8.8125 L 5.28125 -7.65625 L 2.140625 -7.65625 L 2.140625 -5.15625 L 4.609375 -5.15625 L 4.609375 -4 L 2.140625 -4 L 2.140625 -1.15625 L 5.28125 -1.15625 L 5.28125 0 Z M 0.796875 0 "/> </symbol>
Uhh..
What's the matter? Can't you render SVG in your head?
Mhh if I told you you'd probably have me do it!
...fair.
We were using a tiny subset of what Inkscape can do: rendering a PDF file to an SVG surface, as paths. And it turns out, we only need the poppler and cairo libraries to do that.
Because both have a "glib" interface, we can use all the GTK-cinematic-universe tooling to generate Rust bindings for them. cairo already has an official binding, but the poppler one was out-of-date: we just regenerated it with gtk-rs/gir and we were on our way.
Thanks to my sponsors:
If you liked what you saw, please support my work!
Here's another article just for you:
In order to increase fluency in a programming language, one has to read a lot of it.
But how can you read a lot of it if you don't know what it means?
In this article, instead of focusing on one or two concepts, I'll try to go through as many Rust snippets as I can, and explain what the keywords and symbols they contain mean.
Ready? Go!