snim2.org

I am a Senior Lecturer in Computer Science at the University of Wolverhampton. My interests generally lie in the area of programming languages and tools, especially for the Internet of Things and other distributed systems.

« Personal blog »

Unit testing tips: diffing PDF files

How do you unit test a piece of code that generates a PDF file? There are a number of interesting answers to this question around the web, including some neat ideas such as:

  • Use something like PIL to convert the PDF to a PNG or similar, then iterate over the pixels in the resulting bitmap.
  • OCR the PDF file and check the resulting text against ground truth.
  • Use a specialist PDF-diff tool to test the generated PDF against ground truth.

This seems like overkill to me! A simple way forward is just to use the diff tool that comes as standard on UNIX platforms.

Usually diff is used with plain text files, but it can work with binary files as well. Here’s a very simple example:

 

$ diff report.pdf expected.pdf
Binary files report.pdf and expected.pdf differ
$

 

Hmm! Neat, but not terribly useful. What else can we do? A quick browse through the diff man-page show that the -a command-line switch tells diff to treat a binary file as if it were text. This sounds like a step forward.

 

 diff -a report.pdf expected.pdf
162,163c162,163
< /CreationDate (D:20140812210344+01'00')
< /ModDate (D:20140812210344+01'00')
---
> /CreationDate (D:20140812012140+01'00')
> /ModDate (D:20140812012140+01'00')
187c187
&lt; /ID [&lt;3428D71EEBFEECF7176993643DEA57D0&gt; &lt;3428D71EEBFEECF7176993643DEA57D0&gt;]
---
&gt; /ID [&lt;3FD57F91F32489646331D1DBBF510CDA&gt; &lt;3FD57F91F32489646331D1DBBF510CDA&gt;]
$

 

As you’d expect with PDF, there is some metadata inside the files that we would expect to differ between PDF files, even if the files have the same content. What we need to do next is to tell diff to ignore this metadata, and we can do that with the -I switch. We might also want to ignore whitespace, which we can do with -w:

 

$ diff -w -a -I .*Date.* -I \/ID.* report.pdf expected.pdf
$

 

Just what we wanted! As with all UNIX tools here, the command was successful (the files were ‘identical’) so we didn’t get any output. To put that in a unit testing context, we can write that up as pytest unit test:

 

import os
import subprocess

def test_pdf():
    # Generate PDF here ...
    assert os.path.exists('expected.pdf')
    assert os.path.exists('report.pdf')
    # Diff the resulting PDF file with a ground truth.
    diff_command = ['diff', '-w', '-a', '-I', '.*Date.*', '-I', '\/ID.*',
                    'report.pdf', 'expected.pdf']
    child = subprocess.Popen(diff_command,
                             stdout=subprocess.PIPE,
                             cwd=os.path.dirname(__file__))
    out, err = child.communicate()
    assert 0 == child.returncode

 


Tagged: diff, PDFs, software, TDD, testing, unit testing

New features I would love to see in writeLaTeX

Laboratory notes template on writeLaTeX.com

Laboratory notes template on writeLaTeX.com

writeLaTeX is my new favourite thing. If you haven’t heard of it, writeLaTeX is an online service for writing collaborative LaTeX documents. Think of it like Google docs for scientists and people who like to typeset very beautiful documents.

Why does this matter? Well, it solves a whole bunch of simple problems for me. I can move between different machines at home and work and keep the same environment. This is more difficult than just auto-syncing my own documents via Dropbox or similar. It also means I need the same LaTeX environment, whether I am working on a locked-down Windows machine at work or a completely open Ubuntu laptop running FOSS only software at home. Already that’s something that removes many of my document writing headaches.

More than providing just a synchronisation service, I can collaborate with colleagues in real-time, so I never need to worry about using the “latest” version of any document, even if my colleagues don’t or can’t use versioning software like Git or Mercurial. Beyond that, writeLaTeX automatically compiles my projects in the background, so I can always see a nearly up to date version of the resulting PDF. My favourite way of editing on writeLaTeX is to have a full “editor” window in my main monitor (straight ahead of me) and a full “pdf render” window on another monitor (off to the side). It’s super-convenient and allows me to concentrate without feeling interrupted by compiler warnings or errors when I’m half way through a complicated edit. I could go on and on, but you get the point – writeLaTeX is a very, very neat way to typeset beautiful documents.

Of course, writeLaTeX is not the only start-up in this space. Authorea and ShareLaTeX make similar offerings, and both have different and interesting strengths. It happens that when I needed a service like this writeLaTeX was the app that had all the features and built-in style and class files I needed and the right combination of features for me. In fact, the pre-installed TeX packages are exactly what you get from installing all of TeXLive on Ubuntu, so writeLaTeX essentially mirrors my own Linux set-up (minus my dodgy Makefiles). That said, I’m very excited that a few competitors are working on these problems. That tells me that online, collaborative LaTeX services have a serious long-term future, and that should benefit users of all these different services.

Once you start using a new shiny toy, there’s always the sense that this is *so* awesome, I wish it could do X… So this is my current wish list for writeLaTeX. This is no criticism of the awesome service, but if you happen to have a few million dollars lying around, please pay the company to implement the below…

Auto-sync with GitHub, BitBucket and similar

It’s really convenient to have all my writeLaTeX projects together on a writeLaTeX project page, but it also breaks the structure of  my projects and documents and imposes a second, different structure.

This is what I call the expression problem of scientific projects (Computer Scientists will get the joke) – you can either organise your documents and code around each project you take part in (Option 1), or you can organise your documents around their type (Option 2). Either choice is good (it’s just a matter of personal taste) but it makes a big difference to your personal workflow and how quickly you can find information and track the progress of your projects. Like many things, consistency is the key principle here.

Option 1 looks like this:

science_project1/
....papers/
........paper1/
............main.tex
............figures/
................chart1.png
................petri_dish.png
............refs.bib
........paper2/ ...
....talks/
........talk1/
............main.tex
............figures/
................chart1.png
................petri_dish.png
............refs.bib
........talk2/ ...
....software/
........some_code.py ...
...
science_project2/ ...

Option 2 looks like this:

papers/
....paper_about_project1/
........main.tex
........figures/
............chart1.png
............petri_dish.png
........refs.bib
....paper_about_project_2/ ...
talks/
....talk_about_project1/
........main.tex
........figures/
............chart1.png
............petri_dish.png
........refs.bib
....talk_about_project2/ ...
software_about_project1/
....some_code.py ...
...

But what happens when you start to use services like writeLaTeX? Your whole workflow gets a lot more complex. You might have all of your projects sync’s to a service like GitHub, or not, but now your papers and talks are on writeLaTeX and can’t be “checked out”, your software might well be on GitHub or similar, you might well be sending your figures and data off to FigShare. It’s suddenly more difficult to keep everything together and it isn’t immediately clear how much progress you have made with each part of the project.

In my view the answer to this problem has to come in two parts. Firstly a way to expose writeLaTeX projects as git repositories so that they can be incorporated as git submodules inside an existing GitHub project (other SCMs and hosting companies are available). This means that it doesn’t matter whether you choose Option 1 or Option 2 above to structure your project files. writeLaTeX could then issue pull requests on GitHub when you update your documents to “send” your updates to GitHub. Secondly, existing CI services such as Travis can be configured to send documents off to FigShare once a tagged release of a paper has been created. This costs a little time to set up, but it is an automated workflow that can be reused over different projects, so that small set-up cost is nicely amortized.

Linting

lint is a tool to check code for errors before it has been compiled. There are a number of these for LaTeX (the one I currently use is chkTeX), and it would be useful to have them run automatically during the background built-compile-render cycle that writeLaTeX already runs.

If you are not writeLaTeX one option here is to use a continuous integration tool to run the lint for you, together with your normal build cycle. For example, this:

is a Travis recipe for running chkTeX over a project, and this is the result of the current Travis build of my UoW PhD template:

$ chktex -W
ChkTeX v1.6.4 - Copyright 1995-96 Jens T. Berger Thielemann.
The command "chktex -W" exited with 0.
$ chktex -q -n 6 *.tex chapters.*.tex 2>/dev/null | tee lint.out
The command "tee lint.out" exited with 0.
$ test ! -s lint.out
The command "test ! -s lint.out" exited with 0.

A way to copy and share files between different projects

There are a few jobs that need to be done for any paper, but are time-consuming busy work that ideally would be minimised. One of these is producing and curating long lists of references to prior art, usually in BibTeX. Another is pulling in tables and figures (usually to do with prior art) that can be used in different papers. An obvious example is a BibTeX file containing the authors own papers. You might have a file called something like mypapers.bib which you certainly need in your own CV, but then you also need in pretty much all your papers and several talks. What happens when you update this file for your CV project? It isn’t shared between different projects, so if you also need to update it in all your other projects. That might not be so bad when you are just added a newly published paper to your list, but if you find a typo in your old papers it’s a real pain. The same is true for curated lists of papers in the area you work in and all sorts of other files.

It would be nice to find some clever way to resolve this, but what if you also have all your files nicely structured and curated using either Option 1 or Option 2 above? Maybe a neat thing to do would be to have some “dummy” projects which only contain common files, such as BibTeX files (and don’t get compiled with pdflatex or similar), then use something like Git submodules to “import” the dummy projects into “real” ones that do compile documents. 

More help with BibTeX

If there’s one huge and pointless sink of valuable time it’s curating long lists of BibTeX references. In recent years a number of services have started to make this easier — Bibsonomy and Google Scholar being two very handy services — but there is still much that has to be done manually. A neat way to search for a citation and pull it into a BibTeX file from within writeLaTeX would be really, really cool.

Some crazy form of document review

Open document review has started to become common, at least for books. A great example of this is Real World OCaml where you can log in with a GitHub account and comment on any paragraph of the book. Comments then become issue tickets in a GitHub repository and the authors can resolve each comment (I notice Real World OCaml has logged an impressive 2457 closed tickets). This is a really neat solution to document review and would be a huge bonus for anyone writing in LaTeX.


Tagged: authorea, collaborative, document editing, latex, online service, saas, sharelatex, typesetting, writelatex

Europython 2014 talk on message passing concurrency and Python


Tagged: concurrency, csp, dynamic languages, manycore, multicore, parallel processing, python

Research diaries and lab notes

The idea of keeping a diary fills me with dread. It conjurers up distant memories of receiving leather-bound paper diaries from well-meaning relatives at Christmas and the crushing obligation to write something, anything every single day, when actually nothing very interesting was going on. The obligation to do something every day is a sure-fire killer of motivation for me. So, as you can imagine, I have never been keen on keeping a regular diary of research notes and results. Not that I haven’t tried. I have a paper notebook that I use to keep track of discussions and obligations from meetings and at various times I’ve tried to use that as a discipline for writing down ideas and notes from my research work. Somehow though, it never stuck.

That is, it never stuck until I read this blog post by Mikhail Klassen on the writeLaTeX blog. Mikhail points out that having a digital diary has some compelling advantages. It allows you to keep track of intermediate results and ideas, links to software repositories and BibTeX citations. This means that next time you need to quickly put together a presentation or poster, or you are starting to write a paper, you can pull figures, citations and text directly from your diary. This is especially useful if a lot of your writing has equations and citations that are time-consuming to keep track of. So, keeping a diary means that a lot of the time-consuming tasks involved at the start of writing a paper or presentation just disappear – those costs are amortized with the costs of keeping the diary. This has enormous appeal to me. The time I get for research is not large, and anything I can do to make my work more efficient makes the process a lot less stressful.

So, having looked carefully at Mikhail’s template I was really impressed, but I wanted to tweak a few  things. In particular I changed the layout of the whole diary and based my version on the excellent tufte-latex class which is inspired by the work of Edward Tufte. I also added a couple of new sections at the top of the diary – Projects and Collaborations and Someday / MaybeProjects and Collaborations is there to help keep track of ongoing commitments, and as a reminder that those projects need to be regularly progressed or abandoned. Someday / Maybe is there to keep track of vague ideas that sound good but you aren’t yet committed to acting on. I find it useful to have a list of these, as they can easily get forgotten, and many good ideas which aren’t quite ready for action can be used as student projects or re-purposed. Other ideas can sit around for a long time, but suddenly become useful when a new collaboration comes about, or you find some scientific result or new technology which makes a previously very difficult idea tractable.

Lastly, like Mikhailmy template and my own notes are on writeLaTeX, which is a cloud platform for writing LaTeX documents. writeLaTeX (and its cousins ShareLaTeX and Authorea) have some great features, like collaborative real-time document editing, auto-compilation so that you can see a current version of the PDF of your document as you type, a wealth of templates and a friendly near-WYSIWYG editor. writeLaTeX can also has a limited sync-with-Dropbox feature for offline work. All of this makes diary entries really simple to write. I just have a writeLaTeX window open in my browser all day and I can write updates and upload new documents as I go along.

Oh, and because I have a pathological aversion to keeping a diary, I call mine “Lab Notes”. Much friendlier!

writeLaTeX.com

Example Lab Notes


Tagged: academic writing, advice, best practice, papers, productivity, research, undergraduate projects, writing

Automate, automate, automate

I’ve recently been working on a new Python project, which started off as a bit of an experiment at the recent PyPy London Sprint. Working on a brand new repository is always nice, a blank slate and a chance to write some really elegant code, without all the crud of a legacy project.

In this case, the infrastructure for the project is pretty involved. I was using the pytest unit testing framework and using the rpython toolkit from pypy, both for the first time.

That led to an interesting situation. When I run the unit tests, I want to use the CPython interpreter. This means I can use all the standard library modules that I know well, and can test the basic algorithms I’m writing. When I want to “translate” my code into a binary executable, I use pypy and some of its rlib replacements for the Python standard library modules. When I get an runtime error in the translation, I need to know whether that is related to my use of the rlib libraries or my code is just plain wrong, and using CPython  helps me to do that.

The problem is that I have to keep switching between different standard libraries and interpreters. Somewhere in my code there is a switch for this:   

DEBUG = True

In testing that switch should be True and in production it should be False, but changing that line manually is a real pain, so I need some scripts to catch when I’ve set the DEBUG flag to the wrong mode.

Test automation #1

Here’s my (slightly simplified) first go at automating a test script:

import subprocess

debug_file = ...
framework = 'pytest.py'
try:
    retcode = subprocess.check_output(['grep', 'DEBUG = False', debug_file])
    print 'Please turn ON the DEBUG switch in', debug_file, 'before testing.'
except subprocess.CalledProcessError:
    subprocess.call(('python', framework))

What does this do? First the script calls the UNIX utility grep to find out whether there the DEBUG flag is correctly set:

retcode = subprocess.check_output(['grep', 'DEBUG = False', debug_file])

If it is, the script prints a warning message:

print 'Please turn ON the DEBUG switch in', debug_file, 'before testing.'

which tells me I have to edit the code, and if not, the script runs the tests:

subprocess.call(('python', framework))

Nice, but I still have to edit the file if the flag is wrong.

Test automation #2

Nicer, would be for the script to change the flag for me. Fortunately, this is easily done with the Python fileinput module. Here’s the second version of the full test script (slightly simplified):

import fileinput
import subprocess
import sys

debug_file = ...
debug_on = 'DEBUG = True'
debug_off = 'DEBUG = False'

def replace_all(filename, search_exp, replace_exp):
    """Replace all occurences of search_exp with replace_exp in filename.

    Code by Jason on:

http://stackoverflow.com/questions/39086/search-and-replace-a-line-in-a-file-in-python

    """
    for line in fileinput.input(filename, inplace=1, backup='.bak'):
        if search_exp in line:
            line = line.replace(search_exp, replace_exp)
        sys.stdout.write(line)

def main():
    """Check and correct debug switch. Run testing framework.
    """
    framework = 'pytest.py'
    opts = ''

    try:
        retcode = subprocess.check_output(['grep', debug_off, debug_file])
        print 'Turning ON the DEBUG switch in', debug_file, 'before testing...'
        replace_all(debug_file, debug_off, debug_on)
    except subprocess.CalledProcessError:
        pass
    finally:
        subprocess.call(('python', framework, opts))
    return

if __name__ == '__main__':
    main()

Test automation #3

So, now the flag is tested, set correctly if needs be and the tests are run. But I still have to run the test script! What a waste of typing. So, the next step is simply to call this script from a git pre-commit hook

Code for this post

The full history for this script can be found here and here.


Tagged: git, programming, pytest, python, software, unit testing, workflow

West Midlands Employment Data

At the Government Open Data Hack Day event organised by James Cattell and Gavin Broughton, Andy Pryke, Christophe Ladroue and I had a go at analysing employment statistics for the West Midlands. In particular we were looking for correlations between employment data and other factors, such as census data about age and gender. As with all data mining work, the most difficult and time-consuming job was cleaning the available data before it could be usefully used in an analysis. Christophe wrote a very clear account of the work he did using R to deal with nomis data. You can see a summary of our results in the video below.

… and if you want to download the yourself here it is publicly available here:

https://docs.google.com/spreadsheet/ccc?key=0AtT1QPEACWUldE9NTVduRGV


#efdhack2012 26th May 2012

This one’s a little different. Python West Midlands is hosting a hackday to kick off a new open source project for a very interesting little charity called Evidence for Development(EfD). EfD wants to help people make better decisions about aid projects – at local and national level – by putting real data about the real situation in the hands of the people making the decisions.

If you want to know if your aid programme is making a difference to the right people then you need to model the economy of your target village or district, before and after. Makes sense; simple science right? Problem is you can’ afford a bunch of western econometricians crawling all over the place (cost too much, takes too long) and anyway their cash-based economic models don’t work that well in a place where cash is only a small part of the economy (grow your own; harvest wild food; get paid in kind or cash or both for day labour; trade crops, labour or other goods; etc, etc). So EfD developed simple economic models that work in this environment, that can be learned and applied by locally trained people and that, are built to run on laptops. No reliance on big foundations’ data centres.

Last year EfD, in partnership with Chancellor College of the University of Malawi and The University of  Wolverhampton developed a Python/MySQLapp to model local economies that is already in use in several countries in Southern Africa.

This year the challenge is bigger – to build software that can model national and international economies. The model exists and works (it has a great track record of predicting famine effects from annual summary surveys of rural economies). But the only current implementations are proprietary, ill-supported and not extensible. Smells like open source spirit.

So for this hackday we’re going to have with us the two developers who led the IHM development last year (from Chancellor College in Zomba, Malawi) and the developers of the modeling methodologies from EfD (from Barnes and Surrey – exotic eh?). We’ll have a pretty completeMySQL database schema to work on and we hope to finish the day with a simple demo scenario that downloads reference data about a geographical area (a livelihood zone) produces a spreadsheet template to capture information about that livelihood zone (what they grow there, what they eat, how they make a living) runs some local completeness reports and uploads the captured data for merging (with other livelihood zone surveys) to allow analysis of a national survey.

I’m not a software developer, can I still contribute?

Yes! Absolutely. There are a number of jobs that can be contributed without writing any code. We would really appreciate the support of contributors who can build a web presence for these projects, write user and developer documentation, help spread the word and any number of jobs! If you’re keen to help out, there will definitely be a place for you.

When:

10:30 onwards, 26th May 2012. Please sign up here.

Where:

Thyme Software, Coventry University Technology Park, Puma Way, Coventry, CV1 2TT [map]


Tagged: charity, development, event, evidence_for_development, hackathon, hackday, malawi, mysql, python, pythonwm, software, uk

The great Christmas email experiment of 2011-12

This year I took pretty much all the holiday time I could over Christmas, probably for the first time ever. As an experiment, I let all the emails I received over this period accumulate in my Inobx, with the exception of things like posts to mailing lists which get automatically filtered, labeled and skip the Inbox. Generally, I try to follow an Inbox Zero policy, which means my Inbox is usually empty and every email I get is either dealt with as soon as I read it or saved in a “Next Action” list to be dealt with later. That policy makes it much easier to carve out large blocks of time for more difficult tasks, like writing lectures, marking or programming which all require uninterupted concentration. I think this works pretty decently, and at least I haven’t had to declare email bankruptcy.

So, the point of this experiment was really to see how well my Inbox Zero policy is working as well as I thought and, in particular, whether the bulk of the email I deal with is sensible content that really requires attention.

Of course, the “experiment” as such is a little silly, after all this is email from a vacation period and out of term time, so the results are weighted heavily. Usually I get a lot more email per day and a lot more relevant, sensible email that needs attention and the aim is always to maximise the time spent on those emails and minimise the time spent on unecessary emails.

Starting point

Anyway, enough caveats. My starting point was this:

Inbox: 316

Action list: 50

Before going on vacation I cleared out both the Inbox and the Action List of everything that could be dealt with then. So, the starting point here is all the email accumulated over a short vacation and all the items on my to-do list that couldn’t be finished before the holiday started.

The data

Yesterday I spent a happy (!!) afternoon going through each email and either responding to it, deleting it, reporting it as SPAM or filing it. In a Google Docs spreadsheet I wrote down the sender (anonymously unless the sender was a company), sender type and action for each email or group of emails from the same sender. I say “email”, actually I mean “email thread”. So one email on my spreadsheet here could well mean a thread of many emails from various senders. However, what I’m interested in here is really the aggregate data from the 300 emails, which you can see on this table:

Aggregated data from 300 emails
So, there are two things I’m interested in here:
  1. Where is the email from? Is it from people I need to communicate with or from companies and others sending “news” and other updates that can be ignored or processed in a more convenient way, such as via an RSS reader. Obviously emails from colleagues (including external collaborators) and students are all important. Other senders vary considerably depending on the content of the email.
  2. How were the emails processed? Emails that were deleted or marked as SPAM are emails I don’t want to receive repeatedly, so are best unsubscribed from. Emails that needed real attention can be filtered to be marked as important if they aren’t already.

Where to emails come from?

330 emails broken down by sender type

330 emails broken down by sender type

So, thinking of this email as signal and noise, the signal here is email from students, colleagues, friends and open source projects. Of course, SOME of the other emails will be important too and will need some action too, but this is a rough guide. The total number of “noise” emails, according to the sender, worked out as 78 out of 316, or around 25%.

Now, 25% to my mind is astonishingly low. Given that most of the email that hits my account gets filtered out and never sees the Inbox in the first place, 25% is really not what I expected to see here. 

What happened to all those emails?

300 email conversations broken down by next action

300 email conversations broken down by next action

The other way to look at signal vs noise is how the emails were processed. The signal in this case is the emails that were actioned immediately or saved for working on next week, which was 73 out of 316 or just over 23%. That’s very close to the previous SNR, becasuse the sender of a message is a good predictor of its importance.

Again though, 23% is astonishingly low. The main culprit is web apps and social media apps that send frequent notifications, updates and other fluff. Often when you sign up to these things they subscribe you to all sorts of email alerts automatically, then it takes effort on your part to change your settings and unsubscribe. A better way to deal with this, if you use GMail, is to use the Gmail plus trick which allows you to filter out all these emails automatically.

A point about unsubscribing from mailing lists 

When you unsubscribe from an email alert you are informing the sender that you no longer wish to be contacted. The very LAST thing you then need is another email saying “Well done! You have unsubscribed” which you then have to deal with separately. Seriously, this is a terrible way to treat potential customers. Very few of the email alerts I unsubscribed to did this, but those that did really annoyed me.TripIt, Klout, SAA, Costa, the Electoral Reform Society and UCU: consider yourself mildly whinged at. Hurumph.

End point

Just for the record…

Inbox: 0

Action list: 89

Actioned immediately: 34

The take home…

This stuff is boring common sense. It’s motherhood and apple pie. You know it all already. So you’re doing this already, right?

  • Email is a huge sink of time.
  • Process email in batch mode, once or twice a day. Don’t let incoming emails dictacte your work schedule.
  • Unsubscribe to everything you can at the first chance you get. Better still, don’t sign up in the first place.
  • If you use GMail, use the Gmail plus trick.
  • If you sign up to a lot of web apps and different services with logins and passwords, keep confirmation emails in a specific folder or label (I use web-signups) so you can keep track of which services you already have an account for.
  • Filter and label emails automatically whenever you can. Don’t let anything into your Inbox that doesn’t need to be there (looking at you posts to mailing lists).
  • Learn the keyboard shortcuts on your favourite email client. Use them. Banish the mouse.
  • Deal with emails that can be dealt with immediately, immediately.
  • Keep a “next action” folder of emails that cannot be dealt with immediately. Don’t have them hanging around your Inbox making you feel guilty, nervous and demoralised.
  • Keep a sensible hierarchy of folders or labels to organise your email. Or use something like ActiveInbox.

Tagged: email, productivity