Loading UTF-8 json file in python

2013-02-17 at 19:35:43 | categories: python, tips

Json, json...

Another lesson learned. I had a bunch of json files which define some language strings. Some of these files use UTF-8 others are plain 7bit ASCII. What I wanted to accomplish was just loading a file, sorting all the keys, pretty formatting it all and saving to another file. Pretty easy, right?

The task theoretically could be accomplished by cat file.json | python -mjson.tool > formatted_file.json. That works good for 7bit ASCII, but did not work well enough for UTF-8 encoded files, because the output characters were encoded and instead of a single character I got something like \u016. I did not know how to turn ensure_ascii to False in this command call, so I wrote a simple script.

After some trial and errors I ended up with a script which critical part is below. I left the comments:

import json
import codecs

# just open the file...
input_file  = file("input_file.json", "r")
# need to use codecs for output to avoid error in json.dump
output_file = codecs.open("output_file.json", "w", encoding="utf-8")

# read the file and decode possible UTF-8 signature at the beginning
# which can be the case in some files.
j = json.loads(input_file.read().decode("utf-8-sig"))

# then output it, indenting, sorting keys and ensuring representation as it was originally
json.dump(j, output_file, indent=4, sort_keys=True, ensure_ascii=False)

It worked very well and I could feed it with all the json files I had and just process them. I hope this piece of code can be useful for anyone experiencing the same problems.