I have a .json with the following structure:
[
{
"Country": "Spain",
"Age": "14"
},
{
"Country": "China",
"Age": "16"
},
]
I try to read it with the following method:
import json
from pprint import pprint
with open('json.json') as f:
data = json.load(f)
pprint(data)
but it throws me the following error:
ValueError: No JSON object could be decoded
The JSON is returned to me by the Octoparse software, so I don't think it's malformed.
How to store the values in a json local to my script?
I would like it to have the following format:
{"14":"Spain","16":"China"}
Thanks.
Diagnosis
Although the JSON pasted into the question is correct (except for a copying error that left an extra trailing comma), when the user tries the same operations on their own JSON, they get the error
ValueError
, which is not very informative .After some conversations with the user, I get the json file that he is really working with, and I try to replicate the execution of his code with Python2 (which is the version that the user uses), and sure enough, although the supplied JSON looks correct, I get the error:
If, instead, I repeat the execution using Python3, the diagnosis is much more precise and confirms my suspicions that there are hidden characters at the beginning of the file that are causing the problems:
The problem
The file initially contains a series of bytes called "BOM" (Byte Order Mark) that are invisible when displayed on the screen or loaded in an editor, but not when read from a program.
The purpose of those bytes, if the file were in UTF-16, is to allow programs that read it to deduce the endianity of the architecture on which the file was generated (that is, whether it is little endian or big endian ). However, in a UTF-8 file it makes no sense to introduce these bytes because the UTF-8 format is immune to the endianity problem .
However, many editors and Windows programs still insert these bytes when saving to UTF-8, and this is apparently not compatible with the JSON standard.
Solution
Using python3 it is possible to pass
open()
a parameter that specifies the encoding of the file to read (if not passed, assumeutf-8
). In this case, it would have to beutf-8-sig
passed, as Python3 itself is telling us in its error message.However, since the user uses Python2, he does not have the possibility to pass that parameter when opening the file, so we have no choice but to read the entire file to a byte string, and then encode that string to Unicode, using the format in question. Later we will use
json.loads()
instead ofjson.load()
, since this way we can pass the correctly decoded unicode string instead of the file.Namely:
This solution occupies more memory than Python3's, since we have to load the entire file before parsing the json, while in python3 it would be parsed as it is read, but since the file is not very large (61K) No problem.