Decode a UTF-8 string that may be cut off

Let's say we got some bytestring through a socket, but it was cut off in the middle of a UTF-8 character.

We can simulate this:

bs = "приклад".encode('utf-8')[:-1]  # last byte was lost
print(bs)
#< b'\xd0\xbf\xd1\x80\xd0\xb8\xd0\xba\xd0\xbb\xd0\xb0\xd0'
bs.decode('utf-8')
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 12: unexpected end of data

So, is this string completely undecodable? Can't we just get "прикла"?
Don't worry, I have a solution. Works with 3.x and 2.7!

1
2
3
4
5
6
7
8
import io
import codecs

bs = b'\xd0\xbf\xd1\x80\xd0\xb8\xd0\xba\xd0\xbb\xd0\xb0\xd0'
stream = io.BytesIO(bs)
stream_reader = codecs.getreader('utf-8')(stream)
print(stream_reader.read())
#< прикла

Created
Comments powered by Disqus