Write serialize and deserialize functions for an array of strings

Problem

Write 2 functions to serialize and deserialize an array of strings. strings can contain any Unicode character.
Do not worry about string overflow.

input = ['abdcd', '4agasd-dsfafdas', 'hi there I love you']

output = serialize(input)

deserialize(output) = ['abdcd', '4agasd-dsfafdas', 'hi there I love you']



Basically, you need to decide how you want to encode your serialize messages so that you can deserialize it later.

For simplicity, I decided to encode the messages as below:

   [meta_data_length]>[meta_data with ',' delimiter][concatenated strings]

For simplicity, I use a special delimiter '-'. However, we can avoid using it if we fix the first length field to a fixed size such as 64 bits. Again, for simplicity, we will assume '>' will not be used in the data.

To ensure the data contains '>', we will use html.escape() and html.unescape() to encode and decode '>' in the data. (UPDATE: 2022-06-13) The original code had a bug of now being able to deserialize the data correctly when the data contains the same delimiter, '>' in the string.

Once the serialization format is defined, we can write two methods, according to the serialization format.

Here is the working python code.

def serialize(input):
len_list=[]
for i in input:
len_list.append(str(len(i)))
data = "".join(input)
meta_data = ",".join(len_list)
meta_len = len(meta_data)
return "{}-{}{}".format(str(meta_len), meta_data, data)
def deserialize(input):
tokens = input.split("-")
meta_len = int(tokens[0])
meta_and_data = tokens[1]
meta_data = meta_and_data[:meta_len]
data = meta_and_data[meta_len:]
len_list = [int(i) for i in meta_data.split(',')]
#extract data
result = []
s = 0
for l in len_list:
result.append(data[s:s+l])
s += l
return result
input = ['abdcd', '4agasddsfafdas', 'hi there I love you']
serialized = serialize(input)
print ("input = {} after serialization = {}".format(input, serialized))
print ("after deserialization = {}".format(deserialize(serialized)))

Practice statistics:

15:00: to write up the code

8:00: to fix the logical error. Had to debug the code by executing it. end value for reading data was calculated incorrectly. It should be s+l instead of l itself.

UPDATE(2022-06-13): Solved the problem again. Had to spend time figuring out how to avoid the deserialization failure when the data string contains the delimiter for meta_data_length separation. 
After trying several things, I decided to escape the delimiter with html.escape().

'''Write 2 functions to serialize and deserialize an array of strings. strings can contain any unicode character.
Do not worry about string overflow.
input = ['abdcd', '4agasd-dsf>afdas', 'hi there I love you']
output = serialize(input)
deserialize(output) = ['abdcd', '4agasd-dsf>afdas', 'hi there I love you']'''
import html
def serialize(input):
input_lens=[]
escaped_data = []
for i in input:
escaped = html.escape(i)
input_lens.append(str(len(escaped)))
escaped_data.append(escaped)
meta_data = ",".join(input_lens)
meta_data_len = len(meta_data)
data = "".join(escaped_data)
return "{}>{}{}".format(meta_data_len, meta_data, data)
def deserialize(serialized):
meta_data_len = int(serialized.split('>')[0])
rest = serialized.split('>')[1]
meta_data = rest[:meta_data_len]
data = rest[meta_data_len:]
data_len = meta_data.split(',')
output = []
start = 0
for length in data_len:
end = start + int(length)
token = data[start: end]
output.append(html.unescape(token))
start = end
return output
# test
input = ['abdcd', '4agasd-dsf>afdas', 'hi there I love you']
output = serialize(input)
print ("serialized = {}".format(output))
print("deserialized = {}".format(deserialize(output)))

Comments

Popular posts from this blog

Planting flowers with no adjacent flower plots

Find the shorted path from the vertex 0 for given list of vertices.