I recently had a baffling experience in Google Colab where the number of records in a dataframe appearing with a non-null value for a particular field (“Outcome”) suddenly dropped.
I checked the raw source data in Excel and, filtering on the “Outcome” column, the number of Excel ‘blanks’ was lower than the number of records showing null for “Outcome” in the dataframe.
Trawling through the specific records that showed a null in the dataframe, it turned out that some of these were in fact not blank in Excel but rather the actual string “None”.
This had always been the case but somehow suddenly these “None” cell values were being converted to Python nulls (confusingly, these are displayed as ‘None’ or ‘NaN’ in Colab) in the dataframe records.
The dataframe was loaded from Excel using the pandas.read_excel method.
So how do we get “None” strings to appear as “None” strings in the dataframe rather than Python ‘None’ nulls?
It turns out that by default the read_excel method changes certain Excel values to null as per its default na_values parameter.
To get the “None” values read in as “None” rather than null one has to specify keep_na_values as False; if we want empty strings to still be read in as nulls, however, we also need to tell the method this by setting na_values to “” (an empty string) e.g.
df = pd.read_excel( io='TestWorkbook.xlsx', sheet_name="Sheet1", na_values='', keep_default_na=False )
This and further subtleties (e.g. Excel displaying in its user interface a value different to what is in fact stored) is described in this Stackoverflow post.
In summary, another unexpected change in Google Colab Python code behaviour. With the help of the API documentation we can explain and work around it, though why the behaviour suddenly changed (previously ‘None’ strings in Excel were read into dataframes as the same, rather than converted to nulls) I have no idea!