I have an .xlsx file and I would like to get it in the format of a MultiIndex.
I know how to get the values of the first or second column.
from openpyxl import load_workbook
import pandas as pd
wb = load_workbook(filename='Trees.xlsx')
# Test if the cell values are "Division", if it is the case, store it
divisions = (cell.value for cell in wb['Industry']['A']
if cell.data_type == "s" and "Division" in cell.value)
# loop over the second cell, check wether there is already something in the first column
sub_divisions = (cell.value for cell in wb['Industry']['B']
if cell.data_type == "s")
I don't know how to fit one inside the other. I tried the following:
for cell in wb['Industry']['A']:
if cell.data_type == "s" and "Division" in cell.value:
divisions = divisions + cell.value
for sub_cell in wb['Industry']['B'] # How to add the condition : from the row just below the one which had the last division
if ((sub_cell.data_type == "s") and ("Division" not in cell['A'][cell.row+1].value)):
sub_divisions = sub_divisions + sub_cell
It doesn't work, it's just a deletion.
I was advised by a friend to use ElastiSearch with indexes and I was also thinking of using what seems to be called SIC in the first column.
The .xlsx file is here .
I'm not entirely sure what kind of output you're expecting.
From the structure of the excel it is deduced that column "A" has two things mixed:
"Division A: Agriculture, Forestry, And Fishing"
, etc. In front of each of these "headings" there appears to be a blank line.In column B there are basically strings of text that would act as "second level headings", and blank lines, where we have to assume that we are still under the same second level heading, unless a new first level heading appears in the previous column.
The same is true for column C and D.
Therefore I understand that what you want is to actually store the numbers that appear in column A, but indexed (with a multi-level index) according to the texts of columns A (when it appears), B, C and D.
The problem is quite difficult because the sheet is not well structured, since looking at only one column, such as C, for example, while we see empty cells, it is not easy to know when we are still within the same subcategory C, or when we have already changed sub -category (because new categories have appeared in columns A or B).
I also understand that when a new category appears in the columns to the left, the category of the current column "is reset" so to speak to a neutral value which means "there is no subcategory yet" and that I have chosen to represent as " ---".
Proposed solution
This is the solution that I propose (I am showing how the datframe is left after each step so that it is better understood):
1. Read the sheet
After reading it, I delete the first row that was empty, and I rename the columns to letters to make it more manageable:
Aspect of df (first 10 rows):
2. Separate the numbers in column A to another column (E)
I will create a new column (E) containing those numbers, or the text "---" when there is no number in A. At the same time, I will remove the numbers from column A (changing them for
NaN
) to leave only the texts:df appearance now:
3. Fill level 2, 3, etc. headings. when they "reset"
Every time a new header appears at level N, all lower levels (N+1, N+2, ...) are "reset" (I assign "---" to them):
New look of df:
4. Remove NaNs from headers
Every time a new header appears in a column, from there on down I will fill all the columns
NaN
with that header. Thanks to the fact that I put "---" in certain places before, those act as a "brake" to stop the padding:New look of df:
5.Multi-index
Finally! With the header data thus replicated, we can create a multi-index dataframe using columns A, B, C and D as indexes and column E as final data:
Final appearance (first 15 rows):
Table-html version of the final result, where the structure of the multi-index is better appreciated: