What is a promise in Javascript?

Question

Asked: 2020-05-10 09:19:55 +0800 CST 2020-05-10 09:19:55 +0800 CST 2020-05-10 09:19:55 +0800 CST

How to design a MultiIndex DataFrame with levels included?

772

I have an .xlsx file and I would like to get it in the format of a MultiIndex.

I know how to get the values of the first or second column.

from openpyxl import load_workbook
import pandas as pd

wb = load_workbook(filename='Trees.xlsx')

# Test if the cell values are "Division", if it is the case, store it
divisions = (cell.value for cell in wb['Industry']['A']
                 if cell.data_type == "s" and "Division" in cell.value)

# loop over the second cell, check wether there is already something in the first column
sub_divisions = (cell.value for cell in wb['Industry']['B']
                 if cell.data_type == "s")

I don't know how to fit one inside the other. I tried the following:

for cell in wb['Industry']['A']:
    if cell.data_type == "s" and "Division" in cell.value:
        divisions = divisions + cell.value
        for sub_cell in wb['Industry']['B'] # How to add the condition : from the row just below the one which had the last division
            if ((sub_cell.data_type == "s") and ("Division" not in cell['A'][cell.row+1].value)):
                sub_divisions = sub_divisions + sub_cell

It doesn't work, it's just a deletion.

I was advised by a friend to use ElastiSearch with indexes and I was also thinking of using what seems to be called SIC in the first column.

The .xlsx file is here .

1 Answers

Voted

abulafia · Answer 1 · 2020-05-13T10:34:58+08:00

I'm not entirely sure what kind of output you're expecting.

From the structure of the excel it is deduced that column "A" has two things mixed:

Text that are a kind of first level "headings", such as "Division A: Agriculture, Forestry, And Fishing", etc. In front of each of these "headings" there appears to be a blank line.
Numbers that apparently are some kind of code and that I understand would be the final "data" of the dataframe

In column B there are basically strings of text that would act as "second level headings", and blank lines, where we have to assume that we are still under the same second level heading, unless a new first level heading appears in the previous column.

The same is true for column C and D.

Therefore I understand that what you want is to actually store the numbers that appear in column A, but indexed (with a multi-level index) according to the texts of columns A (when it appears), B, C and D.

The problem is quite difficult because the sheet is not well structured, since looking at only one column, such as C, for example, while we see empty cells, it is not easy to know when we are still within the same subcategory C, or when we have already changed sub -category (because new categories have appeared in columns A or B).

I also understand that when a new category appears in the columns to the left, the category of the current column "is reset" so to speak to a neutral value which means "there is no subcategory yet" and that I have chosen to represent as " ---".

Proposed solution

This is the solution that I propose (I am showing how the datframe is left after each step so that it is better understood):

1. Read the sheet

After reading it, I delete the first row that was empty, and I rename the columns to letters to make it more manageable:

import pandas as pd
import numpy as np
df = pd.read_excel("Trees.xlsx")[1:]
df.columns=["A", "B", "C", "D"]

Aspect of df (first 10 rows):

                                A                             B                              C                 D
1   Division A: Agriculture, F...                           NaN                            NaN               NaN
2                             100  Agricultural production-crop                            NaN               NaN
3                             110                           NaN                    Cash grains               NaN
4                             111                           NaN                            NaN             Wheat
5                             112                           NaN                            NaN              Rice
6                             115                           NaN                            NaN              Corn
7                             116                           NaN                            NaN          Soybeans
8                             119                           NaN                            NaN  Cash grains, nec
9                             130                           NaN  Field crops, except cash g...               NaN
10                            131                           NaN                            NaN            Cotton

2. Separate the numbers in column A to another column (E)

I will create a new column (E) containing those numbers, or the text "---" when there is no number in A. At the same time, I will remove the numbers from column A (changing them for NaN) to leave only the texts:

df['E'] = df.A.apply(lambda x: x if type(x)==int else "---")
df.A = df.A.apply(lambda x: np.nan if type(x) == int else x)

df appearance now:

                                A                             B                              C                 D    E
1   Division A: Agriculture, F...                           NaN                            NaN               NaN  ---
2                             NaN  Agricultural production-crop                            NaN               NaN  100
3                             NaN                           NaN                    Cash grains               NaN  110
4                             NaN                           NaN                            NaN             Wheat  111
5                             NaN                           NaN                            NaN              Rice  112
6                             NaN                           NaN                            NaN              Corn  115
7                             NaN                           NaN                            NaN          Soybeans  116
8                             NaN                           NaN                            NaN  Cash grains, nec  119
9                             NaN                           NaN  Field crops, except cash g...               NaN  130
10                            NaN                           NaN                            NaN            Cotton  131

3. Fill level 2, 3, etc. headings. when they "reset"

Every time a new header appears at level N, all lower levels (N+1, N+2, ...) are "reset" (I assign "---" to them):

df.B = np.where(pd.notnull(df.A), "---", df.B)
df.C = np.where(pd.notnull(df.B), "---", df.C)
df.D = np.where(pd.notnull(df.C), "---", df.D)

New look of df:

                               A                             B                              C                 D    E
1   Division A: Agriculture, F...                           ---                            ---               ---  ---
2                             NaN  Agricultural production-crop                            ---               ---  100
3                             NaN                           NaN                    Cash grains               ---  110
4                             NaN                           NaN                            NaN             Wheat  111
5                             NaN                           NaN                            NaN              Rice  112
6                             NaN                           NaN                            NaN              Corn  115
7                             NaN                           NaN                            NaN          Soybeans  116
8                             NaN                           NaN                            NaN  Cash grains, nec  119
9                             NaN                           NaN  Field crops, except cash g...               ---  130
10                            NaN                           NaN                            NaN            Cotton  131

4. Remove NaNs from headers

Every time a new header appears in a column, from there on down I will fill all the columns NaNwith that header. Thanks to the fact that I put "---" in certain places before, those act as a "brake" to stop the padding:

df = df.fillna(method="pad")

New look of df:

                                A                             B                              C                 D    E
1   Division A: Agriculture, F...                           ---                            ---               ---  ---
2   Division A: Agriculture, F...  Agricultural production-crop                            ---               ---  100
3   Division A: Agriculture, F...  Agricultural production-crop                    Cash grains               ---  110
4   Division A: Agriculture, F...  Agricultural production-crop                    Cash grains             Wheat  111
5   Division A: Agriculture, F...  Agricultural production-crop                    Cash grains              Rice  112
6   Division A: Agriculture, F...  Agricultural production-crop                    Cash grains              Corn  115
7   Division A: Agriculture, F...  Agricultural production-crop                    Cash grains          Soybeans  116
8   Division A: Agriculture, F...  Agricultural production-crop                    Cash grains  Cash grains, nec  119
9   Division A: Agriculture, F...  Agricultural production-crop  Field crops, except cash g...               ---  130
10  Division A: Agriculture, F...  Agricultural production-crop  Field crops, except cash g...            Cotton  131

5.Multi-index

Finally! With the header data thus replicated, we can create a multi-index dataframe using columns A, B, C and D as indexes and column E as final data:

df = df.set_index(['A','B','C','D'])

Final appearance (first 15 rows):

A                              B                            C                              D                                  
Division A: Agriculture, Fo... ---                          ---                            ---                             ---
                               Agricultural production-crop ---                            ---                             100
                                                            Cash grains                    ---                             110
                                                                                           Wheat                           111
                                                                                           Rice                            112
                                                                                           Corn                            115
                                                                                           Soybeans                        116
                                                                                           Cash grains, nec                119
                                                            Field crops, except cash gr... ---                             130
                                                                                           Cotton                          131
                                                                                           Tobacco                         132
                                                                                           Sugarcane and sugar beets       133
                                                                                           Irish potatoes                  134
                                                                                           Field crops, except cash gr...  139
                                                            Vegetables and melons          ---                             160

Table-html version of the final result, where the structure of the multi-index is better appreciated:

How to design a MultiIndex DataFrame with levels included?

Proposed solution

1. Read the sheet

2. Separate the numbers in column A to another column (E)

3. Fill level 2, 3, etc. headings. when they "reset"

4. Remove NaNs from headers

5.Multi-index

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?