feature_engineering

czym jest feature engineering? ⭐

Feature engineering to proces tworzenia nowych cech z istniejących danych.
Jest kluczowym krokiem w uczeniu maszynowym, który może znacząco poprawić wydajność modeli.
Obejmuje transformacje, kombinacje i ekstrakcję cech z surowych danych.
Wymaga kreatywności i znajomości domeny problemu.
Może być bardziej ważne niż wybór algorytmu.

typy feature engineering ⭐

1. Transformacje numeryczne

Skalowanie i normalizacja
Transformacje logarytmiczne
Kwadraty i pierwiastki
Interakcje między cechami

2. Cechy kategoryczne

One-hot encoding
Label encoding
Target encoding
Embedding dla kategorii

3. Cechy czasowe

Ekstrakcja komponentów (rok, miesiąc, dzień)
Cechy cykliczne (sin, cos)
Lags i rolling statistics
Cechy sezonowe

4. Cechy tekstowe

TF-IDF
Word embeddings
N-gramy
Sentiment analysis

przykładowy kod (feature engineering)

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns

# Generowanie syntetycznych danych
np.random.seed(42)
n_samples = 1000

# Podstawowe cechy
age = np.random.normal(45, 15, n_samples)
income = np.random.normal(50000, 20000, n_samples)
credit_score = np.random.normal(700, 100, n_samples)

# Cechy czasowe
dates = pd.date_range('2020-01-01', periods=n_samples, freq='D')
purchase_date = np.random.choice(dates, n_samples)

# Cechy kategoryczne
education = np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n_samples)
employment = np.random.choice(['Full-time', 'Part-time', 'Unemployed'], n_samples)

# Cechy tekstowe
job_titles = np.random.choice([
    'Software Engineer', 'Data Scientist', 'Manager', 'Analyst',
    'Developer', 'Consultant', 'Director', 'Coordinator'
], n_samples)

# Target
purchase = (age/100 + income/100000 + credit_score/1000) / 3 + np.random.normal(0, 0.1, n_samples) > 0.5

# Tworzenie DataFrame
data = pd.DataFrame({
    'age': age,
    'income': income,
    'credit_score': credit_score,
    'purchase_date': purchase_date,
    'education': education,
    'employment': employment,
    'job_title': job_titles,
    'purchase': purchase
})

print("Oryginalne dane:")
print(data.head())

1. transformacje numeryczne

# Podstawowe transformacje
data['age_squared'] = data['age'] ** 2
data['income_log'] = np.log1p(data['income'])  # log1p dla bezpieczeństwa
data['credit_income_ratio'] = data['credit_score'] / data['income']
data['age_income_interaction'] = data['age'] * data['income'] / 1000000

# Polynomial features
poly_features = PolynomialFeatures(degree=2, include_bias=False)
numeric_cols = ['age', 'income', 'credit_score']
poly_data = poly_features.fit_transform(data[numeric_cols])
poly_df = pd.DataFrame(poly_data, columns=poly_features.get_feature_names_out(numeric_cols))

# Dodanie polynomial features
for col in poly_df.columns:
    if col not in data.columns:
        data[col] = poly_df[col]

print("\nDane po transformacjach numerycznych:")
print(data[['age', 'age_squared', 'income', 'income_log', 'credit_income_ratio']].head())

2. cechy kategoryczne

# One-hot encoding
education_encoded = pd.get_dummies(data['education'], prefix='education')
employment_encoded = pd.get_dummies(data['employment'], prefix='employment')

# Target encoding (średnia targetu dla każdej kategorii)
target_encoding = data.groupby('education')['purchase'].mean()
data['education_target_enc'] = data['education'].map(target_encoding)

# Frequency encoding (częstość występowania kategorii)
freq_encoding = data['education'].value_counts(normalize=True)
data['education_freq_enc'] = data['education'].map(freq_encoding)

# Dodanie encoded features
data = pd.concat([data, education_encoded, employment_encoded], axis=1)

print("\nTarget encoding dla education:")
print(target_encoding)

3. cechy czasowe

# Ekstrakcja komponentów czasowych
data['purchase_year'] = data['purchase_date'].dt.year
data['purchase_month'] = data['purchase_date'].dt.month
data['purchase_day'] = data['purchase_date'].dt.day
data['purchase_dayofweek'] = data['purchase_date'].dt.dayofweek

# Cechy cykliczne dla miesiąca
data['month_sin'] = np.sin(2 * np.pi * data['purchase_month'] / 12)
data['month_cos'] = np.cos(2 * np.pi * data['purchase_month'] / 12)

# Cechy cykliczne dla dnia tygodnia
data['dayofweek_sin'] = np.sin(2 * np.pi * data['purchase_dayofweek'] / 7)
data['dayofweek_cos'] = np.cos(2 * np.pi * data['purchase_dayofweek'] / 7)

print("\nCechy czasowe:")
print(data[['purchase_date', 'purchase_month', 'month_sin', 'month_cos']].head())

4. cechy tekstowe

# TF-IDF dla job titles
tfidf = TfidfVectorizer(max_features=10, stop_words='english')
job_tfidf = tfidf.fit_transform(data['job_title'])
job_tfidf_df = pd.DataFrame(job_tfidf.toarray(), columns=tfidf.get_feature_names_out())

# Dodanie TF-IDF features
for col in job_tfidf_df.columns:
    data[f'job_tfidf_{col}'] = job_tfidf_df[col]

print("\nTF-IDF features:")
print(job_tfidf_df.head())

5. ocena wpływu feature engineering

# Podział danych
X_original = data[['age', 'income', 'credit_score']]
X_engineered = data.drop(['purchase_date', 'education', 'employment', 'job_title', 'purchase'], axis=1)

y = data['purchase']
X_train_orig, X_test_orig, y_train, y_test = train_test_split(X_original, y, test_size=0.2, random_state=42)
X_train_eng, X_test_eng, y_train, y_test = train_test_split(X_engineered, y, test_size=0.2, random_state=42)

# Trenowanie modeli
model_original = RandomForestClassifier(random_state=42)
model_engineered = RandomForestClassifier(random_state=42)

model_original.fit(X_train_orig, y_train)
model_engineered.fit(X_train_eng, y_train)

# Ocena
y_pred_orig = model_original.predict(X_test_orig)
y_pred_eng = model_engineered.predict(X_test_eng)

print(f"Accuracy (original features): {accuracy_score(y_test, y_pred_orig):.3f}")
print(f"Accuracy (engineered features): {accuracy_score(y_test, y_pred_eng):.3f}")
print(f"Liczba cech (original): {X_original.shape[1]}")
print(f"Liczba cech (engineered): {X_engineered.shape[1]}")

praktyczne ćwiczenia

Eksperymentuj z różnymi transformacjami - przetestuj różne funkcje matematyczne.
Stwórz cechy interakcyjne - połącz różne cechy w kreatywny sposób.
Feature selection - użyj technik selekcji cech po feature engineering.
Cechy domenowe - stwórz cechy specyficzne dla konkretnego problemu.
Automatyczne feature engineering - użyj bibliotek jak Featuretools.

dobre praktyki

Walidacja: Zawsze waliduj nowe cechy na zbiorze testowym.
Overfitting: Uważaj na overfitting przy tworzeniu zbyt wielu cech.
Interpretowalność: Twórz cechy, które mają sens biznesowy.
Skalowanie: Pamiętaj o skalowaniu po feature engineering.

narzędzia do feature engineering

scikit-learn: PolynomialFeatures, StandardScaler
pandas: get_dummies, groupby
Featuretools: Automatyczne feature engineering
tsfresh: Cechy czasowe
textblob: Analiza tekstu

czym jest feature engineering? ⭐​

typy feature engineering ⭐​

1. Transformacje numeryczne​

2. Cechy kategoryczne​

3. Cechy czasowe​

4. Cechy tekstowe​

przykładowy kod (feature engineering)​

1. transformacje numeryczne​

2. cechy kategoryczne​

3. cechy czasowe​

4. cechy tekstowe​

5. ocena wpływu feature engineering​

praktyczne ćwiczenia​

dobre praktyki​

narzędzia do feature engineering​

polecane źródła​

czym jest feature engineering? ⭐

typy feature engineering ⭐

1. Transformacje numeryczne

2. Cechy kategoryczne

3. Cechy czasowe

4. Cechy tekstowe

przykładowy kod (feature engineering)

1. transformacje numeryczne

2. cechy kategoryczne

3. cechy czasowe

4. cechy tekstowe

5. ocena wpływu feature engineering

praktyczne ćwiczenia

dobre praktyki

narzędzia do feature engineering

polecane źródła