Feature Subset Optimization II: Heuristics, Baselines, and Validation Discipline

Advanced Combinatorial Search in DS

Created by Pavel · 12.03.2026 at 07:54 UTC · 1 completed

When exhaustive search is too expensive, a common practical approach is greedy forward selection.

Forward-selection loop:
1. start with empty set,
2. evaluate adding each remaining feature,
3. add the feature with best score gain,
4. stop by criterion (max features, no improvement, budget).

Why baseline comparison matters:
- compare greedy search against random subset sampling,
- otherwise observed gains may be due to chance.

Validation discipline:
- use fixed split protocol or cross-validation,
- keep selection and evaluation separated to avoid leakage,
- test stability across random seeds.

Edge cases:
- local optimum traps,
- noisy score function,
- ties between candidate features.

University approvals: 0

scikit-learn.org/stable/modules/feature_selection.html

article

scikit-learn.org

Tasks

Question 1

Why should heuristic feature search be compared with a random baseline?

Hint

Baselines calibrate claim strength.

Because random search is always optimal.

To test whether heuristic gains exceed chance performance.

To avoid using validation data.

Because heuristics cannot use constraints.

Question 2

Code task: implement one forward-selection step.

def add_best_feature(current_set, candidates, score_fn):
    """Return set after adding single candidate with highest score."""
    # TODO

score_fn(feature_set) returns numeric score (higher is better).

Submission format: submit the full function snippet shown above with # TODO replaced.

Hint

Do not mutate current_set while scoring proposals.

def add_best_feature(current_set, candidates, score_fn):
    # TODO: add single best-scoring candidate
    pass

Starter code is prefilled; replace TODO blocks with your solution.

Runtime output (stdout/stderr)

2 test cases will be used for grading

Run checks runtime behavior only. Final correctness is evaluated when you submit.

Question 3

Which practice best reduces selection bias while evaluating chosen feature subsets?

Hint

Prevent information leakage from evaluation set.

Use same data split repeatedly until score looks good.

Separate feature-selection process from final evaluation data.

Tune and evaluate on training data only.

Skip random seeds for reproducibility.

Question 4

Code task: implement improvement-based stop criterion.

def should_stop(last_best_score, new_best_score, min_gain=0.0):
    """Return True when improvement is smaller than min_gain."""
    # TODO

Submission format: submit a full function definition def should_stop(...): ....

Hint

Gain is new_best_score - last_best_score.

def should_stop(last_best_score, new_best_score, min_gain=0.0):
    # TODO: True if gain < min_gain
    pass

Starter code is prefilled; replace TODO blocks with your solution.

Runtime output (stdout/stderr)

2 test cases will be used for grading

Run checks runtime behavior only. Final correctness is evaluated when you submit.

Card Info

Topic: Combinatorial Search in DS
Difficulty: Advanced
Completed: 1 users

Creator

Pavel