Automating Spyware Detection with Machine Learning & GitHub Actions

 

Automating Spyware Detection with Machine Learning & GitHub Actions

🔍 Introduction

Spyware is a growing cybersecurity threat, silently infiltrating systems to steal sensitive data. Detecting such malware requires constant updates to detection models, but manually retraining and deploying models is inefficient. This project automates the entire process—training, versioning, and deploying spyware detection models—using GitHub Actions and GitHub Releases.

By integrating machine learning with continuous integration (CI), this system ensures that the latest spyware detection model is always available for use in real-time security applications.

🛠 Project Overview

The project consists of two repositories:

  1. spyware-detector-training (this repo) - Automates model training and deployment.
  2. spyware-detector (main project) - Uses the latest model from the training repo for real-time spyware detection.

📌 Key Features

Automated Model Training: The system retrains the model whenever the dataset is updated or the algorithm is modified.
Seamless Deployment: The trained model is published as a release on GitHub.
Main Project Integration: The spyware detector automatically fetches the latest model from the releases.

📂 Dataset & Feature Extraction

Spyware detection relies on behavioral analysis—extracting meaningful patterns from files and processes. The dataset includes:

  • File permissions and access logs
  • Network behavior analysis
  • System API calls

Feature Extraction Process

  1. Preprocessing: Cleaning and normalizing raw malware logs.
  2. Feature Engineering: Extracting crucial indicators (e.g., registry changes, process injections).
  3. Vectorization: Converting extracted data into a format suitable for ML models.

🤖 Machine Learning Model

The system currently uses Random Forest for classification, but it’s modular enough to support future improvements with deep learning or ensemble techniques.

📊 Model Pipeline

  • Train: The dataset is processed, features are extracted, and the model is trained.
  • Evaluate: Performance metrics like accuracy, recall, and F1-score are calculated.
  • Deploy: The trained model is saved and uploaded to GitHub Releases.

🚀 Automating Model Training with GitHub Actions

1️⃣ GitHub Actions Workflow

Whenever a change is pushed to spyware-detector-training, the following workflow is triggered:

name: Train and Release Model

on:
  push:
    branches:
      - main  # Trigger when pushing to main
  workflow_dispatch:  # Allow manual trigger

jobs:
  train-and-release:
    runs-on: ubuntu-latest
    permissions:
      contents: write
    steps:
      - name: Checkout Repository
        uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.9'

      - name: Install Dependencies
        run: |
          pip install -r requirements.txt

      - name: Train the Model
        run: |
          python main.py  # Modify if needed to run training

      - name: Archive Model
        run: |
          mkdir -p release
          cp models/saved/*/model.pkl release/model.pkl
          cp models/saved/*/metadata.json release/metadata.json
          cp models/saved/*/metrics.json release/metrics.json
          tar -czvf trained_model.tar.gz -C release .

      - name: Create GitHub Release
        uses: softprops/action-gh-release@v1
        with:
          tag_name: latest-model-$
          name: Latest Trained Model
          body: "This model was trained automatically on commit $"
          files: trained_model.tar.gz
        env:
          GITHUB_TOKEN: $

2️⃣ Main Project Fetches the Latest Model

The main spyware detector fetches the latest model from GitHub Releases using this script:

import requests

MODEL_URL = "https://github.com/ahmed-n-abdeltwab/spyware-detector-training/releases/latest/download/model.pkl"

def download_latest_model():
    response = requests.get(MODEL_URL)
    if response.status_code == 200:
        with open("model.pkl", "wb") as f:
            f.write(response.content)
        print("✅ Latest model downloaded successfully!")
    else:
        print("❌ Failed to download the latest model.")

download_latest_model()

📢 Future Improvements

🚀 Enhanced ML Models – Testing deep learning approaches like CNNs for behavior analysis.
🔄 Continuous Dataset Expansion – Adding real-time threat intelligence.
📡 Real-time Model Serving – Integrating a cloud API to provide live predictions.

🔗 Conclusion

This project showcases how automation and machine learning can enhance cybersecurity. By integrating GitHub Actions, automated model training, and CI/CD, we ensure that the spyware detection system remains up-to-date against evolving threats.

👉 Star the repo and contribute! Let’s build a more secure future. 🛡️🚀