'My profil picture

Hello I'm

KPATOUKPA

Kpodjro

Data Scientist/ Machine Learning Engineer

About me

Machine Learning Engineer with hands-on experience in developing and optimising large language models (LLMs) and applied AI systems. Currently completing a one-year work-study placement in Data Science and Generative AI, where I design and deploy end-to-end machine learning solutions in a real-world, production-oriented environment.

My background includes strong expertise in data mining, NLP, question–answering systems and language model fine-tuning, achieving a 23% improvement in predictive performance during a recent internship. I have also led research-oriented projects, including:

I am currently pursuing a Master’s degree in Data Science at Université Paris Cité, specialising in machine learning algorithm development, advanced data analysis and decision support systems.

My experiences

experience icon

Data Scientist - Generative AI

URSSAF Caisse Nationale

October 2025 – September 2026

  • Design and development of internal chatbot based on LLMs.
  • Implementation of NLP pipelines for analysing large-scale datasets and user verbatims.
  • Development of predictive and scoring models to support decision-making and risk prioritisation.
  • Collaboration with business teams and IT to deploy AI solutions.
  • Contribution to the evaluation and improvement of generative AI models in production.

Technologies:

Python, PyTorch, LangChain, LangGraph, Docker, Kubernetes, GitLab, PostgreSQL, OpenSearch, OVH Cloud

experience icon

End-of-studies internship

ATTIJARIWAFA BANK

March 2024 - August 2024

  • Data extraction from PDF and text files.
  • Embedding text in the Chromadb database for semantic text retrieval.
  • Development of an enhanced RAG system for managing customer queries.
  • Fine-tuning of the Mistral model and comparison of results (against RAG).
  • Implementation of two user interfaces: one for chatbot use, one for real-time supervision based on 4 KPIs.

Technologies :

Mistral 7b, Langchain, torch, streamlit, VSCode, HuggingFace, MongoDB(NoSQL), Jira

experience icon

Data Scientist assistant Internship

MLView Consulting

August 2023 - September 2023

  • Exploratory analysis of marketing data in collaboration with the marketing team.
  • Training of a deep learning model to predict customer interest in specific campaigns or products.
  • Fine-tuning of the LLAMA-2 large language model (LLM) for personalized teaser generation.

Technologies :

Langchain, HuggingFace, LLAMA-2, tensorflow, VSCode, Colab, Git&Github, SQL

education icon

Education

2024 - 2026

MSc in Machine Learning for Data Science
Paris Cité University, Paris, France

Reinforcement Learning & Graph Learning, Recommendation Systems, Clustering & Dimensionality Reduction, NLP, Data Visualization, Time Series, Mixture Models, Vertex AI (GCP)

Program

2021 – 2024

Engineering Degree in Software and Intelligent Systems
Abdelmalek Essaadi University, Tangier, Morocco

Machine Learning, Deep Learning, AI Methodology, Computer Vision, Data Mining, Inferential Statistics, BI, UML, Informed Search Algorithms, Design Patterns

Program

2019 – 2021

Associate's Degree in Mathematics, Computer Science & Physics
Hassan 1st University, Settat, Morocco

Algebra, Numerical & Complex Analysis, Statistics, C/C++, Mathematical Optimization, Databases, Graph Theory

Program
dsfgf

Discover my

Research Papers

Unsupervised and Semi-Supervised Learning with CatGAN and RIM

Research project (in a team of 4 students):

10/2025 - 12/2025

Abstract : This research explores semi-supervised learning under extreme label scarcity. Several approaches are studied and compared, including supervised baselines, CatGAN (MLP and CNN variants), and Regularized Information Maximization (RIM). The results demonstrate that combining CatGAN with convolutional architectures significantly improves classification performance by leveraging unlabeled data and maximizing information content in the discriminator’s predictions.

Main tasks :

  • Implementation of supervised baselines using MLP and CNN discriminators trained on only 100 labeled samples;
  • Design and training of CatGAN in a semi-supervised setting (MLP and CNN variants);
  • Implementation of RIM as a non-adversarial semi-supervised baseline;
  • Analysis of training dynamics using entropy-based objectives and marginal entropy monitoring;
  • Comprehensive evaluation using accuracy, recall, F1-score, AUC, confusion matrices, and per-class performance.

LLM with graph augmentation for recommandations

Research project (in a team of 4 students) :

10/2024 - 05/2025

Abstract : Enhanced movie recommendations using LLMs (Gemini-1.5, Mistral) to enrich user/item profiles, significantly improving accuracy in LightGCN, MLP, and Matrix Factorization by addressing data sparsity and enabling nuanced personalization. Focused on responsible integration, acknowledging challenges like bias and cost.

Main tasks :

  • Getting started with the dataset & understand the problematic of the need;
  • Study and build a baseline with appropriate models for recommander systems (here: LightGCN, MLP and Matrix Factorization);
  • Do prompt engineering and enrich the dataset by generating significant attributes;
  • Perform MLOps on the new dataset(objective: Predict the rating);
  • Study the impacts and leave recommendations regarding the approach.

Improving Part-of-Speech Tagging in English with TreeTagger

Research project (in a team of 4 students) :

01/2025 - 02/2025

Abstract: This paper explores the use of TreeTagger to accurately identify the different functions of the word "that" in English, such as conjunction, relative pronoun, determiner, or adverb. We first evaluate pre-trained models from the BNC and Penn corpora, then re-train TreeTagger with specific labels derived from the Brown corpus to enhance accuracy. Comparisons with Stanza and UDpipe are presented. The main findings demonstrate that re-training with the Brown corpus significantly improves the tool’s performance and ability to distinguish between various uses of "that".

Main Tasks:

  • Data collection and preparation, including annotating the Brown corpus with specific labels.
  • Initial evaluation of BNC and Penn models for categorizing "that".
  • Re-training TreeTagger with a tailored label set.
  • Comparison of performance with other tools (Stanza and UDpipe).
  • Analyzing the impact of training data size on tagging precision.
  • Proposing methods to better detect and categorize "that" in various linguistic contexts.
Next Section

Explore My

Skills

Data Science Skills

checkmark

Languages

Python, Java, R, C/C++, PL/SQL

IDE

Anaconda, Eclipse, VS Code, RStudio

checkmark

Data Processing

Pandas, Numpy, statsmodels, sklearn, Pyspark

checkmark

Modelization

sklearn, TensorFlow, Keras, pytorch

checkmark

Machine Learning

Supervised,
Unsupervised, Reinforcement, Ensemble Learning

checkmark

Deep Learning

CNN, RNN, LSTM, ANN, TensorFlow, Keras,GNN

checkmark

Computer Vision

OpenCV, Tesseract, KerasCV, pillow

checkmark

NLP

Bert, KerasNLP, LLMAMA-2, Mistral

checkmark

R & R-Shiny

Statistical Modeling, Dashboard Development

checkmark

Databases

MySQL, PostgreSQL, MongoDB, Oracle, Hive

Web and Other Skills

checkmark

Frontend Development

HTML5 & CSS3, Streamlit, Flask, FastAPI, Shiny, Angular (Beginner)

checkmark

Scraping

scrapy, BeautifulSoup, Selenium, pytrend

checkmark

Versioning and Collaboration

Git & GitHub

checkmark

Containerization and Deployment

Docker

checkmark

Work Automation & cloud

Airflow, cron(Linux), AWS, GCP, Vertex AI

checkmark

Projet Management

Gantt Project, Jira

checkmark

Latex

Scientific Document Preparation

checkmark

Power BI & Tableau

Data Visualization, Dashboard Development

Next Section

Browse my

Projects

image-project-weak-signals

Weak Signal Detection and Anomalous Behavior Analysis

Feature extraction (FFT, spectral features) , LLE, TSNE, K-Means, Hierarchical Clustering, Spectral Clustering, Fuzzy KMeans

image-project-kmeans

Distributed and Streaming Optimization of K-Means for Big Data

PySpark, Spark Streaming, MLlib, Apache Beam, Big Data

image-project-nlp

NLP-Based People Insights from Large-Scale Textual Feedback

Python, NLP, Scikit-learn, UMAP, Sentiment Analysis

image-project1

Implementation of a supermarket supply anticipation system based on sales

Sklearn, tensorflow, Keras, matplotlib,SQL

image-project2

calculation of the probability of credit repayment

sklearn, seaborn, xgboost, lightgbm

image-project3

Real-time customer unsubscription prediction

Kafka Stream, PySpark, Sklearn, Flask, Angular, Docker,SQL

image-project1

Implementation of an Image Captioning platform

Sklearn, OpenCV, flask

image-project2

Advanced RAG system with LLAMA-2

Langchain, Streamlit, FAISS, LLMAMA-2

image-project4

Image search by content based on color and shape

arrow

My

Licence and Certifications

Certification 1

Advanced Machine learning

Date of issue: 09/2022

Organism : Huawei

Certification 2

TensorFlow for Computer Vision

Date of issue: 10/2023

Organism : OpenCV University

Certification 3

GenAI for geospacial data

Date of issue: 10/2023

Organism : Nasa Space Challenge

Certification 4

Deep learning on Time series

Date of issue: 09/2023

Organism : Kaggle

Certification 5

Introduction to Deep learning

Date of issue: 10/2023

Organism : Kaggle

arrow

Get in touch

Contact Me