I am a first-year PhD student in Computer Science at the University of Illinois at Urbana-Champaign (UIUC). My research focuses on Natural Language Processing (NLP), with a particular interest in the safety and alignment of large language models (LLMs). I develop methods and benchmarks to evaluate and enhance LLMs in areas such as multilingual and cross-cultural understanding, reasoning, and ethics. You can also find me active on X (Twitter).
We introduce MENAValues, a new benchmark to evaluate how LLMs align with the cultural values of the Middle East and North Africa. Our research reveals that LLM responses are sensitive to language and framing, with models showing shifts in cultural alignment and, in some cases, producing biased outputs. We also identify a phenomenon called "Logit Leakage," where hidden model preferences are exposed through log-probability analysis. This work highlights the importance of using frameworks like MENAValues to assess the cultural sensitivity of LLMs.
This paper addresses gender bias and logical coherence in machine translation, particularly between gendered languages like English and genderless ones such as Persian. We introduce the Translate-with-Care (TWC) dataset, which includes 3,950 challenging scenarios to test translation systems. Our findings show that all tested models struggle with genderless content, often defaulting to masculine pronouns in professional contexts. We demonstrate that fine-tuning an open-source model on our dataset can significantly reduce these biases and errors, outperforming proprietary LLMs.
Tree-of-Experts (ToE) is a new prompting method that improves the generation of Winograd Schema Challenge questions, achieving 50% valid cases compared to 10% with existing methods. Using ToE, we created WSC+, a dataset of 3,026 LLM-generated questions that includes new categories for ambiguous and offensive content. Our findings show that while GPT-4 leads LLM performance on WSC+ with 68.7% accuracy, this falls well below human performance of 95.1%. We also found that LLMs don't necessarily answer their own generated questions better than those created by other models.
View Paper | View GitHub | View VideoTuringQ is the first benchmark that tests LLMs' reasoning abilities in theoretical computer science. Testing with Chain of Thought prompting on various LLMs, we developed an automated evaluation system that performs similarly to human experts. Fine-tuning Llama3-8B on TuringQ improved both its theoretical reasoning and performance on related tasks like algebra, demonstrating the benchmark's value for advancing LLM capabilities in computational theory.
View Paper | View GitHub | View Dataset | View ModelBiased AI medical advice poses risks to patient safety as LLMs increasingly influence healthcare decisions. This study introduces two key resources: BiasMD (6,007 Q&A pairs for bias evaluation) and DiseaseMatcher (32,000 clinical Q&As covering 700 diseases). Using these datasets, we developed EthiClinician, a fine-tuned model that surpasses GPT-4 in ethical reasoning and clinical judgment, setting new standards for safer AI-driven healthcare outcomes.
View Paper | View GitHub | View BiasMD | View DiseaseMatcher | View ModelThis survey offers a comprehensive overview of how generative AI is applied to character animation, covering facial animation, motion synthesis, and more. It highlights key research, datasets, and trends, providing a single, integrative perspective on the field. The paper also discusses open challenges and future research directions to help researchers and developers advance AI-driven animation technologies.
Persian Ease and Persian Formalizer are a pair of complementary language models fine-tuned for Persian text style transfer: Persian Ease transforms formal Persian text into a more casual, conversational style Persian Formalizer converts informal Persian text into formal language suitable for professional or academic contexts Both models leverage fine-tuning techniques to preserve meaning while adapting the linguistic style appropriately.
View PersianEase | View PersianTextFormalizerImplemented dual transformer-based models (mT5 and BERT) for Persian language processing, featuring Named Entity Recognition (NER) for token classification and an abstractive text summarization system.
View PersianSummarizer | View PersianNERThis project develops a model to accurately predict drug names in both Persian and English using embedding techniques, FastText and BERT. By leveraging these embeddings, the model predicts drug names based on specific features and patterns in the input data. This bilingual approach enables pharmaceutical and healthcare applications to enhance drug name identification and suggestion in multilingual environments.
View ProjectThis project conducts sentiment analysis on Twitter and YouTube data, implementing a specialized preprocessing pipeline to improve accuracy and contextual understanding. The pipeline includes essential NLP steps such as lemmatization, tokenization, NER, and spell checking, along with advanced customizations like bigram verification, contradiction resolution, and a slang dictionary tailored to social media language. These techniques enable more accurate and nuanced sentiment insights by accounting for informal language, abbreviations, and unique social media expressions.
View ProjectI'm always open to conversations and potential collaborations! Feel free to reach out at zahraei2 [at] illinois [dot] edu .