Advanced Neural Networks and Deep Learning Final Project

ASL Typing AI Platform

Real-Time ASL detection, correction, & completion on resource constrained devices
Project image

By Andrew Xavier, Alban Dietrich, Kenneth Munyuza, and Sam D. Brosh, Columbia University

Video, Paper, and Repository

Introduction

American Sign Language (ASL) serves as a vital communication bridge for the deaf community. While ASL primarily uses hand gestures instead of spoken words, its detection has become a focal point in image processing and classification. Recognizing the challenges and the tedious learning curve associated with ASL, our team introduced the ASL Detection, Correction, and Completion (ASLDCC) application. This innovative tool, compatible with both computers and smartphones, aims to make ASL communication more accessible and efficient.

About ASLDCC

ASLDCC is equipped with three main submodules:

  1. Detection Submodule: Recognizes the user's hand signs.

  2. Natural Language Processing (NLP) Submodule: Suggests corrections or completions based on detected signs.

  3. Full-Stack Submodule: Presents the detected letters and recommendations in a user-friendly interface.

The application is especially beneficial for users with basic interfaces, eliminating the need for a keyboard. In an era dominated by Augmented Reality (AR), ASLDCC ensures that signers can communicate without fretting over errors or prolonged signing times.

How It Works

  • Detection: Utilizing MediaPipe, the application detects hand landmarks in an image (in this case, frames from the live video feed). This data was used to train a convolutional neural network (CNN) to identify ASL representations of alphabet letters. During inference, the predictions are aggregated over frames to ensure the signed word is intentional
Figure 1: Computer Vision Detection submodule pipeline
Detection Module
  • NLP Processing: Depending on the detected signs, the NLP algorithm suggests possible word completions. For instance, if the letter "A" is detected, the algorithm might suggest "Able", "Apple", or "Apply".] The possible words for a given input are generated by using a DAWG. The top 3 words of those are determined by a statistical model using the brown corpus as a reference set
Figure 2: NLP submodule pipeline.
NLP Processing
Figure 3: DAWG graph example using only our top 20 words. Actual DAWG during inference uses 23,000 words
DAWG Graph
  • Full-Stack Integration: The entire process is seamlessly integrated into a user-friendly interface, allowing users to view detected letters and select word suggestions
Figure 4: Demo using the Desktop Version
Full-Stack Integration

Performance & Results

Our ASL classification model, trained on diverse datasets, achieved **98.3% accuracy and an average f1-score of 0.985 across all classes. The application's efficacy was further demonstrated through its real-time performance on both mobile and web platforms. Notably, the web application provided slightly enhanced accuracy due to the stability of laptop configurations.

Figure 5 + 6: Performance of CV Detection Module

Confusion Matrix
Class Scores

Example of Final Product on Mobile Device:

  1. User signs "E" and intends to type "each"
Mobile Example 1
  1. User signs "2" to select the second auto-complete option, "each"
Mobile Example 2
  1. "Each" is added to the current result message
Mobile Example 3

Conclusion & Future Prospects

ASLDCC stands as a stride in addressing the communication challenges faced by the deaf community.

Future enhancements include refining the dataset, introducing hand filter mechanisms to eliminate background distractions, and expanding the model to recognize full ASL words, which involve dynamic hand movements.